[pfSense] System stats: HUGE SPIKE, then failed.

Karl Fife

2018-04-03 21:50:39 UTC

There was just now a sudden spike in states, ~100x the normal number,
maxing out the system max in just an hour, and causing the system to fail.

With a maxed out state table, of course the system fails to process
traffic. Has anyone seen something like this before, or have any ideas
what kinds of things would look like this?

Monitoring PNG attached.

For us, on a normal day, the system hovers around 7-15K states. Just
before noon today, the system suddenly started adding states at a rage
of about 9K per minute until the system maxed out (at 800K states in
just under an hour and fifteen minutes).

Failure mode analysis was difficult because we couldn't access the WebUI
or SSH becasue (of course) the LAN interface couldn't allocate a state
for the connection, so we had to restart (hoping to find something in
the logs. Logs were not helpful because the circular logs were too
small (subsequently "embiggened" of course), but more to the point, the
offending states wouldn't be logged anyway, so that won't tell what IP
or IP's belong to the offending states anyway.

Going forward:

The ~1 hour window in which to do forensics (when/if this happens again)
is quite small, so I wonder if there is a way to have growl generate a
notification when say, states exceed a certain threshold, so we can at
least pay attention while it's happening. Any tips on notifications?

Probably irrelevant, but this is: pfSense 2.4.2R p1 AMD64 on a
Supermicro Rangely/Atom ECC, ZFS

Thanks!
-Karl