Press "Enter" to skip to content

EdgeRouter X Crashing

        We've been having a lot of trouble with an EdgeRouter X based infrastructure. We have nearly 10 EdgeRouters, working in pairs with VRRP. None of them keep working more then 1 to 3 days.
What happens is that for some reason, some of these ERs crash/freeze, in a way that we can only access again by manually restarting them. When they freeze they stop working completely, so we cannot access the interface or navigate through them. All of them are connected to the same VPN, with many subnets.

What we already tried:

  • Because both ERs at each pair were connected to the same switch, we first thought it was a DHCP problem, so we built a script that only enables DHCP on the master ER, and keeps stoping it on the backup. That didn't work;
  • Then, we thought it was a memory problem, so we built a script to clean the RAM each 5 minutes;
  • Stopped SNMP services and it didn't work;
  • Lastly, we considered it to be a disk usage problem, because it appears to be cumulative, so we built a new script to erase some of the log files each 5 minutes. All syslogs are being sent through the network to our rsyslog server, with a information log level. That seemed to work well until now (most ERs have a 5 days uptime), but today one ER crashed.

Relevant info:

  • Our logs (rsyslog server) are not showing anything relevant or related to the crash. They stop as soon as the ER crashes.
  • Log files like /var/log/messages and /var/log/dmesg don't have a high write flow (we ran tail -f and it gets only few messages)
  • Firmware version: 2.0.9 (with no hotfixes).
  • Bootloader version: e50_001_1e49c.

Present situation:

The problem seems to be disk related, because by periodicaly cleaning some log files the problem was almost solved. But we still can't isolate what files are increasing in size, because while troubleshooting, commands like ls, df and du don't seem to work correctly (as the log file sizes don't increase).

Be First to Comment

Leave a Reply