Name(s): Yurich, Serhii, Roman, Anton L., Martin
Date: 2025-02-06
Last modified: 2025-02-07
Summary
CRM was reported as slow, we investigated and found out that Manager 1 was at 100% CPU and Manager 2 was at 0% CPU load. Further investigation showed that Manager 2 had a full disc due to logs. Removing them solved the issue.
Impact
CRM and Talent were slow, but reachable. No data loss.
Timeline
-
13:35: Manager 2 full disc space, all services shut down. Manager 1 got all the load, which resulted in 100% CPU load
-
13:58: Beni contacted Martin via Teams that perfomance was low
-
14:03: Martin wrote in Development channel that there are issues with perfomance regarding CRM
-
14:10: Yurich and Anton started investigation, but quick solution was not found.
-
14:44: Martin started video call in production channel and created task force with Anton, Roman, Serhii and Yurich. Other developers joined as well.
-
14:51: Communication in Rocken Chat that issue exists
-
15:10: Serhii saw that Manager 2 disc was full and removed old logs. After that servers stabilized
-
15:21: Communication in Rocken Chat that issue was resolved
-
15:42: Roman noticed that the websocket container’s log file keeps growing.
-
15:45: Roman started video call with Yurii and Serhii.
-
15:55: Serhii found environment variable that is responsible for enabling debug mode for the websocket container
-
16:05: Roman set the variable
SOKETI_DEBUG: 0for websocket container and restarted the stack in docker swarm to fix the issue immediately. -
16:06: Yurii set the same value of the variable in the GitLab CI variables in order to avoid this issue in the future.
Root Cause(s)
The /var folder of Manager 2 was full with websocket logs. No more memory resulted in services crashing. The load of the services couldn’t be distributed and was directed to mostly Manager 1.
Action Items
-
Log rotation for all production services on Manager 1, 2 and 3


Leave a Reply
You must be logged in to post a comment.