Devices disconnecting from the server
On a project that has a number of IoT devices attached to it. I got reports of devices going offline. This was odd, the code deployed had been working for a over a year without issue.
It was very clear from the logs that some devices were disconnecting after 30 seconds. The processes attached to the devices should be pinging every 10 seconds, something very odd was happening.
The servers running the code have no UI, so I used observer_cli within the application to peek at what is going on. Running this to catch a disconnecting device was fun with the 30 second window of opportunity. Fortunately I was successful on several occasions and found that the run queue for the process was increasing.
I surmised from this that the queued messages must be the pings to the device and looking at the process message queue proved this. Something was blocking the process! Looking at the current stack revealed that it was stalled on a database call.
Going back to the home page in observer_cli revealed that the SQLite driver had the most reductions, top of the list. Unfortunately looking into the process it also revealed a rising message queue. It also showed that the driver was almost continually in the esqlite:receive_answer/2
function.
The application was making more requests than SQLite could handle in a timely way.
Now I had an idea of what was happening I tested the system under load locally. With my friend Google, I set off to optimise how the application interacted with SQLite.
Whilst this defiantly improved things the main problem was during a mass update of the devices. SQLite could not process the number of writes that were required and still allow the system to function normally.
Time to move away from SQLite.