Debugging with Rob

OK. Today I’m going to find out why my program is crashing. I’m going to do this in an “experimental” way. I’m going to put up a hypothesis (a theory about why something happens) and then try to test it. Before I do that though, perhaps I’d better describe the problem. I apologise if some of this explanation is a bit technical:

The program runs for a while and then stops.

The length of time the program runs varies, and I can’t tell what happens at the point of failure because it is inside a device. And I can’t take a look inside the program because those tools aren’t available to me.

So, let’s try my first hypothesis:

Hypothesis 1: “The program flashes the neopixel leds at the same time as it receives data from the air quality sensor. Perhaps the code that controls the lights is affecting the code that gets the data.

Test:

  • Increased speed of pixel update.

  • Loaded up the serial interface with data.

  • Increased rate of MQTT update to once per second

Results

Display update slowed right down (as we might expect) but no crash

On this basis I’ll conclude that this hypothesis is not valid. So let’s try another:

Hypothesis 2: Serial data reception is interfering with the WiFi transmission. When the device has got a complete reading from the air quality sensor it sends a message over WiFi to the server.

Test:

  • Turned off the Pixel updates

  • Loaded up the serial interface with data

  • Added code to confirm successful network message sent

Result:

When the serial data is being transmitted the data transfer is slower because the serial interface steals cycles from the processor. Eventually the transfer collapses and the Publish method starts to return false. Shortly after that the whole system falls over.

Further Test:

restored the pixels and loaded up the serial port

Result:

When the serial port is loaded the performance collapses before.

So, it is not a good idea to use the network connection while you are receiving data from the serial port.

This is because I’m using a software simulation of the hardware that normally receives serial data. This simulation has specific timing constraints that means that it needs to “lock out” other processes when it runs. And it seems that this is causing the problem. Under normal circumstances the node only sends a network message every six minutes or so, and the chances of interference are small. But when I’m doing “proper” testing - sending lots of messages and receiving loads of data - I notice the problem.

The solution is to re-work the code so that the two things don’t occur at the same time. Which means I get to write more code. Which I quite like.

You might like to know (if you’ve read this far) that the notes above actually came from my diary. I always write these things up each day. The idea is that if I get a similar problem in the future I’ll have something to go back to. If you don’t write a diary/log when you do this kind of thing you’re really missing out on a trick.