Tuesday, September 22, 2015

Getting 2.5 Megalines of code to behave On Curiosity and its software

http://jlouisramblings.blogspot.com/2012/08/getting-25-megalines-of-code-to-behave.html

I cannot help but speculate on how the software on the Curiosity rover has been constructed. We know that most of the code is written in C and that it comprises 2.5 Megalines of code, roughly[1]. One may wonder why it is possible to write such a complex system and have it work. This is the Erlang programmers view.

First some basics. The rover uses a radioactive power source which systematically delivers power to it in a continuous fashion. The power source also provides some heating to the rover in general - which is always nice given the extreme weather conditions present on Mars.

The rover is mostly autonomous. It takes minutes to hours to send a message and you can only transmit data in limited periods of the Mars day. The rover itself can talk with earth, but that link is slow. It can also talk through satellites orbiting Mars, using them as an uplink. This is faster. The consequence is that the rover must act on its own. We cannot guide it by having a guy in a seat with a joystick here back on earth.

There are two identical computers on the Rover. We note that NASA acts in the words of Joe Armstrong: "To have a reliable system you need two computers". One of these are always dormant, ready to take over, if the other one dies for a reason. This is a classic takeover scenario as seen in Erlang systems, the OpenBSD PF Firewall and so on. The computers are BAE systems RAD750 computers. They run a PowerPC ISA and have some modest speeds. 200 mhz, 150 or 250 nm manufacturing process and an impressive operating temperature range. It is also radiation hardened and withstand lots of radiation. The memory is also hardened against radiation. It is not an easy task to be a computer on board the Curiosity.

The operating system is VxWorks. This is a classic microkernel. A modest guess is that the kernel is less than 10 Kilolines of code and is quite battle tested. In other words, this kernel is near bug free. The key here is isolation. We isolate different parts of the rover. There are certain subsystems which are outright crucial to the survival of the rover, whereas a scientific instrument is merely there for observation. Hence we can apply a nice fact, namely that only parts of our 2.5 million lines of code needs to be deeply protected against error. There will some parts which we can survive without.

NASA[2] uses every trick in the bag to ensure good code quality. Recursion is shunned upon for instance, simply because C compilers cannot guarantee the stack won't explode. Loops are ensured to be terminating such that a static analyzer can find problems. All memory is mostly statically allocated to avoid messing with sudden collection calls and unpredictable performance. Also note that message passing is the preferred way of communicating between subsystems. Not mutexes. Not Software transactional memory. Also, isolation is part of the coding guidelines. By using memory protection and singular ownership of data, we make it hard for subsystems to mess with each other. The Erlang programmer nods at the practices.

The architecture on the Mars Pathfinder[3] which is the basis turns out to be very Erlang like. They have "Modules" which passes messages. They only wait on receiving messages, sending are void-functions. They have a single event loop for receiving, probably much akin to an Erlang gen_server process. The different modules communicate by sending messages to each other, over a protocol. You can access the memory space of another module, but it is shunned by the JPL coding guidelines. A difference to Erlang which disallows it entirely. The Mars Exploration Rovers (Spirit and Oppurtunity) has many more modules but is the same software basis. And Curiosity is no different. They essentially built on the older software. The thread count is in the hundreds, which also neatly reflects what it would probably be in an Erlang system of this kind.

In Curiosity, they added "Components" which are groups of modules in order to manage the complexity. Components are also needed in order to handle the fact that you have two redundant computers and many other subsystems are also redundant for robustness. Interestingly, the Erlang designers also saw the need for such a thing, they just named them Applications. Nod.

Functions checks for all invariants. Input parameters that they satisfy a precondition. That a postcondition holds of the return value and that various invariants are still true with assertions. The Erlang programmers nods again. Interestingly, there is a 60 line limit on functions so they can be printed on a single sheet of paper. The Erlang programmer prefers way shorter function bodies here but the idea still holds. Make code simple and comprehensible.

Another interesting story is that in the past, one of the rovers had problems with priority inversion. They saved it by using a debug console to inject a correction to the rover. This is very much like we often do in Erlang systems. We can alter the running system as we see fit and upgrade them on the fly. We can monitor the system as it runs and make sure it runs as we would like. The ability to hot-fix the system is valuable. Also, development is done with extensive tracing and analysis of the traces - i.e., Erlang QuickCheck / PropEr, error logging and the tracing facilities.

It turns out that many of the traits of Erlang systems overlap with that of the Rovers. But I don't think this is a coincidence. The software has certain different properties - the rovers are hard realtime whereas the erlang systems are soft realtime. But by and large, the need to write robust systems means that you need to isolate parts of the system from each other. It is also food for thought, because it looks like the method works. These traits are important for highly reliable software. Perhaps more so than static type checks and verification.

The upshot is that of all the code lines in the Rover, we probably do not have to trust them all to the maximal level of security. We can sandbox different parts and apply different levels of correctness checking to these parts. In other words, we can manage the errors and alleviate the risk by careful design. Thus for some modules, we can probably live with the fact that they might error. Suppose that the uplink fails. We can probably restart it and have it survive. If not, we have another redundant uplink directly to earth which is slower - but can be used to restore the other uplink. This layering means that multiple components have to fail for the mission to abort. A science experiment can probably fail as well without aborting the mission. We could just take another picture after having restarted the module. There is a trusted computing base, but hopefully it is small and need little change. It is also battle tested on 3 other rovers in the base.

The things that do not overlap has to do with the need of having soft realtime vs hard realtime. In Erlang we can yield service. It is bad, but we can do it. On a rover it can be disastrous. Especially in the flight control software. Fire a rocket too late and you are in trouble. This explains why they use static allocation and fixed stack size over dynamic allocation. It also explains why they dislike recursion. On the other hand, we get to avoid manual memory management in Erlang. We also have the benefit of a very deterministic tail call optimization, so we can rely on its use.

TL;DR - Some of the traits of the Curiosity Rovers software closely resembles the architecture of Erlang. Are these traits basic for writing robust software?

Sources:

[0] Wikipedia, the Curiosity rover page