Gene Cooperman and Twinkle Jain (Northeastern University)
The ROS master is well-known to be a single point of failure. The DMTCP open-source package for transparent checkpoint-restart was recently extended to support checkpointrestart for the ROS master. After a failure, the ROS master is rolled back and resumed from the last checkpoint. Checkpoints can be performed as often as every few seconds. The DMTCP plugin model also allows users to add plugins that model and restart their external devices in a state equivalent to that at checkpoint. Finally, we speculate on the potential of DMTCP’s distributed mode to support a global restore with appropriate plugins in the future.