LibNX/MPIckpt supports both C and FORTRAN programs.
Currently checkpoint_here() needs to be called by all processes of an application to initiate a checkpoint. We devised this approach under the assumption that the programmer knows the best places to checkpoint the applications and most programs running on Intel Paragon are of scientific computation nature, which usually displays a high degree of symmetry among its processes.
LibNX/MPIckpt also provides memory exclusion mechanism through two function calls: exclude_bytes() and include_bytes(). By default, entire heap space including static data area and data malloc'ed by the program is saved during a checkpoint. Some programs use temporary storage for calculation or I/O. Excluding such temporary storage from being checkpointed often preserves correctness of the program. In some other cases, some memory regions will be written with newly computed values after the call to checkpoint_here and hence their values need not be saved in a checkpoint. Identifying and excluding such memory locations can drastically improve the performance of the checkpointer.
Memory exclusion can be done via exclude_bytes() function calls prior to checkpoint_here(). And include_bytes() can be later on used to re-include this storage or its subset for checkpoint. Please see the man pages for more details. ``Memory Exclusion: Optimizing the Performance of Checkpointing Systems'' is also a good source for explanation on memory exclusion and performance of libNXckpt.
Finally, the checkpointing library provides incremental checkpointing capability which lets the checkpointer saves only the pages that are touched by the application since last checkpoint and thus reduce checkpoint data size and latency.
In order to run the checkpointing version of the program, you need a startup script that sets a few parameters for the checkpointer.
You can then run your program as before. Nothing needs to be added to the command line. However, if you disable checkpointing in the startup script and wish to overwrite it, you can still specify =checkpoint as the last command line argument in order to enable checkpointing.
Once your program starts running, libNXckpt will take care of checkpointing the entire application and write out the checkpoint data to files in the directory specified in your startup script.
External checkpoint_here, include_bytes, exclude_bytes
Integer checkpoint_here, include_bytes, exclude_bytes
Parameter (CKPT_IMMEDIATE=1, CKPT_PERIODIC=0)
The above declaration is also In the program, you can then call CHECKPOINT_HERE() with CKPT_IMMEDIATE or CKPT_PERIODIC as the argument. Everything else stays the same as for C programs.
Currently libNXckpt doesn't support hrecv, hrecvx, and hsend. The reason being that these NX calls are rarely used and supporting them is a non-trivial task.
Incremental checkpointing cannot be run with -plk flag on. This is due to the Mach virtual memory mapping policy.
Various NX command-line switches such as -noc and -mbf have not been extensively tested with libNXckpt.
Source code for CLIP is provided to the public as is. The gzip version of the source files can be obtained here
In designing our checkpointer library, we used results from J.S. Plank's
Ph.D. thesis "Efficient Checkpointing on MIMD Architectures." Plank also
gave us many good suggestions in building libNXckpt.
Please send your comments and/or questions to :
yuqun@cs.princeton.edu