LibNXckpt -- Startup Script






Overview

It is necessary to have one startup script in order to run programs linked with libNXckpt. The startup script .ckptrc controls the behavior of the checkpointing library:

By setting certain parameters in the startup script, the user can easily tailor the checkpointing library to suit his/her needs and the situation without having to recompile the program. The startup script consists of lines of ASCII text, separated by regular carriage returns. Each line can be either a (parameter, value) pair or a comment. A comment line either starts with # (POUND sign) as its first charater, or is blank line. A parameter specification line starts with the name of the parameter (no leading white space allowed) followed by a tab and then the value for the parameter. There are three types of parameters in the startup script: BOOLEAN, STRING, and INTEGER: The startup script .ckptrc can reside either in the current working directory where the program starts or the user's home directory. If startup scripts are found at both locations, the one in the current working directory takes its effect, and the other one is just ignored.



Parameters for .ckptrc

checkpointing
If ON (by default), checkpointing is enabled. When this parameter is OFF, the user can still enable checkpointing by specifying =checkpoint on the command line of the program.

verbose
Usually OFF (by default). If turned ON, some messages regarding checkpointing status are printed to STDOUT for each checkpoint.

directory
A path name specifies where to save the checkpoints, for example, /pfs/yuqun.

maxfiles
A positive integer for this parameter specifies how many checkpoints to keep on disk. The default is 1.

exclude
If turned ON (by default), exclude_bytes() and include_bytes() will take effect; otherwise, these two function calls serve as null procedure calls that do nothing.

incremental
An ON value turns on the page protection based incremental checkpointing feature. Under incremental checkpointing, the first checkpoint is a full checkpoint, which saves the entire data section and the stack of your program. This is followed then by a number of incremental checkpoints, which only save the address space that have changed since the previous checkpoint. The number of incremental checkpoints can be set by the parameter maxincfiles. After maxincfiles incremental checkpoints, a full checkpoint will be taken again. For programs that touch only part of their address space at some execution stages, (e.g., iterate over one array in a loop), this feature can save a lot of checkpointing overhead by reducing the amount of checkpoint data that needs to be saved on disk.

The default value for incremental is OFF.

maxincfiles
As mentioned above, this parameter tells the checkpointer how many incremental checkpoints to take after a full checkpoint. After this many incremental checkpoints, a full checkpoint will be taken again which will then be followed by another set of maxincfiles incremental checkpoints.

If incremental checkpointing is enabled, maxincfiles has to be greater than zero.

mintime
Sets the minimum time that can elapse between two consecutive checkpoints. A timer is maintained on node 0 of your application. When all nodes call checkpoint_here(CKPT_PERIODIC), they consult node 0 to check if the timer has expired or not. If mintime has passed, a checkpoint is initiated, after which the timer is again set to mintime. A simple synchronization protocol ensures that either all nodes take this periodic checkpoint or none does.

The default is 0, which disables the timer-based PERIODIC checkpointing.

singlefile
If ON, this parameter tells the checkpointer to save the checkpoint data from all the node for one checkpoint to a single file. Writing all the data to one file has performance advantage in most cases. However, when you run your program on more than 160 nodes, it is recommended that this parameter be turned OFF. The default is ON.

maxiobuf
This sets the upper limit on the user-space buffer allocated for doing checkpointing I/O. The buffer is allocated only once prior to the first checkpoint. The checkpointer tries to allocate a buffer whose size is equal to the size of a full stripe on the PFS directory where the checkpoint data is saved. This scenario provides maximum I/O performance to checkpointing. Some PFS filesystems, however, has a large striping factor (i.e., 80 on Caltech's /pfs-sio), this would cause the checkpointer to use up too much buffer space and sometimes result in heavy paging. Hence it is recommended that maxiobuf is set at an appropriate value (around 1 Megabyte). The default value is 0. In this case, the checkpointer uses the full stripe size of the PFS file system as its buffer size.

maxhdrbuf
This sets the upper limit of the checkpointing header buffer, whose size is defaulted to the striping unit size on your PFS or file system block size if you are saving the checkpoint data to a UFS directory.

cleanup
If ON, the checkpoint files will be removed after the program successfully completes. The default is ON.

abortonerror
If ON, the checkpointer will force the program to abort when encountering an error during checkpointing. The default is ON.


Please send your comments and/or questions to :
yuqun@cs.princeton.edu