checkpointing
If ON (by default), checkpointing is enabled. When this
parameter is OFF, the user can still enable checkpointing by
specifying =checkpoint on the command line of the program.
verbose
Usually OFF (by default). If turned ON, some
messages regarding checkpointing status are printed to STDOUT for each
checkpoint.
directory
A path name specifies where to save the checkpoints, for example,
/pfs/yuqun.
maxfiles
A positive integer for this parameter specifies how many checkpoints
to keep on disk. The default is 1.
exclude
If turned ON (by default), exclude_bytes() and
include_bytes() will take effect; otherwise, these two function
calls serve as null procedure calls that do nothing.
incremental
An ON value turns on the page protection based incremental
checkpointing feature. Under incremental checkpointing, the first checkpoint
is a full checkpoint, which saves the entire data section and the stack of
your program. This is followed then by a number of incremental checkpoints,
which only save the address space that have changed since the previous
checkpoint. The number of incremental checkpoints can be set by the parameter
maxincfiles. After maxincfiles incremental checkpoints, a full
checkpoint will be taken again. For programs that touch only part of their
address space at some execution stages, (e.g., iterate over one array in a loop),
this feature can save a lot of checkpointing overhead by reducing the amount
of checkpoint data that needs to be saved on disk.
The default value for incremental is OFF.
maxincfiles
As mentioned above, this parameter tells the checkpointer how many
incremental checkpoints to take after a full checkpoint. After this many
incremental checkpoints, a full checkpoint will be taken again which will
then be followed by another set of maxincfiles incremental checkpoints.
If incremental checkpointing is enabled, maxincfiles has to be greater than zero.
mintime
Sets the minimum time that can elapse between two consecutive
checkpoints. A timer is maintained on node 0 of your application. When all
nodes call checkpoint_here(CKPT_PERIODIC), they consult node 0
to check if the timer has expired or not. If mintime has passed,
a checkpoint is initiated, after which the timer is again set to mintime.
A simple synchronization protocol ensures that either all nodes take this
periodic checkpoint or none does.
The default is 0, which disables the timer-based PERIODIC checkpointing.
singlefile
If ON, this parameter tells the checkpointer to save the
checkpoint data from all the node for one checkpoint to a single file. Writing
all the data to one file has performance advantage in most cases. However, when
you run your program on more than 160 nodes, it is recommended that this
parameter be turned OFF. The default is ON.
maxiobuf
This sets the upper limit on the user-space buffer allocated for
doing checkpointing I/O. The buffer is allocated only once prior to the
first checkpoint. The checkpointer tries to allocate a buffer whose size
is equal to the size of a full stripe on the PFS directory where the checkpoint
data is saved. This scenario provides maximum I/O performance to checkpointing.
Some PFS filesystems, however, has a large striping factor
(i.e., 80 on Caltech's /pfs-sio), this would cause the checkpointer to
use up too much buffer space and sometimes result in heavy paging. Hence it
is recommended that maxiobuf is set at an appropriate value (around
1 Megabyte). The default value is 0. In this case, the checkpointer uses
the full stripe size of the PFS file system as its buffer size.
maxhdrbuf
This sets the upper limit of the checkpointing header buffer, whose
size is defaulted to the striping unit size on your PFS or file system block
size if you are saving the checkpoint data to a UFS directory.
cleanup
If ON, the checkpoint files will be removed after the program
successfully completes. The default is ON.
abortonerror
If ON, the checkpointer will force the program to abort when
encountering an error during checkpointing. The default is ON.