Single file mode
Under this mode, checkpoints from all the nodes and for the same generation are stored in ONE file. Checkpoints for different generations are stored in different files. The naming scheme for this mode is NXckpt.?.data, where "?" is a number calculated from the checkpoint generation.
Multiple file mode
Under this mode, each node stores its checkpoint for one checkpoint generation in a separate file. The files have names like NXckpt.?.node.#. The "?" is calculated from the checkpoint generation, while "#" refers to the node number.
The "?" in the checkpoint file name is calculated with checkpoint generation modulo maxfiles + 1 (specified in the startup script). The "+ 1" comes here because we need a temporary checkpoint file to store the current and unfinished checkpoint. The maxfiles parameter is defaulted to 1, in which case the checkpoint file number "?" is either 0 or 1.
In the case of incremental checkpointing, the total number of files should be calculated as maxfiles + maxincfiles + 1, because we have to save all the incrementals since the most recent full checkpoint.
All these filename conventions, however, don't concern users that much. They are managed by the checkpointer. And the user should not change these filenames.
In the same directory where the checkpoints are stored, there is a
control file named ckpt.cntl. It's a regular ASCII file which you can
view with
Here is a sample directory listing of the checkpoint directory /pfs-sio/yuqun.
Here is a sample ckpt.cntl file. In this file, File Number Low specifies the earliest checkpoint generation that is available in this directory, and File Number High refers to the most recent valid checkpoint stored in the directory. In this specific example, MAXFILES is 1, so the corresponding file number for generaton 10 (both File Number Low and File Number High) is 10 module (1+1)= 0.