Checkpointing OpenMPI applications
From ComputeMode
Contents |
Installing and configuring the nodes
TODO: create a kameleon recipe
- Retrieve the fixed openmpi-checkpoint package and copy it to the node image directory:
shell# wget https://gforge.inria.fr/frs/download.php/30817/openmpi-checkpoint_1.4.2-4_amd64.deb -O /cm/debian/orig/tmp/openmpi-checkpoint_1.4.2-4_amd64.deb
- Install the required packages in the node image:
shell# chroot /cm/debian/orig shell# apt-get install openmpi-bin blcr-dkms blcr-util shell# dpkg -i /tmp/openmpi-checkpoint_1.4.2-4_amd64.deb shell# rm /tmp/openmpi-checkpoint_1.4.2-4_amd64.deb shell# exit
- Configure the bclr to be loaded at boot by the nodes:
- Make sure that there is a "blcr" line in /cm/debian/patch/etc/modules :
# /etc/modules: kernel modules to load at boot time. # # This file should contain the names of kernel modules that are # to be loaded at boot time, one per line. Comments begin with # a "#", and everything on the line after them are ignored. #af_packet blcr softdog rtc
User configuration
- Create two directories to store the checkpoints in the NFS shared home:
shell$ mkdir $HOME/checkpoints
- Create/edit the openmpi mca configuration file $HOME/.openmpi/mca-params.conf with the following options:
- OpenMPI < 1.5.1 (debian stable package)
# Remote snapshot directory (globally mounted file system) snapc_base_global_snapshot_dir=/home/user/checkpoints
- OpenMPI >= 1.5.1
# Remote snapshot directory (globally mounted file system) sstore_base_global_snapshot_dir=/home/user/checkpoints
Simple usage
The '-am ft-enable-cr' must be passed to mpirun for checkpointing to run :
shell$ mpirun -am ft-enable-cr my-app <args>
At any moment, simply call ompi-checkpoint on the PID of a running MPI process to checkpoint it:
shell$ ompi-checkpoint 2405
To restart an saved process, call ompi-restart with the basename (not path!) of the dump:
shell$ ompi-restart ompi_global_snapshot_2405.ckpt
Using the start_checkpointed.pl wrapper script
TODO