HPC Clusters on GCP — Running LS-Dyna jobs on Slurm using Intel MPI
We recently started helping one of our clients exploring a migration from a datacenter hosted HPC workload that uses Sun Grid Engine to a Slurm HPC cluster in cloud. Creating a slurm cluster itself is not a difficult task in GCP, thanks to the multiple options available. There is a marketplace offering that lets you spin up a cluster quickly, or SchedMD’s terraform template is easy to get started with. There is also Cloud HPC Toolkit, which makes deploying a highly customizable and production grade cluster easy.
An HPC cluster is not just the scheduler, but the actual software that runs on it. In this case, among other things, we had to make sure that we could run distributed LS-Dyna jobs using Intel MPI. A lot of things could go wrong while trying to make these systems work together — a Slurm cluster running on Google Compute Engine, scheduling LS-Dyna jobs with Intel message passing interface (MPI), and they did go wrong. We thought it’d be beneficial to document how we went about it for the future random internet stranger, just as we benefitted from the random internet strangers while trying to make this whole thing work. In this blog post, we will set up a Slurm cluster on GCE from scratch with Cloud filestore as the shared storage, install and configure Intel MPI, LD-Dyna and finally run a sample job.
Deploy a Slurm cluster
As mentioned above, there are multiple (easy) ways you can create a Slurm cluster on GCP like the marketplace, Terraform template, or HPC Toolkit. For this blog post, we will use SchedMD’s terraform template to build a basic cluster. You may create a VPC, Cloud Filestore as part of the template itself, but in our example, we are going to create them separately. Let’s get started.
If you already have a VPC and subnets, feel free to skip this section. If you don’t have one already, follow along.
Configure your project and region in gcloud
gcloud auth login
gcloud config set project YOUR-PROJECT-NAME
gcloud config set compute/region asia-south1
We will create a VPC and one subnet in Mumbai region
gcloud compute --project=projectname firewall-rules create slurm-vpc-internal --direction=INGRESS --priority=1000 --network=slurm-vpc --action=ALLOW --rules=all --source-ranges=10.10.0.0/24
You might also want to create a NAT gateway for this VPC.
Go to cloud filestore service in GCP and create a basic file share choosing connectivity with the VPC we just created above. Also, configure the “private connection” under “advanced network options” to allow slurm nodes to freely mount the fileshare. I am naming this file share as slurmfs
Let’s now use the terraform template to create a Slurm cluster. We will be creating a “basic” and small cluster with 2 compute nodes in debug partition. You might want to use Cloud Shell for running terraform apply.
git clone https://github.com/SchedMD/slurm-gcp.git && cd slurm-gcp/terraform/slurm_cluster/examples/slurm_cluster/cloud/basic
You may choose the cluster type that makes the best sense for your scenario. For the sake of this blog post, we will choose a basic cluster. Edit the example.tfvars
using your favorite text editor with the project, VPC, filestore, and other necessary details.
Under network storage, fill in the cloud filestore details you have created above
network_storage = [{
server_ip = "10.x.x.x"
remote_mount = "/slurmfs"
local_mount = "/data"
fs_type = "nfs"
mount_options = null
},
]
Similarly, the login
login_network_storage = [
{
server_ip = "10.x.x.x" #change this depending on your env
remote_mount = "/slurmfs"
local_mount = "/data"
fs_type = "nfs"
mount_options = null
},
]
Here is my Below is my example.tfvars
file. You can make changes accordingly.
Run terraform
terraform init
terraform apply -var-file=example.tfvars
If everything went well, you should see the slurm cluster up and running on your GCP console. Connect to controller node (I am assuming you have the necessary firewall rules created allowing IaP access via SSH).
My slurm cluster is now up and running and just so that you can follow along, here is what my cluster looks like. I have a controller node, login node and 2 compute nodes in the debug partition
Intel MPI (mpirun) and Hostbased Passwordless SSH
Before we get to the reason behind configuring hostbased SSH, we should understand the MPI (Message Passing Interface) in HPC world. An MPI is a library spec for parallel computing architectures which allow communication of information between various nodes and clusters. In our scenario, we are going to run LS-Dyna jobs using Intel MPI library, mpirun
.
If you are new to Slurm + Intel MPI combination, you are likely to spend a ton of time trying to trigger distributed jobs and run into multiple obscure error messages. Please note that Intel MPI requires passwordless authentication between nodes (not just between the controller and compute nodes). Intel provides a utility sshconnectivity.exp
along with the installation files, but I was not able to get it to work. I ended up configuring hostbased authentication between the nodes.
Here is a great explanation of how hostbased SSH authentication works. Also this cookbook will come in handy too. We will collect the public keys of all nodes in the cluster and then distribute them to /etc/ssh/ssh_known_hosts
in each node.
SSH to the login node
gcloud compute ssh --zone "asia-south1-a" "hpc-login-zd18txv1-001" --tunnel-through-iap --project "PROJECTNAME"
Collect all the public keys using sh_keyscan
and save them on a temporary file, preferably to the cloud filestore mount. In this post, it’s mounted as /data/
on each node. This will allow you to easily copy the file between the nodes.
sudo -i
ssh-keyscan -t rsa hpc-controller hpc-debug-worker-0 hpc-debug-worker-1 > /data/known_hosts
You will end up with an output like this.
hpc-controller ssh-rsa AAAAB3NzaC1yc2.........
hpc-debug-worker-0 ssh-rsa AAAAB3NzaC1yc2E...........
hpc-debug-worker-1 ssh-rsa AAAAB3NzaC1yc2..........
To avoid running into any name resolution errors, add all variants of your hostname. Open the temporary hosts file and edit it with IP address, FQDN of your nodes. You can cat /etc/hosts
to get the FQDN. After making these changes, I ended up with the following content in /data/known_hosts
file
Now, copy this file on ALL nodes — hpc-controller, hpc-debug-worker-0, hpc-debug-worker-1 to /etc/ssh/ssh_known_hosts
# SSH to each VM and copy the file
sudo cp /data/known_hosts /etc/ssh/ssh_known_hosts
Next, on all nodes, configure the following in /etc/ssh/ssh_config
sudo tee -a /etc/ssh/ssh_config << EOF
HostbasedAuthentication yes
EnableSSHKeysign yes
EOF
Next, on each node — enable host-based authentication in /etc/ssh/sshd_config
HostbasedAuthentication yes
On all nodes, restart sshd — systemctl restart sshd
on all nodes, create the following file — vi /etc/ssh/shosts.equiv
. In my case, I ended up with the following contents
hpc-controller.c.PROJECTID.internal
hpc-debug-worker-0.c.PROJECTID.internal
hpc-debug-worker-1.c.PROJECTID.internal
Finally, on each node, fix permissions
chcon system_u:object_r:etc_t:s0 /etc/ssh/sshd_config /etc/ssh/ssh_config /etc/ssh/shosts.equiv /etc/ssh/ssh_known_hosts
At this point, we are good to go, but I ran into ssh_keysign
related errors, because the CentOS 7 image the terraform template was using, had incorrect permissions on the host keys.
Fix permissions on each node
sudo chgrp ssh_keys /etc/ssh/*_key
sudo chmod g+r /etc/ssh/*_key
The updated permissions should look like this
Now, test host-based authentication from the controller (or any node) ssh -vvv hpc-debug-worker-0
. Remember to use ssh as a non-root user.
Install Intel MPI
Now that we are done with host-based authentication, let’s setup Intel MPI on all the nodes
Download the stand-alone offline installer from here
SSH to the hpc-controller
node and download
wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18926/l_mpi_oneapi_p_2021.7.0.8711_offline.shchmod a+x l_mpi_oneapi_p_2021.7.0.8711_offline.sh# start installation./l_mpi_oneapi_p_2021.7.0.8711_offline.sh
The installation is pretty much self-explanatory.
Thankfully, slurm takes care of propagating this installation to the other nodes as well.
source the env variables file and invoke the mpirun
source intel/oneapi/setvars.sh
mpirun --version
Install LS-Dyna
LS-Dyna requires a license to run. In this blog post, we will trigger a Dyna 9.3.1 release with a network license. We assume that a license server already exists in your environment.
Download ls-dyna binary
wget https://www.oasys-software.com/dyna/wp-content/uploads/2018/01/ls-dyna_mpp_d_R9_3_1_x64_centos65_ifort131_sse2_platformmpi.tar.gztar -xvf ls-dyna_mpp_d_R9_3_1_x64_centos65_ifort131_sse2_platformmpi.tar.gzmv ls-dyna_mpp_d_R9_3_1_x64_centos65_ifort131_sse2_platformmpi lsdyna-931chmod a+x lsdyna-931
Run LS-Dyna job using mpirun
We can now test a distributed job using both srun and mpirun. There is a sample job on the internet that you can download and run
wget https://hcc.unl.edu/docs/submitting_jobs/app_specific/job-examples/ls-dyna/ls-dyna.files/pendulum.k
Running the distributed job usingsrun
. We can invoke the job using srun, by specifying the option— mpi=pmi2
, this will make use of mpirun
. You can wrap this in a SBATCH
script and invoke it as well.
source intel/oneapi/setvars.shexport LSTC_LICENSE=network
export LSTC_LICENSE_SERVER=10.50.16.4 #this is our license server# use srun to invoke the job on specific nodessrun --nodelist=hpc-debug-worker-0,hpc-debug-worker-1 --mpi=pmi2 lsdyna-931 i=pendulum.k
And it ran successfully!
You may also trigger the same job using mpirun directly, specifying slurm as the bootstrap option
mpirun -bootstrap slurm -hosts hpc-debug-worker-0,hpc-debug-worker-1 ./lsdyna-931 i=pendulum.k
Hope you find this useful! Happy high-performance computing! :)
Here are some resources I found useful