HPC Clusters on GCP — Running LS-Dyna jobs on Slurm using Intel MPI

RK Kuppala
The Cloudside View
Published in
8 min readOct 3, 2022

--

We recently started helping one of our clients exploring a migration from a datacenter hosted HPC workload that uses Sun Grid Engine to a Slurm HPC cluster in cloud. Creating a slurm cluster itself is not a difficult task in GCP, thanks to the multiple options available. There is a marketplace offering that lets you spin up a cluster quickly, or SchedMD’s terraform template is easy to get started with. There is also Cloud HPC Toolkit, which makes deploying a highly customizable and production grade cluster easy.

An HPC cluster is not just the scheduler, but the actual software that runs on it. In this case, among other things, we had to make sure that we could run distributed LS-Dyna jobs using Intel MPI. A lot of things could go wrong while trying to make these systems work together — a Slurm cluster running on Google Compute Engine, scheduling LS-Dyna jobs with Intel message passing interface (MPI), and they did go wrong. We thought it’d be beneficial to document how we went about it for the future random internet stranger, just as we benefitted from the random internet strangers while trying to make this whole thing work. In this blog post, we will set up a Slurm cluster on GCE from scratch with Cloud filestore as the shared storage, install and configure Intel MPI, LD-Dyna and finally run a sample job.

Deploy a Slurm cluster

As mentioned above, there are multiple (easy) ways you can create a Slurm cluster on GCP like the marketplace, Terraform template, or HPC Toolkit. For this blog post, we will use SchedMD’s terraform template to build a basic cluster. You may create a VPC, Cloud Filestore as part of the template itself, but in our example, we are going to create them separately. Let’s get started.

If you already have a VPC and subnets, feel free to skip this section. If you don’t have one already, follow along.

Configure your project and region in gcloud

gcloud auth login 
gcloud config set project YOUR-PROJECT-NAME
gcloud config set compute/region asia-south1

We will create a VPC and one subnet in Mumbai region

gcloud compute --project=projectname firewall-rules create slurm-vpc-internal --direction=INGRESS --priority=1000 --network=slurm-vpc --action=ALLOW --rules=all --source-ranges=10.10.0.0/24

You might also want to create a NAT gateway for this VPC.

Go to cloud filestore service in GCP and create a basic file share choosing connectivity with the VPC we just created above. Also, configure the “private connection” under “advanced network options” to allow slurm nodes to freely mount the fileshare. I am naming this file share as slurmfs

Let’s now use the terraform template to create a Slurm cluster. We will be creating a “basic” and small cluster with 2 compute nodes in debug partition. You might want to use Cloud Shell for running terraform apply.

git clone https://github.com/SchedMD/slurm-gcp.git && cd slurm-gcp/terraform/slurm_cluster/examples/slurm_cluster/cloud/basic

You may choose the cluster type that makes the best sense for your scenario. For the sake of this blog post, we will choose a basic cluster. Edit the example.tfvars using your favorite text editor with the project, VPC, filestore, and other necessary details.

Under network storage, fill in the cloud filestore details you have created above

network_storage = [{
server_ip = "10.x.x.x"
remote_mount = "/slurmfs"
local_mount = "/data"
fs_type = "nfs"
mount_options = null
},
]

Similarly, the login

login_network_storage = [
{
server_ip = "10.x.x.x" #change this depending on your env
remote_mount = "/slurmfs"
local_mount = "/data"
fs_type = "nfs"
mount_options = null
},
]

Here is my Below is my example.tfvars file. You can make changes accordingly.

Run terraform

terraform init
terraform apply -var-file=example.tfvars

If everything went well, you should see the slurm cluster up and running on your GCP console. Connect to controller node (I am assuming you have the necessary firewall rules created allowing IaP access via SSH).

My slurm cluster is now up and running and just so that you can follow along, here is what my cluster looks like. I have a controller node, login node and 2 compute nodes in the debug partition

Intel MPI (mpirun) and Hostbased Passwordless SSH

Before we get to the reason behind configuring hostbased SSH, we should understand the MPI (Message Passing Interface) in HPC world. An MPI is a library spec for parallel computing architectures which allow communication of information between various nodes and clusters. In our scenario, we are going to run LS-Dyna jobs using Intel MPI library, mpirun .

If you are new to Slurm + Intel MPI combination, you are likely to spend a ton of time trying to trigger distributed jobs and run into multiple obscure error messages. Please note that Intel MPI requires passwordless authentication between nodes (not just between the controller and compute nodes). Intel provides a utility sshconnectivity.exp along with the installation files, but I was not able to get it to work. I ended up configuring hostbased authentication between the nodes.

Here is a great explanation of how hostbased SSH authentication works. Also this cookbook will come in handy too. We will collect the public keys of all nodes in the cluster and then distribute them to /etc/ssh/ssh_known_hosts in each node.

SSH to the login node

gcloud compute ssh --zone "asia-south1-a" "hpc-login-zd18txv1-001"  --tunnel-through-iap --project "PROJECTNAME"

Collect all the public keys using sh_keyscan and save them on a temporary file, preferably to the cloud filestore mount. In this post, it’s mounted as /data/ on each node. This will allow you to easily copy the file between the nodes.

sudo -i
ssh-keyscan -t rsa hpc-controller hpc-debug-worker-0 hpc-debug-worker-1 > /data/known_hosts

You will end up with an output like this.

hpc-controller ssh-rsa AAAAB3NzaC1yc2.........
hpc-debug-worker-0 ssh-rsa AAAAB3NzaC1yc2E...........
hpc-debug-worker-1 ssh-rsa AAAAB3NzaC1yc2..........

To avoid running into any name resolution errors, add all variants of your hostname. Open the temporary hosts file and edit it with IP address, FQDN of your nodes. You can cat /etc/hosts to get the FQDN. After making these changes, I ended up with the following content in /data/known_hosts file

Now, copy this file on ALL nodes — hpc-controller, hpc-debug-worker-0, hpc-debug-worker-1 to /etc/ssh/ssh_known_hosts

# SSH to each VM and copy the file
sudo cp /data/known_hosts /etc/ssh/ssh_known_hosts

Next, on all nodes, configure the following in /etc/ssh/ssh_config

sudo tee -a /etc/ssh/ssh_config << EOF
HostbasedAuthentication yes
EnableSSHKeysign yes
EOF

Next, on each node — enable host-based authentication in /etc/ssh/sshd_config

HostbasedAuthentication yes

On all nodes, restart sshd — systemctl restart sshd

on all nodes, create the following file — vi /etc/ssh/shosts.equiv . In my case, I ended up with the following contents

hpc-controller.c.PROJECTID.internal
hpc-debug-worker-0.c.PROJECTID.internal
hpc-debug-worker-1.c.PROJECTID.internal

Finally, on each node, fix permissions

chcon system_u:object_r:etc_t:s0 /etc/ssh/sshd_config /etc/ssh/ssh_config /etc/ssh/shosts.equiv /etc/ssh/ssh_known_hosts

At this point, we are good to go, but I ran into ssh_keysign related errors, because the CentOS 7 image the terraform template was using, had incorrect permissions on the host keys.

Fix permissions on each node

sudo chgrp ssh_keys /etc/ssh/*_key
sudo chmod g+r /etc/ssh/*_key

The updated permissions should look like this

Now, test host-based authentication from the controller (or any node) ssh -vvv hpc-debug-worker-0 . Remember to use ssh as a non-root user.

Install Intel MPI

Now that we are done with host-based authentication, let’s setup Intel MPI on all the nodes

Download the stand-alone offline installer from here

SSH to the hpc-controller node and download

wget https://registrationcenter-download.intel.com/akdlm/irc_nas/18926/l_mpi_oneapi_p_2021.7.0.8711_offline.shchmod a+x l_mpi_oneapi_p_2021.7.0.8711_offline.sh# start installation./l_mpi_oneapi_p_2021.7.0.8711_offline.sh

The installation is pretty much self-explanatory.

Thankfully, slurm takes care of propagating this installation to the other nodes as well.

source the env variables file and invoke the mpirun

source intel/oneapi/setvars.sh 
mpirun --version

Install LS-Dyna

LS-Dyna requires a license to run. In this blog post, we will trigger a Dyna 9.3.1 release with a network license. We assume that a license server already exists in your environment.

Download ls-dyna binary

wget https://www.oasys-software.com/dyna/wp-content/uploads/2018/01/ls-dyna_mpp_d_R9_3_1_x64_centos65_ifort131_sse2_platformmpi.tar.gztar -xvf ls-dyna_mpp_d_R9_3_1_x64_centos65_ifort131_sse2_platformmpi.tar.gzmv ls-dyna_mpp_d_R9_3_1_x64_centos65_ifort131_sse2_platformmpi lsdyna-931chmod a+x lsdyna-931

Run LS-Dyna job using mpirun

We can now test a distributed job using both srun and mpirun. There is a sample job on the internet that you can download and run

wget https://hcc.unl.edu/docs/submitting_jobs/app_specific/job-examples/ls-dyna/ls-dyna.files/pendulum.k

Running the distributed job usingsrun . We can invoke the job using srun, by specifying the option— mpi=pmi2 , this will make use of mpirun. You can wrap this in a SBATCH script and invoke it as well.

source intel/oneapi/setvars.shexport LSTC_LICENSE=network
export LSTC_LICENSE_SERVER=10.50.16.4 #this is our license server
# use srun to invoke the job on specific nodessrun --nodelist=hpc-debug-worker-0,hpc-debug-worker-1 --mpi=pmi2 lsdyna-931 i=pendulum.k

And it ran successfully!

You may also trigger the same job using mpirun directly, specifying slurm as the bootstrap option

mpirun -bootstrap slurm -hosts hpc-debug-worker-0,hpc-debug-worker-1  ./lsdyna-931 i=pendulum.k

Hope you find this useful! Happy high-performance computing! :)

Here are some resources I found useful

--

--