SLURM Multinode on AWS
Launch Login Node
Prepare User Data
When launching a login node it is worth considering what user data options to provide. While it is not required, user data can provide powerful customisation at launch that can further streamline the cluster build process.
There are several options that can be added to change how a compute node will contact nodes on startup.
- Sharing public ssh key to clients:
- Instead of manually obtaining and sharing the root public SSH key (passwordless root ssh is required for flight profile) this can be shared over the local network with
SHAREPUBKEY=true
- Instead of manually obtaining and sharing the root public SSH key (passwordless root ssh is required for flight profile) this can be shared over the local network with
- Add an auth key:
- Add the line
AUTH_KEY=<string>
. This means that the node will only accept incoming flight hunter nodes that provide a matching authorisation key
- Add the line
#cloud-config
write_files:
- content: |
SHAREPUBKEY=true
AUTH_KEY=banana
path: /opt/flight/cloudinit.in
permissions: '0600'
owner: root:root
Info
More information on available user data options for Flight Solo via the user data documentation
Deploy
-
Find the Flight Solo image here or by searching the marketplace for "Flight Solo".
-
Click "Continue to Subscribe"
-
Read the terms and conditions, then click "Continue to Configuration"
-
Configure region, software version (if unsure use the latest), and fulfillment option (if unsure use the default). Then click "Continue to Launch". Make sure the region is the same for all nodes to be used in a cluster.
-
Click on "Usage Instructions" to see some instructions on how to get started, and a link to this documentation.
-
Select the "Launch from Website" action.
-
Choose an instance type to use.
-
Choose VPC settings. Remember what VPC was used to create this instance, as it should also be used for any associated compute nodes.
-
Choose a subnet. Remember what subnet was used to create this instance, as it should also be used for any associated compute nodes.
-
A security group is needed to associate with all nodes on the cluster. It is recommended to use a security group with rules limiting traffic through:
- HTTP
- HTTPS
- SSH
- Port 8888
- Ports 5900 - 5903
- All traffic from within the security group should be allowed. (This rule can only be added after creation)
Note
If you already have a security group which does this, use it here and make sure to use it again for the compute nodes. Otherwise, a security group can be made from the launch page, or through the security groups page
Describing exactly how to create a security group is out of scope for this documentation, but covered by the AWS documentation.
However, here is an example security group that might be used for a Flight Solo cluster:
Tip
The seller's settings (shown below) can be used as a reference for creating a security group.
-
After a security group has been made, click "Select existing security group" select it from the drop down menu.
-
Choose what key pair to use. It is good practice for this to be the same on all nodes in a cluster.
-
Click Launch
To set up a cluster, you will need to import a Flight Solo image.
-
Go the EC2 instance console
-
Click "Launch" to go to the EC2 instance setup page.
-
Set the number of instances to 1, and name of instance to something descriptive.
-
Confirm that the region(top right, next to username) is correct.
-
In the "Application and OS Images" section choose the "My AMIs" tab and select your imported solo AMI.
-
In the "Instance type" section, choose the required instance size.
-
In the "Keypair" section, select a keypair to use. It is good practice to use the same keypair for the login and compute nodes.
-
In the "Network settings" sections, click the "Edit" button to set the network and subnet. Remember what these are, as they should be the same for any associated compute nodes.
-
Another thing needed is a security group to associate with all nodes on the cluster. It is recommended to use a security group with rules limiting traffic through:
- HTTP
- HTTPS
- SSH
- Port 8888
- Ports 5900 - 5903
- All traffic from within the security group should be allowed. (This rule can only be added after creation)
Note
If you already have a security group which does this, use it here and make sure to use it again for the compute nodes. Otherwise, a security group can be made from the launch page, or through the security groups page
Describing exactly how to create a security group is out of scope for this documentation, but covered by the AWS documentation.
However, here is an example security group that might be used for a Flight Solo cluster:
-
After a security group has been made, click "Choose Existing" select it from the drop down menu.
-
In the "Configure Storage" section, allocate as much memory as needed. 8GB is the minimum required for Flight Solo, so it is likely the compute nodes will not need much more than that, as the login node hosts most data.
-
Finally, click "Launch Instance".
Launch Compute Nodes
Prepare User Data
Setting up compute nodes is done slightly differently than a login node. The basic steps are the same except subnets, networks and security groups need to match the ones used for the login node.
This is the smallest amount of cloud init data necessary. It allows the login node to find the compute nodes as long as they are on the same network, and ssh into them from the root user (which is necessary for setup).
#cloud-config
users:
- default
- name: root
ssh_authorized_keys:
- <Content of ~/.ssh/id_alcescluster.pub from root user on login node>
Tip
The above is not required if the SHAREPUBKEY
option was provided to the login node. If this was the case then the SERVER
option provided to the compute node will be enough to enable root access from the login node.
There are several options that can be added to change how a compute node will contact nodes on startup.
- Sending to a specific server:
- Instead of broadcasting across a range, add the line
SERVER=<private server IP>
to send to specifically that node, which would be your login node.
- Instead of broadcasting across a range, add the line
- Add an auth key:
- Add the line
AUTH_KEY=<string>
. This means that the compute node will send it's flight hunter packet with this key. This must match the auth key provided to your login node
- Add the line
#cloud-config
write_files:
- content: |
SERVER=10.10.0.1
AUTH_KEY=banana
path: /opt/flight/cloudinit.in
permissions: '0600'
owner: root:root
users:
- default
- name: root
ssh_authorized_keys:
- <Content of ~/.ssh/id_alcescluster.pub from root user on login node>
Info
More information on available user data options for Flight Solo via the user data documentation
Deploy
-
Go to the EC2 instance setup page through marketplace.
-
Find the Flight Solo image here or by searching the marketplace for "Flight Solo".
-
Click "Continue to Subscribe"
-
Read the terms and conditions, then click "Continue to Configuration"
-
Configure region, software version (if unsure use the latest), and fulfillment option (if unsure use the default). Then click "Continue to Launch". Make sure the region is the same for all nodes to be used in a cluster.
-
Click on "Usage Instructions" to see some instructions on how to get started, and a link to this documentation.
-
Select the "Launch from EC2" action
-
Click "Launch" to go to the EC2 instance setup page.
-
-
Set the instance name and number of instances.
-
Confirm that the region(top right, next to username) is the same as the region the login node was created in.
-
In the "Application and OS Images" section, confirm that Flight Solo is the selected AMI.
-
In the "Instance type" section, choose the required instance size.
-
In the "Keypair" section, select a keypair to use. It is good practice to use the same keypair for the login and compute nodes.
-
In the "Network settings" section, select the same network, subnet, and security group as the login node.
-
To change the network and subnet, click the "Edit" button, and then use the drop downs to find the correct network and subnet.
-
-
In the "Configure Storage" section, allocate as much memory as needed. 8GB is the minimum required for Flight Solo, so it is likely the compute nodes will not need much more than that, as the login node hosts most data.
-
In the "Advanced details" section there are many settings, but at the bottom is a text box labeled "User data".
-
Write a cloud init script in the user data section, see here for details:
-
To get the information necessary for the cloud init script. Go to the EC2 console. Make sure your region is set to the one used for login and compute nodes.
-
Select the created login node to see more details about it, including the private ip.
-
Log in to the login node.
-
Become the root user and open the file
~/.ssh/id_alcescluster.pub
, copy the contents to the cloud init script.Tip
If the login node is launched using the
SHAREPUBKEY
then there is no need to perform stepsd
ande
as this will be performed by the systems.
-
-
Back on the compute node creation page, click "Launch Instance".
Note
Repeat this process for any other types of nodes that need to be added to the cluster.
-
Go the EC2 instance console
-
Click "Launch Instance" to go to the EC2 instance setup page.
-
-
Set the instance name and number of instances.
-
Confirm that the region(top right, next to username) is the same as the region the login node was created in.
-
In the "Application and OS Images" section choose the "My AMIs" tab and select your imported solo AMI.
-
In the "Instance type" section, choose the required instance size.
-
In the "Keypair" section, select a keypair to use. It is good practice to use the same keypair for the login and compute nodes.
-
In the "Network settings" section, select the same network, subnet, and security group as the login node.
-
To change the network and subnet, click the "Edit" button, and then use the drop downs to find the correct network and subnet.
-
-
In the "Configure Storage" section, allocate as much memory as needed. 8GB is the minimum required for Flight Solo, so it is likely the compute nodes will not need much more than that, as the login node hosts most data.
-
In the "Advanced details" section there are many settings, but at the bottom is a text box labeled "User data".
-
Write a cloud init script in the user data section, see here for details:
-
To get the information necessary for the cloud init script. Go to the EC2 console.
-
Select the created login node to see more details about it, including the private ip.
-
Log in to the login node.
-
Become the root user and open the file
~/.ssh/id_alcescluster.pub
, copy the contents to the cloud init script.Tip
If the login node is launched using the
SHAREPUBKEY
then there is no need to perform stepsd
ande
as this will be performed by the systems.
-
-
Back on the compute node creation page, click "Launch Instance".
Note
Repeat this process for any other types of nodes that need to be added to the cluster.
General Configuration
Create Node Inventory
-
Parse your node(s) with the command
flight hunter parse
.-
This will display a list of hunted nodes, for example
[flight@login-node.novalocal ~]$ flight hunter parse Select nodes: (Scroll for more nodes) ‣ ⬡ login-node.novalocal - 10.10.0.1 ⬡ compute-node-1.novalocal - 10.10.101.1
-
Select the desired node to be parsed with Space, and you will be taken to the label editor
Choose label: login-node.novalocal
-
Here, you can edit the label like plain text
Choose label: login1
Tip
You can clear the current node name by pressing Down in the label editor.
-
When done editing, press Enter to save. The modified node label will appear next to the ip address and original node label.
Select nodes: login-node.novalocal - 10.10.0.1 (login1) (Scroll for more nodes) ‣ ⬢ login-node.novalocal - 10.10.0.1 (login1) ⬡ compute-node-1.novalocal - 10.10.101.1
-
From this point, you can either hit Enter to finish parsing and process the selected nodes, or continue changing nodes. Either way, you can return to this list by running
flight hunter parse
. -
Save the node inventory before moving on to the next step.
Tip
See
flight hunter parse -h
for more ways to parse nodes.
-
Add genders
- Optionally, you may add genders to the newly parsed node. For example, in the case that the node should have the gender
cluster
andall
then run the command:flight hunter modify-groups --add cluster,all login1
SLURM Multinode Configuration
-
Configure profile
flight profile configure
- This brings up a UI, where several options need to be set. Use up and down arrow keys to scroll through options and enter to move to the next option. Options in brackets coloured yellow are the default options that will be applied if nothing is entered.
- Cluster type: The type of cluster setup needed, in this case
Slurm Multinode
. - Cluster name: The name of the cluster.
- Setup Multi User Environment with IPA?: Boolean value to determine whether to configure a multi-user environment with IPA. If set to true then the following will need to be filled in
- IPA domain: The domain for the IPA server to use.
- IPA secure admin password: The password to be used by the
admin
user of the IPA installation to manage the server.
- Default user: The user that you log in with.
- Set user password: Set a password to be used for the chosen default user.
- IP or FQDN for Web Access: As described here, this could be the public IP or public hostname.
- IP range of compute nodes: The IP range of the compute nodes used, remember to add the netmask. E.g.
172.31.16.0/20
- Cluster type: The type of cluster setup needed, in this case
- This brings up a UI, where several options need to be set. Use up and down arrow keys to scroll through options and enter to move to the next option. Options in brackets coloured yellow are the default options that will be applied if nothing is entered.
-
Apply identities by running the command
flight profile apply
-
First apply an identity to the login node
flight profile apply login1 login
-
Wait for the login node identity to finish applying. You can check the status of all nodes with
flight profile list
.Tip
You can watch the progress of the application with
flight profile view login1 --watch
-
Apply an identity to the each of the compute nodes (in this example, genders-style syntax is used to apply to
node01
andnode02
)flight profile apply node[01-02] compute
Tip
You can check all available identities for the current profile with
flight profile identities
-
Success
Congratulations, you've now created a SLURM Multinode environment! Learn more about SLURM in the HPC Environment docs.
Verifying Functionality
-
Create a file called
simplejobscript.sh
, and copy this into it:#!/bin/bash -l echo "Starting running on host $HOSTNAME" sleep 30 echo "Finished running - goodbye from $HOSTNAME"
-
Run the script with
sbatch simplejobscript.sh
, and to test all your nodes try queuing up enough jobs that all nodes will have to run. -
In the directory that the job was submitted from there should be a
slurm-X.out
whereX
is the Job ID returned from thesbatch
command. This will contain the echo messages from the script created in step 1