User Tools

Site Tools


simulation:campus_cluster

General

How to log in

ssh <username>@cc-login.campuscluster.illinois.edu

Where <username> is your University NetID, so whatever is in front of your official university email address, e.g. <username>@illinois.edu. Your password is also the official university password.

Scratch folder

You have 5GB in your home, this is enough for code etc. but not enough to run anything. Run simulations on /scratch instead:

cd /scratch/users/<username>/
# or
cd scratch # from home

The scratch space is not backed up and things older than 30 days get deleted on a regular basis (every day). Copy and make backups!

Until you are certain of what you want to keep and discard from /scratch/, running the following script once per week should prevent any files in /scratch/ from being deleted.

Copy the following into a new file in the home directory and name it “touch_old_files.sh”

#!/bin/bash
for f in `find /scratch/users/<USERNAME_HERE> -type f -mtime +14` # (two weeks or older)
do
	echo $f
	stat -c %y "$f"
	touch $f
done

Make it executable with chmod +x touch_old_files.sh and run it with bash touch_old_files.sh

There is now a “projects/statt” folder made in the same level as scratch where files will not be deleted automatically. This is a 12TB folder that is shared among all group members.

Transfer files

Use rsync or scp. Both commands can transfer to and from the cluster, and are usually called from the laptop that is connected to the cluster, not the cluster itself. The reason is that you need a static ip address or hostname, which laptops usually do not have. All PCs in the lab do have a static ip and hostname.

# rsync [options] [source path] [destination path]
rsync -avr ./* <username>@cc-login.campuscluster.illinois.edu:/scratch/users/<username>/...

Notable options:

  • -r: recursive (calls rsync recursively on sub-directories and files. Required if you are syncing directory)
  • -a: archive (syncs the files' last modified time in the destination to that of the source file)
  • -v: verbose (prints out information about the sync as it is running)
  • -h: human-readable (makes the size information a human-readable format)
  • -i: information: Provides more detailed information about what it is changing
  • -n: dry-run: does NOT do the transfer, but gives you information about what it would have done

For -i option, refer to http://www.staroceans.org/e-book/understanding-the-output-of-rsync-itemize-changes.html to understand the output

Alternatively, to transfer a file from the cluster to your local computer, you can type the following command in your zsh terminal:

# scp [options] [source path] [destination path]
scp <username>@cc-xfer.campuscluster.illinois.edu:/home/<username>/scratch/<filename> /<directory at your local computer to store the file>

Notable options:

  • -r: recursive (calls scp recursively on sub-directories and files. Required if you are syncing directory)
  • … and probably more

When transferring a lot of files use the cc-xfer.campuscluster.illinois.edu nodes instead.

See also this help page for storage, transfer, and data.

Setting up a Python Environment

As a first time user of the campus cluster, a personal virtual environment is required to access the majority of python and conda functionality. The following code snippet creates and activates the environment, allowing one to make use of a lot of the campus cluster's functionality.

module load anaconda/2022-May/3
init bash
conda create --name <environment_name_here>
conda activate <environment_name_here>

There is also a premade conda environment with many of the required packages. Activate via:

conda activate /projects/statt/software/conda/hoomd

Software requirements

Modules: cuda/11.7 anaconda/3 (These should be loaded automatically when loading the hoomd modules)

Conda: python 3.10, numpy, scipy, freud, networkx, omnia::eigen3, pybind11, hoomd 4.0.1, gsd, signac 2.X, signac-flow (all installed on cluster in conda environment /projects/statt/software/conda/hoomd)

On cluster: module load /projects/statt/modulefiles/hoomd/4.0.1 conda activate /projects/statt/software/conda/hoomd

Basic SLURM

The cluster uses a job management system for all the users, called SLURM, Almost all SLURM commands start with an 's'. The Campus Cluster guides have information on all of this, but here are some useful commands:

sinfo #shows information about all available queues 
squeue #shows all running/submitted jobs (of everyone)
squeue -u username #shows the jobs of username,ie yourself
sbatch #to submit a job for later execution

Useful SLURM snippets

This shows detailed information about one particular job (running or not):

scontrol show jobid -dd PID

Replace 'PID' with the job number.

Using the one of the following 2 commands, you can see the available partitions in the campuscluster and the available GPUs to run your simulations, as well as time limits and nodes available.

sinfo -o "%20N %10c %10m %25f %10G %R" | grep gpu
sinfo -s -o "%.14R %.12l %.12L %.5D"

You can cancel ALL your jobs with scancel -u <username>

Many jobs with one SLURM script

If you have many “small” or identical jobs with similar runtime, one can submit ONE slurm job that runs them in batches. The jobs are assumed to be independent, eg. don't communicate with each other, don't rely on output from each other. There multiple ways of doing this, see https://www.chpc.utah.edu/documentation/software/serial-jobs.php for more information.

The simplest is if you want to run everything one ONE node (forced by #SBATCH –nodes=1-1):

#!/bin/bash
 
#SBATCH --job-name=test
#SBATCH --output=%x_%J.out
#SBATCH --error=%x_%J.err
#SBATCH --time=00:02:00
#SBATCH --partition=secondary
#SBATCH --nodes=1-1 
#SBATCH --ntasks=8
#SBATCH --mem=2GB
 
# just printing some information
echo $SLURM_JOB_ID 
echo "nodelist SLURM"
echo $SLURM_JOB_NODELIST 
echo $SLURM_NNODES
echo $SLURM_NTASKS 
 
# should be done in for loop (assuming 1 CPU per task here)
python simple_test.py first 13 &
python simple_test.py second 1 & 
python simple_test.py third 7 &
python simple_test.py fourth 1 &
python simple_test.py fifth 1 &
python simple_test.py sixth 15 &
python simple_test.py seven 10 &
python simple_test.py last 1 &
 
wait 

Here, the ampersand & puts the actual task (here a test python script) into the background and wait ensures that slurm waits until all jobs are completed. Important is to note, that $SLURM_NNODES must equal to one, and $SLURM_JOB_NODELIST should only have one entry, otherwise it will not work. You can run as many as $SLURM_NTASKS at the same time, here this is set by #SBATCH –ntasks=8 to be 8.

Submit this script like usual with sbatch run_job.sh. This way limits the total maximum of jobs to whatever the CPU count for a node is on the cluster, and also may result in a longer waiting time since slurm needs to wait until a whole node (or part of a node with 8 CPUs) is empty.

Alternatively, one can use a build in slurm functionality:

#!/bin/bash
 
#SBATCH --job-name=test
#SBATCH --output=%x_%J.out
#SBATCH --error=%x_%J.err
#SBATCH --time=00:02:00
#SBATCH --partition=secondary
#SBATCH --ntasks=8
#SBATCH --mem=2GB
 
echo "jobid SLURM"
echo $SLURM_JOB_ID 
echo "nodelist SLURM"
echo $SLURM_JOB_NODELIST 
echo $SLURM_NNODES
echo $SLURM_NODEID
echo $SLURM_NTASKS 
 
srun --multi-prog conf_for_jobs

Here, we use srun –multi-prog which will let slurm execute multiple jobs at the same time. This script can be submitted the usual way with sbatch run_job.sh. The file conf_for_jobs is a configuration file, that should have three fields per line, separated by spaces: task number, executable file, arguments to the executable file. So, for example:

0 python simple_test.py first 5
1 python simple_test.py second 1
2 python simple_test.py third 3
3 python simple_test.py fourth 10
4 python simple_test.py fith 1
5 python simple_test.py six 10
6 python simple_test.py seven 3
7 python simple_test.py eighth 1

This will work on any number of nodes, so slurm can spread them out and it might spend less time in the queue. The test script simple_test.py used for this doesn't do anything useful, it just prints where the job is running, the arguments given and sleeps for some seconds:

import sys
import time
import socket
 
print("running on ", socket.gethostname())
time.sleep(1)
print("arguments given")
print(str(sys.argv))
time.sleep(int(sys.argv[-1]))
print("DONE")

When testing, it is useful to print a bunch of slurm variables (ie. $SLURM_JOB_NODELIST ) and use a simple script like the one above to check if and where it is running (socket.gethostname() should print the name of the node it is running on).

Queue/Node info

Everyone should have access to secondary, statt, GSEG_CC, and eng-research. Type sinfo to see.

Nodes that have GPUs in them on secondary:

ccc[0038,0093-0096,0130,0156,0158,0183,0203-0206,0212,0215,0278-0279,0286-0298,0303-0305,0308-0311,0324,0333-0337]
golub[001-002,101-104,121-128,133-136,138,159-162,167-170,175-178,209-210,223-226,283,292-298,305,346-359,368-369,374,378-380]
Nodes Type Amount Queue
ccc0324 RTXA60 1 secondary
ccc0333 A40 3 secondary
ccc[0312-0316] A108 eng-research-gpu
golub[346-349] K804 secondary
golub[305,378-380] K804 secondary
ccc[0035-0036] K804 secondary-eth
ccc0037 P1004 secondary-eth
ccc0060 V1002 secondary-eth
ccc[0076-0077] V1002 secondary-eth
ccc[0078-0084] V1002 eng-research-gpu
ccc0215 V100 1 secondary
ccc[0286-0287] TeslaT41 GSEG_CC,secondary,secondary-sg
golub[121-128] TeslaK41 secondary
sinfo -o "%20N  %10c  %10m  %25f  %10G " | grep -v "NoGPU"

Shared Project storage

The group has 12 TB of shared storage on the cluster. Please be nice and don't store excessive amounts of data that are not needed. In your home, execute

ln -s /projects/statt project 

to create a symlink to our shared project space. You don't have to do this, you can also just type cd /projects/statt/folder-you-want-to-go-to every time you want to switch folders. The symlink allows you to shortcut this to cd project/folder-you-want-to-go-to instead.

We have some hoomd and azplugins installations pre-installed as modules on the shared project storage space. These are in the folders /projects/statt/modulefiles and /projects/statt/bin.

Do not write or modify files in /projects/statt/modulefiles and /projects/statt/bin without explicit discussion and permission!

Shared modules

CUDA 8.0 works with GCC 7,8 CUDA 11.7 works with GCC⇐11

Currently Loaded Modulefiles (for compiling hoomd 2.9.7):

1) cuda/11.7             2) anaconda/2023-Mar/3   3) gcc/11.2.0       4) cmake/3.18.4

Accessing the Cluster without a Password

*The following was copy-pasted from ChatGPT

To avoid entering your password every time you SSH into a remote server, you can set up SSH key-based authentication. This involves generating a public/private key pair and adding the public key to the authorized_keys file on the remote server. Here's a step-by-step guide to setting this up: 1) Generate SSH Key Pair: Run the following command on your local machine to generate an SSH key pair if you haven't already:

ssh-keygen -t rsa

This command will prompt you to choose a location to save the key pair (leaving it empty will choose a default location) and to set a passphrase (you can leave it empty for passwordless access, but adding a passphrase enhances security).

2) Copy Public Key to Remote Server: Use the ssh-copy-id command to copy your public key to the remote server. Replace username and remote_server_ip with your actual username and the IP address of the remote server:

ssh-copy-id username@remote_server_ip

This command will prompt you for your password on the remote server. After entering your password, your public key will be added to the authorized_keys file on the remote server.

2 alt) If ssh-copy-id is not available for you, you can manually copy the public key by appending the content of your local ~/.ssh/id_rsa.pub file to the ~/.ssh/authorized_keys file on the remote server. Configure SSH on Local Machine: If ssh-agent is not running, start it:

eval $(ssh-agent)

Add your private key to the SSH agent:

ssh-add ~/.ssh/id_rsa

3) Test SSH Connection: Try SSHing into the remote server again:

ssh username@remote_server_ip

You should now be able to SSH into the remote server without being prompted for a password. Remember, while this setup improves convenience, it's important to ensure the security of your private key. Keep it safe and don't share it with anyone you don't trust.

simulation/campus_cluster.txt · Last modified: 2024/04/18 16:29 by bj21

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki