Running Jobs with Slurm¶

Slurm (Simple Linux Utility for Resource Management) is the workload manager. You describe what resources your job needs (CPUs, memory, time, GPUs), and Slurm schedules your job to run when those resources are available. You never SSH directly to a compute node — Slurm handles that for you.

Key Slurm Concepts¶

Partition: A logical group of nodes, often differentiated by hardware (e.g., general, astro, teraram,teslap100). Choose the right one for your job.
Job: A unit of work submitted to Slurm. Can be a single task or a complex multi-node parallel job.
Allocation: The set of resources Slurm reserves for your job.

Useful Slurm Commands¶

# View the job queue
squeue

# View only your jobs
squeue -u your_username

# View available partitions and their status
sinfo

# Cancel a job
scancel JOB_ID

# Detailed accounting for a completed job
sacct -j JOB_ID --format=JobID,JobName,Elapsed,CPUTime,MaxRSS,State

Batch Jobs¶

A batch job is the standard workflow: you write a shell script describing your job and its resource requirements, submit it, and retrieve results when it finishes. This is ideal for long runs, parameter sweeps, and any job that does not need interactive supervision.

Anatomy of a Slurm Batch Script¶

Lines beginning with #SBATCH are directives to the Slurm scheduler — they are not executed as shell commands, but parsed before your job starts.

#!/bin/bash
# ─────────────────────────────────────────────────────
#  Slurm directives
# ─────────────────────────────────────────────────────
#SBATCH --job-name=thrust_nnlo       # A human-readable name for your job
#SBATCH --partition=general          # Which partition to run on
#SBATCH --nodes=1                    # Number of compute nodes
#SBATCH --ntasks=1                   # Number of parallel tasks (MPI ranks)
#SBATCH --cpus-per-task=8            # CPU cores per task (for OpenMP/threads)
#SBATCH --mem=16G                    # Total memory for the job
#SBATCH --time=04:00:00              # Maximum wall time (days-HH:MM:SS)
#SBATCH --output=logs/job_%j.out     # Standard output (%j = job ID)
#SBATCH --error=logs/job_%j.err      # Standard error
#SBATCH --mail-type=END,FAIL         # Email on completion or failure
#SBATCH --mail-user=user@lcm.mi.infn.it

# ─────────────────────────────────────────────────────
#  Environment setup
# ─────────────────────────────────────────────────────
module purge
module load gcc/13
module load mpi/mpich-x86_64

# ─────────────────────────────────────────────────────
#  Job execution
# ─────────────────────────────────────────────────────
# Set OpenMP thread count to match requested CPUs
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

echo "Job started on $(hostname) at $(date)"
echo "Running with $SLURM_CPUS_PER_TASK threads"

# Change to temporary for faster I/O
cd $SLURM_TMPDIR

# Run your compiled executable
srun thrust_calculator --order NLO --output results_nnlo.dat

echo "Job finished at $(date)"

Submit the script with:

sbatch my_job.sh
# Output: Submitted batch job 482910

A Fortran/C++ Compilation + Run Example¶

Because you cannot compile on the login node, a common pattern is to use a short "build job" first, then a "run job":

#!/bin/bash
#SBATCH --job-name=compile_mc
#SBATCH --partition=standard
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=4G
#SBATCH --time=00:30:00
#SBATCH --output=logs/compile_%j.out
#SBATCH --mail-type=END,FAIL         # Email on completion or failure
#SBATCH --mail-user=user@lcm.mi.infn.it


module purge
module load gcc/15

cd /home/your_username/myproject

mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-O3 -march=native"
make -j $SLURM_CPUS_PER_TASK

echo "Build complete."

Once the build job finishes (check with squeue or wait for the email), your executable is ready and you can submit your production run job.

Job Arrays — Running Parameter Sweeps¶

If you need to run the same code with many different inputs (e.g., different values of a coupling constant, different scales μ), use a job array instead of submitting dozens of identical scripts:

#!/bin/bash
#SBATCH --job-name=scale_scan
#SBATCH --array=1-50           # Launches 50 jobs, SLURM_ARRAY_TASK_ID = 1..50
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
#SBATCH --time=02:00:00
#SBATCH --output=logs/scan_%A_%a.out   # %A = array job ID, %a = task ID

module purge
module load gcc/14

# Use the task ID to select a parameter from a list
SCALE_FILE=/home/your_username/scales.txt
SCALE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" $SCALE_FILE)

echo "Running with scale mu = ${SCALE} GeV"
./my_program --mu $SCALE --output results/output_${SLURM_ARRAY_TASK_ID}.dat

💡 Limit concurrent array tasks to be a good citizen.

A job array of 500 tasks could flood the queue. Use the %N suffix to cap concurrent running tasks: #SBATCH --array=1-500%20 runs at most 20 simultaneously.

Interactive Sessions¶

Sometimes you need to work interactively on a compute node — for example, to test a compilation, debug a script, or run Mathematica interactively. Use salloc or srun for this.

⚠️ Interactive sessions still consume cluster resources.

An interactive session holds a reservation on compute nodes for its entire duration, even if you are idle. Request only what you need, and exit your session as soon as you are done.

`salloc` — Request an Allocation, Then Work¶

salloc requests resources and drops you into a new shell. From there, you can run commands directly on the allocated node(s).

# Request 1 node, 4 CPUs, 8 GB RAM, for up to 2 hours
salloc --nodes=1 --ntasks=1 --cpus-per-task=4 --mem=8G --time=02:00:00

# Once the allocation is granted, your prompt changes.
# You are now running on a compute node.
# Load modules and work interactively:
module load gcc/15
make -j 4
./my_program --test

# When done, exit the allocation
exit

`srun` — Run Commands on Compute Nodes¶

srun is used to launch tasks directly on compute nodes under Slurm allocation. It’s ideal for:

Running a single command non-interactively
Launching parallel jobs
Opening an interactive session on a compute node

Unlike sbatch, srun runs jobs immediately (subject to scheduling), and unlike ssh, it ensures proper resource accounting and isolation.

Run a Single Command¶

Use srun when you need to execute one command with allocated resources:

# Compile using 8 CPU cores on a compute node
srun --ntasks=1 --cpus-per-task=8 --mem=4G --time=00:15:00 \
     make -j8
``

Interactive Shell on a Compute Node¶

To debug, test code, or run commands manually:

# Open an interactive bash shell on a compute node
srun --ntasks=1 --cpus-per-task=4 --mem=8G --time=01:00:00 --pty bash

This gives you a shell inside a compute node, not the login node.