Troubleshooting & FAQ¶

My job is stuck in `PENDING` — why?¶

Run squeue -j JOB_ID and inspect the REASON column. Common reasons:

Reason	Meaning
`Resources`	Resources are available but not yet allocated — your job will start soon
`Priority`	Higher-priority jobs are ahead of yours in the queue
`QOSMaxCpuPerUserLimit`	You have hit your per-user CPU quota
`ReqNodeNotAvail`	You requested a specific node that is down or reserved
`InvalidAccount`	Your account is not associated with a valid Slurm account — contact support

My job failed immediately — where do I look?¶

# Check the job's exit state
sacct -j JOB_ID --format=JobID,State,ExitCode

# Read the error log (you specified this with --error in your script)
cat logs/job_JOB_ID.err

Common causes: missing module load in the script, incorrect file paths, insufficient memory (OOM Killed).

How do I know how much memory my job actually used?¶

sacct -j JOB_ID --format=JobID,MaxRSS,AveRSS,Elapsed

MaxRSS is the peak resident memory used. Use this to tune future jobs.

I need to run a Jupyter notebook — how?¶

Launch Jupyter on a compute node via srun, forward the port via SSH:

# Step 1: On the cluster login node, start an interactive session
srun --ntasks=1 --cpus-per-task=2 --mem=8G --time=04:00:00 --pty bash

# Step 2: On the compute node (note the hostname, e.g., node07)
source ~/venvs/myenv/bin/activate
jupyter lab --no-browser --port=8888 --ip=0.0.0.0

# Step 3: On your local machine, open an SSH tunnel
ssh -N -L 8888:node:8888 your_username@galileo.mi.infn.it

# Step 4: Open http://localhost:8888 in your browser

Kill the process immediately:

# Find the process
ps aux | grep your_username

# Kill it by PID
kill -9 PID

Then submit the work as a proper Slurm job. If you are unsure whether a task is "too heavy" for the login node, the answer is almost always: use salloc to be safe.

📬 Need help?

For account issues, software requests, or anything not covered here, contact the HPC support team at admins@lcm.mi.infn.it. Please include your username, job ID(s), and the relevant error output in your message.