Troubleshooting & FAQ¶
My job is stuck in PENDING — why?¶
Run squeue -j JOB_ID and inspect the REASON column. Common reasons:
| Reason | Meaning |
|---|---|
Resources |
Resources are available but not yet allocated — your job will start soon |
Priority |
Higher-priority jobs are ahead of yours in the queue |
QOSMaxCpuPerUserLimit |
You have hit your per-user CPU quota |
ReqNodeNotAvail |
You requested a specific node that is down or reserved |
InvalidAccount |
Your account is not associated with a valid Slurm account — contact support |
My job failed immediately — where do I look?¶
# Check the job's exit state
sacct -j JOB_ID --format=JobID,State,ExitCode
# Read the error log (you specified this with --error in your script)
cat logs/job_JOB_ID.err
Common causes: missing module load in the script, incorrect file paths,
insufficient memory (OOM Killed).
How do I know how much memory my job actually used?¶
MaxRSS is the peak resident memory used. Use this to tune future jobs.
I need to run a Jupyter notebook — how?¶
Launch Jupyter on a compute node via srun, forward the port via SSH:
# Step 1: On the cluster login node, start an interactive session
srun --ntasks=1 --cpus-per-task=2 --mem=8G --time=04:00:00 --pty bash
# Step 2: On the compute node (note the hostname, e.g., node07)
source ~/venvs/myenv/bin/activate
jupyter lab --no-browser --port=8888 --ip=0.0.0.0
# Step 3: On your local machine, open an SSH tunnel
ssh -N -L 8888:node:8888 your_username@galileo.mi.infn.it
# Step 4: Open http://localhost:8888 in your browser
I accidentally ran something heavy on the login node — what do I do?¶
Kill the process immediately:
Then submit the work as a proper Slurm job. If you are unsure whether a task
is "too heavy" for the login node, the answer is almost always: use
salloc to be safe.
📬 Need help?
For account issues, software requests, or anything not covered here, contact the HPC support team at admins@lcm.mi.infn.it. Please include your username, job ID(s), and the relevant error output in your message.