I like my job, so no screenshots. Sorry.

Notes:

  • sbatch is a command for submitting jobs on high performance compute nodes
  • the huge-n128-512g node uses 128 cores and has 512GiB of memory
  • This is occurring in a medical research nonprofit

User: Hello everyone, this is the first time I’m using GCP. I’m trying to run a job, but it keeps failing. These are the sbatch headers I’m using:

#SBATCH --partition=huge-n128-512g
#SBATCH --nodes=8
#SBATCH [email protected]
#SBATCH --mail-type=FAIL
#SBATCH --mem-per-cpu=32G

IT: Please make sure you need to use that node, each one costs $4500/month to use. Can you describe the job you’re trying to do?

User: I’m doing high-depth genetic sequencing using 3gb bam files.

(additional note: there’s usually only 1 bam file per chromosome, so 69gb total. Nice.)

IT: Those bam files are pretty small. I’d recommend starting with the med-n16-64g node and moving up if needed. We’re only billed for run time. If the jobs take the same amount of time, it would be 13% of the cost.

The astute among you will notice that an 8 node swarm of 32GiB of memory per core is 32TiB total. The job was failing because the --mem-per-cpu flag was going above the available memory on each node. Even without that flag, the swarm would have used 4TiB memory. Holy overallocation, Batman!