I like my job, so no screenshots. Sorry.

Notes:

  • sbatch is a command for submitting jobs on high performance compute nodes
  • the huge-n128-512g node uses 128 cores and has 512GiB of memory
  • This is occurring in a medical research nonprofit

User: Hello everyone, this is the first time I’m using GCP. I’m trying to run a job, but it keeps failing. These are the sbatch headers I’m using:

#SBATCH --partition=huge-n128-512g
#SBATCH --nodes=8
#SBATCH [email protected]
#SBATCH --mail-type=FAIL
#SBATCH --mem-per-cpu=32G

IT: Please make sure you need to use that node, each one costs $4500/month to use. Can you describe the job you’re trying to do?

User: I’m doing high-depth genetic sequencing using 3gb bam files.

(additional note: there’s usually only 1 bam file per chromosome, so 69gb total. Nice.)

IT: Those bam files are pretty small. I’d recommend starting with the med-n16-64g node and moving up if needed. We’re only billed for run time. If the jobs take the same amount of time, it would be 13% of the cost.

The astute among you will notice that an 8 node swarm of 32GiB of memory per core is 32TiB total. The job was failing because the --mem-per-cpu flag was going above the available memory on each node. Even without that flag, the swarm would have used 4TiB memory. Holy overallocation, Batman!

  • Pika@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    21
    ·
    4 days ago

    Was this used as an eye opener event to maybe put some restrictions in place for resource allocation? I don’t see any real need of a newbie needing access to the ability to even do that level of allocation.

    Like this time it was only caught cause the nodes physically didn’t have the resources needed, Yall got lucky that time, next time it might not be the case.

    Granted It seems it wouldn’t have been a $36,000 job since you mentioned it was based off run time but, still wasteful of resources and cost prohibitive.

  • flandish@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    4 days ago

    I worked in medical research not too long ago as an ETL guy using Netezza for holding data and making cohorts. I don’t envy having to also spin up / down compute based nodes these days. Good luck out there.

  • Paragone@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    2
    ·
    4 days ago

    In CFD, that wouldn’t even give you 1-litre of physics-correct simulation, ttbomk…

    ( I read in 1 paper that you need 11-micron cells in your mesh, for physics-correctness: bigger didn’t work right. & there are one HELL of alot of 11-micron cells in an aircraft’s boundary-layer.

    Which explains why airliner-simulation-runs can be priced in the $0.1B+ range, from what I’ve read… )

    _ /\ _