Troubleshooting

When attempting to run AlphaFold on the HPC, you might be confronted with a variety of errors, originating from several sources. Most commonly, they will be a result of a wrong configuration or resource exhaustion (e.g., when running large proteins or proteins with too large MSAs). The most common errors are enlisted below, along with possible workarounds to solve them.

You will find the errors that are described in two possible locations:

  1. At submission time (qsub), the command line will reply with an error specification
  2. In the error file that is generated for each job running on the HPC. This file can be found in your $VSC_DATA/alphafold directory, and will be named <job_name>.e<job_id> (for instance: example_run.e1502930). Display its contents with less or cat. The actual error message is most often found at the bottom of the file.

At submission time

Wrong configuration

On the #PBS lines in the script, you should only use the ‘keys’ that were specified in the example (e.g., nodes, ppn, gpus, mem). If you would for instance try to use the key cpu, this is the error that you would get at submission time.
On the #PBS lines in the script, you will get this error at submission time when you try to allocate an invalid number of resources. For instance, this will show when you try to allocate 5 GPUs on a node with just 4 GPUs, or if you try to allocate too much memory.

Wrong setup: module not loaded

When submitting a job without first swapping to the correct cluster (with for instance module swap cluster/joltik), this is the error produced.

In the output error file

Wrong setup: wrong FASTA file name

The file XX.fasta is not present in $VSC_DATA/alphafold/fastas/

Insufficient CPU memory (RAM)

Not enough memory was allocated to fit the MSAs + compute model predictions. Try to allocate a larger amount of memory in the #PBS header of the script.

Insufficient GPU memory

The memory on the GPU memory was not large enough to fit the sequence and compute its structure. This should only occur if you are trying to run a very large protein, or if you run on a smaller GPU (<16GB). You can try to allocate more than one GPU in the #PBS header.

Insufficient storage

There is not enough disk space on the drive where your alphafold/ directory is located. If this is in $VSC_DATA, you can try to (a) free up space, or (b) join a virtual organization (VO), where you will have more storage capacity.

Time limit exceeded

The time limit of the job was reached. You can try to increase the allocated time in the #PBS header (max. 3 days). If you have the feeling that the time required is out of proportion with what was expected, make sure you allocate a GPU in the #PBS header.