Berzelius AlphaFold Guide

AlphaFold is a deep learning-based protein structure prediction program developed by DeepMind. The software uses a neural network to predict the 3D structure of a protein from its amino acid sequence. The first version of AlphaFold was released in 2018, and it was considered a breakthrough in the field of protein structure prediction. In 2021, AlphaFold2 won the CASP14 competition, a biennial competition that evaluates the state-of-the-art methods in protein structure prediction. AlphaFold2 was able to predict the structure of proteins with remarkable accuracy, which has implications for drug discovery and understanding diseases at the molecular level.

AlphaFold's official documentation recommends to run AlphaFold using Docker which is not available on Berzelius. Therefore, in this user guide we provide three ways to run AlphaFold: using Conda, Container Module and Apptainer.

0. Preparations

0.1 Set the paths

We specify the paths for AlphaFold database, AlphaFold installation and results.

ALPHAFOLD_DB=/proj/nsc/users/alphafold_db_2.3.1
ALPHAFOLD_DIR=/proj/nsc_testing/xuan/alphafold_2.3.1
ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_DIR} ${ALPHAFOLD_RESULTS}
mkdir -p ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input

0.2 Download genetic databases

We have a copy of AlphaFold database on Berzelius at /proj/nsc/users/alphafold_db_2.3.1 which should only be used for test purposes. Please download your own copy for your research.

The aria2 module will be used for downloading the AlphaFold database.

module load aria2/1.36.0
wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1
cd ${ALPHAFOLD_DIR}
scripts/download_all_data.sh ${ALPHAFOLD_DB}

0.3 Download the test data

The test input T1050.fasta can be found on this page. Download and save it to ${ALPHAFOLD_RESULTS}/input.

1. Running AlphaFold using Conda

1.1 Create a conda env

We first load the Anaconda module.

module load Anaconda/2021.05-nsc1

We create a conda env from a yml file.

git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
conda env create -f /tmp/berzelius-alphafold-guide/alphafold_2.3.1.yml
conda activate alphafold_2.3.1

1.2 Install AlphaFold

1.2.1 Download AlphaFold

You can specify the version in the download link.

wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1

1.2.2 Apply OpenMM patch

cd /home/xuan/.conda/envs/alphafold_2.3.1/lib/python3.8/site-packages/ 
patch -p0 < ${ALPHAFOLD_DIR}/docker/openmm.patch

1.2.3 Download chemical properties

wget -q -P ${ALPHAFOLD_DIR}/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

1.2.4 Install patch

This patch includes two new input arguments in run_alphafold.py.

  • n_parallel_msa
    --n_parallel_msa=1: the searches are not parallelized.
    --n_parallel_msa=3: the searches are all parallelized.
    This new flag has been wrapped as -P in the wrapper.

  • run_feature_only
    --run_feature_only=true: to only run MSA and template searches.
    This new flag has been wrapped as -F in the wrapper.

The patch also provides the flexibility to choose the number of threads used for the MSA searches. Read the Optimization section for more details.

To install the patch

git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
cd /tmp && bash berzelius-alphafold-guide/patch/patch.sh ${ALPHAFOLD_DIR}

1.3 Run an example

cd ${ALPHAFOLD_DIR}
bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

Please check the input arguments in run_alphafold.py. A complete list of input arguments is attached here for reference.

2. Running AlphaFold using Container Module

2.1 Load the module

module load AlphaFold/2.3.1-apptainer

2.2 Run an example

bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

3. Running AlphaFold using Apptainer

3.1 Create the container image

There is an Apptainer image of AlphaFold 2.3.1 at /software/sse/containers.

3.2 Run an example

apptainer exec --nv alphafold_2.3.1.sif bash -c "cd /app/alphafold && bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false"

4. Optimization

4.1 MSA searches in parallel

The three independent sequential MSA searches can be arranged in parallel to accelerate the job. The parallelisation can be enabled by setting the flag -P 3.

  • jackhmmer(uniref90) + template_searcher(pdb)
  • jackhmmer(mgnify)
  • hhblits(bfd) or jackhmmer(small_bfd)

Ref 1: AlphaFold PR 399 Parallel execution of MSA tools.
Ref 2: Zhong et al. 2022, ParaFold: Paralleling AlphaFold for Large-Scale Predictions.

4.2 Multithreading for MSA searches

AlphaFold 2.3.1 uses a default choice of 8, 8, 4 threads for the three MSA searches and it is not always the optimal. The hhblits search is the most time comsuming so we can manually set a large number of threads. The number of threads used for the three searches can be set in alphafold/data/pipeline.py at line 131 to 134.

For multimer models, the search of Jackhmmer (uniprot) will start when the first three searches finish. The number of threads can be set in alphafold/data/pipeline_multimer.py at line 179.

We recommend to use n_cpu=8, 8, 16, 32 on Berzelius for jackhmmer (uniref90), jackhmmer (mgnify), hhblits (bfd) and Jackhmmer (uniprot), respectively.

4.3 Separation of CPU part (MSA and template searches) and GPU part (predictions)

A flag --run_feature_only has been added to separate the CPU and GPU parts. Alphafold makes use of GPUs for the prediction part of the modelling, which can be a small part of the running time. Most of the operations executed by the script are CPU-based (MSA and template searches). We therefore strongly suggest to run the CPU part on Tetralith and the GPU part on Berzelius.

4.4 I/O optimization

Each compute node has a local scratch file system used for temporary storage while the job running. The data will be deleted when the job finishes. The size of this disk on Berzelius at /scratch/local is 15 TB of NVMe SSD storage. We can copy the AlphaFold database to /scratch/local at the beginning of a job to achieve better I/O performance. However, based on our experiments on Berzelius, copying the database to the node local storage doesn't give any significant improvement regarding job running time.

5. Best practice of running AlphaFold on Berzelius

To make the best utilisations of the GPU resources on Berzelius, we strongly suggest to separate the CPU and GPU parts when running AlphaFold jobs. You should run the CPU part job on Tetralith or on your local computer, and then run the GPU part job on Berzelius.

  1. Run the CPU part job on Tetralith.

You need to set -F true in the command to run MSA and template searches only.

Also, set -P 1 to run the MSA searches sequentially since the parallelisation on Tetralith will not give any improvement due to the I/O bottleneck. The CPU part job needs a large amount of memory so make sure that you have requested enough number of CPU cores. A sbatch script example has been prepared for you here.

On Tetralith, the AlphaFold database can be found at /proj/common_datasets/alphafold/v2.3.1.

  1. Transfer the CPU part results from Tetralith to Berzelius via your local computer.

  2. Run the GPU part job on Berzelius. You need to set -F false in the command. The command will skip the MSA and template searches and jump to the predictions directly.

  3. To achieve better GPU utilisations, you can run several alphafold GPU part jobs concurrently. See the example sbatch script in which 5 GPU part jobs will be executed concurrently.

Appendix

A. A list of input arguments for run_alphafold.py

Name Default Description
fasta_paths None Paths to FASTA files, each containing a prediction target that will be folded one after another.
data_dir None Path to directory of supporting data.
output_dir None Path to a directory that will store the results.
uniref90_database_path None Path to the Uniref90 database for use by JackHMMER.
mgnify_database_path None Path to the MGnify database for use by JackHMMER.
bfd_database_path None Path to the BFD database for use by HHblits.
small_bfd_database_path None Path to the small version of BFD used with the "reduced_dbs" preset.
uniref30_database_path None Path to the UniRef30 database for use by HHblits.
uniprot_database_path None Path to the Uniprot database for use by JackHMMer.
pdb70_database_path None Path to the PDB70 database for use by HHsearch.
pdb_seqres_database_path None Path to the PDB seqres database for use by hmmsearch.
template_mmcif_dir None Path to a directory with template mmCIF structures, each named .cif.
max_template_date None Maximum template release date to consider.
obsolete_pdbs_path None Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements.
db_preset full_dbs Choose preset MSA database configuration - smaller or full.
model_preset monomer Choose preset model configuration - monomer, monomer_casp14, monomer_ptm, or multimer.
benchmark False Run multiple JAX model evaluations to obtain a timing that excludes the compilation time which should be more indicative of the time required for inferencing many proteins.
random_seed None The random seed for the data pipeline.
num_multimer_predictions_per_model 5 How many predictions (each with a different random seed) will be generated per model.
use_precomputed_msas False Whether to read MSAs that have been written to disk instead of running the MSA tools.
run_relax True Whether to run the final relaxation step on the predicted models.
use_gpu_relax None Whether to relax on GPU.
n_parallel_msa 1 Wether to parallelize MSA searches. 1: not parallelized; 3: all parallelized.
run_feature_only False Whether to only run MSA and template searches.
jackhmmer_binary_path shutil.which('jackhmmer') Path to the JackHMMER executable
hhblits_binary_path shutil.which('hhblits') Path to the HHblits executable
hhsearch_binary_path shutil.which('hhsearch') Path to the HHsearch executable
hmmsearch_binary_path shutil.which('hmmsearch') Path to the hmmsearch executable
hmmbuild_binary_path shutil.which('hmmbuild') Path to the hmmbuild executable
kalign_binary_path shutil.which('kalign') Path to the Kalign executable

User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express