Using AlphaFold on Berzelius

Introduction

AlphaFold is a deep learning-based protein structure prediction program developed by DeepMind. The software uses a neural network to predict the 3D structure of a protein from its amino acid sequence. The first version of AlphaFold was released in 2018, and it was considered a breakthrough in the field of protein structure prediction. In 2021, AlphaFold2 won the CASP14 competition, a biennial competition that evaluates the state-of-the-art methods in protein structure prediction. AlphaFold2 was able to predict the structure of proteins with remarkable accuracy, which has implications for drug discovery and understanding diseases at the molecular level.

Preparations

Setting the Paths

We specify the paths for AlphaFold database, AlphaFold installation and results.

ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
ALPHAFOLD_DIR=/proj/nsc_testing/xuan/alphafold_2.3.1
ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_DIR} ${ALPHAFOLD_RESULTS}
mkdir -p ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input

Downloading Genetic Databases

We have a copy of AlphaFold database on Berzelius at /proj/common-datasets.

The aria2 module will be used for downloading the AlphaFold database.

module load aria2/1.36.0-gcc-8.5.0
wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1
cd ${ALPHAFOLD_DIR}
scripts/download_all_data.sh ${ALPHAFOLD_DB}

Downloading Test Data

The test input T1050.fasta can be found on this page. Download and save it to ${ALPHAFOLD_RESULTS}/input.

Patch

Our patch includes two new input arguments in run_alphafold.py.

  • n_parallel_msa
    --n_parallel_msa=1: the searches are not parallelized.
    --n_parallel_msa=3: the searches are all parallelized.
    This new flag has been wrapped as -P in the wrapper.

  • run_feature_only
    --run_feature_only=true: to only run MSA and template searches.
    This new flag has been wrapped as -F in the wrapper.

The patch also provides the flexibility to choose the number of threads used for the MSA searches. Read the Optimization section for more details.

Running AlphaFold Using the Module

On Berzelius, we have a preinstalled AlphaFold 2.3.1 as a module.

Loading the Module

On a compute node we load the AlphaFold module.

module load AlphaFold/2.3.1-hpc1

Running an Example

We run an example.

run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

Please run run_alphafold.sh -h to check the usage.

To remove the module

module rm AlphaFold/2.3.1-hpc1

Running AlphaFold Using Conda

Creating a Conda Env

We first load the Anaconda module.

module load Anaconda/2023.09-0-hpc1-bdist

We create a conda env from a yml file.

git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
conda env create -f /tmp/berzelius-alphafold-guide/alphafold_2.3.1.yml
conda activate alphafold_2.3.1

Installing AlphaFold

To download AlphaFold

wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1

To apply OpenMM patch

cd /home/xuan/.conda/envs/alphafold_2.3.1/lib/python3.8/site-packages/ 
patch -p0 < ${ALPHAFOLD_DIR}/docker/openmm.patch

To download chemical properties

wget -q -P ${ALPHAFOLD_DIR}/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

To install the patch

git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
cd /tmp && bash berzelius-alphafold-guide/patch/patch_2.3.1.sh ${ALPHAFOLD_DIR}

Running an Example

cd ${ALPHAFOLD_DIR}
bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false

Please check the input arguments in run_alphafold.py. A complete list of input arguments is attached here for reference.

Running AlphaFold Using Apptainer

Creating the Container Image

There is an Apptainer image of AlphaFold 2.3.1 at /software/sse/containers.

Running an Example

apptainer exec --nv alphafold_2.3.1.sif bash -c "cd /app/alphafold && bash run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 3 \
  -F false"

Optimization

MSA Searches in Parallel

The three independent sequential MSA searches can be arranged in parallel to accelerate the job. You can enable parallelization by setting the flag -P 3.

  • jackhmmer(uniref90) + template_searcher(pdb)
  • jackhmmer(mgnify)
  • hhblits(bfd) or jackhmmer(small_bfd)

Ref 1: AlphaFold PR 399 Parallel execution of MSA tools.
Ref 2: Zhong et al. 2022, ParaFold: Paralleling AlphaFold for Large-Scale Predictions.

Multithreading for MSA Searches

AlphaFold 2.3.1 uses a default choice of 8, 8, and 4 threads for the three MSA searches, which is not always optimal. The hhblits search is the most time-consuming, so we can manually allocate more threads to it. You can set the number of threads for the three searches in alphafold/data/pipeline.py at lines 131 to 134.

For multimer models, the Jackhmmer (uniprot) search will start when the first three searches finish. You can set the number of threads in alphafold/data/pipeline_multimer.py at line 179.

We recommend using n_cpu=8, 8, 16, 32 on Berzelius for jackhmmer (uniref90), jackhmmer (mgnify), hhblits (bfd), and Jackhmmer (uniprot), respectively.

Separation of CPU Part (MSA and Template Searches) and GPU part (Predictions)

A flag --run_feature_only has been added to separate the CPU and GPU parts. Alphafold makes use of GPUs for the prediction part of the modelling, which can be a small part of the running time. Most of the operations executed by the script are CPU-based (MSA and template searches). We therefore strongly suggest to run the CPU part on Tetralith and the GPU part on Berzelius.

Best Practice of Running AlphaFold on Berzelius

To make the best use of the GPU resources on Berzelius, we strongly suggest separating the CPU and GPU parts when running AlphaFold jobs. You should run the CPU part on Tetralith or your local computer, and then run the GPU part on Berzelius.

  1. Run the CPU part of the job on Tetralith.

You need to set -F true in the command to run MSA and template searches only.

Additionally, set -P 1 to run the MSA searches sequentially, as parallelization on Tetralith will not provide any improvement due to the I/O bottleneck. The CPU part of the job requires a large amount of memory, so make sure you request enough CPU cores. An sbatch script example has been prepared for you here.

On Tetralith, the AlphaFold database can be found at /proj/common_datasets/AlphaFold.

ALPHAFOLD_DB=/proj/common_datasets/AlphaFold
ALPHAFOLD_RESULTS=/proj/nsc/users/xuan/alphafold_results_2.3.1/
module load AlphaFold/2.3.1-hpc1
run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g false \
  -P 1 \
  -F true
  1. Transfer the CPU part results from Tetralith to Berzelius via your local computer.

  2. Run the GPU part of the job on Berzelius.

You need to set -F false in the command. This will skip the MSA and template searches and proceed directly to the predictions.

ALPHAFOLD_DB=/proj/common-datasets/AlphaFold
ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1/
module load AlphaFold/2.3.1-hpc1
run_alphafold.sh \
  -d ${ALPHAFOLD_DB} \
  -o ${ALPHAFOLD_RESULTS}/output \
  -f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
  -t 2021-11-01 \
  -g true \
  -P 1 \
  -F false
  1. To achieve better GPU utilization, you can run several AlphaFold GPU part jobs concurrently. See the example sbatch script, which demonstrates how to execute 5 GPU part jobs concurrently.

AlphaFold Alternatives

LocalColabFold

On Berzelius, we have a preinstalled LocalColabFold 1.5.5 as a module.

Loading the Module

On a compute node we load the LocalColabFold module.

module load LocalColabFold/1.5.5-hpc1

Running an Example

We run an example.

colabfold_batch --data /proj/common-datasets/AlphaFold input/ output/

OpenFold

On Berzelius, we have a preinstalled OpenFold 2.1.0 as a module.

Loading the Module

On a compute node we load the OpenFold module.

module load OpenFold/2.1.0-hpc1

Running an Example

Model inference without pre-computed alignments

export BASE_DATA_DIR=/proj/common-datasets/AlphaFold
export TEMPLATE_MMCIF_DIR=${BASE_DATA_DIR}/pdb_mmcif/mmcif_files
export INPUT_FASTA_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/input/T1050
export OUTPUT_DIR=/proj/nsc_testing/xuan/alphafold_results_2.3.1/output

cd ${OPENFOLD_PREFIX}
python3 run_pretrained_openfold.py \
    $INPUT_FASTA_DIR \
    $TEMPLATE_MMCIF_DIR \
    --output_dir $OUTPUT_DIR \
    --config_preset model_1_ptm \
    --uniref90_database_path $BASE_DATA_DIR/uniref90/uniref90.fasta \
    --mgnify_database_path $BASE_DATA_DIR/mgnify/mgy_clusters_2022_05.fa \
    --pdb70_database_path $BASE_DATA_DIR/pdb70/pdb70 \
    --uniclust30_database_path $BASE_DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --bfd_database_path $BASE_DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --model_device "cuda:0" 

For more information on using OpenFold, please refer to the OpenFold documentation.


User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express