AlphaFold is a deep learning-based protein structure prediction program developed by DeepMind. The software uses a neural network to predict the 3D structure of a protein from its amino acid sequence. The first version of AlphaFold was released in 2018, and it was considered a breakthrough in the field of protein structure prediction. In 2021, AlphaFold2 won the CASP14 competition, a biennial competition that evaluates the state-of-the-art methods in protein structure prediction. AlphaFold2 was able to predict the structure of proteins with remarkable accuracy, which has implications for drug discovery and understanding diseases at the molecular level.
AlphaFold's official documentation recommends to run AlphaFold using Docker which is not available on Berzelius. Therefore, in this user guide we provide three ways to run AlphaFold: using Conda, Container Module and Apptainer.
We specify the paths for AlphaFold database, AlphaFold installation and results.
ALPHAFOLD_DB=/proj/nsc/users/alphafold_db_2.3.1
ALPHAFOLD_DIR=/proj/nsc_testing/xuan/alphafold_2.3.1
ALPHAFOLD_RESULTS=/proj/nsc_testing/xuan/alphafold_results_2.3.1
mkdir -p ${ALPHAFOLD_DB} ${ALPHAFOLD_DIR} ${ALPHAFOLD_RESULTS}
mkdir -p ${ALPHAFOLD_RESULTS}/output ${ALPHAFOLD_RESULTS}/input
We have a copy of AlphaFold database on Berzelius at /proj/nsc/users/alphafold_db_2.3.1
which should only be used for test purposes. Please download your own copy for your research.
The aria2 module will be used for downloading the AlphaFold database.
module load aria2/1.36.0
wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1
cd ${ALPHAFOLD_DIR}
scripts/download_all_data.sh ${ALPHAFOLD_DB}
The test input T1050.fasta
can be found on this page. Download and save it to ${ALPHAFOLD_RESULTS}/input
.
We first load the Anaconda module.
module load Anaconda/2021.05-nsc1
We create a conda env from a yml file.
git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
conda env create -f /tmp/berzelius-alphafold-guide/alphafold_2.3.1.yml
conda activate alphafold_2.3.1
You can specify the version in the download link.
wget -O /tmp/v2.3.1.tar.gz https://github.com/deepmind/alphafold/archive/refs/tags/v2.3.1.tar.gz
tar -xf /tmp/v2.3.1.tar.gz -C ${ALPHAFOLD_DIR} --strip-components=1
cd /home/xuan/.conda/envs/alphafold_2.3.1/lib/python3.8/site-packages/
patch -p0 < ${ALPHAFOLD_DIR}/docker/openmm.patch
wget -q -P ${ALPHAFOLD_DIR}/alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt
This patch includes two new input arguments in run_alphafold.py
.
n_parallel_msa--n_parallel_msa=1
: the searches are not parallelized.--n_parallel_msa=3
: the searches are all parallelized.
This new flag has been wrapped as -P
in the wrapper.
run_feature_only--run_feature_only=true
: to only run MSA and template searches.
This new flag has been wrapped as -F
in the wrapper.
The patch also provides the flexibility to choose the number of threads used for the MSA searches. Read the Optimization section for more details.
To install the patch
git clone https://gitlab.liu.se/xuagu37/berzelius-alphafold-guide /tmp/berzelius-alphafold-guide
cd /tmp && bash berzelius-alphafold-guide/patch/patch.sh ${ALPHAFOLD_DIR}
cd ${ALPHAFOLD_DIR}
bash run_alphafold.sh \
-d ${ALPHAFOLD_DB} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 3 \
-F false
Please check the input arguments in run_alphafold.py
. A complete list of input arguments is attached here for reference.
module load AlphaFold/2.3.1-apptainer
bash run_alphafold.sh \
-d ${ALPHAFOLD_DB} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 3 \
-F false
There is an Apptainer image of AlphaFold 2.3.1 at /software/sse/containers
.
apptainer exec --nv alphafold_2.3.1.sif bash -c "cd /app/alphafold && bash run_alphafold.sh \
-d ${ALPHAFOLD_DB} \
-o ${ALPHAFOLD_RESULTS}/output \
-f ${ALPHAFOLD_RESULTS}/input/T1050.fasta \
-t 2021-11-01 \
-g true \
-P 3 \
-F false"
The three independent sequential MSA searches can be arranged in parallel to accelerate the job. The parallelisation can be enabled by setting the flag -P 3
.
Ref 1: AlphaFold PR 399 Parallel execution of MSA tools.
Ref 2: Zhong et al. 2022, ParaFold: Paralleling AlphaFold for Large-Scale Predictions.
AlphaFold 2.3.1 uses a default choice of 8, 8, 4 threads for the three MSA searches and it is not always the optimal. The hhblits search is the most time comsuming so we can manually set a large number of threads. The number of threads used for the three searches can be set in alphafold/data/pipeline.py
at line 131 to 134.
For multimer models, the search of Jackhmmer (uniprot) will start when the first three searches finish. The number of threads can be set in alphafold/data/pipeline_multimer.py
at line 179.
We recommend to use n_cpu=8, 8, 16, 32
on Berzelius for jackhmmer (uniref90), jackhmmer (mgnify), hhblits (bfd) and Jackhmmer (uniprot), respectively.
A flag --run_feature_only
has been added to separate the CPU and GPU parts. Alphafold makes use of GPUs for the prediction part of the modelling, which can be a small part of the running time. Most of the operations executed by the script are CPU-based (MSA and template searches). We therefore strongly suggest to run the CPU part on Tetralith and the GPU part on Berzelius.
Each compute node has a local scratch file system used for temporary storage while the job running. The data will be deleted when the job finishes. The size of this disk on Berzelius at /scratch/local
is 15 TB of NVMe SSD storage. We can copy the AlphaFold database to /scratch/local at the beginning of a job to achieve better I/O performance. However, based on our experiments on Berzelius, copying the database to the node local storage doesn't give any significant improvement regarding job running time.
To make the best utilisations of the GPU resources on Berzelius, we strongly suggest to separate the CPU and GPU parts when running AlphaFold jobs. You should run the CPU part job on Tetralith or on your local computer, and then run the GPU part job on Berzelius.
You need to set -F true
in the command to run MSA and template searches only.
Also, set -P 1
to run the MSA searches sequentially since the parallelisation on Tetralith will not give any improvement due to the I/O bottleneck. The CPU part job needs a large amount of memory so make sure that you have requested enough number of CPU cores. A sbatch script example has been prepared for you here.
On Tetralith, the AlphaFold database can be found at /proj/common_datasets/alphafold/v2.3.1
.
Transfer the CPU part results from Tetralith to Berzelius via your local computer.
Run the GPU part job on Berzelius. You need to set -F false
in the command. The command will skip the MSA and template searches and jump to the predictions directly.
To achieve better GPU utilisations, you can run several alphafold GPU part jobs concurrently. See the example sbatch script in which 5 GPU part jobs will be executed concurrently.
Name | Default | Description |
---|---|---|
fasta_paths | None | Paths to FASTA files, each containing a prediction target that will be folded one after another. |
data_dir | None | Path to directory of supporting data. |
output_dir | None | Path to a directory that will store the results. |
uniref90_database_path | None | Path to the Uniref90 database for use by JackHMMER. |
mgnify_database_path | None | Path to the MGnify database for use by JackHMMER. |
bfd_database_path | None | Path to the BFD database for use by HHblits. |
small_bfd_database_path | None | Path to the small version of BFD used with the "reduced_dbs" preset. |
uniref30_database_path | None | Path to the UniRef30 database for use by HHblits. |
uniprot_database_path | None | Path to the Uniprot database for use by JackHMMer. |
pdb70_database_path | None | Path to the PDB70 database for use by HHsearch. |
pdb_seqres_database_path | None | Path to the PDB seqres database for use by hmmsearch. |
template_mmcif_dir | None | Path to a directory with template mmCIF structures, each named .cif. |
max_template_date | None | Maximum template release date to consider. |
obsolete_pdbs_path | None | Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements. |
db_preset | full_dbs | Choose preset MSA database configuration - smaller or full. |
model_preset | monomer | Choose preset model configuration - monomer, monomer_casp14, monomer_ptm, or multimer. |
benchmark | False | Run multiple JAX model evaluations to obtain a timing that excludes the compilation time which should be more indicative of the time required for inferencing many proteins. |
random_seed | None | The random seed for the data pipeline. |
num_multimer_predictions_per_model | 5 | How many predictions (each with a different random seed) will be generated per model. |
use_precomputed_msas | False | Whether to read MSAs that have been written to disk instead of running the MSA tools. |
run_relax | True | Whether to run the final relaxation step on the predicted models. |
use_gpu_relax | None | Whether to relax on GPU. |
n_parallel_msa | 1 | Wether to parallelize MSA searches. 1: not parallelized; 3: all parallelized. |
run_feature_only | False | Whether to only run MSA and template searches. |
jackhmmer_binary_path | shutil.which('jackhmmer') | Path to the JackHMMER executable |
hhblits_binary_path | shutil.which('hhblits') | Path to the HHblits executable |
hhsearch_binary_path | shutil.which('hhsearch') | Path to the HHsearch executable |
hmmsearch_binary_path | shutil.which('hmmsearch') | Path to the hmmsearch executable |
hmmbuild_binary_path | shutil.which('hmmbuild') | Path to the hmmbuild executable |
kalign_binary_path | shutil.which('kalign') | Path to the Kalign executable |
Guides, documentation and FAQ.
Applying for projects and login accounts.