Difference between revisions of "AlphaFold2"

From ScientificComputing
Jump to: navigation, search
Line 21: Line 21:
 
If you wish to download databases separately, you can see the instruction [[Downloading Alphafold databases|here]].
 
If you wish to download databases separately, you can see the instruction [[Downloading Alphafold databases|here]].
  
== Submit a job ==
+
== Create a job script ==
AlphaFold2 can run with CPUs only, or with CPUs and GPUs which help speed up the computation significantly.
+
A job script is a BASH script containing commands to request computing resources, set up the computing environment, run the application and retrieve the results.
 +
 
 +
Here is a breakdown of a job script called ''run_alphafold.bsub''.
  
Here is an example of a job submission script (run_alphafold.bsub) which requests 12 CPU cores, in total 120GB of memory, in total 120GB of local scratch space and one GPU. This job is to fold a monomeric protein [https://www.ebi.ac.uk/pdbe/entry/pdb/3h7p/protein/1 Ubiquitin (76aa)].
+
=== Request computing resources ===
  
 +
AlphaFold2 can run with CPUs only, or with CPUs and GPUs which help speed up the computation significantly. Here we request 12 CPU cores, in total 120GB of memory, in total 120GB of local scratch space and one GPU.
 
  #!/usr/bin/bash
 
  #!/usr/bin/bash
  #BSUB -n 12
+
  #BSUB -n 12                                                   # Number of CPUs
  #BSUB -W 4:00
+
  #BSUB -W 24:00                                                 # Runtime
  #BSUB -R "rusage[mem=10000, scratch=10000, ngpus_excl_p=1]"
+
  #BSUB -R "rusage[mem=10000, scratch=10000]"                    # CPU memory and scratch space per CPU core
  #BSUB -J alphafold
+
#BSUB -R "rusage[ngpus_excl_p=1] select[gpu_mtotal0>=10240]"  # Number of GPUs and GPU memory
 +
#BSUB -R "span[hosts=1]"                                       # All CPUs in the same host
 +
  #BSUB -J alphafold                                             # Job name
 
   
 
   
 +
=== Set up a computing environment for AlphaFold ===
 
  source /cluster/apps/local/env2lmod.sh
 
  source /cluster/apps/local/env2lmod.sh
 
  module load gcc/6.3.0 openmpi/4.0.2 alphafold/2.1.1
 
  module load gcc/6.3.0 openmpi/4.0.2 alphafold/2.1.1
 
  source /cluster/apps/nss/alphafold/venv_alphafold/bin/activate
 
  source /cluster/apps/nss/alphafold/venv_alphafold/bin/activate
   
+
 
 +
=== Enable Unified Memory (if needed) ===
 +
If the input protein sequence is too large for a single GPU memory (approximately larger than 1500aa), enable Unified Memory to bridge the system memory to the GPU memory so that you can oversubscribe the GPU memory of a single GPU.
 +
  ...
 +
#BSUB -R "rusage[ngpus_excl_p=4] select[gpu_mtotal0>=10240]"
 +
...
 +
export TF_FORCE_UNIFIED_MEMORY=1
 +
export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0"
 +
 
 +
=== Define paths ===
 
  # Define paths to databases, fasta file and output directory
 
  # Define paths to databases, fasta file and output directory
 
  DATA_DIR="/cluster/project/alphafold"
 
  DATA_DIR="/cluster/project/alphafold"
 
  FASTA_DIR="/cluster/home/jarunanp/fastafiles"
 
  FASTA_DIR="/cluster/home/jarunanp/fastafiles"
 
  OUTPUT_DIR=$TMPDIR/output
 
  OUTPUT_DIR=$TMPDIR/output
+
 
 +
=== Call Python run script ===
 
  python /cluster/apps/nss/alphafold/alphafold-2.1.1/run_alphafold.py \
 
  python /cluster/apps/nss/alphafold/alphafold-2.1.1/run_alphafold.py \
 
  --data_dir=$DATA_DIR \
 
  --data_dir=$DATA_DIR \
Line 49: Line 65:
 
  --uniclust30_database_path=$DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
 
  --uniclust30_database_path=$DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
 
  --mgnify_database_path=$DATA_DIR/mgnify/mgy_clusters_2018_12.fa \
 
  --mgnify_database_path=$DATA_DIR/mgnify/mgy_clusters_2018_12.fa \
--pdb70_database_path=$DATA_DIR/pdb70/pdb70 \
 
 
  --template_mmcif_dir=$DATA_DIR/pdb_mmcif/mmcif_files \
 
  --template_mmcif_dir=$DATA_DIR/pdb_mmcif/mmcif_files \
 
  --obsolete_pdbs_path=$DATA_DIR/pdb_mmcif/obsolete.dat \
 
  --obsolete_pdbs_path=$DATA_DIR/pdb_mmcif/obsolete.dat \
--fasta_paths=$FASTA_DIR/ubiquitin.fasta
 
 
# Copy the results from the compute node
 
mkdir -p output
 
cp -r $OUTPUT_DIR/* output
 
  
To fold a multimeric protein, the option --model_preset=multimer has to be called, and --pdb_seqres_database_path and --uniprot_database_path have to be set. The command to run AlphaFold becomes:
+
Then, define the input fasta file, select the model preset (monomer or multimer) and define the path to structure databases accordingly.
 +
* For a monomeric protein
 +
--fasta_paths=$FASTA_DIR/ubiquitin.fasta \
 +
--model_preset=monomer \
 +
--pdb70_database_path=$DATA_DIR/pdb70/pdb70
  
python /cluster/apps/nss/alphafold/alphafold-2.1.1/run_alphafold.py \
+
* For a multimeric protein,
--data_dir=$DATA_DIR \
+
  --fasta_paths=$FASTA_DIR/IFGSC_6mer.fasta \
  --output_dir=$OUTPUT_DIR \
+
  --model_preset=multimer \
--max_template_date="2021-12-06" \
 
--bfd_database_path=$DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
 
--uniref90_database_path=$DATA_DIR/uniref90/uniref90.fasta \
 
  --uniclust30_database_path=$DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
 
--mgnify_database_path=$DATA_DIR/mgnify/mgy_clusters_2018_12.fa \
 
 
  --pdb_seqres_database_path=$DATA_DIR/pdb_seqres/pdb_seqres.txt \
 
  --pdb_seqres_database_path=$DATA_DIR/pdb_seqres/pdb_seqres.txt \
  --uniprot_database_path=$DATA_DIR/uniprot/uniprot.fasta \
+
  --uniprot_database_path=$DATA_DIR/uniprot/uniprot.fasta
--template_mmcif_dir=$DATA_DIR/pdb_mmcif/mmcif_files \
+
 
  --obsolete_pdbs_path=$DATA_DIR/pdb_mmcif/obsolete.dat \
+
'''AlphaFold 2.1.2'''
  --model_preset=multimer \
+
In this version, it is possible to enable running relaxation on GPU with the option --use_gpu_relax. However, this option shall not be enabled while running computation on the Euler cluster since the AlphaFold algorithm tries to create multiple contexts but the default GPU computing mode is exclusive and, therefore, preventing creating the multiple contexts.
  --fasta_paths=$FASTA_DIR/IFGSC_6mer.fasta
+
  --use_gpu_realx=0
 +
 
 +
=== Copy the results from the compute node ===
 +
  mkdir -p output
 +
  cp -r $OUTPUT_DIR/* output
  
 +
== Submit a job ==
 
Submit a job with the command
 
Submit a job with the command
 
  $ bsub < run_alphafold.bsub
 
  $ bsub < run_alphafold.bsub

Revision as of 14:27, 22 March 2022

< Examples

AlphaFold2 predicts a protein's 3D folding structure by its amino acid sequence with the accuracy that is competitive with experimental results. This AI-powered structure prediction of AlphaFold2 has been recognized as the scientific breakthrough of the year 2021. The AlphaFold package is now installed in the new software stack on Euler.

Load modules

The AlphaFold module can be loaded as following.

$ env2lmod
$ module load gcc/6.3.0 openmpi/4.0.2 alphafold/2.1.1
Now run 'alphafold_init' to initialize the virtual environment

The following have been reloaded with a version change:
  1) gcc/4.8.5 => gcc/6.3.0

$ alphafold_init
(venv_alphafold) [jarunanp@eu-login-18 ~]$ 

Databases

The AlphaFold databases are available for all cluster users at /cluster/project/alphafold.

If you wish to download databases separately, you can see the instruction here.

Create a job script

A job script is a BASH script containing commands to request computing resources, set up the computing environment, run the application and retrieve the results.

Here is a breakdown of a job script called run_alphafold.bsub.

Request computing resources

AlphaFold2 can run with CPUs only, or with CPUs and GPUs which help speed up the computation significantly. Here we request 12 CPU cores, in total 120GB of memory, in total 120GB of local scratch space and one GPU.

#!/usr/bin/bash
#BSUB -n 12                                                    # Number of CPUs
#BSUB -W 24:00                                                  # Runtime
#BSUB -R "rusage[mem=10000, scratch=10000]"                    # CPU memory and scratch space per CPU core
#BSUB -R "rusage[ngpus_excl_p=1] select[gpu_mtotal0>=10240]"   # Number of GPUs and GPU memory 
#BSUB -R "span[hosts=1]"                                       # All CPUs in the same host
#BSUB -J alphafold                                             # Job name

Set up a computing environment for AlphaFold

source /cluster/apps/local/env2lmod.sh
module load gcc/6.3.0 openmpi/4.0.2 alphafold/2.1.1
source /cluster/apps/nss/alphafold/venv_alphafold/bin/activate

Enable Unified Memory (if needed)

If the input protein sequence is too large for a single GPU memory (approximately larger than 1500aa), enable Unified Memory to bridge the system memory to the GPU memory so that you can oversubscribe the GPU memory of a single GPU.

...
#BSUB -R "rusage[ngpus_excl_p=4] select[gpu_mtotal0>=10240]"
...
export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0"

Define paths

# Define paths to databases, fasta file and output directory
DATA_DIR="/cluster/project/alphafold"
FASTA_DIR="/cluster/home/jarunanp/fastafiles"
OUTPUT_DIR=$TMPDIR/output

Call Python run script

python /cluster/apps/nss/alphafold/alphafold-2.1.1/run_alphafold.py \
--data_dir=$DATA_DIR \
--output_dir=$OUTPUT_DIR \
--max_template_date="2021-12-06" \
--bfd_database_path=$DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref90_database_path=$DATA_DIR/uniref90/uniref90.fasta \
--uniclust30_database_path=$DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--mgnify_database_path=$DATA_DIR/mgnify/mgy_clusters_2018_12.fa \
--template_mmcif_dir=$DATA_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$DATA_DIR/pdb_mmcif/obsolete.dat \

Then, define the input fasta file, select the model preset (monomer or multimer) and define the path to structure databases accordingly.

  • For a monomeric protein
--fasta_paths=$FASTA_DIR/ubiquitin.fasta \
--model_preset=monomer \
--pdb70_database_path=$DATA_DIR/pdb70/pdb70
  • For a multimeric protein,
--fasta_paths=$FASTA_DIR/IFGSC_6mer.fasta \
--model_preset=multimer \
--pdb_seqres_database_path=$DATA_DIR/pdb_seqres/pdb_seqres.txt \
--uniprot_database_path=$DATA_DIR/uniprot/uniprot.fasta

AlphaFold 2.1.2 In this version, it is possible to enable running relaxation on GPU with the option --use_gpu_relax. However, this option shall not be enabled while running computation on the Euler cluster since the AlphaFold algorithm tries to create multiple contexts but the default GPU computing mode is exclusive and, therefore, preventing creating the multiple contexts.

--use_gpu_realx=0

Copy the results from the compute node

mkdir -p output
cp -r $OUTPUT_DIR/* output

Submit a job

Submit a job with the command

$ bsub < run_alphafold.bsub

The screen output is saved in the output file named starting with lsf.o followed by the JobID, e.g., lsf.o195525946. Please see this page for how to read the output file.

From our benchmark, it took around 40 minutes to fold Ubiquitin[76aa] and 2.5 hours to fold T1050[779aa].

Further readings

< Examples