Difference between revisions of "AlphaFold2"
(Add info about postprocessing) |
|||
(2 intermediate revisions by 2 users not shown) | |||
Line 20: | Line 20: | ||
If you wish to download databases separately, you can see the instruction [[Downloading Alphafold databases|here]]. | If you wish to download databases separately, you can see the instruction [[Downloading Alphafold databases|here]]. | ||
+ | |||
+ | == Postprocessing == | ||
+ | |||
+ | Similar plots as generated by the [https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb Colabfold jupyter notebook] can be created by the [https://gitlab.ethz.ch/sis/alphafold-postprocessing alphafold-postprocessing python script]. | ||
+ | It is available on Euler as a module | ||
+ | module load gcc/6.3.0 alphafold-postprocessing | ||
+ | postprocessing.py -o plots/ work_directory/ | ||
+ | |||
+ | The above command will process ''pkl'' files generated by ''alphafold'' in the folder ''work_directory/'' and put the resulting plots into a folder ''plots/''. | ||
+ | |||
+ | The postprocessing is integrated in the setup script described below. | ||
== Create a job script == | == Create a job script == | ||
Line 103: | Line 114: | ||
--use_gpu_relax=1 | --use_gpu_relax=1 | ||
− | === Disable Multi-Process Service === | + | === Disable Multi-Process Service (version >= 2.1.2) === |
If MPS is enabled before running AlphaFold, disable MPS with the command | If MPS is enabled before running AlphaFold, disable MPS with the command | ||
Line 120: | Line 131: | ||
This setup script creates a job script with estimate computing resources depending on the input protein sequence. To download the setup script: | This setup script creates a job script with estimate computing resources depending on the input protein sequence. To download the setup script: | ||
− | git clone https://gitlab.ethz.ch/ | + | git clone https://gitlab.ethz.ch/sis/alphafold_on_euler |
Usage: | Usage: |
Revision as of 13:27, 17 June 2022
< Examples |
AlphaFold2 predicts a protein's 3D folding structure by its amino acid sequence with the accuracy that is competitive with experimental results. This AI-powered structure prediction of AlphaFold2 has been recognized as the scientific breakthrough of the year 2021. The AlphaFold package is now installed in the new software stack on Euler.
Load modules
The AlphaFold module can be loaded as following.
$ env2lmod $ module load gcc/6.3.0 openmpi/4.0.2 alphafold/2.1.1 Now run 'alphafold_init' to initialize the virtual environment The following have been reloaded with a version change: 1) gcc/4.8.5 => gcc/6.3.0 $ alphafold_init (venv_alphafold) [jarunanp@eu-login-18 ~]$
Databases
The AlphaFold databases are available for all cluster users at /cluster/project/alphafold.
If you wish to download databases separately, you can see the instruction here.
Postprocessing
Similar plots as generated by the Colabfold jupyter notebook can be created by the alphafold-postprocessing python script. It is available on Euler as a module
module load gcc/6.3.0 alphafold-postprocessing postprocessing.py -o plots/ work_directory/
The above command will process pkl files generated by alphafold in the folder work_directory/ and put the resulting plots into a folder plots/.
The postprocessing is integrated in the setup script described below.
Create a job script
A job script is a BASH script containing commands to request computing resources, set up the computing environment, run the application and retrieve the results.
Here is a breakdown of a job script called run_alphafold.bsub.
Request computing resources
AlphaFold2 can run with CPUs only, or with CPUs and GPUs which help speed up the computation significantly. Here we request 12 CPU cores, in total 120GB of memory, in total 120GB of local scratch space and one GPU.
#!/usr/bin/bash #BSUB -n 12 # Number of CPUs #BSUB -W 24:00 # Runtime #BSUB -R "rusage[mem=10000, scratch=10000]" # CPU memory and scratch space per CPU core #BSUB -R "rusage[ngpus_excl_p=1] select[gpu_mtotal0>=10240]" # Number of GPUs and GPU memory #BSUB -R "span[hosts=1]" # All CPUs in the same host #BSUB -J alphafold # Job name
Set up a computing environment for AlphaFold
source /cluster/apps/local/env2lmod.sh module load gcc/6.3.0 openmpi/4.0.2 alphafold/2.1.1 source /cluster/apps/nss/alphafold/venv_alphafold/bin/activate
Enable Unified Memory (if needed)
If the input protein sequence is too large for a single GPU memory (approximately larger than 1500aa), enable Unified Memory to bridge the system memory to the GPU memory so that you can oversubscribe the GPU memory of a single GPU.
... #BSUB -R "rusage[ngpus_excl_p=4] select[gpu_mtotal0>=10240]" ... export TF_FORCE_UNIFIED_MEMORY=1 export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0"
Define paths
# Define paths to databases, fasta file and output directory DATA_DIR="/cluster/project/alphafold" FASTA_DIR="/cluster/home/jarunanp/fastafiles" OUTPUT_DIR=${SCRATCH}/protein_name/output
For the output directory, there are two options.
- Use $SCRATCH (max 2.7TB), $HOME (max. 20GB) or group storage (/cluster/project or /cluster/work), e.g.,
OUTPUT_DIR=${SCRATCH}/protein_name/output
- Use the local /scratch as the output directory. To do so, request the scratch space with BSUB options, e.g., here requesting 120GB scratch space in total. At the end of the computation, don't forget to copy the result from there.
#BSUB -n 12 #BSUB -R "rusage[scratch=10000]" ... OUTPUT_DIR=${TMPDIR}/output ... python /path/run_alphafold.py ... ... cp ${TMPDIR}/output ${SCRATCH}/protein_name
Start Multi-Process Service on GPU (version >= 2.1.2)
From the version 2.1.2, it is possible to enable running relaxation on GPU with the option --use_gpu_relax=1. This option will try to create multiple contexts on the GPU but the default GPU computing mode is exclusive and does not allow creating multiple contexts. This can be circumvented by starting Multi-Process Service with the command
nvidia-cuda-mps-control -d
Call Python run script
python /cluster/apps/nss/alphafold/alphafold-2.1.1/run_alphafold.py \ --data_dir=$DATA_DIR \ --output_dir=$OUTPUT_DIR \ --max_template_date="2021-12-06" \ --bfd_database_path=$DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --uniref90_database_path=$DATA_DIR/uniref90/uniref90.fasta \ --uniclust30_database_path=$DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --mgnify_database_path=$DATA_DIR/mgnify/mgy_clusters_2018_12.fa \ --template_mmcif_dir=$DATA_DIR/pdb_mmcif/mmcif_files \ --obsolete_pdbs_path=$DATA_DIR/pdb_mmcif/obsolete.dat \
Then, define the input fasta file, select the model preset (monomer or multimer) and define the path to structure databases accordingly.
- For a monomeric protein
--fasta_paths=$FASTA_DIR/ubiquitin.fasta \ --model_preset=monomer \ --pdb70_database_path=$DATA_DIR/pdb70/pdb70
- For a multimeric protein
--fasta_paths=$FASTA_DIR/IFGSC_6mer.fasta \ --model_preset=multimer \ --pdb_seqres_database_path=$DATA_DIR/pdb_seqres/pdb_seqres.txt \ --uniprot_database_path=$DATA_DIR/uniprot/uniprot.fasta
Enable relaxation on GPU (version >= 2.1.2)
In this version, it is possible to enable running relaxation on GPU with the option --use_gpu_relax. Please see above how to start MPS to use this option.
--use_gpu_relax=1
Disable Multi-Process Service (version >= 2.1.2)
If MPS is enabled before running AlphaFold, disable MPS with the command
echo quit | nvidia-cuda-mps-control
Submit a job
Submit a job with the command
$ bsub < run_alphafold.bsub
The screen output is saved in the output file named starting with lsf.o followed by the JobID, e.g., lsf.o195525946. Please see this page for how to read the output file.
From our benchmark, it took around 40 minutes to fold Ubiquitin[76aa] and 2.5 hours to fold T1050[779aa].
Setup script
This setup script creates a job script with estimate computing resources depending on the input protein sequence. To download the setup script:
git clone https://gitlab.ethz.ch/sis/alphafold_on_euler
Usage:
./setup_alphafold_run_script.sh -f [Fasta file] -w [work directory] --max_template_date yyyy-mm-dd
Example:
$ ./setup_alphafold_run_script.sh -f ../../fastafiles/IFGSC_6mer.fasta -w $SCRATCH Reading /cluster/home/jarunanp/alphafold_run/fastafiles/IFGSC_6mer.fasta Protein name: IFGSC_6mer Number of sequences: 6 Protein type: multimer Number of amino acids: sum: 1246 max: 242 Estimate required resources: Run time: 24:00 Number of CPUs: 12 Total CPU memory: 120000 Number of GPUs: 1 Total GPU memory: 20480 Total scratch space: 120000 Output an LSF run script for AlphaFold2: /cluster/scratch/jarunanp/run_alphafold.bsub
Further readings
- DeepMind Blog post: "AlphaFold: a solution to a 50-year-old grand challenge in biology"
- ETH News: "Computer algorithms are currently revolutionising biology"
- AlphaFold2 presentation slides 21 March 2022
- Downloading AlphaFold databases and benchmark results
< Examples |