Difference between revisions of "AlphaFold2"

From ScientificComputing
Jump to: navigation, search
Line 99: Line 99:
 
AlphaFold2 uses HHsearch and HHblits from the HH-suite to perform protein sequence searching. The HH-suite searches do many random file access and read operations. Therefore, it is recommended to store the databases of AlphaFold on a solid state drive (SSD) due to the significantly higher input/output speed (IOPS) compared to a traditional mechanical hard disk drive (HDD).  
 
AlphaFold2 uses HHsearch and HHblits from the HH-suite to perform protein sequence searching. The HH-suite searches do many random file access and read operations. Therefore, it is recommended to store the databases of AlphaFold on a solid state drive (SSD) due to the significantly higher input/output speed (IOPS) compared to a traditional mechanical hard disk drive (HDD).  
  
We tested the performance of AlphaFold to fold two proteins ([https://www.ebi.ac.uk/pdbe/entry/pdb/3h7p/protein/1 ubiquitin.fasta with the length of 76 amino acids], [https://www.predictioncenter.org/casp14/target.cgi?target=T1050 T1050.fasta with the length of 779 amino acids]) reading the databases from our three central storage systems.
+
We tested the performance of AlphaFold to fold two proteins ([https://www.ebi.ac.uk/pdbe/entry/pdb/3h7p/protein/1 ubiquitin.fasta with the length of 76 amino acids], [https://www.predictioncenter.org/casp14/target.cgi?target=T1050 T1050.fasta with the length of 779 amino acids]) reading the AlphaFold databases from our three central storage systems.
 
* '''/cluster/scratch''' is a fast, short-term, personal storage system based on SSD
 
* '''/cluster/scratch''' is a fast, short-term, personal storage system based on SSD
 
* '''/cluster/project''' is a long-term group storage system which uses HDD for the permanent storage and NVMe flash caches to accelerate the reading speed
 
* '''/cluster/project''' is a long-term group storage system which uses HDD for the permanent storage and NVMe flash caches to accelerate the reading speed
 
* '''/cluster/work''' is a fast, long-term, group storage system based on HDD and suitable for large files
 
* '''/cluster/work''' is a fast, long-term, group storage system based on HDD and suitable for large files
  
and running on four GPU models including RTX 2080 Ti, TITAN RTX, GTX 1080 Ti and GTX 1080. All jobs allocated 12 CPU cores, 1 GPU, the total memory of 120 GB and the total scratch space of 120 GB. The benchmark was run 5 times for the job with data on /cluster/scratch and /cluster/project. The benchmark was run only once for the job with databases on /cluster/work because the small reads on this storage system yields the significant slow performance not only for this job but also decrease the overall performance of the storage system. The tested compute nodes were not reserved for testing.
+
The tests ran on four of NVIDIA GPU models available on Euler including RTX 2080 Ti, TITAN RTX, GTX 1080 Ti and GTX 1080. All jobs allocated 12 CPU cores, 1 GPU, the total memory of 120 GB and the total scratch space of 120 GB. The figures below show the benchmark results which are the average runtime of five runs for the tests with the databases on /cluster/scratch and /cluster/project. The tests with the databases on /cluster/work were run only once because the small reads on this storage system decrease significantly not only the performance of these particular tests but also the overall performance of the whole /cluster/work storage system. The tested compute nodes were not reserved for testing and we did not control
  
 
[[Image:Benchmark ubiquitin 1gpu.jpg|600px]] [[Image:Benchmark T1050 1gpu.jpg|600px]]
 
[[Image:Benchmark ubiquitin 1gpu.jpg|600px]] [[Image:Benchmark T1050 1gpu.jpg|600px]]

Revision as of 10:05, 17 December 2021

< Examples

Load modules

AlphaFold2 is installed in the new software stack and can be loaded as following.

$ env2lmod
$ module load gcc/6.3.0 openmpi/4.0.2 alphafold/2.1.1
Now run 'alphafold_init' to initialize the virtual environment

The following have been reloaded with a version change:
  1) gcc/4.8.5 => gcc/6.3.0

$ alphafold_init
(venv_alphafold) [jarunanp@eu-login-18 ~]$ 

Databases

The AlphaFold databases has the total size when unzipped of 2.2 TB. Users can download the databases to $SCRATCH. However, if there are several users of AlphaFold in your group, institute or department, we recommend to use a group storage.

For D-BIOL members, the AlphaFold databases are currently located at /cluster/work/biol/alphafold.

Download the AlphaFold databases to your $SCRATCH

  • Download and install aria2c in your $HOME
$ cd $HOME
$ wget https://github.com/aria2/aria2/releases/download/release-1.36.0/aria2-1.36.0.tar.gz
$ tar xvzf aria2-1.36.0.tar.gz
$ cd aria2-1.36.0
$ module load gcc/6.3.0 gnutls/3.5.13 openssl/1.0.1e
$ ./configure --prefix=$HOME/.local
$ make
$ make install
$ export PATH="$HOME/.local/bin:$PATH"
$ which aria2c
~/.local/bin/aria2c
  • Check if you have enough space in your $SCRATCH. You may need to free up your $SCRATCH in case there is not enough space.
$ lquota
+-----------------------------+-------------+------------------+------------------+------------------+
| Storage location:           | Quota type: | Used:            | Soft quota:      | Hard quota:      |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/home/jarunanp      | space       |         10.38 GB |         17.18 GB |         21.47 GB |
| /cluster/home/jarunanp      | files       |            85658 |           160000 |           200000 |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/shadow             | space       |         16.38 kB |          2.15 GB |          2.15 GB |
| /cluster/shadow             | files       |                7 |            50000 |            50000 |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/scratch/jarunanp   | space       |          2.42 TB |          2.50 TB |          2.70 TB |
| /cluster/scratch/jarunanp   | files       |           201844 |          1000000 |          1500000 |
+-----------------------------+-------------+------------------+------------------+------------------+
  • Create a folder for the databases
$ cd $SCRATCH
$ mkdir alphafold_databases
  • Download the databases: you can call a script to download all the databases or call a script for each databases. These scripts are in the same directory $ALPHAFOLD_ROOT/scripts/.
$ bsub -W 24:00 "$ALPHAFOLD_ROOT/scripts/download_all_data.sh $SCRATCH/alphafold_databases"

Submit a job

Here is an example of a job submission script (job_script.bsub) which requests 12 CPU cores, in total 120GB of memory, in total 120GB of local scratch space and one GPU.

#!/usr/bin/bash
#BSUB -n 12
#BSUB -W 4:00
#BSUB -R "rusage[mem=10000, scratch=10000, ngpus_excl_p=1]"
#BSUB -J alphafold

source /cluster/apps/local/env2lmod.sh
module load gcc/6.3.0 openmpi/4.0.2 alphafold/2.1.1
source /cluster/apps/nss/alphafold/venv_alphafold/bin/activate

# Define paths to databases
DATA_DIR="/cluster/scratch/jarunanp/21_10_alphafold_databases"

python /cluster/apps/nss/alphafold/alphafold-2.1.1/run_alphafold.py \
--data_dir=$DATA_DIR \
--output_dir=$TMPDIR \
--max_template_date="2021-12-06" \
--bfd_database_path=$DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref90_database_path=$DATA_DIR/uniref90/uniref90.fasta \
--uniclust30_database_path=$DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--mgnify_database_path=$DATA_DIR/mgnify/mgy_clusters_2018_12.fa \
--pdb70_database_path=$DATA_DIR/pdb70/pdb70 \
--template_mmcif_dir=$DATA_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$DATA_DIR/pdb_mmcif/obsolete.dat \
--fasta_paths=ubiquitin.fasta

# Copy the results from the compute node
mkdir -p output
cp -r $TMPDIR/* output

Submit a job with the command

$ bsub < job_script.sh

The screen output is saved in the output file named starting with lsf.o followed by the JobID, e.g., lsf.o195525946. Please see this page for how to read the output file.

From testing folding ubiquitin.fasta with AlphaFold, it took around 40 minutes to finish for the databases stored on $SCRATCH.

Benchmark results

AlphaFold2 uses HHsearch and HHblits from the HH-suite to perform protein sequence searching. The HH-suite searches do many random file access and read operations. Therefore, it is recommended to store the databases of AlphaFold on a solid state drive (SSD) due to the significantly higher input/output speed (IOPS) compared to a traditional mechanical hard disk drive (HDD).

We tested the performance of AlphaFold to fold two proteins (ubiquitin.fasta with the length of 76 amino acids, T1050.fasta with the length of 779 amino acids) reading the AlphaFold databases from our three central storage systems.

  • /cluster/scratch is a fast, short-term, personal storage system based on SSD
  • /cluster/project is a long-term group storage system which uses HDD for the permanent storage and NVMe flash caches to accelerate the reading speed
  • /cluster/work is a fast, long-term, group storage system based on HDD and suitable for large files

The tests ran on four of NVIDIA GPU models available on Euler including RTX 2080 Ti, TITAN RTX, GTX 1080 Ti and GTX 1080. All jobs allocated 12 CPU cores, 1 GPU, the total memory of 120 GB and the total scratch space of 120 GB. The figures below show the benchmark results which are the average runtime of five runs for the tests with the databases on /cluster/scratch and /cluster/project. The tests with the databases on /cluster/work were run only once because the small reads on this storage system decrease significantly not only the performance of these particular tests but also the overall performance of the whole /cluster/work storage system. The tested compute nodes were not reserved for testing and we did not control

Benchmark ubiquitin 1gpu.jpg Benchmark T1050 1gpu.jpg

From testing the two proteins, we can see the best performance of AlphaFold when reading the data from /cluster/scratch and /cluster/project and running on RTX 2080 Ti and TITAN RTX. Reading databases from /cluster/scratch and /cluster/project shows comparable performance while /cluster/work is around 10 times slower in this case. As /cluster/scratch is for short-term storage and only for personal use, /cluster/project is the best choice for a user group of AlphaFold.

Further readings

< Examples