Downloading Alphafold databases

From ScientificComputing
Jump to: navigation, search

< AlphaFold2

The AlphaFold databases are available for all cluster users at /cluster/project/alphafold.

Alternatively, users can download the databases to their personal $SCRATCH. Or, if there are several users of AlphaFold in your group, institute or department, we recommend to use a group storage /cluster/project. The total size of the databases for AlphaFold when unzipped is ~2.3TB.

Download the AlphaFold databases to your $SCRATCH

  • Download and install aria2c in your $HOME
$ cd $HOME
$ wget
$ tar xvzf aria2-1.36.0.tar.gz
$ cd aria2-1.36.0
$ module load gcc/6.3.0 gnutls/3.5.13 openssl/1.0.1e
$ ./configure --prefix=$HOME/.local
$ make
$ make install
$ export PATH="$HOME/.local/bin:$PATH"
$ which aria2c
  • Check if you have enough space in your $SCRATCH. You may need to free up your $SCRATCH in case there is not enough space.
$ lquota
| Storage location:           | Quota type: | Used:            | Soft quota:      | Hard quota:      |
| /cluster/home/jarunanp      | space       |         10.38 GB |         17.18 GB |         21.47 GB |
| /cluster/home/jarunanp      | files       |            85658 |           160000 |           200000 |
| /cluster/shadow             | space       |         16.38 kB |          2.15 GB |          2.15 GB |
| /cluster/shadow             | files       |                7 |            50000 |            50000 |
| /cluster/scratch/jarunanp   | space       |          2.42 TB |          2.50 TB |          2.70 TB |
| /cluster/scratch/jarunanp   | files       |           201844 |          1000000 |          1500000 |
  • Create a folder for the databases
$ mkdir alphafold_databases
  • During downloading we saw that threw an error that rsync could not rename the file. Therefore, the option -O had to be added to the rsync command:
rsync --recursive --links --perms --times --compress --info=progress2 --delete --port=33444 -O \
  • Download the databases: you can call a script to download all the databases or call a script for each database one-by-one. These scripts are in the same directory $ALPHAFOLD_ROOT/scripts/.
$ bsub -W 24:00 "$ALPHAFOLD_ROOT/scripts/ $SCRATCH/alphafold_databases"

Benchmark results

AlphaFold2 uses HHsearch and HHblits from the HH-suite to perform protein sequence searching. The HH-suite searches do many random file access and read operations. Therefore, it is recommended to store the databases of AlphaFold on a solid state drive (SSD) due to the significantly higher input/output speed (IOPS) compared to a traditional mechanical hard disk drive (HDD).

We tested the performance of AlphaFold to fold two proteins (Ubiquitin with the length of 76 amino acids, T1050 with the length of 779 amino acids) reading the AlphaFold databases from our three central storage systems.

  • /cluster/scratch is a fast, short-term, personal storage system based on SSD
  • /cluster/project is a long-term group storage system which uses HDD for the permanent storage and NVMe flash caches to accelerate the reading speed
  • /cluster/work is a fast, long-term, group storage system based on HDD and suitable for large files

The tests ran on two of NVIDIA GPU models available on Euler including RTX 2080 Ti and TITAN RTX (see the GPU specs here). All jobs allocated 12 CPU cores, 1 GPU, the total memory of 120 GB and the total scratch space of 120 GB. The figures below show the benchmark results which are the average runtime of five runs for the tests with the databases on /cluster/scratch and /cluster/project. The tests with the databases on /cluster/work were run only once because the small reads on this storage system decrease significantly not only the performance of these particular tests but also the overall performance of the whole /cluster/work storage system. The tested compute nodes were not reserved for testing, i.e., the compute nodes might be loaded by other computational while the AlphaFold tests were running.

Benchmark ubiquitin 1gpu.jpg

Fig 1: The performance results of AlphaFold2 in folding the Ubiquitin structure

Benchmark T1050 1gpu.jpg

Fig 2: The performance results of AlphaFold2 in folding the T1050 structure

Alphafold ubiquitin.png

Alphafold T1050.png

Fig 3: This figure shows a cartoon representation of two superimposed ubiquitin structures. Ubiquitin is a small monomeric protein with 76 amino acids. The structure in blue has been determined experimentally (X-ray crystallography, pdb database code: 1upq.pdb). The model in green shows the structure predicted by AlphaFold2. The RMSD (root mean square distance) between the two structures is 0.797 A. The RMSD has been calculated for the backbone atoms. (Image and caption text by Dr. Simon Rüdisser, BNSP)

Fig 4: The five models of T1050 generated by AlphaFold2 are shown as cartoon representation. T1050 is a monomeric protein with 779 amino acids. T1050 is one of the targets from the CASP (Critical Assessment of Techniques for Protein Structure Prediction) initiative. (Image and caption text by Dr. Simon Rüdisser, BNSP)

From testing folding the two proteins with AlphaFold, /cluster/project shows to be the best choice as a group storage for the AlphaFold databases. The performance of AlphaFold when reading the data from /cluster/scratch and /cluster/project is comparable to one another and around 10 times faster than when reading the data from /cluster/work. /cluster/scratch is for short-term storage and only for personal use and, therefore, it is not an optimal solution for a group of users.

< AlphaFold2