AlphaFold2

From ScientificComputing
Revision as of 14:26, 9 December 2021 by Jarunanp (talk | contribs)

Jump to: navigation, search

< Examples

Load modules

AlphaFold2 is installed in the new software stack and can be loaded as following.

$ env2lmod
$ module load gcc/6.3.0 openmpi/4.0.2 alphafold/2.1.1
Now run 'alphafold_init' to initialize the virtual environment

The following have been reloaded with a version change:
  1) gcc/4.8.5 => gcc/6.3.0

$ alphafold_init
(venv_alphafold) [jarunanp@eu-login-18 ~]$ 

Databases

The AlphaFold databases has the total size when unzipped of 2.2 TB. Users can download the databases to $SCRATCH. However, if there are several users of AlphaFold in your group, institute or department, we recommend to use a group storage.

For D-BIOL members, the AlphaFold databases are currently located at /cluster/work/biol/alphafold.

Download the AlphaFold databases to your $SCRATCH

  • Download and install aria2c in your $HOME
$ cd $HOME
$ wget https://github.com/aria2/aria2/releases/download/release-1.36.0/aria2-1.36.0.tar.gz
$ tar xvzf aria2-1.36.0.tar.gz
$ cd aria2-1.36.0
$ module load gcc/6.3.0 gnutls/3.5.13 openssl/1.0.1e
$ ./configure --prefix=$HOME/.local
$ make
$ make install
$ export PATH="$HOME/.local/bin:$PATH"
$ which aria2c
~/.local/bin/aria2c
  • Check if you have enough space in your $SCRATCH. You may need to free up your $SCRATCH in case there is not enough space.
$ lquota
+-----------------------------+-------------+------------------+------------------+------------------+
| Storage location:           | Quota type: | Used:            | Soft quota:      | Hard quota:      |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/home/jarunanp      | space       |         10.38 GB |         17.18 GB |         21.47 GB |
| /cluster/home/jarunanp      | files       |            85658 |           160000 |           200000 |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/shadow             | space       |         16.38 kB |          2.15 GB |          2.15 GB |
| /cluster/shadow             | files       |                7 |            50000 |            50000 |
+-----------------------------+-------------+------------------+------------------+------------------+
| /cluster/scratch/jarunanp   | space       |          2.42 TB |          2.50 TB |          2.70 TB |
| /cluster/scratch/jarunanp   | files       |           201844 |          1000000 |          1500000 |
+-----------------------------+-------------+------------------+------------------+------------------+
  • Create a folder for the databases
$ cd $SCRATCH
$ mkdir alphafold_databases
  • Download the databases: you can call a script to download all the databases or call a script for each databases. These scripts are in the same directory $ALPHAFOLD_ROOT/scripts/.
$ bsub -W 24:00 "$ALPHAFOLD_ROOT/scripts/download_all_data.sh $SCRATCH/alphafold_databases"

Submit a job

Here is an example of a job submission script (job_script.bsub) which requests 12 CPU cores, in total 120GB of memory, in total 120GB of local scratch space and one GPU.

#!/usr/bin/bash
#BSUB -n 12
#BSUB -W 4:00
#BSUB -R "rusage[mem=10000, scratch=10000, ngpus_excl_p=1]"
#BSUB -J alphafold

source /cluster/apps/local/env2lmod.sh
module load gcc/6.3.0 openmpi/4.0.2 alphafold/2.1.1
source /cluster/apps/nss/alphafold/venv_alphafold/bin/activate

# Define paths to databases
DATA_DIR="/cluster/scratch/jarunanp/21_10_alphafold_databases"

python /cluster/apps/nss/alphafold/alphafold-2.1.1/run_alphafold.py \
--data_dir=$DATA_DIR \
--output_dir=$TMPDIR \
--max_template_date="2021-12-06" \
--bfd_database_path=$DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniref90_database_path=$DATA_DIR/uniref90/uniref90.fasta \
--uniclust30_database_path=$DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--mgnify_database_path=$DATA_DIR/mgnify/mgy_clusters_2018_12.fa \
--pdb70_database_path=$DATA_DIR/pdb70/pdb70 \
--template_mmcif_dir=$DATA_DIR/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=$DATA_DIR/pdb_mmcif/obsolete.dat \
--fasta_paths=ubiquitin.fasta

# Copy the results from the compute node
mkdir -p output
cp -r $TMPDIR/* output

Submit a job with the command

$ bsub < job_script.sh

The screen output is saved in the output file named starting with lsf.o followed by the JobID, e.g., lsf.o195525946. Please see this page for how to read the output file.

The first tests, we used ubiquitin.fasta. It took around 40 minutes to finish for the databases stored on $SCRATCH.

Further readings

< Examples