MATLAB PCT

From ScientificComputing
Jump to: navigation, search

MATLAB's Parallel Computing Toolbox (PCT) lets you run suitably-written programs in parallel or as a set of independent jobs. Several cores calculate different parts of a problem, possibly at the same time, to reduce the total time-to-solution.

A trivial program that uses a parpool (a pool of workers) is shown below. It calculates the the squares of the first ten integers in parallel and stores them in an array:

squares = zeros(10,1);
pool = parpool(4);
parfor i = 1:10
    squares(i) = i^2;
end
disp(squares)
pool.delete()

You can use the Parallel Computing Toolbox (PCT) on Euler in two ways, though the best way depends on the properties of the program. One is to submit a job that requests multiple cores to the batch system and use the local parpool. The parallel part of your program (for example, the parfor loop above) will run within your job. The other is to submit a single-core master job and use the SLURM parpool. MATLAB will itself submit a parallel job to compute just the parallel part of your program.


Local parpool

Set up MATLAB to use SLURM local parpool

One-time preparation: Before using the SLURM job pool for the first time, you need to import a cluster profile. for that, start MATLAB and then call configCluster. For each cluster, configCluster only needs to be called once per version of MATLAB. Please be aware that running this command more than once per version will reset your cluster profile back to default settings and erase any saved modifications to the profile. Additionally, please be aware that this command will work only for the most recent versions of Matlab on the cluster, starting 2023b and later.

Use a local parpool

Illustration of a typical parallel job using the local pool. The job has three computational parts: A, B, and C, where part B can run in parallel. Gray rectangles show busy cores. White rectangles show idle cores (wasted time).

When you use the local parpool, you submit a multi-core job to SLURM. MATLAB will run additional worker processes within your multi-core job to process the parallel part of your program. A diagram of this is shown to the right.

A trivial parallel program (simulation.m) is shown below:

squares = zeros(100,1);
pool = parpool("threads", 4);
parfor i = 1:100
    squares(i) = i^2;
end
disp(squares)
pool.delete()

To submit this program, pass the number of cores to the sbatch --cpus-per-task argument. This should be greater or equal to the size of the pool requested in your MATLAB script (e.g., 4).

sbatch --ntasks=1  --cpus-per-task=4 --time=1:00:00 --mem-per-cpu=2g --wrap="matlab -nodisplay -singleCompThread -r simulation"

You must not use the -nojvm MATLAB argument but you should include the -singleCompThread MATLAB argument. MATLAB is quite memory-hungry, so request at least 2 GB of memory per core as shown above.

The local parpool is limited to 12 cores in releases up to R2016a (8.7/9.0). From release R2016b (9.1) on, you can use all the cores of an Euler node (effectively up to 192).

SLURM parpool

Please note that currently SLURM parpools with a smaler number of cores work well, while SLURM parpools with 100 or more cores fail due to an error that we are still investigating.

Set up MATLAB to use SLURM parpool

One-time preparation: Before using the SLURM job pool for the first time, you need to import a cluster profile. for that, start MATLAB and then call configCluster . For each cluster, configCluster only needs to be called once per version of MATLAB. Please be aware that running this command more than once per version will reset your cluster profile back to default settings and erase any saved modifications to the profile.

Use a SLURM parpool

Illustration of a typical parallel job using the SLURM pool. The job has three computational parts: A, B, and C, where part B can run in parallel. Gray rectangles show busy cores. White rectangles show idle cores (wasted time).

When you use the SLURM parpool, you submit a single-core job to SLURM. MATLAB will submit an additional parallel job to run the MATLAB workers to process the parallel part of your program. A diagram of this is shown to the right.

A trivial parallel program (simulation.m) is shown below:

squares = zeros(10,1);
batch_job = parcluster;
pool = parpool(batch_job, 4);
parfor i = 1:10
    squares(i) = i^2;
end
disp(squares)
pool.delete()

To submit this program, just submit your MATLAB program (the master job) as a serial (single-core) job:

sbatch -n 1 --time=120:00:00 --mem-per-cpu=2g  --wrap="matlab -nodisplay -singleCompThread -r simulation"

The master job is assumed to not need much CPU power; however, it may need to run for a long time since it needs to wait for the parallel pool job to start and run.

You must not use the -nojvm MATLAB argument but you should include the -singleCompThread MATLAB argument. MATLAB is quite memory-hungry, so request at least 2 GB of memory as shown above.

Older versions of MATLAB used a matlabpool instead of a parpool.

Change the settings of a SLURM parpool

You can change the settings of SLURM jobs that the SLURM parpool will submit, such as requesting more time or memory. To do this, you must edit the SLURM parameters in MATLAB. Here are a few examples:

>> % First, get a handle to the cluster :
>> c = parcluster;

>> % Specify the account to use
>> c.AdditionalProperties.AccountName = 'account-name';

>> % Request email notification of job status
>> c.AdditionalProperties.EmailAddress = 'user-id@id.ethz.ch';

>> % Specify GPU options
>> c.AdditionalProperties.GpusPerNode = 1;
>> c.AdditionalProperties.GpuMem = '10g';

>> % Specify memory to use, per core (default: 4gb)
>> c.AdditionalProperties.MemUsage = '6gb';

>> % Specify the wall time (e.g., 5 hours)
>> c.AdditionalProperties.WallTime = '05:00:00';

Save changes after modifying AdditionalProperties for the above changes to persist between MATLAB sessions.

>> c.saveProfile

To see the values of the current configuration options, display AdditionalProperties.

>> % To view current properties
>> c.AdditionalProperties


Submit an independent batch job

Use the batch command to submit asynchronous jobs to the cluster. The batch command will return a job object which is used to access the output of the submitted job. See the MATLAB documentation for more help on batch.

>> % First, get a handle to the cluster
>> c = parcluster;

>> % Then submit job to query where MATLAB is running on the cluster
>> job = c.batch(@pwd, 1, {}, 'CurrentFolder','.');

>> % Query job for state
>> job.State

>> % If state is finished, fetch the results
>> job.fetchOutputs{:}

>> % Delete the job after results are no longer needed
>> job.delete

To retrieve a list of currently running or completed jobs, call parcluster to retrieve the cluster object. The cluster object stores an array of jobs that were run, are running, or are queued to run:

>> c = parcluster;
>> jobs = c.Jobs;

Then, to fetch results for job with ID 2 :

>> job2.fetchOutputs{:}

To view results of a previously completed job :

>> % Get a handle to the job with ID 2
>> job2 = c.Jobs(2);

To see how to submit parallel workflows with the batch command, let’s use the following example, which is saved as parallel_example.m.

function [t, A] = parallel_example(iter)
if nargin==0
   iter = 8;
end

disp('Start sim')

t0 = tic;
parfor idx = 1:iter
   A(idx) = idx;
   pause(2)
   idx
end
t = toc(t0);

disp('Sim completed')

save RESULTS A

end


This time when we use the batch command, to run a parallel job, we’ll also specify a MATLAB Pool.

>> % Get a handle to the cluster
>> c = parcluster;

>> % Submit a batch pool job using 4 workers for 16 simulations
>> job = c.batch(@parallel_example, 1, {16}, 'Pool',4, 'CurrentFolder','.');

>> % View current job status
>> job.State

>> % Fetch the results after a finished state is retrieved
>> job.fetchOutputs{:}
ans = 
	8.8872

The job ran in 8.89 seconds using four workers. Note that these jobs will always request N+1 CPU cores, since one worker is required to manage the batch job and pool of workers. For example, a job that needs eight workers will consume nine CPU cores.


Troubleshoot parallel jobs

Using parallel pools often results in hard-to-diagnose errors. Many of these errors are related to running several pools at the same time, which is not what MATLAB expects. If you encounter persistent problems starting pools, try to perform one of these commands. Before running them, make sure that you do not have a MATLAB processes running.

  1. Remove the matlab_metadat.mat file in your current working directory.
  2. Remove the $HOME/.matlab/local_cluster_jobs directory.
  3. Remove the entire $HOME/.matlab directory. Warning: Your MATLAB settings on Euler will be lost.

If a parallel job produces an error, call the getDebugLog method to view the error log file :

>> c.getDebugLog(job)

When troubleshooting a job, the cluster admin request the scheduler ID of the job. This can be derived by calling schedID :

>> schedID(job)
ans = 
         25539