From ScientificComputing
Jump to: navigation, search

How to use this discussion page

  • If you are not logged in yet, login or create a new account using the link in the top right corner of this window
  • Click the "+" tab at the top of this window to add a new section
  • Enter a subject in the corresponding field
  • Type your comment or question in the main edit window
  • Write four tildes "~~~~" after a blank line to sign your post (name, time & date)
  • Press the "Show preview" button below to review your text
  • Do not forget to press the "Save page" button when you're done

Changelog (for scripts, profile)

R010 2016-02-22
Adds support for MATLAB 8.6 (R2015b).
R009 2015-05-05
Increases default job duration to 24 hours.
Adds support for MATLAB 8.4 and 8.5 to include support for all versions from 8.1–8.5 (R2013b–R2015a).
R008 2015-04-24
Adapted for Euler and adds support for MATLAB 8.2 (R2013b).
R007 2013-11-11
Disables verbose server-side debugging.
R006 2013-07-16
Provides a function to get the mirrored path on Calculus.
A helper function to simplify getting files back to the client.
R005 2013-07-09
Maps MATLAB job tasks to LSF job arrays.
Submitting many tasks is now much faster.
R004 2013-06-20
The remote job storage directory reflects the local storage directory.
This reduces the chances that starting MATLAB from different directories will confuse it.
Note that the settings file has changed to allow this. You have to re-import it: delete the old Calculus profile and import it from the Calculus004.settings file.
R003 2013-06-18
Creates remote job storage directory on demand. This enables access from independent clients.
R002 2013-06-13
Job time limits can be set through a MATLAB preference.
Removes debugging print.
(R001) Initial release
Based on MATLAB R2013a (8.1) MATLAB integration scripts
Remembers the username as a MATLAB property.

Job timeout

Any way of setting the max runtime on single jobs? Some jobs seem to get stuck, and the wait fcn isn't elegant enough to avoid the problem with little effort.

Nzamboni 15:04, 29 May 2013 (UTC)

We can easily set a global run-time limit in the batch system (Calculus uses LSF in the background, like Brutus). The question is, what should this limit be? We want to gather some usage statistics before we make that decision. This will be done during the beta-testing phase.
@Urban: is it possible for to set a run-time limit (bsub -W HH:MM) directly from MATLAB?
olivier 15:50, 29 May 2013 (UTC)

I meant on single jobs, flexibly before submission or at creation - not globally. That's the case for the Mathworks Job Scheduler, but I didn't find an option for the Calculus profile. Maybe it's something they changed in 2013a, so far I used only previous versions.

Nevertheless, a global run-time limit should likely be in the range of 1-few hours to avoid major issues and yet be on the safe side.

Thx, --Nzamboni 19:53, 29 May 2013 (UTC)

I have looked into allowing setting time limits on jobs.
Due to the way MATLAB handles the MDCS, I can not find an effective way to set the time limit on a per-job basis. One option would be to set a variable with the time limit, but that would clutter MATLAB's workspace. Having it work as when using the MJS would mean that your local workstation's MATLAB files would have to be changed, which is not something we want to do.
-- Urbanb 09:20, 5 June 2013 (UTC) (edited)
-- Urbanb 08:08, 31 May 2013 (UTC)

Job timeouts can now be set as a preference for Calculus jobs. Until we get a better feel for the types of jobs being run, we will avoid setting a global time limit. -- Urbanb 11:56, 17 June 2013 (UTC)

To prevent runaway jobs, the default job length is set to 36 h (this duration will be subject to change). It is set on the Calculus server and can be overridden using the mechanism described on the main page. -- Urbanb 15:00, 28 June 2013 (UTC)

Task status not updated

I sent a job with 50 identical tasks. After a few minutes the whole job is finished. However, several tasks (18 to be precise) are listed to be still pending. Any explanation?

                  ID: 8
                Type: independent
            Username: nzamboni
               State: finished
          SubmitTime: Wed May 29 16:50:14 CEST 2013
           StartTime: Wed May 29 16:51:00 CEST 2013
    Running Duration: 0 days 0h 0m 41s
     AutoAttachFiles: true
 Auto Attached Files: List files
       AttachedFiles: ...dfs\Groups\biol\sysbc\users\nzamboni\Documents\Matlab Work\synfr2\current\singlesim.m
     AdditionalPaths: E:\lib\lindoapi\bin
   Associated Tasks: 
      Number Pending: 18
      Number Running: 0
     Number Finished: 32
   Task ID of Errors: []

Nzamboni 15:05, 29 May 2013 (UTC)

Did you check this only once, or did you retry it some time (~several minutes) later, too? The task status files on the Calculus directory all showed a "finished" state, so there must be an issue with how MATLAB synced the status back to your workstation. Hopefully it is a just a delay.
I submit()ed a few jobs with 100 to 1000 tasks and all were accounted as "Finished": none were stuck in "Pending" nor "Running".
I did notice that the submission is very slow. I will see if this could be sped up, too.
-- Urbanb 08:13, 31 May 2013 (UTC)

Again, same problem. After 5 hours that the job is finished there are still 27 (out of 50) tasks pending.

                  ID: 10
                Type: independent
            Username: nzamboni
               State: finished
           StartTime: Tue Jun 04 09:52:14 CEST 2013
    Running Duration: 0 days 0h 0m 35s
   Associated Tasks: 
      Number Pending: 27
      Number Running: 0
     Number Finished: 23

-- Nzamboni 12:29, 4 June 2013 (UTC)

I have increased the length of time for which the jobs on Calculus are visible to the scheduling system from 1 hour to 12+ hours.
-- Urbanb 09:11, 5 June 2013 (UTC)

Brutus access

I can't access the calculus functions and config files on brutus. Not everybody is by default a brutus user...

Nzamboni 07:38, 18 June 2013 (UTC)

Yes, I am aware of that but, unless I am mistaken, this wiki can only serve pictures and other media. I will try to see if the files can be hosted by a webserver, too.
-- Urbanb 08:45, 18 June 2013 (UTC)
The local files can now be downloaded directly from a webserver. The links to the .tgz or .zip files are provided in the instructions.
-- Urbanb 09:01, 18 June 2013 (UTC)

Number of nodes

Any chance to access more than 64 workers? Maybe temporarily...

Nzamboni 12:32, 23 January 2014 (UTC)

The current hard limit is effectively 128 cores per user, meaning that two 64-worker jobs could be run at once (if the cluster is empty).
The 64-worker limit for each pool is not a hard limit. It's a soft limit and the default pool size. You can change the soft limit by editing the definition of the Calculus cluster through the GUI, or through the MATLAB command interface:
cluster = parcluster('Calculus') ; cluster.NumWorkers=128 ; cluster.saveProfile() ;

Can't fetch all data

I have a problem to fetch my data after a job with several tasks has finished. As described in the wiki I submit my job, use the wait command and afterwards try to fetch the data using fetchOutputs. The return values of the tasks are arrays of a certain length. Depending on the number of tasks and the size of the arrays this is rather likely to fail. The problem seems to be, that the TaskX.out.mat Files are transferred to slowly from the Cluster back to the local machine. Most of them are not transferred when the wait command has finished. The situation improves a bit if I repeat the fetchOutputs command several times, but there seems to be a timeout in the transfer function. Roughly 5 minutes after the job finished, no more data files seem to be transferred. The problem is rather arbitrary regarding the number of tasks and the size of the arrays. Sometimes even a job containing 8 tasks and some hundreds of kilobytes of return data fails, sometimes 20 tasks or more with several megabytes of data finishes without any problem. Therefore it is kind of a lottery to use the cluster. I guess if one would increase the time window to receive the data the problem would vanish. Is this possible? Or is there another way to access the data?

--Tlotterm 08:04, 8 September 2014 (UTC)

Waiting for jobs is programmed within the main MATLAB installation. While there are many useful changes that we would like to make to that code, any changes there is unsupported by Mathworks and would saddle maintaining the MATLAB-LSF-Calculus interfaces.
The data can only be usefully retrieved by MATLAB. This behavior is unexpected. Note that Mathworks states that the volume of data transferred is limited to 2 GB (the limit for the version of .mat files used).
Unfortunately data transfer is the major problem we have seen for the MDCS.
-- Urbanb 09:02, 8 September 2014 (UTC)
Thanks for your reply. I don't think that the size of the .mat files is an issue. I'm far away from 2GB. As I said, sometimes 10MB is no problem and the next time 100KB are not transferred. Is there any possibility to access the files directly on the cluster and transfer them manually? I have no Brutus and/or Euler account and the Calculus cluster is in theory the ideal solution for me.
--Tlotterm 10:59, 8 September 2014 (UTC)
There is a way to transfer those MATLAB data, though you will probably want to wrap it in some script. Creat a pmode session, which will basically give you a MATLAB shell on a Calculus node:
pmode start 'Calculus' 1
Once the window opens, you can try to find the data directory for the failed job/task. Let's assume you started the pmode in the same directory from which you submitted the failed job. Type this in the pmode window:
Now type
and you will see job files and directories. Go to the directory of the failed job (let's say number 14):
cd Job14
The output of a task i is stored as the argsout variable in the Taski.out.mat task output file. Let's assume you need the output from task 1. You can load this file and send it to your desktop client:
pmode lab2client argsout labindex
(You can store it to a variable with a different name in the client by appending that name to the command.)
-- Urbanb 12:31, 8 September 2014 (UTC)