Difference between revisions of "Euler III Beta Testing"

From ScientificComputing
Jump to: navigation, search
(Updates to the Euler III beta program.)
 
(29 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The [[Euler#Euler_III|Euler III]] extension to the Euler cluster is available for beta testers.
+
<span style="color:red;font-size:x-large;">The information on this wiki page is obsolete, as the Euler III beta testing phase is over.</span>
  
If you are interested in running test jobs on the new nodes and are willing to work with us to diagnose and resolve problems, then [[Contact contact us]] to be granted access as a beta tester for these nodes.
+
'''The [[Euler#Euler_III|Euler&nbsp;III]] extension to the Euler cluster has been put into production on Wednesday, 10 May 2017.'''
  
Serial jobs or single-node parallel jobs using from '''1 to 4''' cores, use '''up to 30 GB''' of total memory and request '''up to 24&nbsp;hours''' are good candidates to run on these nodes.
+
Serial jobs or single-node parallel jobs using from '''1 to 4''' cores and use '''up to 30 GB''' of total memory per node may run on these nodes from then on.
  
The new nodes run CentOS&nbsp;7 exclusively. All other production nodes in Euler currently run CentOS&nbsp;6. Jobs that rely on Infiniband will not run on these nodes. Single-node Open MPI jobs will run but MVAPICH2 jobs will not run.
+
== Select or avoid Euler III nodes ==
 +
 
 +
For a short time after these nodes enter production, you can force or prevent your jobs from running there. To '''force''' your job to run on these nodes, request the “-R beta” bsub option:
 +
bsub -R beta [other bsub options] ./my_command
 +
To '''prevent''' your job from running on these nodes, request the “-R stable” bsub option:
 +
bsub -R stable [other bsub options] ./my_command
 +
 
 +
== Known issues ==
 +
 
 +
See the [[#Troubleshooting|Troubleshooting]] section below for solutions to issues you may encounter.
 +
 
 +
;NAS NFS mounts
 +
:Euler&nbsp;III nodes are in a different IP range than the rest of the Euler nodes. If you use your own NAS, then you need to change the export rules and/or update your firewall to include the [[Cluster_IP_ranges|new IP addresses]]. The NAS shares provided by the Storage Group of the IT Services have been automatically changed to include the new IP ranges. You can test whether your NAS is affected by submitting a test job to list some files on your NAS:<br><tt>bsub -R beta -Ip ls /nfs/my-nas-server/my-nas-volume</tt> (with appropirate substitutions).
 +
;Infiniband and MPI
 +
:Euler&nbsp;III nodes do not have an Infiniband network, but they do have a fast, low-latency Ethernet interconnect. MPI jobs requesting 5 or more cores will by default continue to run exclusively on Euler I and&nbsp;II nodes with the Infiniband network. Although not recommended, refer to [[#Submitting parallel jobs|Submitting parallel jobs]] below for details on running parallel jobs on Euler&nbsp;III nodes.
 +
;Back connections (from Euler to external server)
 +
:If your job connects to your workstation or another external server, such as MATLAB MDCS, you will need to change your firewall and/or access rules because Euler&nbsp;III nodes are in a [[Cluster_IP_ranges|different IP range than the rest of the Euler nodes]].
 +
;Missing libraries
 +
:Euler&nbsp;III nodes run CentOS&nbsp;7, unlike the rest of Euler, which runs CentOS&nbsp;6. Some libraries may be missing, especially for self-compiled programs. Refer [[#Missing libraries|Missing libraries]] below if you encounter problems.
 +
;Jobs submitted from Euler&nbsp;III nodes can only run on other Euler&nbsp;III nodes.
 +
:Jobs submitted on a CentOS&nbsp;7 must run on another CentOS&nbsp;7 node. For now, only Euler&nbsp;III can run such jobs.
  
 
== Submitting beta jobs ==
 
== Submitting beta jobs ==
Line 11: Line 31:
 
To submit a job to run on the beta Euler&nbsp;III nodes, you must request the '''beta''' resource, e.g.,
 
To submit a job to run on the beta Euler&nbsp;III nodes, you must request the '''beta''' resource, e.g.,
 
  bsub -R beta [other bsub options] ./my_command
 
  bsub -R beta [other bsub options] ./my_command
''Only'' jobs from approved beta testers will run until the Euler&nbsp;III nodes will be put into public beta.
+
 
 +
=== Submitting parallel jobs ===
 +
 
 +
While the Euler&nbsp;III nodes are targeted to serial and shared-memory parallel jobs, multi-node parallel jobs are still accepted.
 +
 
 +
You need to tell the system that Infiniband is not available,
 +
module load interconnect/ethernet
 +
''before'' loading the MPI module. Then you need to request at most four cores per node:
 +
bsub -R beta -R "span[ptile=4]" [other bsub options] ./my_command
 +
 
 +
;Open&nbsp;MPI
 +
:Open&nbsp;MPI 1.6.5 has been tested to work with acceptable performance.
 +
:Open&nbsp;MPI 2.0.2 has been tested to work
 +
;MVAPICH2
 +
:MVAPICH2 2.1 works but preliminary results show low scalability. You need to load the interconnect/ethernet module.
 +
;Intel&nbsp;MPI
 +
:Intel&nbsp;MPI 5.1.3 has been tested.
 +
 
 +
== Troubleshooting ==
 +
 
 +
=== Missing libraries ===
 +
 
 +
Euler&nbsp;III nodes run CentOS 7, which includes many updated libraries. We have included as many backward-compatible libraries as possible in the default system. However, due to stability and operational concerns, there are some that we had to install as a separate module.
 +
 
 +
If your program aborts with an error message such as
 +
[leonhard@eu-ms-001-01 ~]$ ./some_program
 +
'''some_program: error while loading shared libraries: libpython2.6.so.1.0: cannot open shared object file: No such file or directory'''
 +
but it works on the other, older, Euler nodes, then one of the libraries is not found.
 +
 
 +
==== Self-compiled programs ====
 +
If you have built your program yourself, then for now it is advisable to load the “legacy” and “centos_cruft/6” modules before submitting your beta job (or before calling your program within your job shell script). For example,
 +
[leonhard@euler00 ~]$ module load legacy centos_cruft/6
 +
[leonhard@euler00 ~]$ bsub -R beta ./my_program
 +
 
 +
==== Euler-provided programs and modules ====
 +
Let us know if you encounter this problem when using a program provided by us so we can fix it for all users. Please include the error message in your report.

Latest revision as of 08:00, 13 November 2017

The information on this wiki page is obsolete, as the Euler III beta testing phase is over.

The Euler III extension to the Euler cluster has been put into production on Wednesday, 10 May 2017.

Serial jobs or single-node parallel jobs using from 1 to 4 cores and use up to 30 GB of total memory per node may run on these nodes from then on.

Select or avoid Euler III nodes

For a short time after these nodes enter production, you can force or prevent your jobs from running there. To force your job to run on these nodes, request the “-R beta” bsub option:

bsub -R beta [other bsub options] ./my_command

To prevent your job from running on these nodes, request the “-R stable” bsub option:

bsub -R stable [other bsub options] ./my_command

Known issues

See the Troubleshooting section below for solutions to issues you may encounter.

NAS NFS mounts
Euler III nodes are in a different IP range than the rest of the Euler nodes. If you use your own NAS, then you need to change the export rules and/or update your firewall to include the new IP addresses. The NAS shares provided by the Storage Group of the IT Services have been automatically changed to include the new IP ranges. You can test whether your NAS is affected by submitting a test job to list some files on your NAS:
bsub -R beta -Ip ls /nfs/my-nas-server/my-nas-volume (with appropirate substitutions).
Infiniband and MPI
Euler III nodes do not have an Infiniband network, but they do have a fast, low-latency Ethernet interconnect. MPI jobs requesting 5 or more cores will by default continue to run exclusively on Euler I and II nodes with the Infiniband network. Although not recommended, refer to Submitting parallel jobs below for details on running parallel jobs on Euler III nodes.
Back connections (from Euler to external server)
If your job connects to your workstation or another external server, such as MATLAB MDCS, you will need to change your firewall and/or access rules because Euler III nodes are in a different IP range than the rest of the Euler nodes.
Missing libraries
Euler III nodes run CentOS 7, unlike the rest of Euler, which runs CentOS 6. Some libraries may be missing, especially for self-compiled programs. Refer Missing libraries below if you encounter problems.
Jobs submitted from Euler III nodes can only run on other Euler III nodes.
Jobs submitted on a CentOS 7 must run on another CentOS 7 node. For now, only Euler III can run such jobs.

Submitting beta jobs

To submit a job to run on the beta Euler III nodes, you must request the beta resource, e.g.,

bsub -R beta [other bsub options] ./my_command

Submitting parallel jobs

While the Euler III nodes are targeted to serial and shared-memory parallel jobs, multi-node parallel jobs are still accepted.

You need to tell the system that Infiniband is not available,

module load interconnect/ethernet

before loading the MPI module. Then you need to request at most four cores per node:

bsub -R beta -R "span[ptile=4]" [other bsub options] ./my_command
Open MPI
Open MPI 1.6.5 has been tested to work with acceptable performance.
Open MPI 2.0.2 has been tested to work
MVAPICH2
MVAPICH2 2.1 works but preliminary results show low scalability. You need to load the interconnect/ethernet module.
Intel MPI
Intel MPI 5.1.3 has been tested.

Troubleshooting

Missing libraries

Euler III nodes run CentOS 7, which includes many updated libraries. We have included as many backward-compatible libraries as possible in the default system. However, due to stability and operational concerns, there are some that we had to install as a separate module.

If your program aborts with an error message such as

[leonhard@eu-ms-001-01 ~]$ ./some_program
some_program: error while loading shared libraries: libpython2.6.so.1.0: cannot open shared object file: No such file or directory

but it works on the other, older, Euler nodes, then one of the libraries is not found.

Self-compiled programs

If you have built your program yourself, then for now it is advisable to load the “legacy” and “centos_cruft/6” modules before submitting your beta job (or before calling your program within your job shell script). For example,

[leonhard@euler00 ~]$ module load legacy centos_cruft/6
[leonhard@euler00 ~]$ bsub -R beta ./my_program

Euler-provided programs and modules

Let us know if you encounter this problem when using a program provided by us so we can fix it for all users. Please include the error message in your report.