Difference between revisions of "Euler VI Testing"
Line 1: | Line 1: | ||
The new [[Euler#Euler_VI|Euler VI]] nodes are available for beta testing. They have 128 cores, 512 GB of memory and are connected in a 200 Gbps EDR Infiniband fabric. | The new [[Euler#Euler_VI|Euler VI]] nodes are available for beta testing. They have 128 cores, 512 GB of memory and are connected in a 200 Gbps EDR Infiniband fabric. | ||
+ | |||
+ | == Roadmap == | ||
+ | |||
+ | Euler VI will be put into regular production through the following phases. The current phase is highlighted in '''bold.''' | ||
+ | |||
+ | * '''Closed beta testing: the new nodes are tested by the HPC group and interested users who contact us and agree to be beta-testers.''' | ||
+ | * Open beta testing: the new nodes can be tested by anyone who is interested. | ||
+ | * Gradual easement: a portion of all jobs may be able to run on the system, starting with a minimal set and gradually increasing. | ||
+ | * Production: the new nodes are treated as any other node in the cluster | ||
<!-- You are encouraged to test your jobs on the Euler VI nodes during this open beta testing phase. --> | <!-- You are encouraged to test your jobs on the Euler VI nodes during this open beta testing phase. --> | ||
Line 15: | Line 24: | ||
If you encounter any problem with running your jobs on the new Euler VI nodes, then please report it to {{Cluster_support}}. | If you encounter any problem with running your jobs on the new Euler VI nodes, then please report it to {{Cluster_support}}. | ||
− | + | After the nodes are put into production, any jobs submitted with the “stable” option may run on the new nodes, too. Some time after than, jobs submitted with the “beta” option will no longer be able to run. | |
== Changes in behavior == | == Changes in behavior == | ||
Line 25: | Line 34: | ||
=== Non-threaded (multi-node MPI) jobs === | === Non-threaded (multi-node MPI) jobs === | ||
− | You should use the “-R "span[ptile=128]"” (or other appropriate value instead of 128) if you intend to run multi-node jobs. | + | You should use the “<tt>-R "span[ptile=128]"</tt>” (or other appropriate value instead of 128) if you intend to run multi-node jobs. |
== Known issues == | == Known issues == | ||
== Troubleshooting == | == Troubleshooting == |
Revision as of 12:31, 31 January 2020
The new Euler VI nodes are available for beta testing. They have 128 cores, 512 GB of memory and are connected in a 200 Gbps EDR Infiniband fabric.
Contents
Roadmap
Euler VI will be put into regular production through the following phases. The current phase is highlighted in bold.
- Closed beta testing: the new nodes are tested by the HPC group and interested users who contact us and agree to be beta-testers.
- Open beta testing: the new nodes can be tested by anyone who is interested.
- Gradual easement: a portion of all jobs may be able to run on the system, starting with a minimal set and gradually increasing.
- Production: the new nodes are treated as any other node in the cluster
Select or avoid Euler VI nodes
During the testing and transition period you can force your job to use or avoid these nodes.
To force your job to run on these nodes, request the “-R beta” or “-R "select[model==EPYC_7742]"” bsub option:
bsub -R beta [other bsub options] ./my_command bsub -R "select[model==EPYC_7742]" [other bsub options] ./my_command
To prevent your job from running on these nodes, request the “-R stable” bsub option:
bsub -R stable [other bsub options] ./my_command
If you encounter any problem with running your jobs on the new Euler VI nodes, then please report it to cluster support.
After the nodes are put into production, any jobs submitted with the “stable” option may run on the new nodes, too. Some time after than, jobs submitted with the “beta” option will no longer be able to run.
Changes in behavior
If you request Euler VI nodes, then the batch system will run jobs requesting up to 128 cores on a single node.
Threaded jobs
Non-threaded (multi-node MPI) jobs
You should use the “-R "span[ptile=128]"” (or other appropriate value instead of 128) if you intend to run multi-node jobs.