Resource Usage Monitoring

From ScientificComputing
Jump to: navigation, search

To improve the usage of the cluster, we have set up some monitoring of all jobs. It monitors the resources requested and identifies users who systematically request far more resources than needed. Often users are not aware of the mismatch between their usage and their needs, so we hope that this service will raise awareness on this issue and help users improve their utilization of the cluster.

The overall goal of this service is to:

  • Shorten the queue waiting time for all users
  • Lower the cost for shareholders (since they are charged for all the resources they request, even if they don't use them)
  • Increase the overall utilization of the cluster
  • Ensure that the space, power and cooling in the datacenter are used as efficiently as possible (we cannot keep expanding Euler indefinitely)

How to improve your usage / stop receiving emails

The best approach is to be proactive and understand the requirements of your jobs. To do so, we provide a few tools:

  • On the cluster, myjobs -j JOB_ID where JOB_ID is the ID of a job in slurm. This tool provides the usage for both the CPU and the RAM. In the future, the script will also provide the GPU usage.
  • On the cluster, get_inefficient_jobs which will provide the same information than our new service.
  • Using the slurm jobs webgui to see analyze your jobs
  • Following the best practices page on the wiki

How the service works

Each week, the service will look at the most active users, analyze all their jobs executed the previous week and look for the following issues:

  • Low utilization of system RAM or CPU with respect to the requested resources
  • Large amount of system RAM or CPU requested on GPU nodes (which prevents other users to access the remaining GPUs on the node)
  • Large number of extremely short jobs in slurm

If we detect anything in the previous list, we sent an email to the user and update a DB. This DB keeps track of who has been contacted so that we can check if their usage improves over time. If it is not the case, we may take action to prevent them wasting resources on the cluster.

In the future, we plan to add other elements to the list, such as the GPU utilization.