Cluster user survey 2019

From ScientificComputing
Revision as of 15:51, 22 January 2020 by Byrdeo (talk | contribs) (Survey results)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Introduction

In September 2019 the HPC Group of Scientific IT Services launched a cluster user survey in order to know the individual needs, assess user satisfaction and further tailor our service portfolio to our customers and users. Moreover we want to better understand how our customers differentiate us from other computing infrastructure providers and address potential gaps.

All cluster users that used at least 1 CPU core on average in the past 6 month were invited to fill out a (very long) questionnaire. Out of these 954 users, 234 participated in the survey. This corresponds to a return rate of 24.53%, which is very impressive for a large survey with many questions.

We would like to take this opportunity to thank all these users for taking the time to help us improve our service.

Summary


Overall the majority of the cluster users is very satisfied with the infrastructure and services that the HPC group provides.


  • Participants
    • We have many experienced (2+ years) cluster users
    • Access to computing capacity is important for the entire ETH research community from traditional quantitative disciplines such as physics and engineering over life sciences to social sciences
  • Application/software requirements
    • The wide range of centrally provided software is appreciated
    • There should be more documentation about the centrally provided applications and libraries
    • Containers are not yet as widely used as we thought
  • Code development
    • Python is the most used programming language on our clusters
    • Availability of tools for code development is good, but documentation could be better
  • Resource requirements
    • Next to parallel jobs, there is a large amount of serial computations
    • Most parallel jobs use a single full compute node
    • Users appreciate the possibility to choose among compute nodes with various memory sizes
    • Most jobs do not use a large amount of memory, however, availability of nodes with more than 1 TB of memory is important to some users
    • Different run time limits are justified; interestingly, a large fraction of users needs run times limits beyond 24 hours
    • We observe a very high satisfaction of our users with regards to overall performance of the clusters
    • Most computations which use GPU acceleration use a single GPU
    • Half of the participants of the survey expect growing needs of computational resources
  • Batch system
    • Most users, directly use bsub to submit their jobs on the cluster
    • Users are mostly happy about the configuration of the batch system. Runtime limits and waiting time could be improved
    • Many additional features (X11, interactive jobs, bjob_connect etc.) of the batch system are not well known, except bbjobs
  • System stability and availability
    • Perception of stability and availability of both, the Euler and the Leonhard clusters are very good.
    • Cluster maintenance windows are perceived as well organized, transparent to the users and well communicated
    • Almost no lost jobs are reported due to node failure or file system issues
  • Documentation and support
    • Information on the wiki is accurate, mostly up-to-date, but it could be more detailed and contain more examples
    • Cluster support is well reachable and provides fast, accurate and polite answers
    • There is a large demand for technical trainings and workshops
  • Shareholder model
    • Our current shareholder model is well accepted and established
    • Most shares are financed through department/institute budgets
  • Central cluster and computing infrastructure
    • For ETH researchers the most important infrastructures, next to Euler and Leonhard, are the ones at CSCS
    • Most users, that use additional platforms next to Euler/Leonhard, consider the ETH clusters as equal or better than the other resources.
  • Additional services offered by scientific IT services (SIS)
    • Roughly a third of the participants indicate interested in expertise and services that go beyond the basic support of the HPC team

Changes based on user feedback

Already implemented

  • We redesigned the front page of the wiki to show more information
  • In particular, the front page now shows the 20 most recently installed software on Euler and Leonhard (this was requested by multiple users participating in the survey)
  • The status of our clusters & services is now displayed at the top of every wiki page, which allows user to immediately see if a service is down

On our todo list

  • The documentation about our systems and services needs to be augmented and improved overall
  • We will to add more examples to the wiki
  • We will create a short summary of the "Using the batch system" documentation

Results

This section provides aggregated results for all questions in the survey. For the visualization of the answers, pie chart and donut charts are used for single-choice questions and bar charts for multiple-choice questions. As in the survey, questions are grouped by topic.

About participants

What is your role?

In which field are you working?

How important are computations for your work?

What kind of Euler/Leonhard user are you?

How many years of experience do you have in using a cluster/supercomputer?

How often do you use the Euler/Leonhard cluster?

What is your account status?

Application/software requirements

Which type of software do you use?

Which are the top 3 centrally installed software that you are using?

Are you happy with the centrally provided software?

Is it important for you to keep the older versions of a software?

Do you need any software that is not provided centrally?

Do you plan to use Singularity containers on Euler/Leonhard?

Are you using other container technologies?

Code development

How important are the following programming languages for your work?

Are you happy with the centrally provided tools for the code development?

Which parallelization libraries are you using?

Which scientific libraries are you using?

Which machine learning libraries/frameworks are you using?

Resource requirements

What type of computation are you running?

How many cores do you typically use per job?

What is the maximum memory that you need per node (GB)?

What is the typical runtime of your computations?

How happy are you with the performance of the cluster?

What kind of GPU computations do you typically run (if applicable)?

What model of GPU are you using on Leonhard (if applicable)?

How many GPUs per job do you typically use (if applicable)?

How do you see your future needs?

Using the batch system

How do you use the batch system on Euler/Leonhard?

Can you split long computations into shorter jobs?

Are you happy with the current configuration of the batch system?

Do you use the following features of the batch system?

System stability and availability

How happy are you with the stability and availability of the cluster?

How happy are you with the planned cluster maintenance?

How often have you lost jobs due to a problem on the cluster, e.g. node failure or file system issue?

Documentation and support

How often do you search for information on the wiki scicomp.ethz.ch?

How satisfied are you with the wiki?

How often did you use the following support channels? (times per year)

Are you happy with the support?

Would you like to have regular trainings on the following topics?

Shareholder model

Are you happy with the shareholder model?

What could be improved/added to the shareholder model?

How did you finance your share?

If you had the choice, how would you prefer to get resources on Euler/Leonhard?

Central cluster and computing infrastructures

Which of the following computing infrastructures are the most important for your research?

How do you rate Euler/Leonhard in comparison with the other computing infrastructures used most frequently?

Additional services offered by Scientific IT Services (SIS)

Which of the following SIS services are relevant to you?