Weights and Biases blocked on the ETH Proxy Server

From ScientificComputing
Jump to: navigation, search


Weights and biases (wandb) is a web service used by many Euler user to have a dashboard for monitoring their machine learning jobs while they are running. It is an external service that requires access to the internet. The compute nodes of Euler don't have direct internet access and therefore the users need to go through the ETH proxy server to access the wandb servers.

Denial of service

Recently a large number of concurrent connections from Euler to the wandb service caused a denial of service (DOS) of the ETH proxy server. Due to this DOS, api.wandb.ai has been blocked on the ETH proxy server, which causes jobs using this service to terminate with an error message, as wandb cannot be reached any more.

We are now in contact with the service owners of the ETH proxy server and try to find a solution for this problem.


2022-09-07 08:30
We have implemented a workaround to make wandb accessible again from the compute nodes in Euler. The only requirement for using the workaround is that the eth_proxy module is unloaded (module unload eth_proxy), as the wandb service is still blocked on the ETH proxy server. With the eth_proxy module not being loaded, the wandb service is again working.
2022-09-07 14:30
Some users reported that wandb again works, but after some time they still get an error message "wandb: Network error (TransientError), entering retry loop.". We are investigating this problem and are trying to find a solution.
2022-09-13 16:10
We got more reports from users regarding the workaround that is currently in place for allowing access to the wandb service from Euler compute nodes. If you just send values to be plotted, then the workaround does not result in errors. The errors are only showing up if you upload images. We are still working on a solution to make the service again fully available.
2022-09-15 15:20
From now on, you can again load the eth_proxy module and your jobs can again use the wandb service as before the incident with the ETH proxy server.