If you are reading this, you probably know that we did not…. as we could not…. in the face of sustained temperatures beyond the design specification of the Research Data Centre. As with all engineering, there are safe operational limits and decisions are made about the conditions in which the system will operate. Similarly, air con for NHS clinics stopped working as it was ‘too hot for them to operate’, some other HPC facilities shutdown or partly shutdown and we all know about the trains.
If you want your facility to be affordable to build, economical to run and minimally damaging to the environment there are compromises to be made. Losing 36 hours service or even a day or two more in 4 years due to extreme weather does not feel unreasonable for BEAR though I do understand how frustrating it must be when you don’t have the tools to do your research.
There certainly will be lessons to learn from this episode and potentially, improvements we can make. Once we are fully operational again, we will consider both short term measures and longer term improvements. We do have a project underway to increase cooling capacity. We think this will give us a little more headroom from next year despite this not being the primary purpose of the work.
I want to say thank you to my team, Advanced Research Computing, and particularly the group responsible for the infrastructure who spent a very long Monday 18th July fighting valiantly and trying every trick in the book to assist the cooling systems and keep, at least, the essential services working. By late afternoon it was apparent that this was not going to be possible and that overnight temperatures would not drop sufficiently to alleviate the problems. Reluctantly, we shutdown. Despite transport challenges, they were back on site on Tuesday 19th, at times working in 40C heat, to ensure the data centre was safe and prepare for a restart (as well as doing some essential equipment moves to avoid future disruption).
On behalf of ARC I would like to apologise for this service outage.
By Carol Sandys, Head of Advanced Research Computing & Research Engagement
Note from Editor: We have subsequently heard that both Google and Oracle had to turn off parts of their systems to protect them from the heat – see https://www.bbc.co.uk/news/technology-62202125 for details.