Over the past few months we’ve had a number of service issues with BEAR and this is something we’d really like to apologise for – we pride ourselves on providing excellent services and recently we’ve not lived up to that.

Sorry about this! 🙇🏻

I thought I’d take some time to explain some of the issues and what our team has been doing to resolve them. Some of the issues are complicated; others are symptoms of the challenges we face – as we try to keep our services closely coupled. This is sometimes inevitable but when things are working, we think integrated access to storage across all our services is beneficial.

Avoiding maintenance windows

So some of the issues have arisen as we’ve been trying to avoid the usual quarterly BEAR maintenance windows. We understand how these can be disruptive and, particularly with the extended window we took for our data centre migration in May, we’ve been trying to avoid further service outages this year. This has meant some of our maintenance tasks have fallen behind. BUT, we’re now on top of this and are in the process of delivering rolling upgrades to our systems. The approach seems to be working well though it has taken us quite a while to reach this point. A lot of preparation and co-ordination is required, for example, we don’t want to run an HPC job that spans nodes with different versions of the operating system. Equally, some of the HPC applications are also quite closely aligned to the drivers installed (e.g. our high speed InfiniBand network), and so we have to test these carefully and rebuild when needed.

We also don’t have a full-scale or even a large test system – this would be impractical as well as unaffordable given the scale of the HPC and storage systems. That isn’t to say we don’t test changes, but sometimes issues only become apparent once in service and when running ‘at scale’. One really useful recent advance is the development of a set of application tests, each with known input files and expected outputs so we can validate a large number of the HPC applications in an automated way. To put this into perspective, currently this validation of the application stack can take 3-4 days to complete.

This doesn’t mean we won’t be taking maintenance windows in the future – there are some cases where it’s unavoidable, but we hope to keep them shorter and less frequent.

But my files keep reappearing (or they aren’t there yet)!

This is an interesting problem we’ve had, where users have deleted files and they are magically reappearing. As the Research Data Store (RDS) service has grown and is no longer only used for HPC work, users’ run-away or high IO HPC jobs can have quite a big impact on it meaning the RDS storage is creaking a little (more on this later!). A year ago, to address this, we implemented a caching technology between the RDS and most of our HPC compute nodes. This buffered the IO and writes between the RDS and compute jobs. However, the code running this is complicated and it has become clear contains a number of bugs.

The re-appearance of files occurred when users had deleted files from RDS, but they were still in the cache system. (Normally when you access a directory on the cache system, it updates and deletes the files there also to keep everything in sync). In some cases when the cache got stuck either collecting data from the RDS or writing data back, it would go into a state where it wanted to re-synchronise and this could cause it to mistakenly identify files in the cache that had been deleted from RDS, but it thought were new files that needed copying to RDS … and as if by magic, the files reappear!

The cache also can cause delayed updates to files on RDS, for example if there is a lot of storage traffic, then the update is queued and so a file that your HPC job has created might be waiting behind a lot of others in the queue. i.e. it can take a while for it to end up back at RDS.

Related to this, we found a weird quirk of the caching system where quota information wasn’t passed over from RDS to the cache. This meant that a user could go ‘over quota’. As the quota wasn’t set, the write to a file completed without issue, however when the cache system then tried to push the file back to RDS, as the user was over quota, this write would fail. The cache system would keep trying to do the write and keep failing, so if a user had a lot of files the cache was trying to push back, it would block up the queue and nothing else would move until eventually the cache got so far behind it gave up and tried to do a full re-sync – see above about files reappearing …

So we fixed this by adding quotas on the cache, we also added monitoring to look for long queues so we could check for these issues. And then we found another weird thing.

When the cache knew a file had been deleted from RDS, it removed it from the cache. Except it didn’t really – it squirrelled it away into another directory that users can’t see, just in case it was needed to be recovered. So again users could go over quota as deleted files still existed in the cache, but not obviously where the file was. Once we understood this, we put in procedures to clean up these “keep the deleted files” files.

We’re in the middle of a process to replace all the RDS systems, and as part of this, the caching layer will be removed and the hardware redeployed for other use, and we’re in the process of stopping nodes using the cache as we apply rolling updates to them.

I can’t access my files on my VM!

This problem has mostly gone away (well we worked around it for a lot of users), but users of BEARCloud VMs had issues where they used to be able to access their project storage, but suddenly couldn’t.

First I should explain that VMs use NFS to access storage rather than SMB (Windows) shares or native access (as used by HPC nodes). BEARCloud didn’t exist when we built the RDS. Adding storage access to VMs was a feature we added. All was going well until we had users who couldn’t access new storage projects. After a bit of head scratching, we found this was a Linux limitation with NFS on the number of groups a user can have access to. When a user was added to more projects, they eventually hit the group number limit and then couldn’t access the storage – but ONLY from NFS clients (VMs). It worked fine from HPC nodes and SMB access. We found there is an NFS config option in the servers to move the checking of groups from the Linux NFS client to the server and great … this works with more groups! A little while later users reported they couldn’t access data, and with a lot of digging, we found that in this mode, complex file access permissions (ACLs) didn’t work.

So we fixed one issue and introduced a new one! There’s a workaround we can apply to this and have done so as people reported the issue to us. Longer term we will review how the BEARCloud VMs access storage and have already identified two options to make this work better.

Fighting fires!

So we feel like we’ve been fighting fires a lot recently, and it’s caused a number of service issues as a result. I’ll just add a little detail on what we we’ve been doing in the past week or two to fight something else.

We’re fairly sure that we’ve been hitting a bug in our storage software that was running on some of the older HPC nodes. All those in the new data centre were upgraded during the data centre move, but the kit that remained in the old site wasn’t updated (see why rolling maintenance is important!). This meant that in some limited cases, a user application could cause the storage client to hang in a weird way – the node was contactable on the network, but couldn’t respond to certain types of storage operation. Now when a node is properly failed, it gets booted out of the storage cluster and things carry on, but this was a weird failure mode. When a node was acting like this, certain types of operation could cause the whole file-system to hang until we intervened or shut various things down. We’ve since upgraded the storage code on all of our compute nodes.

Related to this we’ve also been looking at problems with memory pressure on compute notes. This presented similar symptoms to the problem above. In the tuning guide for our storage software, there is a recommendation that we keep 5-6% memory clear to optimise performance. This was fine on our older compute nodes, but as memory per node has increased, its got quite high (64GB on some of our 1TB systems!). This is a Linux kernel setting and it’s the amount the kernel should aim to keep free. Whilst fixing some of the other issues, we noticed that the tuning guide has been updated to include “but not more than 2GB”. At the same time, we also upgraded the scheduling system to a newer release which is a bit more aggressive in how cgroups are used. Now these are used by the scheduler to limit how much impact one user job can have on another. There’s also a kernel bug we found reference to where when a cgroup is removed, it doesn’t always get cleaned up properly – e.g. memory might not be freed correctly. The scheduling system has an amount of memory it knows it can use for user jobs, however when we added up the numbers, there was a mismatch. The scheduler was assigning user jobs, the user jobs was (normally) consuming memory but the kernel thought it needed to free memory… adding this all together led to cases where the compute nodes were under memory pressure and, in turn, led to instability of the storage software. We’re currently in the process of applying a kernel update to the compute nodes and rolling this out as part of rolling maintenance.

In addition to all of this, we’ve also added monitoring into our system to look for the signs of this sort of issue occurring so we can take action before they cause service issues.

Replacing the RDS hardware

Finally a word on some other developments. Our RDS hardware is now quite old and filling rapidly. So we’re in the process of refreshing the service offering. In preparation there have already been a number of behind the scenes updates to enable us to scale better and be more performant. We’re expecting a delivery of new hardware in the next few weeks and have planning in place to migrate all the data to the new system. We hope to do a lot of this in the background, but will need some interruption to service to complete the final data transfer – unfortunately this is one of those unavoidable maintenance windows!

As part of the new solution, we’re moving away from replicating all data by default to the approach of replicating only high availability requirement data. This doesn’t mean we take any less care of your data, but as volumes of data grow (we’re in the multi PetaByte scale), it’s more cost effective if we don’t replicate – with the new Park Grange Data Centre, we have a higher level of availability in the space and the replication doesn’t help when we need to do critical maintenance anyway – We’ll still be able to do live upgrades on most of the system and the high availability only really comes into play if the data centre environment fails for some reason. Backups will still be made to two sites so we’ll always have a disaster recovery version of the data if needed.

So overall I’m pretty optimistic that the worst of our service instability is behind us. The whole team has been working hard (including going beyond expectations and working through the night at times – thanks all!) to make the service reliable.

We’ve also got new technology coming along to better support a wider range of research including our announcement about delivering the UK’s largest IBM PowerAI cluster!

Thanks for understanding and bearing with us! And as ever, please log any service issues via the IT Service Desk.

Images: From Wikimedia Commons, attribution listed in each image link.