As users of some of our services will probably know, we’ve been having issues with our storage services on and off for a while now.
Firstly, we have never lost any user data during this and mostly for users it has manifested itself as slow access at times. At times there has been no access, though not usually for long periods of time, this will mostly have affected HPC users or those doing data transfers overnight.
I can assure you that we take this very seriously and have been working away under the hood to try and I’m really hopeful that we will soon have finally got back to a stable storage platform again. This will likely get technical in places (see the Glossary at the end for technical terms/abbreviations), but we think its important to be honest about what we’ve seen, what we’ve done and why its taken a while to get back to a stable service.
Really, we’ve been suffering from growing pains, we’ve added a lot of services in the past 18 months and have been trying to make a flexible and useful set of services for researchers at Birmingham. We don’t like it when things don’t work as expected or we give poor service to users.
First, lets go back a few years to the days before we had the Research Data Store, users dumped research data into the HPC storage system, sometimes there was School or College storage, even down to piles of CDs and DVDs in offices. We were lucky enough to secure funding from the University to support the installation of the RDS and we now have researchers from across the University successfully using the system – we provide integrated access from Windows, Linux and Mac end user systems, HPC systems and Cloud VMs.
Our RDS storage is designed to be a highly-available system – this means that if we have a failure of one of the University’s data centres, then we can continue to provide access to this vital research data. Sure, data centre failures are rare, but power issues can occur, or contractors can cut through fibres etc. Whenever you write your data to the RDS, it is automatically and immediately written to storage in both data centres before your client is told that the write has been completed and so we always have both copies safe should something happen to our systems.
When we designed the solution, we knew HPC users would want access to the storage and so initially it was only mounted and accessible on the login/head nodes. We wanted users to copy the data to the local HPC storage system, work on it there and copy results back. This approach is quite common in other University HPC systems, but of course we got requests to make it available on the compute nodes directly. You might think, surely we can just mount it, well, that’s true (and we did in the end), but the reason it wasn’t designed like this to start with is related to the 2 copies being written. For a client using SMB (Windows style access), the time to copy the data (latency) isn’t a big issue, but for HPC compute jobs, the latency can introduce big delays in the completion of HPC jobs. Heavy IO workloads from the HPC nodes can also cause performance issues for SMB users.
Doing what the users wanted …
We eventually relented and agreed to make the storage accessible on the HPC nodes, and this is probably where our problems started. Sure its easy for users to access their data, but it kicked off a whole host of other issues.
We use a parallel file-system for the storage which means lots of HPC or SMB clients can access the storage really quickly and we can sustain loss of servers (e.g. crash, hardware fault etc) without users noticing. But it also means that communication between the file-system clients is really important to ensure file-locking works properly – i.e. data corruption doesn’t occur if you access the file from two places. You might not even know you are doing this by the way, as we have a cluster of servers providing access.
Growing storage
Incidentally, we added two new storage systems in the past year proving several PB of storage to support life-sciences. We also re-architected our backup and DR solution. We have two tape libraries, all our storage is backed up to tape and the tapes copied between the two data centres. We have a DR solution that allows us to do a restore of service in hours rather than weeks (given we have PB of data …!) Our tapes are all encrypted in case of theft/loss, and the tape libraries automatically verify every 90 days that the tape is still readable. We can reconstruct a failed tape from the copy in the other data centre. Like I said, we care about not losing your data …
This is all important as to do this, we’ve had to add extra servers to drive it all, and it takes staff effort to implement and to implement without downtime.
Troubleshooting and fixing
During December we spent a lot of time to work out why we were seeing the file-system for RDS going offline periodically and found a root cause. This was related to the way some of the routing internally is done. When we first started, it was based on a couple of Linux boxes, which worked well when it was just some basic firewalling for management networks, but as things grew, we started hitting limits on them. This actually also had an impact on the HPC service, but we didn’t realise this for a while – the nodes running the firewall were also the HA pair for the scheduler and were being overwhelmed by network connections and missing IRQs from the network interfaces. This was causing issues with the HA and more important network issues to the storage systems. These automatically take action and “expel” nodes to prevent data corruption, due to network issues, this was causing the servers sharing the data via SMB to Windows (etc) clients to temporarily (2-3 mins) “lose” sight of the storage, which kicked off their own issues. Additionally when one of the Linux routers became overwhelmed with connections, it would crash. The second would take over, but would suddenly be “stormed” by several hundred HPC nodes trying to reconnect to the storage, causing it to er … crash.
We learnt how to manually intervene, but couldn’t do anything out of hours, and the resolution could take an hour or so for us to get everything stable.
The solution of course was to replace the Linux routers with something more capable and we had plans to do this in early January, but given its high risk, took us a while to get to the point where all the other bits were lined up and we could schedule in a maintenance window. Being a service provider, we have to manage risk, impact and workaround, but as a high risk option we wanted to be sure it was the right choice and we knew it would work.
I should note that whilst planning this, we found a firewall config issue that was limiting speed artificially. We fixed it, and that caused us some more issues… Basically we were accidentally NATing storage traffic, when we fixed this we increased performance which made some of our other issues worse! There was also a quirk in routing, that we didn’t find out until much later as it was masked by the other issues we were dealing with.
Now all of this storage instability was having effects on other services. BEAR Cloud uses the same storage technology for the VM images. This means we can live migrate VMs and spawn them on different hypervisors at will. But when the storage goes offline, they become read only. And a number of our pilot users will have experienced this. The storage system automatically heals itself, but the VM needs a reboot to make it work again. So were chasing down from this side of the system as well!
Once we’d replaced the routing, the storage system stabilised quite a lot and we all took a sigh of relief and turned back to what we had been doing. This was getting BEAR Cloud up and running and getting access to data into VMs transparently.
We wanted to get performance out of it and were finding the access method we’d designed to be slower than we expected. There is a bug in the Linux kernel related to VXLAN offload technology (to make it go fast in a VM environment), so we changed how we did the networking. This meant we could use the standard export servers used for SMB systems and we got a massive performance boost from this. Way more than we expected at all, but the protocol servers were spec’d for data transfer, so it makes sense really.
And then of course, we listened to the users again and mounted shared home directories and access to the apps we run on the HPC systems…
First off, we have fail-over NFS running to make this work in an HA manner, but due to a bug in the code that hands over between servers and the way we had configured LDAP services, the fail-over would periodically hang on some VMs. We have a work around for this and are expecting a code fix soon as well, so this problem is cleared now.
Most recently, we’ve seen our export servers suddenly running really slowly. It will be one of the four servers and the load will rise to over 400 and even checking the storage directly will appear slow (i.e. its not just the protocol layer being slow). It will only be one node and if we mark it as failed and then unfail it, it recovers. The slowness was also having an effect on SMB services and DataShare – our sync n share tool. That is HA, but the load balancers could talk to the services, they were just really really slow at times.
Our storage software includes network testing tools, and we used these that found the odd thing – a few DNS issues, but nothing major and it has taken us a while to finally track this down (we think!)
I mentioned a quirk of routing and this has been biting us for months and we think is the cause of the slowness we’ve seen in the past month. This is all related to the Linux kernel and rp_filter which drops traffic it thinks is on the wrong interface. Many of our storage systems have multiple network cards and access to multiple network. When we were accidentally NATing the traffic, this was hidden, but once we fixed that, the traffic wasn’t being NATed out and so was hitting rp_filter rules. Now this looks weird when you debug as ping etc work, but data doesn’t flow properly.
With a lot of effort, we have tweaked rp_filter across a number of servers (not just storage but also backup nodes). What is weird here is that we can mount the file-systems and access them fine on the protocol nodes, something clearly triggers something that makes one run slowly. Hopefully fixing the rp_filter issue, this is resolved now, but we need a few days to wait and see!
So far, each time we have fixed an issue, we’ve then run into something new…
I feel like I’ve been chasing storage issues for ever now, but I’m hopeful we are finally getting there. Thanks to Rad and Ed for their perseverance in fixing bits along the way, putting up with the issues and dealing with user tickets as a result! And truly we are sorry if you have been affected by this, either poor performance or loss of service for a while!
Hopefully you can see that we’ve been working hard to fix the issues and have made significant architecture changes to help alleviate the issues and improve performance.
Next steps
So I’ve explained what we’ve been up against and fixing. We’re not done yet, we think we have ironed out the little network issues all over.
So what next?
Well, we’ve just taken delivery of a new DSS-G storage system from Lenovo. We’re the first in Europe to get it, and amongst the first in the world. What we plan to do is remove the direct link between HPC systems and RDS to prevent them impacting each other. BUT, we still want the data access to be transparent, so the new storage system will basically be acting as a big caching system between HPC nodes and RDS storage. HPC nodes will benefit of the cache acknowledging the writes and it will then sort the copy to the second data centre as soon as possible (but not giving the latency to the HPC job). This will mean that we’ll be decommissioning the old HPC storage and moving all project data over to RDS storage. We’ll be working with research groups on that in the coming months.
We’re also upgrading our core switches to give higher performance and enable us to increase the size of the data pipes between the data centres to ensure these aren’t causing us problems as we grow.
Finally, we have just completed an upgrade to the licenses of the storage software which will allow us to use encryption at rest meaning we can encrypt data on the fly as it is written to the storage systems without intervention of the user. We need to look at the performance and user requirements of this, so it won’t be rolled out for all users.
Glossary and Abbreviations – for non-technical readers!
DNS – Domain Name System
DR – Disaster Recovery
DSS – Distributed Storage Solution
HA – High Availability
HPC – High Performance Computing
Hypervisors – Software, hardware or firmware that creates and runs virtual machines
IO – Input Output
IRQ – Interrupt Request – a hardware signal sent to the processor that temporarily stops a program running, allowing an interrupt handler program to run instead
LDAP – Lightweight Directory Access Protocol
NAT – Network address translation – when all traffic appears to come from the same machine
NFS – Network File System
rp_filter – Reverse path filter
VM – Virtual Machine
VXLAN – Virtual Extensible LAN – a network virtualisation technology that manages VMs