Baskerville pushes the boundaries

Published: Posted on

In this case study, we hear from Christos Baziotis, from Edinburgh,who has been making use of Baskerville to enable his research in text generation using machine learning models. 

I am a (3rd) final year Ph.D. candidate at the ILCC at the University of Edinburgh and a member of EdinburghNLP under the supervision of Barry Haddow and Alexandra Birch, in collaboration with Biao Zhang. 

In my research, I aim to enable machine learning models to learn with limited supervision by exploiting prior knowledge from unlabelled data. I am particularly interested in text generation, such as summarization or machine translation, and multilingual natural language processing tasks. 

What is your research about

Our research is on machine translation (MT), which is the technology used by tools like Google translate to automatically translate text from one language to another. MT is an important tool for many people who want to communicate with others who do not share a common language, for example for those who work or study in another country, for cross-border commerce, or for international relief efforts.

Current state-of-the-art MT is implemented using a neural network that is trained on large amounts of parallel text. A parallel text corpus contains pairs of sentences written in two different languages, but with the same meaning, produced by human translators. Since parallel text is expensive to produce, researchers have developed ways of using monolingual text (which is much more plentiful) to improve MT systems. 

Training of neural networks uses a variant of stochastic gradient descent (SGD). Each step of SGD consists of an evaluation of the neural network on a subset of the training data (a batch) followed by a calculation of the gradient on the same batch. We then update the network parameters and iterate, continuing training steps until convergence. The evaluation and gradient calculation require a large number of matrix multiplications, meaning that the parallel computation power of a GPU is essential to be able to train the network in a reasonable time. Increasing the batch size allows training to converge faster, but requires more GPU memory.  

Baskerville’s many GPUs with large memory allow us to reach the desired batch sizes with limited or no gradient accumulation steps, reducing the waiting time in array jobs.

MT works well in languages and domains where there are plentiful parallel texts, but this mainly applies to languages spoken in Europe, plus Chinese. For most of the world’s languages, parallel text is in very short supply. One approach to expanding MT to languages with less data, is to use multilingual machine translation (MMT), where a system is trained to translate many different language pairs simultaneously. This approach can improve the performance of low-resourced language pairs, using transfer from high resource languages. MMT is the basis of Meta’s “No Language Left Behind”, and is used in Google’s online translation service. 

Using a multitask learning approach, MMT training can use both monolingual and parallel data at the same time. An effective MMT system requires a high capacity neural network, that is one with many parameters, which means it is more resource intensive to train than pairwise MT. We are seeking to improve MMT along several dimensions. Firstly we have noted that the current way of doing MMT works well for certain datasets, but is much less effective in others, and we want a deeper understanding of the success factors. 

We are also seeking ways of better exploiting monolingual data, especially in related languages, using architectural variants known as adapters and hyper-adapters, and (non-parametric) retrieval methods. 

Why did you choose Baskerville?

In the beginning, we started using Baskerville for the quality of the hardware. Having A100 GPUs means that training steps run faster than on earlier types of GPUs. The larger memory GPUs enable us to train with the large capacity models required for MMT, without resorting to model parallelism (splitting the model across multiple GPUs), which would slow down training and make it more complex. The A100 GPUs also enable us to use mixed or half-precision (also known as FP16) training, which further significantly reduces memory consumption, thus enabling us to use larger models and batch sizes.

Baskerville configuration and documentation made it straightforward to get started.

The quicker experimental turnaround is important for work in our area. Our work normally proceeds in an iterative cycle: think of an idea – run an experiment to test it – examine results and update idea – iterate. If the cycle becomes too long, because experiments run too slowly, then this type of iteration is not possible. There is too much time spent waiting for results. This means that certain types of work, for example multilingual machine translation (as we describe above) are only possible with sufficient computational resources. 

As we mentioned above, training time for MMT is dependent on batch size, in other words how much data we can push through each gradient update step. Having access many GPUs in Baskerville means that we can do multi-GPU training using data parallelism. For instance, when we want to train a model with a batch size of 32K words, but our GPUs can only fit 4K words at a time, we split the batch into 8 shards, and process each shard on a different GPU, making a gradient update after all the shards have been processed. The alternative to multi GPU training would be gradient accumulation, where the shards are processed sequentially on the same GPU. This is obviously much slower than multi-GPU training.

Because of the long training time for our models, and the fact that the cluster necessarily limits job time we use array jobs and checkpointing. This means that we potentially have to queue for a considerable time between jobs in the area. Using large batches to reduce training time also reduces the number of jobs in the array, and so we have less queuing overhead.  

In addition to having state-of-the-art processing power, we have found that setting up experiments and running on Baskerville has been a smooth experience. In our group, we have significant experience in using slurm-based HPC, but every cluster has a slightly different way of doing things. The Baskerville configuration and documentation made it straightforward to get started. 

What situation have you found benefits the use of the A100-80 over the A100-40 GPUs?

The main benefit of the large memory GPUs is that we can process larger batches of the training set at the same time, reducing overall training time (see the previous answer). The large capacity models required for MMT require large training batches for each training step.

Processing these large batches is only possible by splitting the batches across many GPUs (incurring communication overhead) and/or having each GPU process several shards per training step (increasing the time for each step). Having large memory GPUs reduces these overheads, meaning that each training step can be processed faster. 

The advantage of A100-80 over the A100-40 GPUs is that doubling the size of the GPU memory enables us to use batch sizes that are more than double in size. This is because in data parallelism, we split the batch into N parts and process them in parallel, but importantly, we load a copy of the model to each GPU. This means that for smaller GPUs there is less memory available to process each batch. This problem becomes more pronounced as we use larger models. Figure 1, illustrates this with an example. With A100-80 we can reach the desired batch size with (slightly) less than half the number of GPUs compared to the A100-40s. 

Figure 1: Memory usage for A100-40GB vs A100-80GB. The model size is fixed, but the larger
memory GPU enables us to more than double the batch size processed in each iteration.

We were so pleased to hear of how Christos and his colleagues are able to make use of what is on offer from Advanced Research Computing, particularly to hear of how he has made use of GPU’s on Baskerville– if you have any examples of how it has helped your research then do get in contact with us at bearinfo@contacts.bham.ac.uk. We are always looking for good examples of use of High Performance Computing to nominate for HPC Wire Awards – see our recent winners for more details.