The popularity and effectiveness of Deep Learning (DL) have led to advances in several areas, including speech processing, image recognition, natural-language processing. Data-parallelism has become an established paradigm to train Deep Neural Networks (DNNs) on multiple GPUs and decrease the training time. However, data-parallelism cannot be used when the model does not fit inside the memory of a single GPU. These models are known as out-of-core DNNs as their memory requirement for even batch size 1 is greater than the available GPU memory. This limitation makes the DNN training on very-large real-world images (512X512, 1024X1024, and 2048X2048 pixels) impossible.
To address this fundamental limitation of data-parallelism, a new parallelization strategy called "Model-Parallelism" is gaining attention in CS and AI communities. The inherent dependencies of forward and backward pass serialize the workers involved in model-parallel training leading to the under-utilization of precious GPU resources. Existing state-of-art systems employ pipelining to overcome this limitation. However, it is only applicable when the batch size is greater than 1 (ideally, it should be more than the number of GPUs). Any increase in the batch size overflows the GPU's memory, which makes the model untrainable. To overcome this issue, we present novel memory-aware optimizations to accelerate the training and reduce the under-utilization of GPU resources. Our proposed designs enable the training of larger batch sizes and increase the performance with the batch size on the same number of resources. We integrate our memory-aware designs with data-parallelism and scale the training to 1024 GPUs.