5 min read

Maximizing efficiency with GPU Time Slicing and Multi Instance GPU on AKS

Maximizing efficiency with GPU Time Slicing and Multi Instance GPU on AKS

In today’s data-driven world, high-performance computing (HPC) and machine learning workloads often rely on powerful GPU technology to handle complex calculations and large-scale data processing. GPUs excel in parallel processing, making them well-suited for applications such as image recognition, real-time analytics, and deep learning. However, managing GPU resources efficiently can be a challenge, especially as demands shift and infrastructure costs escalate.

This article explores two advanced GPU management techniques: Time Slicing and Multi-Instance GPU (MIG), that are transforming how businesses optimize GPU usage on platforms like Microsoft Azure. Time Slicing allows multiple tasks to share a single GPU in time-based intervals, maximizing resource utilization during fluctuating demand. Meanwhile, MIG technology splits a GPU into isolated instances, enabling multiple processes to run in parallel without interference. By implementing these strategies, companies can achieve better cost control, reduce idle time, and ensure their applications scale effectively to meet performance needs. As we dive deeper into these technologies, we’ll look at how they work, the benefits they offer, and practical insights for deploying them in cloud environments.

Use case

Cloud Adventures, a company that delivers custom solutions and infrastructure platforms to software companies using Microsoft Azure, encountered a significant challenge with one of their clients. The client required an application with an advanced image recognition feature. Cloud Adventures developed a proof of concept solution which worked well on regular CPUs. The client was impressed and wanted to move forward, looking at a production ready scenario. However, scaling up revealed that training models on CPUs would take days. As a result, they decided to switch to GPUs for improved efficiency. By leveraging the NVIDIA CUDA toolkit, Cloud Adventures aimed to design a solution that performed better by leveraging the GPUs. However, they noticed that the cost of the infrastructure itself increased significantly with the use of GPUs. Where the client understand the change of hardware results in an increase of Azure consumption they do require the platform the be efficient and overhead is to be minimized.

Cloud Adventures is faced with this challenge and documented the following requirements based on their own developments and conversations with the customer:

  • The application leverages the NVIDIA CUDA toolkit
  • They have a proof of concept works but production ready solution requires GPUs
  • The solution needs to be efficient, minimize overhead
  • The workload does not run 24/7 in fact, demand changes throughout the day and weeks

GPU Time Slicing and Multi Instance GPUs

What do we do now? Let's take a look at two concepts: Time Slicing and Multi Instance GPUs, as combined they can provide us with cost effective solution.

GPU Time Slicing

GPU Time Slicing is like a single-lane road where cars (or tasks) have to take turns passing through. Each car has a small amount of time to use the lane before moving aside, allowing the next car to go. Even if the lane is available for only short intervals, each car still makes gradual progress toward its destination without requiring an entirely dedicated road.

Example: In Cloud Adventures' case, if the image recognition application only needs GPU power intermittently (like a car that only enters the lane every few seconds), time slicing lets different tasks take turns on the same GPU lane. This frees up the "lane" for other applications or allows it to stay empty when demand is low, helping reduce costs.

Benefits:

  • Efficiency: Since tasks only use the lane when needed, there’s no wasted GPU power.
  • Cost Reduction: Sharing a single GPU resource in this way saves on infrastructure costs by minimizing idle time.

Multi-Instance GPU (MIG)

Multi-Instance GPU (MIG) works like a multi-lane highway, where each lane is dedicated to a different car, allowing multiple cars to travel simultaneously without interfering with each other. Each lane is isolated, so a car in one lane doesn’t affect cars in the others. This setup is especially useful when you have different tasks that need to run at the same time.

Example: If Cloud Adventures’ image recognition app needs to process multiple image streams at once or run separate parts of its model simultaneously, MIG can split the GPU into several isolated "lanes." Each instance (or lane) can independently handle a different task, allowing multiple processes to happen at the same time without interference, even on a single physical GPU.

Benefits:

  • Resource Optimization: Each task gets its own lane, using only the power needed without blocking others, maximizing GPU utilization.
  • Cost Efficiency: By using a single GPU split into multiple lanes, the solution achieves high performance and reduces the need for additional GPUs, keeping Azure costs in check.

Decision making

Do we need to make a decision whether we use GPU Time Slicing or Multi Instance GPU configurations? No, we can implement both!

By applying GPU Time Slicing (letting tasks take turns on a single lane) and MIG (giving each task its own lane on a multi-lane highway), We can create a highly efficient and adaptable GPU setup, fulfilling the requirement of changing demands. This solution minimizes resource waste and infrastructure costs on Azure, allowing them to meet the client’s performance needs while staying budget-friendly.

How it works

To implement Time Slicing and MIG we need to ensure our workloads can access and leverage the NVIDIA Hardware is provided by Azure. We can use the CUDA toolkit to develop our application and provision the hardware on Azure. But then we are faced with a challenge: Our application (conainerized) needs to interact with that hardware. This is where the NVIDIA Operator and the NVIDIA Device plugin for Kubernetes comes into play.

The GPU Operator automates the deployment of NVIDIA drivers, the CUDA toolkit, and other necessary software components on all GPU nodes in the Kubernetes environment. This enables applications to access GPU resources without complex, manual setup. With the GPU Operator, we can manage GPU resources, optimize utilization, and dynamically scale workloads based on demand, keeping infrastructure costs in check and improving system reliability. The operator streamlines the GPU integration process. And the operator provides us with the ability to perform Time Slicing and configuration Multi Instance GPUs.

The configuration

We don't need to reiterate / reblog the installation instructions. This documented very well here:

There are a couple of things to remember:

  • Regional availability: To device a GPU into smaller GPUs with MIG we need a GPU that supports this. Azure provides Nvidia's A100 GPUs through the NDasrA100_v4 series. These Virtual Machines are only available in specific regions, generally the more popular regions such as East US and Europe West. These are also regions that are sometimes experiencing capacity constraints. Make sure to allocate your resources timely and when requesting capacity make sure to send in a business justification to increase success.
  • Time Slicing is generally overprovisioning, and telling the system you have more GPU time (allocatable) then you really have. As long as you plan for this and accept a decrease in performance when things get busy, it should be fine.
  • With Node Auto Provision, new developments in Kaito, and other scaling capabilities in AKS, there may be simpler and more cost efficient alternatives in the near future.

Closing thoughts

Time Slicing and Multi Instance GPU definitely have a place in running GPU Workloads and solutions such as Azure Kubernetes Service make these configurations and use cases available to us. I do believe there are alternatives out there and with the speed Microsoft is developing the AI related portfolio on Azure, there are bound to be alternatives that fit most use cases. Configurations such as described in this post may be suitable for a niche of use cases but are usually the foundation of something bigger. Without the NVIDIA Operator and the AKS Managed components around solutions such as the NVIDIA device plugin, this would be a lot harder to configure and manage!