[Write-Up: How to Run Two or More Pods Within One GPU in a GKE Cluster]

This can be useful for those who need to run small models without consuming many resources, especially for beginners who are experimenting with their solutions.

To launch a k8s cluster in GCP, it's very convenient to use public modules available at: https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google/latest/submodules/beta-private-cluster-update-variant

Depending on your requirements, you can launch a cluster with public or private nodes, including adding a node pool with GPUs to the cluster."

{
 name              = "gpu"
 machine_type      = "g2-standard-8"
 node_locations    = "us-central1-a"
 node_count        = 0
 total_min_count   = 0
 total_max_count   = 0
 auto_upgrade      = true
 autoscaling       = false
 spot              = true
 disk_type         = "pd-ssd"
 accelerator_count = 1
 accelerator_type  = "nvidia-l4"
 enable_gcfs       = true
}

However, within these modules, you won't be able to run multiple pods within a GPU node pool due to resource allocation limitations. By default, when adding a GPU node pool, you have:

nvidia.com/gpu: "1"

This means that we can allocate one GPU core to a single pod. We won't be able to run more than one pod within our node pool.

But what to do if we need to run multiple pods? There is an option to use GPU time sharing: https://cloud.google.com/kubernetes-engine/docs/concepts/timesharing-gpus

Let's describe the node pool and add it to the cluster.

Create a file named gpu-time-sharing.tf and describe all the necessary input data. Here you can see an example how it looks like.

resource "google_container_node_pool" "gpu_time_sharing" {
 name     = "gpu-time-sharing"
 cluster  = "cluster-id"
 project  = "my-project-id"
 location = "us-central1"
 version  = "1.27.3-gke.100"

 node_locations = ["us-central1-b"]

 autoscaling {
   location_policy      = "ANY"
   total_max_node_count = "0"
   total_min_node_count = "1"
}

 management {
   auto_upgrade = true
   auto_repair  = true
 }

 network_config {
   create_pod_range     = false
   enable_private_nodes = true
   pod_range            = "kubernetes-pods-subnet-name"
 }

 node_config {
   image_type      = "COS_CONTAINERD"
   machine_type    = "g2-standard-8"
   service_account = "cluster_service_account"
   spot            = true
   local_ssd_count = 0
   disk_size_gb    = 200
   disk_type       = "pd-ssd"
   logging_variant = null

   oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

   gcfs_config {
     enabled = true   
 }

   gvnic {
     enabled = true
 }

   guest_accelerator {
     type  = "nvidia-l4"
     count = 1
     gpu_driver_installation_config {
       gpu_driver_version = "LATEST"
     }
     gpu_sharing_config {
       gpu_sharing_strategy       = "TIME_SHARING"
       max_shared_clients_per_gpu = 4
     }
   }

   workload_metadata_config {
     mode = "GCE_METADATA"
  }
   labels = {
     cluster_name = local.name
     purpose      = "gpu-time-sharing"
     node_pool    = "gpu-time-sharing"
  }

   taint = [
   {
     key    = "purpose"
     value  = "gpu-time-sharing"
     effect = "NO_SCHEDULE"
   },
   {
     effect = "NO_SCHEDULE"
     key    = "nvidia.com/gpu"
     value  = "present"
   },
  ]

   tags = ["gke-my-project-id-region",
"gke-my-project-id-region-gpu-time-sharing"]
  }

 timeouts {
   create = "30m"
   update = "20m"
 }
}

Note: Make sure to adjust the input data according to your requirements.

Let's focus on the most critical part:

  guest_accelerator {
     type  = "nvidia-l4"
     count = 1
     gpu_driver_installation_config {
       gpu_driver_version = "LATEST"
     }
     gpu_sharing_config {
       gpu_sharing_strategy       = "TIME_SHARING"
       max_shared_clients_per_gpu = 4
     }
   }

Here, it's described that we need to use the accelerator type 'nvidia-l4' with the driver version, in our case, 'latest'. You can specify the driver version you require.

gpu_sharing_config {
       gpu_sharing_strategy       = "TIME_SHARING"
       max_shared_clients_per_gpu = 4
}

And this block describes that we need to split the GPU into 4 partitions.

After applying, we obtain a node pool with the ability to run four pods with GPU allocation.

To launch our pod, we need to specify the resources, node selector, and toleration for it:

resources:
  limits:
    cpu: 2
    memory: 8Gi
    nvidia.com/gpu: "1"
  requests:
    cpu: 2
    nvidia.com/gpu: "1"
    memory: 8Gi

nodeSelector:
  purpose: gpu-time-sharing

tolerations:
- effect: NoSchedule
  key: purpose
  operator: Equal
  value: gpu-time-sharing
- effect: NoSchedule
  key: nvidia.com/gpu
  operator: Equal
  value: present

In conclusion, this setup allowing the efficient utilization of resources within the GPU node pool, this approach caters to the needs of individuals experimenting with various solutions while minimizing resource allocation.