Introduction

AWS Spot Instances offer an excellent opportunity to reduce costs in your cloud computing infrastructure, but they come with a potential risk: they can be reclaimed by AWS with little warning. Gergely Orosz recently tweeted about the consequences of not being prepared for this, leading to high-profile outages. In response, I shared my own solution using a combination of backup node pools and taints to maintain high availability in this scenario.

In this blog post, we'll discuss how to create a resilient infrastructure by utilizing AWS Spot Instances strategically along with backup node pools and taints, ensuring cost efficiency and availability.

Section 1: How to Create a Backup Node Pool

Having a backup node pool not running on Spot Instances is critical for maintaining high availability. In this section, we'll look at setting up a backup node pool and configuring it to scale and accept migrating pods.

  1. First, create a node pool:
kubectl create node-pool backup
  1. Configure the node pool to auto-scale:
kubectl create cluster-autoscaler backup-node-pool
  1. Set up the node pool to accept migrating pods, using labels and tolerations:
kubectl label nodes -l cloud.google.com/gke-nodepool=backup-node-pool backup=ready

Now, with the backup node pool set up, we can discuss using taints and affinities to manage pods on Spot Instances.

Section 2: Using Taints and Affinities for Pod Management on Spot Instances

Taints and affinities allow you to manage your Kubernetes deployments by directing pods to prefer running on Spot Instances. By doing so, you increase cost efficiency while maintaining stability.

  1. Apply taints to the Spot Instance node pool:
kubectl taint nodes -l cloud.google.com/gke-nodepool=spot-node-pool spot=ready:NoSchedule
  1. Configure your pods' affinities to prefer running on Spot Instances:
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 1
      preference:
        matchExpressions:
        - key: spot
          operator: In
          values:
          - ready

With taints and affinities in place, you can now run pods with a preference for Spot Instances without risking high availability.

Section 3: Handling Long-Running Jobs with Drain Scripts

Long-running jobs may be more vulnerable to Spot Instance reclaims, but you can use drain scripts to reassign these tasks to on-demand nodes when required.

  1. Create a drain script to move workloads from Spot node pool to on-demand node pool:
#!/bin/bash
NODES=$(kubectl get nodes -l cloud.google.com/gke-nodepool=spot-node-pool -o jsonpath='{.items[*].metadata.name}')
for NODE in $NODES; do
    kubectl drain $NODE --ignore-daemonsets --delete-local-data --force
    kubectl uncordon $NODE
done
  1. Execute the drain script when required, like when receiving spot termination notices.

By using this drain script, you can maintain high availability by quickly reassigning long-running jobs to on-demand nodes when Spot Instances face impending reclaims.

Conclusion

Managing AWS Spot Instances for cost savings requires a mix of strategies, including backup node pools, taints, and drain scripts. We've walked you through setting up a backup node pool, applying taints and affinities, and handling long-running jobs with drain scripts. By implementing these steps, you can maximize cost efficiency without compromising reliability in your AWS infrastructure. Give it a try – you won't be disappointed!

Stephen Lizcano

Share this article

Ready to dive in?

Get compliant and secure today!

Get started now
Starbase Logo

Delivering the fastest path to security and compliance in the cloud.

Β© Copyright 2025 StarOps.

Proudly made in

Los Angeles, CA πŸ‡ΊπŸ‡Έ

StarOps Supports Ukraine

Contact us

hello@staropshq.com

7901 4th St N, Suite 300, St. Petersburg, Florida 33702 United States