Upgrading EKS to Kubernetes 1.32: Karpenter Compatibility, Pitfalls & Zero-Downtime Best Practices

Upgrading an AWS EKS cluster from Kubernetes 1.30 to 1.32 while maintaining Karpenter compatibility requires careful planning and execution. In this comprehensive guide, I'll walk you through the complete upgrade process based on real-world experience, highlighting critical pitfalls and best practices for achieving zero-downtime deployments.

Overview

This upgrade journey involves multiple components:

EKS cluster upgrade (1.30 → 1.31 → 1.32)
Karpenter upgrade (v0.37.6 → v1.0.9 → v1.2.3)
Node group management and application migration
EBS CSI driver configuration

Total Duration: Approximately 3–4 hours including validation and testing.

Pre-Upgrade Checklist

Before starting the upgrade process, ensure you have:

Administrative access to your AWS account
kubectl, helm, and eksctl properly configured
Backup of critical workloads and configurations
Maintenance window scheduled
ArgoCD or deployment tool access for application management

Step-by-Step Upgrade Process

1. Upgrade Base Node Group (Duration: ~15 minutes)

Start by upgrading your existing base node group through the AWS Console:

Navigate to EKS → Clusters → Your Cluster → Compute → Node Groups
Select your base node group
Click "Update" and select the latest AMI version
Monitor the rolling update process

Best Practice: Always upgrade node groups before the control plane to avoid compatibility issues.

2. Upgrade EKS Add-ons

Update all EKS add-ons to their latest compatible versions:

CoreDNS
kube-proxy
VPC CNI
AWS Load Balancer Controller (if installed)

Navigate to EKS → Clusters → Your Cluster → Add-ons and update each component individually.

3. Upgrade EKS Cluster to v1.31 (Duration: ~15 minutes)

Perform the first cluster upgrade via AWS Console:

Navigate to EKS → Clusters → Your Cluster
Click "Update cluster version"
Select Kubernetes version 1.31
Monitor the upgrade progress

Critical Note: EKS upgrades can only increment by one minor version at a time.

4. Re-upgrade Base Node Group (Duration: ~15 minutes)

After the control plane upgrade, update the node group again to match the cluster version:

Return to the Node Groups section
Update the base node group to use 1.31-compatible AMIs
Wait for the rolling update to complete

5. Create Temporary Node Group

Create a new managed node group to serve as a temporary landing zone during Karpenter upgrades:

# Create temporary node group with similar specifications to your Karpenter nodes
aws eks create-nodegroup \
  --cluster-name <your-cluster> \
  --nodegroup-name temp-upgrade-nodes \
  --instance-types m5.large,m5.xlarge \
  --node-role <your-node-role-arn> \
  --subnets <subnet-ids> \
  --scaling-config minSize=1,maxSize=10,desiredSize=3

6. Migrate ArgoCD or Deployment to Temporary Nodes

Temporarily move ArgoCD workloads to the new node group:

# Update ArgoCD deployment
spec:
  template:
    spec:
      nodeSelector:
        eks.amazonaws.com/nodegroup: "temp-upgrade-nodes"
      tolerations:
        - key: "temporary-upgrade"
          value: "true"
          effect: "NoSchedule"

Important: Disable auto-sync in ArgoCD during this process to prevent configuration conflicts.

7. Upgrade Karpenter: v0.37.6 → v1.0.9

This is the most critical step. Karpenter v1.0 introduces breaking changes requiring careful migration.

Environment Setup

export AWS_PARTITION="aws" 
export CLUSTER_NAME="<your-cluster-name>"
export AWS_REGION="<your-region>"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export KARPENTER_NAMESPACE="karpenter"
export KARPENTER_IAM_ROLE_ARN="arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/${CLUSTER_NAME}-karpenter"
export KARPENTER_VERSION="1.0.9"

Update IAM Policies

Create and attach the new Karpenter v1 IAM policy:

# Create temporary v1 policy
POLICY_DOCUMENT=$(mktemp)
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/13d6fc014ea59019b1c3b1953184efc41809df11/website/content/en/v1.0/upgrading/get-controller-policy.sh | sh | envsubst > ${POLICY_DOCUMENT}

POLICY_NAME="KarpenterControllerPolicy-${CLUSTER_NAME}-v1"
ROLE_NAME="${CLUSTER_NAME}-karpenter"
POLICY_ARN="$(aws iam create-policy --policy-name "${POLICY_NAME}" --policy-document "file://${POLICY_DOCUMENT}" | jq -r .Policy.Arn)"
aws iam attach-role-policy --role-name "${ROLE_NAME}" --policy-arn "${POLICY_ARN}"

Install v1 CRDs

helm upgrade --install karpenter-crd oci://public.ecr.aws/karpenter/karpenter-crd \
  --version "${KARPENTER_VERSION}" \
  --namespace "${KARPENTER_NAMESPACE}" \
  --create-namespace \
  --set webhook.enabled=true \
  --set webhook.serviceName="karpenter" \
  --set webhook.port=8443

Upgrade Karpenter Controller

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version ${KARPENTER_VERSION} \
  --namespace "${KARPENTER_NAMESPACE}" \
  --create-namespace \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
  --set settings.clusterName=${CLUSTER_NAME} \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

8. Migrate to New Karpenter v1 API

Delete old Provisioners and create new NodePool and EC2NodeClass resources:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default-nodepool
spec:
  template:
    spec:
      taints:
        - key: karpenter-managed
          value: "true"
          effect: NoSchedule
      requirements:
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-west-2a", "us-west-2b", "us-west-2c"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "t", "m", "r"]
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values: ["2", "4", "8", "16", "32"]
        - key: karpenter.k8s.aws/instance-memory
          operator: In
          values: ["8192", "10752", "16384", "32768"]
      nodeClassRef:
        name: default-ec2nodeclass
        kind: EC2NodeClass
        group: karpenter.k8s.aws
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default-ec2nodeclass
spec:
  amiFamily: AL2023
  amiSelectorTerms:
  - id: "ami-06cc3e2ef40b89309"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "<your-cluster-name>"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "<your-cluster-name>"
  role: "KarpenterNodeRole-<your-cluster>"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100
        volumeType: gp3
        iops: 3000
        deleteOnTermination: true
  metadataOptions:
    httpEndpoint: enabled
  tags:
    Name: "Karpenter-managed-node"

9. Update CloudFormation Stack and Clean Up

TEMPOUT=$(mktemp)
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/v"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > "${TEMPOUT}"
aws cloudformation deploy \
  --stack-name "Karpenter-${CLUSTER_NAME}" \
  --template-file "${TEMPOUT}" \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides "ClusterName=${CLUSTER_NAME}"

# Remove temporary IAM policy
ROLE_NAME="${CLUSTER_NAME}-karpenter"
POLICY_NAME="KarpenterControllerPolicy-${CLUSTER_NAME}-v1"
POLICY_ARN=$(aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text)
aws iam detach-role-policy --role-name "${ROLE_NAME}" --policy-arn "${POLICY_ARN}"
aws iam delete-policy --policy-arn "${POLICY_ARN}"

10. Test Karpenter Functionality

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-test
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx-test
  template:
    metadata:
      labels:
        app: nginx-test
    spec:
      tolerations:
        - key: karpenter-managed
          value: "true"
          effect: NoSchedule
      containers:
      - name: nginx
        image: nginx:latest
        resources:
          requests:
            cpu: 1
            memory: 1Gi

11. Upgrade EKS Cluster to v1.32 (Duration: ~15 minutes)

Perform the second cluster upgrade via AWS Console:
Navigate to EKS → Clusters → Your Cluster → Update to Kubernetes version 1.32 → Monitor the upgrade process.

12. Final Node Group Upgrade (Duration: ~15 minutes)

Update the base node group to match the 1.32 cluster version.

13. Upgrade Karpenter to v1.2.3

export KARPENTER_VERSION="1.2.3"

# Upgrade CRDs
helm upgrade --install karpenter-crd oci://public.ecr.aws/karpenter/karpenter-crd \
  --version "${KARPENTER_VERSION}" \
  --namespace "${KARPENTER_NAMESPACE}" \
  --create-namespace \
  --set webhook.enabled=true \
  --set webhook.serviceName="karpenter" \
  --set webhook.port=8443

# Upgrade Karpenter
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version ${KARPENTER_VERSION} \
  --namespace "${KARPENTER_NAMESPACE}" \
  --create-namespace \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=${KARPENTER_IAM_ROLE_ARN} \
  --set settings.clusterName=${CLUSTER_NAME} \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

14. Migrate Applications Back to Karpenter

Move your applications from temporary nodes back to Karpenter-managed nodes:

Update ArgoCD configurations to use Karpenter node selectors
Migrate Jenkins to on-demand Karpenter nodes
Clean up outdated nodeSelector and taint references in deployments

Common Issue: Some workloads may fail due to outdated node selectors referencing the temporary upgrade nodes. Clean these up manually:

kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"spec":{"nodeSelector":null}}}}'

15. Clean Up Temporary Resources

aws eks delete-nodegroup \
  --cluster-name <your-cluster> \
  --nodegroup-name temp-upgrade-nodes

16. Final Validation

Test your CI/CD pipelines
Verify all services are running on Karpenter-managed nodes
Check application functionality and performance
Monitor logs for any errors or warnings

Critical Pitfall: Jenkins Volume Attachment Issue

During our upgrade, Jenkins encountered a persistent volume attachment error:

Error: jenkins-748597fb85-nxtrm.1841c67a5fceb9b9
FailedAttachVolume
AttachVolume.Attach failed for volume "pvc-828eb-4246-9a69-498401ed6a2e": 
volume attachment is being deleted

Resolution: EBS CSI Driver Service Account

Check existing service accounts:
eksctl get iamserviceaccount --cluster <your-cluster> --region <your-region>
Create IAM Role for EBS CSI:

eksctl create iamserviceaccount \
  --name ebs-csi-controller-sa \
  --namespace kube-system \
  --cluster <your-cluster> \
  --region <your-region> \
  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --override-existing-serviceaccounts \
  --approve \
  --role-only \
  --role-name AmazonEKS_EBS_CSI_DriverRole

Annotate Service Account:

export SERVICE_ACCOUNT_ROLE_ARN=$(aws iam get-role --role-name AmazonEKS_EBS_CSI_DriverRole | jq -r '.Role.Arn')

kubectl annotate serviceaccount ebs-csi-controller-sa \
  -n kube-system \
  eks.amazonaws.com/role-arn=${SERVICE_ACCOUNT_ROLE_ARN}

Restart EBS CSI Controller:
kubectl rollout restart deployment ebs-csi-controller -n kube-system

Best Practices

Staged Approach
Always upgrade incrementally (1.30 → 1.31 → 1.32).
Test each stage thoroughly before proceeding.
Temporary Infrastructure
Create temporary node groups for critical workload migration.
This provides a safety net during Karpenter upgrades.
Application Management
Disable ArgoCD auto-sync during upgrades.
Clean up legacy node selectors and taints post-migration.
Volume Management
Ensure EBS CSI driver has proper IAM permissions.
Test persistent volume functionality after upgrades.
Monitoring and Validation
Monitor each step closely.
Validate functionality at each stage.
Keep rollback plans ready.

Common Issues

CronJobs Issues: May need to be deleted and recreated due to node selector conflicts.
Volume Attachment Delays: EBS CSI driver permissions are critical.
Karpenter API Changes: v1.0 introduces breaking changes requiring complete resource recreation.
Node Selector Cleanup: Manual cleanup required for legacy references.

Conclusion

Upgrading EKS to Kubernetes 1.32 with Karpenter compatibility requires careful orchestration but is achievable with zero downtime when following these practices. The key is taking a staged approach, maintaining temporary infrastructure for workload migration, and thoroughly testing each component.

The most critical aspects are:

Proper Karpenter v1 migration with new API resources
EBS CSI driver configuration for persistent volumes
Systematic application migration with cleanup of legacy configurations

By following this guide, you can successfully upgrade your EKS infrastructure while maintaining service availability and taking advantage of the latest Kubernetes and Karpenter features.

Total Upgrade Time: 3–4 hours including validation
Downtime: Zero with proper planning and execution