Azure Kubernetes Service: Production Best Practices Guide

Running Kubernetes in production requires careful planning across networking, security, reliability, and operations. After managing multiple production AKS clusters, here are the practices that matter most.

Cluster Configuration

Start with the right foundation. These settings are hard to change later:

  • Azure CNI Networking: Use Azure CNI for production. It provides better network performance and is required for Windows containers and some advanced networking scenarios
  • Private Clusters: Keep the API server private. Expose it only through VPN or Azure Bastion
  • Availability Zones: Spread nodes across zones for resilience. This is free in most regions
  • Managed Identity: Use managed identity instead of service principals
# Create production-ready cluster
az aks create \
  --name prod-cluster \
  --resource-group prod-rg \
  --node-count 3 \
  --zones 1 2 3 \
  --network-plugin azure \
  --enable-private-cluster \
  --enable-managed-identity \
  --enable-aad \
  --aad-admin-group-object-ids $ADMIN_GROUP_ID \
  --enable-azure-policy

Node Pool Strategy

Use multiple node pools for workload isolation:

# System node pool for critical addons
az aks nodepool add \
  --cluster-name prod-cluster \
  --name system \
  --node-count 3 \
  --mode System \
  --node-taints CriticalAddonsOnly=true:NoSchedule

# General workload pool
az aks nodepool add \
  --name workloads \
  --node-count 5 \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 20

# High-memory pool for specific workloads
az aks nodepool add \
  --name highmem \
  --node-vm-size Standard_E8s_v3 \
  --node-count 2

Security Essentials

  • Azure AD Integration: Use Azure AD for authentication, RBAC for authorization
  • Pod Managed Identity: Workloads get Azure identities without secrets
  • Network Policies: Enforce network segmentation between namespaces
  • Azure Policy: Prevent privileged containers, enforce image sources
  • Container Insights: Monitor for security anomalies

Reliability Patterns

# Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

---
# Resource Quota per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi

Key Takeaways

  • Use Azure CNI, private clusters, and availability zones from the start
  • Separate node pools for system components and workloads
  • Integrate Azure AD, use pod managed identities, enable network policies
  • Define PDBs and resource quotas for reliability

References


Discover more from C4: Container, Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.