Vivek Dhami

Terraform: Eliminating phantom diffs using ignore_changes and replace_triggered_by

Vivek Dhami — Fri, 20 Mar 2026 11:16:04 GMT

The Problem: Write-Only Fields and Perpetual Diffs

A particularly common source of phantom diffs is Azure DevOps service endpoint resources. Passwords and API keys are write-only — the API accepts them on create/update but never returns them on read. Every terraform plan sees a gap in state and reports a diff.

resource "azuredevops_serviceendpoint_nuget" "this" {
  project_id            = var.project_id
  service_endpoint_name = "nuget-feed"
  url                   = "https://artifacts.corp.example.com/repository/nuget-hosted/"
  username              = "deployer"
  password              = var.nuget_api_key
}

Every terraform plan, without fail:

# azuredevops_serviceendpoint_nuget.this will be updated in-place
~ resource "azuredevops_serviceendpoint_nuget" "this" {
      id                    = "..."
    ~ password              = (sensitive value)
      # (4 unchanged attributes hidden)
  }

The Naive Fix (and Its Blind Spot)

The obvious answer is ignore_changes:

resource "azuredevops_serviceendpoint_nuget" "this" {
  # ...
  password = var.nuget_api_key

  lifecycle {
    ignore_changes = [password]
  }
}

This kills the phantom diff, but creates a new problem: when you actually rotate the API key, Terraform silently ignores the change. You'd have to remember to manually run:

terraform apply -replace="azuredevops_serviceendpoint_nuget.this"

That's fragile, easy to forget, and doesn't scale across dozens of service connections.

The Better Fix: `ignore_changes` + `replace_triggered_by`

Combine both lifecycle features to get the best of both worlds — no phantom diffs during normal operation, but automatic recreation when the secret actually changes.

# A trigger resource that tracks the secret value.
# Only changes when the actual secret value changes.
resource "terraform_data" "nuget_api_key_version" {
  input = var.nuget_api_key
}

resource "azuredevops_serviceendpoint_nuget" "this" {
  project_id            = var.project_id
  service_endpoint_name = "nuget-feed"
  url                   = "https://artifacts.corp.example.com/repository/nuget-hosted/"
  username              = "deployer"
  password              = var.nuget_api_key

  lifecycle {
    ignore_changes = [password]
    replace_triggered_by = [
      terraform_data.nuget_api_key_version
    ]
  }
}

How This Works

ignore_changes = [password] eliminates the phantom diff on every plan — Terraform stops comparing the write-only password field against state.
terraform_data.nuget_api_key_version stores the current value of the secret in state via its input attribute. It only reports a change when var.nuget_api_key actually changes.
replace_triggered_by watches the trigger resource. When the secret rotates and terraform_data detects a real change, it forces a full replacement of the service endpoint — ensuring the new credential is applied.

Behavior Matrix

Scenario	Behavior
Normal plan, no changes	Clean plan, zero diffs
Secret rotated in `var.nuget_api_key`	`terraform_data` changes → service endpoint is replaced with new password
Other attributes changed (URL, name)	Normal in-place update, no interference
`terraform plan` run repeatedly	Consistently clean — no phantom diff

Scaling the Pattern

This pattern works for any write-only field: Docker registry passwords, Maven deploy tokens, PyPI API keys, generic service connection secrets, etc.

# --- Trigger resources (one per secret) ---

resource "terraform_data" "docker_password_version" {
  input = var.docker_password
}

resource "terraform_data" "maven_password_version" {
  input = var.maven_deploy_token
}

resource "terraform_data" "pypi_password_version" {
  input = var.pypi_api_token
}

resource "terraform_data" "npm_password_version" {
  input = var.npm_auth_token
}

# --- Service connections ---

resource "azuredevops_serviceendpoint_dockerregistry" "this" {
  project_id            = var.project_id
  service_endpoint_name = "docker-registry"
  docker_registry       = "https://registry.corp.example.com"
  docker_username       = "deployer"
  docker_password       = var.docker_password
  registry_type         = "Others"

  lifecycle {
    ignore_changes       = [docker_password]
    replace_triggered_by = [terraform_data.docker_password_version]
  }
}

resource "azuredevops_serviceendpoint_maven" "this" {
  project_id            = var.project_id
  service_endpoint_name = "maven-repo"
  url                   = "https://artifacts.corp.example.com/repository/maven-releases/"
  username              = "deployer"
  password              = var.maven_deploy_token

  lifecycle {
    ignore_changes       = [password]
    replace_triggered_by = [terraform_data.maven_password_version]
  }
}

resource "azuredevops_serviceendpoint_npm" "this" {
  project_id            = var.project_id
  service_endpoint_name = "npm-registry"
  url                   = "https://artifacts.corp.example.com/repository/npm-hosted/"
  access_token          = var.npm_auth_token

  lifecycle {
    ignore_changes       = [access_token]
    replace_triggered_by = [terraform_data.npm_password_version]
  }
}

resource "azuredevops_serviceendpoint_generic" "pypi" {
  project_id            = var.project_id
  service_endpoint_name = "pypi-repo"
  server_url            = "https://artifacts.corp.example.com/repository/pypi-hosted/"
  password              = var.pypi_api_token

  lifecycle {
    ignore_changes       = [password]
    replace_triggered_by = [terraform_data.pypi_password_version]
  }
}

Modularizing the Pattern

If you manage many service connections through a module, you can encapsulate the pattern:

# modules/service_connection_with_secret/main.tf

variable "project_id" {}
variable "endpoint_name" {}
variable "server_url" {}
variable "username" { default = "" }
variable "password" { sensitive = true }

resource "terraform_data" "password_version" {
  input = var.password
}

resource "azuredevops_serviceendpoint_generic" "this" {
  project_id            = var.project_id
  service_endpoint_name = var.endpoint_name
  server_url            = var.server_url
  username              = var.username
  password              = var.password

  lifecycle {
    ignore_changes       = [password]
    replace_triggered_by = [terraform_data.password_version]
  }
}

output "endpoint_id" {
  value = azuredevops_serviceendpoint_generic.this.id
}

Usage:

module "pypi_connection" {
  source        = "./modules/service_connection_with_secret"
  project_id    = azuredevops_project.this.id
  endpoint_name = "pypi-repo"
  server_url    = "https://artifacts.corp.example.com/repository/pypi-hosted/"
  password      = var.pypi_api_token
}

Important Caveats

Replacement is Destructive

replace_triggered_by forces a destroy + create, not an in-place update. For service connections this is usually fine, but be aware that:

The service endpoint ID changes on replacement
Pipelines referencing the endpoint by ID (not name) may briefly fail during apply
Any azuredevops_resource_authorization tied to the old endpoint ID needs to be recreated

If zero-downtime is critical, add create_before_destroy:

  lifecycle {
    ignore_changes         = [password]
    replace_triggered_by   = [terraform_data.nuget_api_key_version]
    create_before_destroy  = true
  }

Sensitive Values in `terraform_data`

The terraform_data resource stores its input in state. If your state backend isn't encrypted, the secret will be visible in the state file. Ensure you:

Use an encrypted state backend (Azure Blob with encryption, S3 with SSE, etc.)
Restrict state file access with proper IAM/RBAC policies
Consider using sensitive = true on the variable to suppress console output

When NOT to Use This Pattern

The provider correctly handles the field. Some providers do read secrets back from the API — no phantom diff means no need for this pattern.
Terraform v1.11+ write-only attributes. If your provider supports the new write_only attribute modifier, use that instead — it's the native solution to this problem.
You want in-place updates, not replacements. If the resource is expensive to recreate (e.g., databases with data), the replacement semantics of replace_triggered_by may be too aggressive.

Summary

Approach	Phantom Diff?	Detects Secret Rotation?	Automation
No lifecycle rules	Yes (every plan)	Yes	Automatic
`ignore_changes` only	No	No	Manual `-replace` needed
`ignore_changes` + `replace_triggered_by`	No	Yes	Fully automatic

The ignore_changes + replace_triggered_by pattern gives you clean plans and automatic secret rotation handling. It's the practical middle ground until write-only attributes become widespread across providers.

Managing Terraform Phantom Diffs: A Practical Guide

Vivek Dhami — Fri, 20 Mar 2026 11:06:53 GMT

What Are Phantom Diffs?

Phantom diffs — also called perpetual diffs or ghost changes — are changes that appear in your terraform plan output every single run, even when you haven't modified any code or infrastructure. They create noise, erode trust in your plans, and can mask real changes hiding among the false positives.

# azurerm_resource_group.example will be updated in-place
~ resource "azurerm_resource_group" "example" {
      id       = "/subscriptions/.../resourceGroups/my-rg"
      name     = "my-rg"
    ~ tags     = {
        - "created_by" = "terraform" -> null
      }
  }

You stare at the diff. You didn't change anything. You run plan again — same diff. Welcome to the world of phantom diffs.

Why Do Phantom Diffs Happen?

There are several root causes, and understanding them is the key to fixing them.

1. Provider Defaults and API Normalization

Cloud APIs often return values in a different format than what you send. The provider compares your config to the API response and sees a "difference" every time.

Example: Case sensitivity

resource "azurerm_resource_group" "example" {
  name     = "My-Resource-Group"
  location = "East US"
}

Azure might normalize location to eastus internally. Every plan, Terraform sees "East US" in your config vs "eastus" in state, and reports a diff.

Fix: Match the API's canonical form:

resource "azurerm_resource_group" "example" {
  name     = "My-Resource-Group"
  location = "eastus"
}

2. Computed Attributes Conflicting with Config

Some attributes are both configurable and computed by the provider. If you set a value that the API overrides or augments, you'll get a perpetual diff.

Example: Tags merged by Azure Policy

resource "azurerm_resource_group" "example" {
  name     = "my-rg"
  location = "eastus"

  tags = {
    environment = "dev"
  }
}

If an Azure Policy automatically adds "created_by" = "policy", every plan shows:

~ tags = {
    + "created_by" = "policy"    # Terraform wants to remove this
      "environment" = "dev"
  }

Terraform wants to enforce your declared state, which doesn't include that tag.

Fix: Use ignore_changes in a lifecycle block:

resource "azurerm_resource_group" "example" {
  name     = "my-rg"
  location = "eastus"

  tags = {
    environment = "dev"
  }

  lifecycle {
    ignore_changes = [tags["created_by"]]
  }
}

Or, if the external system adds many unpredictable tags:

  lifecycle {
    ignore_changes = [tags]
  }

3. Attribute Ordering and Serialization

Some resources contain list or set attributes where order is semantically irrelevant but syntactically significant to Terraform.

Example: Security group rules

resource "aws_security_group" "example" {
  name = "my-sg"

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8", "172.16.0.0/12"]
  }
}

If the AWS API returns cidr_blocks as ["172.16.0.0/12", "10.0.0.0/8"], you get a phantom diff every run.

Fix: Sort your values to match API return order, or use ignore_changes:

  cidr_blocks = ["172.16.0.0/12", "10.0.0.0/8"]  # Match API order

4. Default Values Applied Server-Side

Resources often have optional attributes that the API fills in with defaults. If the provider reads these back but you didn't set them, Terraform may show an unwanted diff.

Example: AWS Launch Template

resource "aws_launch_template" "example" {
  name_prefix   = "my-lt-"
  image_id      = "ami-0abcdef1234567890"
  instance_type = "t3.micro"
}

The API might populate metadata_options with defaults. Next plan:

~ metadata_options {
    ~ http_endpoint               = "enabled" -> "enabled"
    ~ http_put_response_hop_limit = 1 -> 1
    ~ http_tokens                 = "optional" -> "optional"
  }

Fix: Explicitly declare the defaults in your config:

resource "aws_launch_template" "example" {
  name_prefix   = "my-lt-"
  image_id      = "ami-0abcdef1234567890"
  instance_type = "t3.micro"

  metadata_options {
    http_endpoint               = "enabled"
    http_put_response_hop_limit = 1
    http_tokens                 = "optional"
  }
}

5. Sensitive or Computed-Only Values in State

Some providers mark attributes as sensitive or don't store them in state properly. This causes Terraform to believe the value is missing or changed.

Example: Passwords and secrets

resource "azuredevops_serviceendpoint_generic" "example" {
  project_id            = azuredevops_project.example.id
  service_endpoint_name = "my-endpoint"
  server_url            = "https://example.com"
  password              = "my-secret"
}

If the provider can't read the password back from the API (it's write-only), every plan shows:

~ password = (sensitive value)

Fix: Use ignore_changes for write-only secrets:

  lifecycle {
    ignore_changes = [password]
  }

Note: Terraform v1.7+ introduced the ephemeral attribute concept and write-only arguments in v1.11+ to handle this pattern more elegantly.

6. Timestamp and Auto-Generated Fields

Resources that include timestamps (last_modified, updated_at) or auto-generated IDs that change on read will always show diffs.

Example:

~ last_modified = "2026-03-19T10:00:00Z" -> "2026-03-20T08:30:00Z"

Fix:

  lifecycle {
    ignore_changes = [last_modified]
  }

7. Inconsistent `jsonencode` / JSON Formatting

When you pass JSON as a string (e.g., IAM policies, API definitions), whitespace or key-ordering differences between your config and the API cause phantom diffs.

Example:

resource "aws_iam_role" "example" {
  name = "my-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "ec2.amazonaws.com" }
    }]
  })
}

The API may return the policy with different key ordering or whitespace, causing:

~ assume_role_policy = jsonencode(
    ~ {
        - Statement = [...]
        + Statement = [...]  # Same content, different formatting
      }
  )

Fix: Use the dedicated policy resources or data sources:

data "aws_iam_policy_document" "assume_role" {
  statement {
    actions = ["sts:AssumeRole"]
    effect  = "Allow"
    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "example" {
  name               = "my-role"
  assume_role_policy = data.aws_iam_policy_document.assume_role.json
}

Strategies for Handling Phantom Diffs

Strategy 1: `lifecycle { ignore_changes = [...] }`

The most common and direct fix. Use it when an external system modifies attributes outside Terraform's control.

resource "kubernetes_deployment" "example" {
  metadata {
    name = "my-app"
  }

  lifecycle {
    ignore_changes = [
      metadata[0].annotations["kubectl.kubernetes.io/last-applied-configuration"],
      spec[0].template[0].metadata[0].annotations,
    ]
  }
}

When to use: External systems (policies, operators, other tools) modify resources after Terraform applies them.

When NOT to use: You're hiding a real configuration problem. ignore_changes is a scalpel, not a sledgehammer.

Strategy 2: Match the API's Canonical Form

Before reaching for ignore_changes, check if you can simply adjust your config to match what the API returns.

# Check what the API actually stores:
terraform show -json | jq '.values.root_module.resources[] | select(.address == "aws_s3_bucket.example")'

Then update your config to match. This is the cleanest solution.

Strategy 3: Use `terraform state show` to Inspect

terraform state show 'module.my_module.azuredevops_serviceendpoint_nuget.this'

Compare the state values with your config to identify exactly which attribute drifts.

Strategy 4: Pin Provider Versions

Provider updates can introduce or fix phantom diffs. Always pin your versions:

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.85.0"
    }
  }
}

Check provider changelogs — phantom diff fixes are common in patch releases.

Strategy 5: Use `replace_triggered_by` Instead of Fighting Diffs

Sometimes a resource will always show a diff because a dependency changed. Rather than suppress it, embrace it:

resource "null_resource" "config_reload" {
  lifecycle {
    replace_triggered_by = [
      azuredevops_variable_group.secrets
    ]
  }
}

Strategy 6: Automate Phantom Diff Detection in CI/CD

Add a check to your pipeline that flags plans with only known phantom diffs:

#!/bin/bash
terraform plan -detailed-exitcode -out=plan.tfplan 2>&1
EXIT_CODE=$?

if [ $EXIT_CODE -eq 2 ]; then
  # Changes detected — check if they're all known phantoms
  terraform show -json plan.tfplan | jq -r '
    .resource_changes[]
    | select(.change.actions != ["no-op"])
    | .address
  ' > changed_resources.txt

  KNOWN_PHANTOMS="azuredevops_serviceendpoint_generic.password_reset"

  while read -r resource; do
    if ! echo "\(KNOWN_PHANTOMS" | grep -q "\)resource"; then
      echo "REAL CHANGE DETECTED: $resource"
      exit 2
    fi
  done < changed_resources.txt

  echo "Only phantom diffs detected — safe to skip apply."
  exit 0
fi

Decision Framework

Symptom	Root Cause	Fix
Case/format differences	API normalization	Match canonical form
External tags/annotations	External system modification	`ignore_changes` on specific keys
Sensitive values always changing	Write-only API fields	`ignore_changes` on secret fields
JSON/policy reformatting	Serialization differences	Use dedicated data sources
Ordering changes in lists	Non-deterministic API responses	Sort values or use `ignore_changes`
New attributes appearing after upgrade	Provider version change	Pin provider, set explicit defaults

Key Takeaways

Diagnose before suppressing. Use terraform state show and terraform show -json to understand why a diff appears before reaching for ignore_changes.
ignore_changes is a trade-off. It solves the noise problem but creates a blind spot — Terraform will never manage that attribute again until you remove the lifecycle rule.
Match the API, not the docs. The API's canonical representation is the source of truth. Adjust your config to match it.
Pin provider versions. Phantom diff behavior can change between provider versions. Pin versions and upgrade deliberately.
Automate detection. In CI/CD, distinguish phantom diffs from real changes to prevent pipeline fatigue and rubber-stamp approvals.
Report bugs upstream. Many phantom diffs are provider bugs. File issues — the maintainers fix them regularly.

Phantom diffs are an inevitable part of working with Terraform at scale, but they don't have to be a source of constant frustration. A methodical approach to diagnosing and resolving them keeps your plans clean and trustworthy.

Understanding WASM and WASI: A complete guide

Vivek Dhami — Mon, 04 Aug 2025 22:00:00 GMT

If you've been following the infrastructure and cloud-native space, you've probably heard whispers about WebAssembly (WASM) being the "next big thing" beyond just browser sandboxing. And honestly? The hype isn't entirely unfounded. When you pair WASM with WASI (the WebAssembly System Interface), you get a genuinely compelling alternative to traditional container runtimes in certain scenarios.

This post breaks down what WASM and WASI actually are, why they're gaining traction in backend and edge computing, and where they fit in your infrastructure toolkit.

THE PROBLEM: RUNTIME PORTABILITY IS STILL MESSY

Let's be real—containers solved a lot of problems. Docker gave us "build once, run anywhere" at the OS level. But containers still carry baggage: you're shipping an entire userspace, you need a container runtime (Docker, containerd, CRI-O), and you're still fundamentally tied to the kernel your container host is running.

For lightweight, short-lived workloads—think serverless functions, edge compute, or plugin architectures—spinning up a full container can be overkill. You're paying the startup cost, the memory overhead, and the attack surface of a Linux userspace even when your actual code is a 2MB binary.

WASM offers a different trade-off: near-native performance, sub-millisecond cold starts, and true portability across architectures and operating systems. But to really unlock WASM outside the browser, you need WASI.

WHAT IS WEBASSEMBLY (WASM)?

WebAssembly is a low-level bytecode format designed to be a compilation target for higher-level languages. Originally created to let languages like C, C++, and Rust run in the browser at near-native speeds, WASM has evolved into a portable runtime standard.

Key Characteristics

Binary instruction format: Compact, fast to decode and execute
Stack-based virtual machine: Executes in a sandboxed environment
Language-agnostic: You can compile C, Rust, Go, C++, even Python to WASM
Portable: The same .wasm binary runs on x86, ARM, in the browser, or on the server
Secure by default: Runs in a memory-safe sandbox with no default access to the host system

Here's a simple example. Take this Rust function:

#[no_mangle]
pub extern "C" fn add(a: i32, b: i32) -> i32 {
    a + b
}

Compile it to WASM:

rustc --target wasm32-unknown-unknown -O add.rs

You now have a .wasm module that can execute in any WASM runtime—browser, Wasmtime, Wasmer, or even embedded systems. No OS dependencies, no libc mismatches, just the bytecode.

WHAT IS WASI (WEBASSEMBLY SYSTEM INTERFACE)?

WASM by itself is intentionally isolated—it has no concept of filesystems, network sockets, or environment variables. That's great for browser security, but it's limiting if you want to use WASM for backend services or CLI tools.

Enter WASI: a standardized system interface for WASM modules. Think of it as a "POSIX for WASM"—a capability-based API that lets WASM code interact with the host system in a controlled, secure way.

Core WASI Capabilities

WASI provides access to:

File I/O: Read and write files via preopened directories
Networking: Socket APIs for TCP/UDP (still evolving)
Environment variables: Access to configuration
Command-line arguments: Standard argc/argv handling
Clocks and random numbers: System time and cryptographic randomness

The key difference from traditional system calls? Capabilities. Instead of a process having blanket access to the filesystem, you explicitly grant access to specific directories. It's the principle of least privilege baked into the runtime.

WASI in Action

Let's look at a Rust program using WASI to read a file:

use std::fs;

fn main() {
    // This only works if the runtime preopens access to the directory
    let contents = fs::read_to_string("/data/config.json")
        .expect("Failed to read file");

    println!("Config: {}", contents);
}

Compile it to WASM with WASI support:

cargo build --target wasm32-wasi --release

Run it using Wasmtime (a WASM runtime):

wasmtime --dir=/host/path/to/data:/data target/wasm32-wasi/release/myapp.wasm

The --dir flag explicitly maps /host/path/to/data on the host to /data inside the WASM sandbox. Without that flag, the WASM module can't access the filesystem at all. This is capability-based security in action.

WHY WASM + WASI MATTERS FOR INFRASTRUCTURE

Alright, so WASM is portable and WASI makes it practical. Why should you care as someone running infrastructure?

1. Cold Start Performance

WASM modules start in microseconds, not milliseconds or seconds. For serverless and edge computing, this is huge. Compare:

Container cold start: 100ms-1s (includes pulling layers, initializing userspace)
WASM cold start: <1ms (just instantiate the module)

If you're running ephemeral workloads with high request-per-second variability, WASM can dramatically reduce idle resource consumption.

2. Smaller Footprint

A typical container image for a Go or Rust service might be 50-200MB. The equivalent WASM binary? Often under 5MB. Less to transfer, less to store, less to scan for vulnerabilities.

3. True Multi-Arch Support

Compile once to WASM, run on x86, ARM64, RISC-V, whatever. No more maintaining separate container images for different architectures or dealing with emulation layers.

4. Enhanced Security Posture

WASM's sandbox is more restrictive than a container's by default. No filesystem access, no network, no syscalls unless explicitly granted via WASI capabilities. This makes it attractive for multi-tenant environments and plugin systems where you're running untrusted code.

THE WASM RUNTIME ECOSYSTEM

To run WASM outside the browser, you need a runtime. Here are the main players:

Wasmtime

Developed by the Bytecode Alliance, Wasmtime is a fast, secure WASM and WASI runtime. It's the reference implementation for WASI and integrates well with Rust, C, and Python.

# Install Wasmtime
curl https://wasmtime.dev/install.sh -sSf | bash

# Run a WASM module
wasmtime myapp.wasm

Wasmer

Another popular runtime with a focus on server-side WASM. Wasmer supports WASI and has a package manager (WAPM) for distributing WASM modules.

# Install Wasmer
curl https://get.wasmer.io -sSfL | sh

# Run with filesystem access
wasmer run myapp.wasm --dir=./data

WasmEdge

Optimized for edge computing and cloud-native workloads. It's particularly fast and has Kubernetes integration via crun and containerd shims.

Spin (Fermyon)

A framework specifically for building WASM-based microservices. Think of it as a lightweight alternative to running services in containers.

# Install Spin
curl -fsSL https://developer.fermyon.com/downloads/install.sh | bash

# Create a new Spin app
spin new http-rust my-service
cd my-service
spin build
spin up  # Runs locally, starts in milliseconds

HOW WASM FITS INTO YOUR STACK

Let's visualize where WASM sits relative to traditional deployments:

WASM isn't replacing containers across the board. It's carving out use cases where its characteristics—fast startup, small size, strong isolation—provide meaningful advantages.

REAL-WORLD USE CASES

Edge Computing

Deploying compute to the edge (CDN nodes, IoT gateways) benefits massively from WASM's portability and low overhead. Cloudflare Workers, Fastly Compute@Edge, and Fermyon Cloud all use WASM under the hood.

Plugin Systems

If you're building software that supports user-provided extensions (think VS Code plugins, database UDFs, or API middleware), WASM provides a safe sandbox. The host can grant specific capabilities without risking arbitrary code execution.

Serverless Functions

WASM-based FaaS platforms (like Spin or Wasmer Edge) can scale to zero more aggressively and respond faster than traditional container-based Lambda.

Polyglot Microservices

Run services written in Rust, Go, C++, and AssemblyScript side-by-side in a WASM runtime without worrying about language-specific runtimes or dependency conflicts.

WASM IN KUBERNETES

Yes, you can run WASM workloads in Kubernetes. Projects like runwasi and containerd-wasm-shims let you treat WASM modules as container images.

Example: Running a WASM workload with containerd

apiVersion: v1
kind: Pod
metadata:
  name: wasm-demo
spec:
  runtimeClassName: wasmtime  # Use WASM runtime instead of runc
  containers:
  - name: wasm-app
    image: myregistry/wasm-app:latest
    command: ["/app.wasm"]

Under the hood, containerd uses a WASM runtime shim to execute the module instead of spawning a traditional container. You get Kubernetes orchestration with WASM's performance characteristics.

LIMITATIONS AND GOTCHAS

WASM and WASI aren't a silver bullet. Here's where they fall short (for now):

Limited Language Support

While Rust, C/C++, and AssemblyScript have excellent WASM support, languages with heavy runtimes (Java, Python, Ruby) produce large binaries or have limited functionality. Go's WASM support is improving but still has constraints around goroutines and syscalls.

WASI Spec Is Still Evolving

Networking support in WASI is incomplete. Socket APIs exist but aren't finalized. If your app does heavy network I/O, you might hit limitations.

Ecosystem Maturity

Compared to Docker and containers, the WASM ecosystem is younger. Tooling, debugging, and observability aren't as polished. You'll encounter rough edges.

Not a Container Replacement (Yet)

For long-running stateful services with complex dependencies, containers are still the safer bet. WASM shines for stateless, ephemeral, or compute-bound workloads.

GETTING STARTED: A PRACTICAL EXAMPLE

Let's build a simple HTTP service in Rust, compile it to WASM, and run it locally.

Step 1: Create a New Spin App

spin new http-rust hello-wasm
cd hello-wasm

Step 2: Edit `src/lib.rs`

use spin_sdk::{
    http::{Request, Response},
    http_component,
};

#[http_component]
fn handle_request(_req: Request) -> Result {
    Ok(http::Response::builder()
        .status(200)
        .header("Content-Type", "application/json")
        .body(Some(r#"{"message": "Hello from WASM!"}"#.into()))?)
}

Step 3: Build and Run

spin build
spin up

Hit http://localhost:3000 and you'll see your JSON response. The entire service compiles to a few MB and starts in under 10ms.

Step 4: Deploy to the Edge (Optional)

spin deploy  # Pushes to Fermyon Cloud or your configured WASM platform

No Dockerfile, no Kubernetes manifest (unless you want it), no container registry juggling. Just a portable WASM module running at the edge.

WRAPPING UP

WebAssembly and WASI represent a genuine shift in how we think about runtime portability and isolation. They're not going to replace containers overnight, but for specific workloads—edge functions, plugin systems, fast-scaling microservices—they offer compelling advantages.

Key takeaways:

WASM is a portable, secure bytecode format with near-native performance
WASI provides a standardized system interface, making WASM practical outside the browser
Cold start times and binary sizes are orders of magnitude better than containers
Capability-based security gives you fine-grained control over what code can access
Ecosystem is maturing fast, but still has gaps in networking, language support, and tooling

If you're building for serverless, edge, or multi-tenant environments, it's worth experimenting with WASM. Start small—convert a stateless API endpoint or a CLI tool—and see how it fits your stack.

And if you're already running Kubernetes, check out the WASM runtime shims. You might be surprised how easily WASM slots into your existing orchestration layer.

MicroVMs: the security-first alternative to containers

Vivek Dhami — Mon, 05 May 2025 22:00:00 GMT

If you've ever deployed untrusted workloads—think CI/CD runners, serverless functions, or multi-tenant SaaS—you've probably lost sleep over container escape vulnerabilities. Containers share the kernel with the host, and that shared boundary is a constant security concern. Virtual machines solve the isolation problem but come with heavyweight overhead that makes them impractical for modern workload densities.

MicroVMs sit right in that sweet spot: near-container speed with VM-level isolation. Let's dig into what they are, how they compare to traditional containers and VMs, and when you should actually use them.

THE ISOLATION PROBLEM WITH CONTAINERS

Containers revolutionized deployment, but they weren't designed with strong isolation in mind. When you run a container, you're really just running a process on the host with some Linux kernel features (namespaces, cgroups, seccomp) creating the illusion of isolation.

The kernel is the shared attack surface. A kernel exploit in one container can potentially compromise the entire host and every other container running on it. We've seen this play out with real CVEs like the runC vulnerability (CVE-2019-5736) that allowed container escapes through malicious images.

For internal workloads where you trust your code, this risk is manageable. But if you're running untrusted code—customer-provided functions, arbitrary CI jobs, or hostile multi-tenant workloads—the shared kernel becomes a serious liability.

WHAT EXACTLY IS A MICROVM?

A microVM is a lightweight virtual machine optimized for speed and density. It provides the strong isolation boundary of a full VM (separate kernel, virtualized hardware) but strips away everything unnecessary: legacy device support, BIOS emulation, firmware bloat.

The result? Boot times measured in milliseconds, memory overhead in single-digit megabytes, and the ability to pack thousands of microVMs on a single host—all while maintaining hardware-enforced isolation through the hypervisor.

Here's how the three approaches stack up:

COMPARING CONTAINERS, VMS, AND MICROVMS

Let's get specific about the tradeoffs:

Metric	Containers	MicroVMs	Traditional VMs
Boot time	<100ms	100-300ms	5-30s
Memory overhead	~1-5 MB	~5-20 MB	~500+ MB
Density	1000s per host	1000s per host	10s-100s per host
Isolation	Process-level (shared kernel)	Hardware virtualization	Hardware virtualization
Attack surface	Large (entire kernel)	Minimal (hypervisor + minimal kernel)	Moderate (hypervisor + full OS)
Image size	Small (OCI layers)	Small (kernel + rootfs)	Large (full OS)
Ecosystem	Docker, containerd, CRI-O	Firecracker, Cloud Hypervisor, gVisor	KVM, Xen, VMware

The key insight: microVMs give you VM security at near-container performance. You're trading a slight increase in boot time and memory overhead for a massive reduction in blast radius.

REAL-WORLD MICROVM IMPLEMENTATIONS

Firecracker

Developed by AWS for Lambda and Fargate, Firecracker is the most mature microVM implementation. It's built on KVM but strips out everything except what's needed to run a single workload.

Here's how you'd launch a simple Firecracker microVM:

# Start the Firecracker process
firecracker --api-sock /tmp/firecracker.sock

# Configure the VM (via API)
curl --unix-socket /tmp/firecracker.sock -X PUT \
  -H "Content-Type: application/json" \
  -d '{
    "kernel_image_path": "/path/to/vmlinux",
    "boot_args": "console=ttyS0 reboot=k panic=1",
    "drives": [{
      "drive_id": "rootfs",
      "path_on_host": "/path/to/rootfs.ext4",
      "is_root_device": true,
      "is_read_only": false
    }],
    "machine-config": {
      "vcpu_count": 2,
      "mem_size_mib": 512
    }
  }' \
  http://localhost/boot

Each Firecracker microVM runs its own kernel, completely isolated from the host. A kernel vulnerability in your workload's guest kernel doesn't give an attacker access to the host kernel.

gVisor

Google's gVisor takes a different approach: it's a user-space kernel that intercepts system calls before they reach the host kernel. It's not technically a microVM, but it solves the same problem—reducing the kernel attack surface.

# Using gVisor (runsc) as a Docker runtime
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: v1
kind: Pod
metadata:
  name: secure-workload
spec:
  runtimeClassName: gvisor  # Use gVisor for this pod
  containers:
  - name: app
    image: untrusted-image:latest

gVisor implements a substantial portion of the Linux kernel in Go, running in user space. System calls from your container go through gVisor's kernel implementation instead of directly to the host kernel. It's faster than Firecracker for some workloads but provides a slightly different security model.

Kata Containers

Kata combines the OCI container workflow with lightweight VMs. Each pod runs in its own microVM, but you still use standard container images and orchestrators.

# Install Kata runtime
kubectl apply -f https://raw.githubusercontent.com/kata-containers/kata-containers/main/tools/packaging/kata-deploy/kata-deploy.yaml

# Use Kata for untrusted workloads
apiVersion: v1
kind: Pod
metadata:
  name: kata-secured
spec:
  runtimeClassName: kata  # Runs in a microVM
  containers:
  - name: workload
    image: potentially-malicious:latest

Kata is particularly useful in Kubernetes environments where you want to mix trusted and untrusted workloads on the same cluster—use the default runtime for trusted workloads, Kata for untrusted ones.

SECURITY BENEFITS FOR APPLICATION DEPLOYMENT

MicroVMs shine in scenarios where the cost of compromise is high:

Multi-tenant SaaS platforms

If you're running customer code (think Shopify apps, Salesforce functions, or CI/CD platforms like GitHub Actions), you need defense-in-depth. MicroVMs ensure that a malicious tenant can't escape their sandbox to access other tenants' data or the underlying infrastructure.

Serverless and function-as-a-service

AWS Lambda runs millions of functions per second, often from completely untrusted sources. Firecracker enables this scale by booting a fresh microVM per invocation in ~125ms, giving each function its own kernel and complete isolation.

CI/CD runners

Running arbitrary build and test jobs is inherently risky. Projects like GitLab Runner and GitHub Actions use microVMs to ensure a compromised build can't poison the runner or access other builds' secrets.

# Example: Isolated CI job with Firecracker
job:
  script:
    - npm install  # Could be malicious dependencies
    - npm test
  runner:
    executor: firecracker
    isolation: kernel-level
    memory: 2048MB
    timeout: 30m

Zero-trust architecture

MicroVMs fit naturally into zero-trust models. Even if an attacker compromises your application, they're still trapped inside a VM with no access to the host network, storage, or other workloads. You can further restrict with seccomp-bpf filters and minimal guest kernels that only include the syscalls your workload needs.

WHEN NOT TO USE MICROVMS

They're not a silver bullet. Here's when containers are still the right choice:

Trusted internal workloads: If you control all the code and trust your developers, the isolation overhead isn't justified
Cost-sensitive deployments: MicroVMs consume slightly more resources; at massive scale, this adds up
Extremely latency-sensitive applications: That extra 100-200ms boot time matters for some real-time workloads
Windows workloads: Most microVM implementations are Linux-focused

GETTING STARTED WITH MICROVMS

If you're convinced microVMs are worth exploring:

Start with gVisor if you're already on Kubernetes—it's the lowest-friction option for testing isolation without changing your orchestration
Use Firecracker directly if you're building a custom platform (FaaS, CI/CD) and need maximum control
Try Kata Containers for a drop-in OCI-compatible solution that works with existing container tooling

The ecosystem is still maturing, but major cloud providers (AWS, Google, Azure) are betting heavily on microVM technology. As the tooling improves and performance gaps close, we'll likely see microVMs become the default for any workload where isolation matters.

KEY TAKEAWAYS

MicroVMs bridge the gap between containers and VMs: you get near-container speed and density with VM-level security isolation. The hypervisor enforces a hard boundary that protects against kernel exploits and container escapes.

Use them when running untrusted code—multi-tenant platforms, serverless functions, CI/CD jobs—or anywhere the blast radius of a compromise would be catastrophic. Stick with containers when you trust your code and need maximum efficiency.

The security model is fundamentally different: instead of trusting the kernel and a stack of user-space isolation primitives, you're trusting the hypervisor—a much smaller, more auditable attack surface. For high-stakes deployments, that trade is worth making.

Terraform state: The good, the bad, and the ugly

Vivek Dhami — Tue, 03 Sep 2024 22:00:00 GMT

If you've worked with Terraform for more than a day, you've encountered the state file. Maybe you've cursed at it when it locked during a critical deployment. Maybe you've panicked when someone accidentally deleted it. Or maybe—just maybe—you've committed one with Azure credentials baked in and had a very awkward conversation with your security team.

Terraform state is simultaneously one of the most elegant solutions and one of the biggest operational headaches in infrastructure-as-code. Let's break down why it exists, where it hurts, and how it can spectacularly blow up in your face.

THE GOOD: WHY STATE EXISTS IN THE FIRST PLACE

Terraform's state file is basically a snapshot of your infrastructure at a given point in time. It maps your HCL configuration to real-world resources and stores metadata that Terraform needs to manage those resources effectively.

It solves the "what's actually deployed" problem

Without state, Terraform would have to query your cloud provider every single time to figure out what exists. That sounds reasonable until you're managing 500+ resources across multiple regions. State makes terraform plan fast by maintaining a local cache of your infrastructure.

# Your config says this:
resource "azurerm_linux_virtual_machine" "web" {
  name                = "web-vm"
  resource_group_name = azurerm_resource_group.main.name
  location            = "East US"
  size                = "Standard_B1s"
  # ... other config
}

# State file knows this exists with its Azure resource ID
# and can detect when you change size or other attributes

It tracks resource dependencies

Terraform builds a dependency graph from your state file. This lets it know that your Azure SQL Database depends on your NSG, which depends on your VNet. When you destroy resources, it tears them down in the correct order. Try doing that manually at 2 AM during an incident—I'll wait.

It enables collaboration (sort of)

When multiple engineers work on the same infrastructure, state acts as the source of truth. Combined with remote backends and locking, it prevents the classic "we both applied changes at the same time and now everything's broken" scenario.

The state file also stores output values, which other Terraform configurations can reference:

# In your networking module
output "vnet_id" {
  value = azurerm_virtual_network.main.id
}

# Another team's config can use this
data "terraform_remote_state" "network" {
  backend = "azurerm"
  config = {
    storage_account_name = "companytfstate"
    container_name       = "tfstate"
    key                  = "network.terraform.tfstate"
  }
}

resource "azurerm_linux_virtual_machine" "app" {
  # Reference the VNet from remote state
  subnet_id = data.terraform_remote_state.network.outputs.subnet_id
  # ... other config
}

This is actually pretty slick when it works. The problem is all the ways it doesn't work.

THE BAD: COMMON PAIN POINTS THAT'LL MAKE YOU QUESTION YOUR LIFE CHOICES

State locking conflicts

You know that feeling when you run terraform apply and get hit with "Error acquiring the state lock"? That's state locking doing its job—preventing concurrent modifications. But when Jenkins crashes mid-apply or someone's laptop dies, that lock stays in place.

# The dreaded error
Error: Error acquiring the state lock

Error message: blob is already locked
Lock Info:
  ID:        a1b2c3d4-5678-90ef-ghij-klmnopqrstuv
  Path:      companytfstate/tfstate/prod.terraform.tfstate
  Operation: OperationTypeApply
  Who:       jenkins@ci-runner-42
  Created:   2026-02-04 14:23:17.123456789 +0000 UTC

Now you're stuck deciding: is that lock legitimate, or is it a zombie from a failed run? Force-unlock and you might corrupt state if someone's actually using it. Don't unlock and your deployment is blocked.

# The nuclear option
terraform force-unlock a1b2c3d4-5678-90ef-ghij-klmnopqrstuv

# Proceed with extreme caution

State drift: When reality diverges from code

Someone made a "quick fix" in the Azure Portal. Someone else manually scaled a VMSS. Now your state file thinks you have 3 instances, but you actually have 5. Terraform has no idea what happened.

# Running refresh shows the horror
$ terraform refresh

azurerm_linux_virtual_machine.web[3]: Refreshing state... [id=/subscriptions/.../manually-created-vm-1]
azurerm_linux_virtual_machine.web[4]: Refreshing state... [id=/subscriptions/.../manually-created-vm-2]

# Now what? Delete them? Keep them? Cry?

Drift detection is possible with terraform plan -refresh-only, but it only tells you about resources Terraform knows about. If someone created resources outside of Terraform, you're blind to them until they cause an outage.

Team collaboration nightmares

Even with remote state and locking, you'll hit these scenarios:

Scenario 1: The race condition

Alice runs terraform plan at 3:00 PM
Bob runs terraform apply at 3:05 PM
Alice runs terraform apply at 3:10 PM using her stale plan
State is now inconsistent with both their changes

Scenario 2: The branch problem

Feature branches create competing versions of state
Merging code is easy; merging state files is... not a thing
You end up with orphaned resources nobody remembers creating

Scenario 3: The "who changed what" mystery State files don't have audit logs. Someone deleted your load balancer and you have no idea who or when. Good luck with that post-mortem.

Performance degradation at scale

Once you hit thousands of resources, terraform plan slows to a crawl even with state caching. You'll start breaking up monolithic state files into smaller modules just to keep things manageable:

# This used to take 10 seconds, now takes 5 minutes
$ terraform plan
Refreshing Terraform state in-memory prior to plan...
azurerm_virtual_network.main: Refreshing state... [id=/subscriptions/.../virtualNetworks/main-vnet]
azurerm_subnet.public[0]: Refreshing state... [id=/subscriptions/.../subnets/public-subnet-0]
# ... 2,437 more resources to go

THE UGLY: SECURITY DISASTERS AND STATE CORRUPTION

This is where things get scary. State files are ticking time bombs if you don't handle them properly.

Sensitive data exposure

Here's the thing nobody tells you: Terraform state files store everything in plaintext. Passwords, API keys, database connection strings—all sitting there unencrypted.

{
  "version": 4,
  "terraform_version": "1.6.0",
  "resources": [
    {
      "type": "azurerm_mssql_database",
      "name": "main",
      "instances": [
        {
          "attributes": {
            "administrator_login": "sqladmin",
            "administrator_login_password": "SuperSecretPassword123!",
            "fully_qualified_domain_name": "mydb.database.windows.net"
          }
        }
      ]
    }
  ]
}

If you commit this to Git (don't laugh, it happens constantly), you've just handed attackers your database credentials. Even if you remove it in the next commit, it's in Git history forever.

If your Azure Storage Account permissions are misconfigured, anyone with read access can download your state blob and extract secrets. If you're using Terraform Cloud's free tier, your state is visible to anyone in your organization.

State corruption and the recovery nightmare

State corruption usually happens when:

A Terraform run gets interrupted mid-apply
Two applies run simultaneously (locking failed)
Manual edits go wrong
The state backend has issues (Azure Blob Storage transient errors, network timeouts)

When state corrupts, you're in for a bad time:

$ terraform plan
Error: Failed to load state: state snapshot was created by Terraform v1.6.0,
which is newer than current v1.5.0; upgrade to Terraform v1.6.0 or greater to work with this state

# Or worse...
Error: Failed to decode state: invalid character 'x' looking for beginning of value

Now you're trying to restore from backups (you have backups, right?), manually editing JSON to fix corruption, or using terraform import to rebuild state from scratch. For every. Single. Resource.

# The tedious recovery process
$ terraform import azurerm_linux_virtual_machine.web /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/mygroup/providers/Microsoft.Compute/virtualMachines/web-vm
$ terraform import azurerm_network_security_group.web /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/mygroup/providers/Microsoft.Network/networkSecurityGroups/web-nsg
# ... repeat 500 more times

The "someone deleted production" disaster

The ultimate nightmare: someone runs terraform destroy on production state by accident. Or they delete the Azure Storage Account container holding state. Or they push broken state that marks everything for deletion.

# This should require two-factor authentication and a blood oath
$ terraform destroy
# ...
Destroy complete! Resources: 437 destroyed.

# Congratulations, you've just deleted production

Without backups and versioning enabled on your state backend, recovery is somewhere between "extremely difficult" and "update your resume."

BEST PRACTICES: MAKING PEACE WITH STATE

Alright, enough doom and gloom. Here's how to avoid these disasters.

Always use remote backends

Never, ever keep state files local. Use Azure Storage, Terraform Cloud, or another remote backend with versioning and encryption.

terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "companytfstate"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"

    # State locking is built-in with blob leases
    # No separate lock table needed like DynamoDB
  }
}

Make sure your Azure Storage Account has:

Versioning enabled (recover from accidental deletes/corruption)
Encryption at rest (enabled by default with Microsoft-managed keys or use CMK)
Strict RBAC policies (principle of least privilege)
Soft delete enabled (provides undelete capability)
Diagnostic logging enabled (audit trail)

State locking with Azure blob leases

State locking prevents concurrent modifications. The good news? Azure Storage handles this automatically using blob leases—no separate infrastructure required.

When Terraform acquires a lock, it creates a lease on the state blob. If another process tries to modify the same state, it fails immediately:

# Azure handles locking automatically via blob leases
# No additional resources to provision
# Lock duration: 15 seconds by default, auto-renewed during operations

You can verify locking behavior by checking blob properties in the Azure Portal or via CLI:

az storage blob show \
  --account-name companytfstate \
  --container-name tfstate \
  --name prod.terraform.tfstate \
  --query "properties.lease" -o table

The lease status will show "locked" when Terraform is actively modifying state.

Never commit secrets to Terraform config

Use secret management tools and reference them dynamically:

# BAD: Hardcoded secrets
resource "azurerm_mssql_server" "main" {
  administrator_login_password = "SuperSecretPassword123!"  # Don't do this
}

# GOOD: Reference from Azure Key Vault
data "azurerm_key_vault" "main" {
  name                = "company-keyvault"
  resource_group_name = "shared-services-rg"
}

data "azurerm_key_vault_secret" "db_password" {
  name         = "sql-admin-password"
  key_vault_id = data.azurerm_key_vault.main.id
}

resource "azurerm_mssql_server" "main" {
  administrator_login_password = data.azurerm_key_vault_secret.db_password.value
}

The secret still ends up in state, but at least it's not in your Git repo. For extra paranoia, use customer-managed keys (CMK) in Azure Key Vault to encrypt state and configure RBAC to restrict access.

Master state manipulation commands

Sometimes you need to manually fix state. Learn these commands:

# Import existing resources into state
terraform import azurerm_linux_virtual_machine.example /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/mygroup/providers/Microsoft.Compute/virtualMachines/example-vm

# Remove a resource from state (without deleting the actual resource)
terraform state rm azurerm_linux_virtual_machine.old_server

# Move a resource to a different address
terraform state mv azurerm_linux_virtual_machine.old azurerm_linux_virtual_machine.new

# List all resources in state
terraform state list

# Show details about a specific resource
terraform state show azurerm_linux_virtual_machine.web

Use these carefully—they directly modify state. Always back up state before manual operations:

terraform state pull > backup.tfstate

Implement proper CI/CD workflows

Don't let humans run terraform apply from their laptops. Use CI/CD pipelines with:

Automatic terraform plan on pull requests
Required approvals before apply
State locking handled automatically
Audit logs of who applied what
Separate state files per environment

# Example GitHub Actions workflow
name: Terraform
on:
  pull_request:
    paths: ['terraform/**']

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
      - run: terraform init
      - run: terraform plan -out=tfplan
      - uses: actions/upload-artifact@v3
        with:
          name: tfplan
          path: tfplan

Split state into manageable pieces

Don't put everything in one giant state file. Break it up by:

Environment (dev, staging, prod)
Layer (networking, compute, data)
Team ownership (platform, security, apps)

terraform/
├── networking/
│   ├── backend.tf
│   └── main.tf
├── compute/
│   ├── backend.tf
│   └── main.tf
└── data/
    ├── backend.tf
    └── main.tf

This limits blast radius and improves performance. Just be careful with cross-stack dependencies.

WRAPPING UP

Terraform state is a necessary evil. It's the price you pay for declarative infrastructure management. The good news is that the benefits—dependency tracking, collaboration, drift detection—outweigh the operational overhead once you've got proper workflows in place.

The bad news is that you'll definitely learn these lessons the hard way at least once. When you do, remember:

Remote state with versioning and locking is non-negotiable
Secrets in state files are a security risk—treat state like sensitive data
State corruption happens—have backups and know how to restore
Manual state operations are powerful and dangerous—measure twice, cut once

And if you ever find yourself running terraform destroy in production, maybe take a coffee break first and verify you're in the right directory. Your future self will thank you.

Managing Azure DevOps pipeline templates: patterns and best practices

Vivek Dhami — Sun, 14 Apr 2024 22:00:00 GMT

If you've worked with Azure DevOps for any length of time, you've probably noticed that pipeline YAML files have a tendency to sprawl. What starts as a simple build-and-deploy pipeline quickly turns into hundreds of lines of duplicated logic scattered across dozens of repositories. Enter pipeline templates—ADO's answer to the DRY principle for CI/CD workflows.

In this post, we'll walk through the fundamentals of Azure DevOps pipelines, explore why centralized templates matter, and dive into practical patterns for structuring, versioning, and managing templates at scale.

AZURE DEVOPS PIPELINES: THE BUILDING BLOCKS

Before we get into templates, let's establish the core concepts. Azure DevOps pipelines are built on three main abstractions:

Pipelines are the top-level containers. They define when and how your CI/CD process runs. A pipeline can be triggered by code commits, pull requests, scheduled runs, or manual invocations.

Jobs represent units of work that run on an agent. Each job executes independently and can run in parallel with other jobs. Jobs are where you specify the agent pool, define dependencies, and set conditions.

Tasks are the individual steps within a job—think "run this shell script," "build this Docker image," or "deploy to Kubernetes." Microsoft provides a bunch of built-in tasks, and you can also create custom ones.

Here's a minimal pipeline to illustrate:

trigger:
  - main

pool:
  vmImage: 'ubuntu-latest'

jobs:
  - job: Build
    steps:
      - task: UseDotNet@2
        inputs:
          version: '8.x'

      - script: dotnet build
        displayName: 'Build application'

      - script: dotnet test
        displayName: 'Run tests'

Simple enough. But once you've got ten repos with similar build processes, you'll be copy-pasting this YAML everywhere. And when you need to update the .NET version or add a security scan? Good luck hunting down every instance.

WHY CENTRALIZE WITH TEMPLATES

Pipeline templates let you extract common logic into reusable files that multiple pipelines can reference. This gives you a few critical benefits:

Consistency across teams. When everyone uses the same template for Docker builds or Kubernetes deployments, you eliminate configuration drift. Security policies, compliance checks, and best practices get baked in automatically.

Easier maintenance. Need to add container scanning to every build? Update the template once, and it propagates to all consumers. No PR marathons across dozens of repos.

Reduced cognitive load. Developers shouldn't need to be YAML experts or remember the incantations for artifact publishing. Templates abstract away complexity and let teams focus on their application logic.

Governance and guardrails. Templates can enforce organizational standards—like requiring approval stages for production deploys or mandating specific test coverage thresholds.

A Simple Template Example

Let's say you want to standardize how your teams build and push Docker images. You'd create a template file in a central repository:

# templates/docker-build.yml
parameters:
  - name: dockerfilePath
    type: string
    default: './Dockerfile'
  - name: imageName
    type: string
  - name: tags
    type: object
    default: ['latest']

steps:
  - task: Docker@2
    displayName: 'Build Docker image'
    inputs:
      command: 'build'
      Dockerfile: ${{ parameters.dockerfilePath }}
      tags: ${{ join(',', parameters.tags) }}

  - task: Docker@2
    displayName: 'Push to registry'
    inputs:
      command: 'push'
      repository: ${{ parameters.imageName }}
      tags: ${{ join(',', parameters.tags) }}

  - script: |
      docker scan ${{ parameters.imageName }}:latest
    displayName: 'Scan image for vulnerabilities'

Now any pipeline can consume this template:

# azure-pipelines.yml
resources:
  repositories:
    - repository: templates
      type: git
      name: DevOps/pipeline-templates
      ref: refs/tags/v1.2.0

trigger:
  - main

pool:
  vmImage: 'ubuntu-latest'

jobs:
  - job: BuildAndPush
    steps:
      - template: templates/docker-build.yml@templates
        parameters:
          imageName: 'myapp'
          tags: ['latest', '$(Build.BuildId)']

Notice the @templates reference—that tells ADO to pull the template from the external repository we defined in resources.repositories. This pattern decouples template definitions from consuming pipelines, which is key for centralized management.

VERSIONING AND UPDATE MANAGEMENT

One of the trickiest parts of managing templates is handling updates without breaking existing pipelines. Here's how to approach it:

Use Git Tags for Template Versions

Store your templates in a dedicated Git repository and tag releases using semantic versioning (e.g., v1.0.0, v1.1.0, v2.0.0). Consuming pipelines reference specific tags:

resources:
  repositories:
    - repository: templates
      type: git
      name: DevOps/pipeline-templates
      ref: refs/tags/v1.2.0  # Pin to a specific version

This gives teams control over when they adopt new template versions. You can test changes in a staging pipeline before rolling them out broadly.

Branch Strategies for Development

For active template development, maintain separate branches:

main: Production-ready, stable templates
develop: Integration branch for new features
hotfix/*: Quick fixes for critical bugs

Tag releases from main and document breaking changes in your release notes. Use conventional commits to make changelog generation easier.

Communicating Breaking Changes

When you need to introduce breaking changes (like renaming a parameter or changing default behavior), follow this pattern:

Deprecate, don't delete. Mark old parameters as deprecated and support them for at least one major version.
Provide migration guides. Document what teams need to change in their pipelines.
Use major version bumps. Go from v1.x.x to v2.0.0 to signal incompatibility.

Example deprecation:

parameters:
  - name: buildConfiguration  # New parameter
    type: string
    default: 'Release'

  - name: configuration  # Deprecated
    type: string
    default: ''

steps:
  - script: |
      if [ -n "${{ parameters.configuration }}" ]; then
        echo "##vso[task.logissue type=warning]Parameter 'configuration' is deprecated. Use 'buildConfiguration' instead."
      fi
    displayName: 'Check for deprecated parameters'

Tracking Template Usage

Azure DevOps doesn't have built-in analytics for template usage, so you'll need to get creative. Some options:

Search across repos using Git grep or ADO's code search to find references to specific template versions.
Pipeline analytics dashboard that parses pipeline definitions and surfaces which templates are in use.
Automated dependency scanning as part of your template repository's CI to flag pipelines still using old versions.

Automating Version Management with Git

Manual version bumping is tedious and error-prone. Instead, leverage Git and conventional commits to automate versioning and release note generation. Here's how to set it up:

Use Conventional Commits

Adopt the Conventional Commits specification for your template repository. This standard formats commit messages in a machine-readable way:

feat: add terraform-apply job template
fix: correct parameter validation in docker-build template
docs: update README with usage examples
BREAKING CHANGE: rename buildConfig parameter to buildConfiguration

The commit type (feat, fix, docs, chore, etc.) determines how the version number gets bumped:

feat → minor version bump (1.2.0 → 1.3.0)
fix → patch version bump (1.2.0 → 1.2.1)
BREAKING CHANGE → major version bump (1.2.0 → 2.0.0)

Automate Versioning with GitVersion

GitVersion is a tool that derives semantic version numbers from your Git history. It works great with Azure DevOps pipelines:

# azure-pipelines.yml (in your template repo)
trigger:
  branches:
    include:
      - main

pool:
  vmImage: 'ubuntu-latest'

steps:
  - task: gitversion/setup@0
    displayName: 'Install GitVersion'
    inputs:
      versionSpec: '5.x'

  - task: gitversion/execute@0
    displayName: 'Calculate version'
    inputs:
      useConfigFile: true
      configFilePath: 'GitVersion.yml'

  - script: |
      echo "Version: $(GitVersion.SemVer)"
      echo "##vso[build.updatebuildnumber]$(GitVersion.SemVer)"
    displayName: 'Display version'

  - script: |
      git tag v$(GitVersion.SemVer)
      git push origin v$(GitVersion.SemVer)
    displayName: 'Create and push Git tag'
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))

Configure GitVersion with a GitVersion.yml file:

mode: ContinuousDeployment
branches:
  main:
    tag: ''
  develop:
    tag: 'beta'
  feature:
    tag: 'alpha'
ignore:
  sha: []

Now every merge to main automatically calculates the next version and creates a Git tag. No manual intervention required.

Generate Changelogs Automatically

Once you're using conventional commits, generating release notes is straightforward. You can use tools like conventional-changelog or git-cliff.

Here's a pipeline task that generates a changelog:

- task: NodeTool@0
  inputs:
    versionSpec: '18.x'
  displayName: 'Install Node.js'

- script: |
    npm install -g conventional-changelog-cli
    conventional-changelog -p angular -i CHANGELOG.md -s -r 0
  displayName: 'Generate changelog'

- script: |
    git config user.name "Pipeline Bot"
    git config user.email "pipeline@company.com"
    git add CHANGELOG.md
    git commit -m "docs: update changelog for $(GitVersion.SemVer)"
    git push origin HEAD:main
  displayName: 'Commit changelog'
  condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))

This generates a nicely formatted changelog grouped by commit type:

## [1.3.0] - 2026-02-05

### Features
* add terraform-apply job template
* support for multi-cloud deployments

### Bug Fixes
* correct parameter validation in docker-build template
* fix kubectl installation on Windows agents

### BREAKING CHANGES
* rename buildConfig parameter to buildConfiguration
  Migration: Update all pipeline references from `buildConfig` to `buildConfiguration`

Create Azure DevOps Releases

Take it a step further and automatically create releases with the generated changelog using the Azure DevOps REST API:

- task: PowerShell@2
  displayName: 'Create Azure DevOps Release'
  inputs:
    targetType: 'inline'
    script: |
      $changelog = Get-Content CHANGELOG.md -Raw
      $body = @{
        name = "v$(GitVersion.SemVer)"
        description = $changelog
        tagName = "v$(GitVersion.SemVer)"
      } | ConvertTo-Json

      $headers = @{
        Authorization = "Bearer $(System.AccessToken)"
        "Content-Type" = "application/json"
      }

      $url = "$(System.CollectionUri)$(System.TeamProject)/_apis/git/repositories/$(Build.Repository.ID)/releases?api-version=7.1-preview.1"
      Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $body

Enforce Commit Message Standards

To ensure everyone follows conventional commits, add validation to your PR pipeline:

- task: NodeTool@0
  inputs:
    versionSpec: '18.x'

- script: |
    npm install -g @commitlint/cli @commitlint/config-conventional
    npx commitlint --from HEAD~1 --to HEAD --verbose
  displayName: 'Validate commit messages'

Create a commitlint.config.js file:

module.exports = {
  extends: ['@commitlint/config-conventional'],
  rules: {
    'type-enum': [2, 'always', [
      'feat', 'fix', 'docs', 'style', 'refactor',
      'test', 'chore', 'revert'
    ]]
  }
};

This fails the PR if commit messages don't follow the standard, keeping your Git history clean and machine-readable.

The Complete Template Repository Pipeline

Putting it all together, here's a full CI/CD pipeline for your template repository:

trigger:
  branches:
    include:
      - main
      - develop

pr:
  branches:
    include:
      - main
      - develop

pool:
  vmImage: 'ubuntu-latest'

stages:
  - stage: Validate
    jobs:
      - job: ValidateTemplates
        steps:
          - task: NodeTool@0
            inputs:
              versionSpec: '18.x'

          - script: npx commitlint --from origin/main --verbose
            displayName: 'Validate commit messages'
            condition: eq(variables['Build.Reason'], 'PullRequest')

          # Run your template tests here
          - script: ./scripts/test-templates.sh
            displayName: 'Test templates'

  - stage: Release
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - job: CreateRelease
        steps:
          - task: gitversion/setup@0
            inputs:
              versionSpec: '5.x'

          - task: gitversion/execute@0

          - script: |
              npm install -g conventional-changelog-cli
              conventional-changelog -p angular -i CHANGELOG.md -s
            displayName: 'Generate changelog'

          - script: |
              git config user.name "Pipeline Bot"
              git config user.email "pipeline@company.com"
              git tag v$(GitVersion.SemVer)
              git push origin v$(GitVersion.SemVer)
            displayName: 'Create version tag'

          - task: PowerShell@2
            displayName: 'Create Azure DevOps Release'
            inputs:
              targetType: 'inline'
              script: |
                $changelog = Get-Content CHANGELOG.md -Raw
                $body = @{
                  name = "v$(GitVersion.SemVer)"
                  description = $changelog
                  tagName = "v$(GitVersion.SemVer)"
                } | ConvertTo-Json

                $headers = @{
                  Authorization = "Bearer $(System.AccessToken)"
                  "Content-Type" = "application/json"
                }

                $url = "$(System.CollectionUri)$(System.TeamProject)/_apis/git/repositories/$(Build.Repository.ID)/releases?api-version=7.1-preview.1"
                Invoke-RestMethod -Uri $url -Method Post -Headers $headers -Body $body

With this setup, your versioning and release process becomes completely hands-off. Developers just write good commit messages, and the pipeline handles the rest—calculating versions, creating tags, generating changelogs, and publishing releases.

TEMPLATE GRANULARITY AND STRUCTURE

Deciding how to break up your templates is part art, part science. Here are some patterns that work well:

Start with Task-Level Templates

Task templates are the smallest reusable unit—think of them as functions in your CI/CD code. They encapsulate a single responsibility:

# templates/tasks/install-kubectl.yml
parameters:
  - name: version
    type: string
    default: '1.28.0'

steps:
  - script: |
      curl -LO "https://dl.k8s.io/release/v${{ parameters.version }}/bin/linux/amd64/kubectl"
      chmod +x kubectl
      sudo mv kubectl /usr/local/bin/
    displayName: 'Install kubectl ${{ parameters.version }}'

Task templates are great for operations you need to repeat across different jobs or stages.

Job Templates for Reusable Workflows

Job templates bundle multiple tasks into a cohesive unit of work:

# templates/jobs/dotnet-build.yml
parameters:
  - name: dotnetVersion
    type: string
    default: '8.x'
  - name: buildConfiguration
    type: string
    default: 'Release'

jobs:
  - job: Build
    pool:
      vmImage: 'ubuntu-latest'
    steps:
      - task: UseDotNet@2
        inputs:
          version: ${{ parameters.dotnetVersion }}

      - script: dotnet restore
        displayName: 'Restore dependencies'

      - script: dotnet build --configuration ${{ parameters.buildConfiguration }}
        displayName: 'Build'

      - script: dotnet test
        displayName: 'Run tests'

      - task: PublishTestResults@2
        inputs:
          testResultsFormat: 'VSTest'
          testResultsFiles: '**/*.trx'

Use job templates when you want to standardize the entire execution environment, including pool selection and dependency management.

Stage Templates for Multi-Environment Deployments

Stage templates are ideal for deployment workflows that need to run across multiple environments:

# templates/stages/deploy-to-k8s.yml
parameters:
  - name: environment
    type: string
  - name: namespace
    type: string
  - name: manifests
    type: string
    default: './k8s/*.yaml'

stages:
  - stage: Deploy_${{ parameters.environment }}
    displayName: 'Deploy to ${{ parameters.environment }}'
    jobs:
      - deployment: DeployToK8s
        environment: ${{ parameters.environment }}
        pool:
          vmImage: 'ubuntu-latest'
        strategy:
          runOnce:
            deploy:
              steps:
                - template: ../tasks/install-kubectl.yml

                - task: KubernetesManifest@0
                  inputs:
                    action: 'deploy'
                    namespace: ${{ parameters.namespace }}
                    manifests: ${{ parameters.manifests }}

Then consume it in your pipeline:

stages:
  - template: templates/stages/deploy-to-k8s.yml@templates
    parameters:
      environment: 'dev'
      namespace: 'myapp-dev'

  - template: templates/stages/deploy-to-k8s.yml@templates
    parameters:
      environment: 'prod'
      namespace: 'myapp-prod'

Organizing Your Template Repository

A well-structured template repository looks something like this:

pipeline-templates/
├── README.md
├── CHANGELOG.md
├── templates/
│   ├── tasks/
│   │   ├── install-kubectl.yml
│   │   ├── docker-build.yml
│   │   └── security-scan.yml
│   ├── jobs/
│   │   ├── dotnet-build.yml
│   │   ├── node-build.yml
│   │   └── terraform-plan.yml
│   ├── stages/
│   │   ├── deploy-to-k8s.yml
│   │   └── deploy-to-azure-app-service.yml
│   └── pipelines/
│       ├── microservice-standard.yml
│       └── infrastructure-deploy.yml
└── examples/
    ├── dotnet-api-pipeline.yml
    └── terraform-pipeline.yml

Tasks are atomic operations. Jobs are workflows. Stages are multi-step processes. Pipelines are full end-to-end templates that teams can use with minimal customization.

Include examples that show how to use your templates—this dramatically reduces onboarding friction.

BEST PRACTICES AND PATTERNS

Here are some patterns that'll save you headaches:

Use Parameters with Defaults

Always provide sensible defaults so teams can use templates without needing to specify every parameter:

parameters:
  - name: buildConfiguration
    type: string
    default: 'Release'
  - name: runTests
    type: boolean
    default: true

Validate Inputs

Use parameter validation to catch errors early:

parameters:
  - name: environment
    type: string
    values:
      - dev
      - staging
      - prod

Keep Templates Focused

A template should do one thing well. If your template is hundreds of lines long and takes 20 parameters, it's probably trying to do too much. Break it into smaller, composable pieces.

Use Conditional Logic Sparingly

Templates can include conditions, but overusing them makes templates harder to understand and maintain:

steps:
  - ${{ if eq(parameters.environment, 'prod') }}:
    - script: echo "Running production-only checks"
      displayName: 'Production validation'

This is fine for simple cases, but if you find yourself writing complex conditional trees, consider creating separate templates instead.

Document Your Templates

Every template should have a header comment explaining:

What it does
Required and optional parameters
Example usage
Any prerequisites (service connections, variable groups, etc.)

# Docker Build and Push Template
#
# Builds a Docker image and pushes it to a container registry.
# Includes vulnerability scanning via docker scan.
#
# Parameters:
#   - imageName (required): Name of the image to build
#   - dockerfilePath (optional): Path to Dockerfile (default: './Dockerfile')
#   - tags (optional): List of tags to apply (default: ['latest'])
#
# Example:
#   - template: templates/docker-build.yml@templates
#     parameters:
#       imageName: 'myapp'
#       tags: ['latest', '$(Build.BuildId)']

Leverage Variable Templates

You can also use templates for variables, which is handy for environment-specific configuration:

# templates/variables/prod-vars.yml
variables:
  - name: resourceGroup
    value: 'prod-rg'
  - name: storageAccount
    value: 'prodstorageacct'
  - name: logLevel
    value: 'Warning'

Then reference it:

variables:
  - template: templates/variables/prod-vars.yml@templates

Test Your Templates

Treat templates like code—they need testing too. Set up a test pipeline that consumes your templates with various parameter combinations. Run it on every PR to catch regressions before they hit production.

WRAPPING UP

Pipeline templates are one of Azure DevOps' most powerful features for scaling CI/CD practices across an organization. By centralizing common workflows, you gain consistency, reduce maintenance burden, and enforce standards without turning into a bottleneck.

Key takeaways:

Structure templates by scope: tasks for single operations, jobs for workflows, stages for multi-environment processes.
Version templates with Git tags and use semantic versioning to communicate breaking changes.
Keep templates focused and composable—smaller, single-purpose templates are easier to maintain and test.
Document thoroughly and provide examples to reduce onboarding friction.

A few gotchas to watch out for: Azure DevOps caches template repositories, so sometimes you'll need to manually refresh or re-trigger pipelines to pick up changes. Also, template expressions (${{ }}) are evaluated at queue time, which can trip you up if you're used to runtime variables.

If you're managing templates across multiple teams, consider setting up a governance model—define who owns the template repo, establish a review process for changes, and create a feedback loop so teams can request new templates or improvements.

Done right, pipeline templates become the foundation of a self-service CI/CD platform where teams can ship fast without sacrificing consistency or security.

Powershell: Custom prompt using oh-my-posh

Vivek Dhami — Wed, 15 Nov 2023 09:09:13 GMT

Install a nerd font from https://www.nerdfonts.com/font-downloads
Install oh-my-posh using winget

winget install oh-my-posh

// Close & reopen powershell window

// Check oh-my-posh version
oh-my-posh version

// Upgrade oh-my-posh using winget
winget upgrade oh-my-posh

Install Terminal-Icons and import them for displaying.


// Open the current user profile:
code $PROFILE

// Add the following lines to the profile file:
Install-Module -Name Terminal-Icons -Repository PSGallery -Scope CurrentUser

Import-Module -Name Terminal-Icons

This will install Terminal-Icons and import them for displaying common icons with the PowerShell terminal.

Initialize oh-my-posh with a theme i.e. amro in this case.

oh-my-posh init pwsh --config "$env:POSH_THEMES_PATH\amro.omp.json" | Invoke-Expression

Themes are available at: https://ohmyposh.dev/docs/themes

Share WSL rootless Podman instance with Windows

Vivek Dhami — Sat, 15 Jul 2023 10:00:00 GMT

Background

Podman Desktop is the easiest way to start using Podman with windows. However currently Podman Desktop creates a new WSL instance with podman and connects windows to it. While it is pretty convenient for most of the users, this still requires another WSL instance and maybe not so convenient when you already have a preconfigured WSL dev environment.

So in this post I will share the setup I have been using to share the rootless podman running in a WSL instance with Windows.

Prerequisites

We will be using systemd for setting up rootless podman in WSL. So the only prerequisite for this setup to work is that the WSL version installed on your windows system should support systemd.

Systemd support was added to WSL in version: 0.67.6. You can check the wsl version via the following command:


wsl -l -v # Should be 0.67.6 or above.

If your WSL version is below the above mentioned version then you can check for WSL updates via wsl --update command.

Furthermore, systemd is not enabled in WSL by default and you can enable it at boot by creating a WSL configuration file /etc/wsl.conf within your WSL instance.


# /etc/wsl.conf file content

[boot]
systemd=true

Restart your WSL instance and you can check systemd status after reboot using this command:

systemctl list-unit-files --type=service

Install Podman on WSL

You can follow the podman installation steps for installing podman on your WSL instance.

For Ubuntu/Debian, installation via apt currently only supports v3.4.4 of podman and it is pretty outdated. However there is an updated version available via an unofficial source. You can read more about it here and here.

sudo apt install podman # installs podman 3.4.4

# For a more updated version: https://www.reddit.com/r/podman/comments/10wvjjp/how_i_backported_podman_4_on_ubuntu_2204/
# https://launchpad.net/~quarckster/+archive/ubuntu/containers

sudo add-apt-repository ppa:quarckster/containers
sudo apt update

After installation you can confirm podman installation by running following commands:


podman version # shows podman version

podman info # shows information about podman instance. Of particular interest are remoteSocket.Path and security.rootless=true.

Connect to podman service in WSL using SSH

We will be using ssh to connect to podman service running in WSL from windows.

Follow the following commands to setup ssh in WSL if it is not already setup:


# Install and enable ssh in Ubuntu/Debian. Please search online for other Linux distributions. 
sudo apt install openssh-server
sudo systemctl enable --now ssh.service
sudo systemctl start --now ssh.service

Generate ssh keys and add it to WSL authorized list:


# Generate and export ssh keys
export WINDOWS_HOME=/mnt/c/Users/
ssh-keygen -b 2048 -t rsa -f $WINDOWS_HOME/.ssh/id_rsa_podman -q -N ""

# Add to authorized list for ssh connection
cat $WINDOWS_HOME/.ssh/id_rsa_podman.pub >> ~/.ssh/authorized_keys

Enable podman rootless service:


# Enable podman rootless service
systemctl --user enable --now podman.socket
systemctl --user start --now podman.socket

# Enable systemd services to continue to work even after user log offs
sudo loginctl enable-linger $USER

# Check podman remote information
podman --remote info

# Check podman socket fullpath. Needed while adding connection in windows.
ls $XDG_RUNTIME_DIR/podman/podman.sock

Finally from windows terminal, add the podman connection and set it as default. You will need podman cli for this step to work and the easiest way to install it in Windows is using Podman Desktop.


# Add podman connection using ssh key and podman socket path
podman system connection add wsl-podman --identity C:\Users\\.ssh\id_rsa_podman ssh://@localhost/run/user//podman/podman.sock 
# output from $XDG_RUNTIME_DIR/podman/podman.sock should replace /run/user//$XDG_RUNTIME_DIR/podman/podman.sock

# set the new connection as default if not set already
podman system connection default wsl-podman

# Check if evrything works
podman info
podman images

This should connect window with WSL podman. If you face connection issues then check if the WSL instance is running.

Known issue: Ubuntu WSL

Due to a bug in WSL2 for Ubuntu or maybe other distributions as well, after restarting WSL Ubuntu instance the file /run/user//bus is nuked and as a result systemctl --user returns file not found error.

The fix for this is to run the following command after restarting WSL VM every time:


sudo systemctl restart user@

A better fix is to disable wslg within WSL, if the graphical Linux apps are not being used as it is this particular mount that is causing the above mentioned issue. wslg can be disable globally by adding the .wslconfig file at %USERPROFILE% location in windows:

[wsl2]
guiApplications=false

That's all folks. I hope you were able to make your podman instance in WSL work with windows.

How Linux networking configuration evolved from ifconfig to systemd

Vivek Dhami — Tue, 16 May 2023 22:00:00 GMT

If you've been working with Linux for more than a few years, you've probably noticed that networking configuration has gone through some serious changes. What started as simple text files and a handful of commands has evolved into a complex ecosystem of tools, daemons, and abstractions. And if you're like me, you've probably typed ifconfig only to be met with "command not found" on a fresh Ubuntu install at least once.

Let's walk through how we got here—from the early days of net-tools to the systemd-dominated present—and understand why these changes happened in the first place.

THE OLD GUARD: NET-TOOLS AND FRIENDS

Back in the day (we're talking late 90s through early 2010s), Linux networking was straightforward. You had a small set of tools that did specific jobs:

ifconfig - configure network interfaces
route - manage routing tables
netstat - network statistics and connections
arp - ARP cache manipulation
/etc/resolv.conf - DNS configuration (literally just a text file)

This was the net-tools package, and it worked. Sort of. The problem was that these tools were designed for a simpler time when Linux networking meant "configure eth0 with a static IP and you're done." As networking grew more complex—VLANs, bridges, tunnels, multiple routing tables, policy-based routing—net-tools started showing its age.

Here's what a typical interface configuration looked like:

# Bring up eth0 with a static IP
ifconfig eth0 192.168.1.100 netmask 255.255.255.0 up

# Add a default route
route add default gw 192.168.1.1

# Check what you just did
ifconfig eth0
route -n

Simple enough for basic use cases, but try doing anything advanced and you'd quickly hit walls. Want to create a VLAN? You'd need vconfig. Bridge interfaces? That's brctl. The tooling was fragmented, and the underlying kernel interfaces (ioctls) these tools used were showing their limitations.

ENTER IPROUTE2: THE MODERNIZATION

Around 2003, the iproute2 suite started gaining traction. The crown jewel was the ip command—a single unified tool that could handle interfaces, routing, tunnels, and more. It was built on top of the newer netlink socket interface, which gave it more power and flexibility than the old ioctl-based tools.

The same configuration from above now looks like this:

# Bring up eth0 with a static IP
ip addr add 192.168.1.100/24 dev eth0
ip link set eth0 up

# Add a default route
ip route add default via 192.168.1.1

# Check what you just did
ip addr show eth0
ip route show

Notice the CIDR notation (/24 instead of netmask 255.255.255.0)—this was one of many improvements. But more importantly, ip could do things that net-tools simply couldn't:

# Create a VLAN
ip link add link eth0 name eth0.100 type vlan id 100

# Set up policy-based routing with multiple routing tables
ip rule add from 10.0.0.0/8 table 100
ip route add default via 192.168.2.1 table 100

# Configure network namespaces (crucial for containers)
ip netns add production
ip link set eth1 netns production

This last feature—network namespaces—became absolutely critical for containerization. Docker, Kubernetes, and basically every modern container runtime relies on network namespaces to isolate network stacks between containers.

By the mid-2010s, major distributions started deprecating net-tools. Debian and Ubuntu stopped installing it by default. Red Hat marked it as deprecated. The message was clear: learn ip or get left behind.

INTERFACE MANAGEMENT GETS COMPLICATED

So now we had better low-level tools, but who was actually calling them? In the early days, you'd have distribution-specific scripts:

Debian/Ubuntu: /etc/network/interfaces file, managed by ifupdown
Red Hat/CentOS: /etc/sysconfig/network-scripts/ifcfg-* files, managed by network service
SUSE: YaST with its own configuration format

Each distribution did things differently, and migrating configurations between distros was painful. Worse, these systems were designed for static configurations—they struggled with dynamic scenarios like laptops moving between networks or cloud instances with changing metadata.

Two major players emerged to solve this:

NetworkManager

NetworkManager arrived to handle dynamic networking scenarios. It was perfect for desktops and laptops that needed to switch between WiFi networks, handle VPNs, and deal with frequent network changes. It introduced D-Bus APIs for managing connections and provided both GUI and CLI tools:

# List connections
nmcli connection show

# Connect to WiFi
nmcli device wifi connect "MyNetwork" password "MyPassword"

# Create a new static connection
nmcli connection add con-name "static-eth0" \
  ifname eth0 type ethernet \
  ip4 192.168.1.100/24 gw4 192.168.1.1

NetworkManager became the default on Fedora, RHEL, and Ubuntu Desktop. It's great for interactive systems but can be overkill for servers with static configurations.

systemd-networkd

When systemd took over init, it brought its own networking daemon: systemd-networkd. Unlike NetworkManager, it's designed for servers and embedded systems with relatively static configurations. Configuration lives in .network files under /etc/systemd/network/:

# /etc/systemd/network/10-static-eth0.network
[Match]
Name=eth0

[Network]
Address=192.168.1.100/24
Gateway=192.168.1.1
DNS=8.8.8.8
DNS=8.8.4.4

Then enable and start it:

systemctl enable --now systemd-networkd

The systemd approach is declarative and integrates cleanly with the rest of the systemd ecosystem. Cloud images (especially for AWS, GCP, Azure) often use systemd-networkd because it's lightweight and predictable.

Here's where it gets interesting: on a modern Linux system, you might have multiple things trying to manage networking. You could have systemd-networkd managing physical interfaces while docker manages bridge interfaces and NetworkManager handles WiFi. They usually stay out of each other's way, but conflicts can happen.

DNS RESOLUTION: FROM SIMPLE TO... SOPHISTICATED?

Now let's talk about DNS, because this has gotten wild.

The old way: /etc/resolv.conf

For decades, DNS configuration was dead simple. You edited /etc/resolv.conf:

nameserver 8.8.8.8
nameserver 8.8.4.4
search example.com

Done. Every application used the standard resolver library (libc), which read this file and made DNS queries. Simple, predictable, and easy to debug.

The problem with simple

But simple broke down in modern scenarios:

Split DNS: VPN connections might need different DNS servers for internal domains
DNSSEC: validating DNS responses cryptographically
DNS over TLS/HTTPS: encrypting DNS queries for privacy
mDNS/LLMNR: local network name resolution (think .local domains)
Per-interface DNS: different DNS servers for different network interfaces

Applications started implementing their own DNS logic. Browsers added DNS-over-HTTPS. VPN clients fought over /etc/resolv.conf. It was chaos.

systemd-resolved to the rescue (sort of)

systemd-resolved came along to centralize all this complexity. It's a local DNS resolver that sits between applications and actual DNS servers:

On a systemd-resolved system, /etc/resolv.conf becomes a symlink to a stub file:

$ ls -l /etc/resolv.conf
lrwxrwxrwx 1 root root 39 Jan 15 10:23 /etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

$ cat /etc/resolv.conf
nameserver 127.0.0.53  # Points to systemd-resolved

All DNS queries go to 127.0.0.53, where systemd-resolved handles them intelligently. The resolvectl command (which replaced the older systemd-resolve) is your main interface for managing and debugging DNS:

# Check overall DNS status and per-interface DNS servers
resolvectl status

# Query a domain and see which server answered
resolvectl query example.com

# Check DNSSEC validation for a domain
resolvectl query --type=A --class=IN example.com

# Flush DNS cache
resolvectl flush-caches

# Get statistics (cache hits, transaction counts)
resolvectl statistics

Configuring DNS with systemd-networkd

The most common way to configure DNS is through systemd-networkd's .network files. You can set DNS servers per-interface:

# /etc/systemd/network/10-eth0.network
[Match]
Name=eth0

[Network]
DHCP=no
Address=192.168.1.100/24
Gateway=192.168.1.1

# Primary and fallback DNS servers
DNS=1.1.1.1
DNS=1.0.0.1

# Domains to search (for short hostnames)
Domains=internal.company.com

# Route only specific domains through these DNS servers (split DNS)
Domains=~internal.company.com ~vpn.company.com

The ~ prefix on domains means "route DNS queries for this domain through this interface's DNS servers." This is crucial for VPN scenarios where you want corporate domains resolved by corporate DNS, but everything else goes to your regular DNS.

After editing, restart systemd-networkd:

systemctl restart systemd-networkd

Global DNS configuration with resolved.conf

For system-wide DNS settings, edit /etc/systemd/resolved.conf:

[Resolve]
# Fallback DNS if no interface provides DNS servers
DNS=8.8.8.8 1.1.1.1

# Fallback DNS for specific domains
Domains=~.

# Enable DNSSEC validation (yes/allow-downgrade/no)
DNSSEC=allow-downgrade

# Enable DNS over TLS (yes/opportunistic/no)
DNSOverTLS=opportunistic

# Enable mDNS for .local domains
MulticastDNS=yes

# Enable LLMNR for local network name resolution
LLMNR=yes

# DNS stub listener address (default: 127.0.0.53)
DNSStubListener=yes

# Cache entries
Cache=yes
CacheFromLocalhost=no

After changing resolved.conf, restart the service:

systemctl restart systemd-resolved

Split DNS for VPN scenarios

Here's a real-world example: you're connected to a corporate VPN via tun0 and want corporate domains (*.corp.example.com) resolved through the VPN's DNS, but everything else through Cloudflare:

# /etc/systemd/network/10-vpn.network
[Match]
Name=tun0

[Network]
DNS=10.10.10.10  # Corporate DNS server
Domains=~corp.example.com ~internal.example.com

The routing domains (prefixed with ~) tell systemd-resolved: "queries for these domains go to this interface's DNS servers only." Check if it's working:

$ resolvectl status tun0
Link 4 (tun0)
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: allow-downgrade
    DNSSEC supported: yes
         DNS Servers: 10.10.10.10
          DNS Domain: ~corp.example.com
                      ~internal.example.com

# Test it
$ resolvectl query jenkins.corp.example.com
jenkins.corp.example.com: 10.20.30.40       -- link: tun0

That -- link: tun0 confirms the query went through the VPN interface.

Enabling DNSSEC

DNSSEC validation adds cryptographic verification to DNS responses. In resolved.conf:

[Resolve]
DNSSEC=yes  # Strict validation; queries fail if DNSSEC validation fails
# or
DNSSEC=allow-downgrade  # Validate when available, but allow unsigned responses

Check DNSSEC status for a domain:

$ resolvectl query --legend=yes cloudflare.com
cloudflare.com: 104.16.132.229
                104.16.133.229

-- Information acquired via protocol DNS in 23.4ms.
-- Data is authenticated: yes

That "authenticated: yes" means DNSSEC validation succeeded.

DNS over TLS (DoT)

To encrypt DNS queries in transit, enable DNS-over-TLS in resolved.conf:

[Resolve]
DNS=1.1.1.1#cloudflare-dns.com 1.0.0.1#cloudflare-dns.com
DNSOverTLS=yes

The #cloudflare-dns.com syntax specifies the TLS server name. Restart and verify:

systemctl restart systemd-resolved
resolvectl status | grep "DNS over TLS"

Note that DoT uses port 853, so your firewall needs to allow it. Also, this is different from DNS-over-HTTPS (DoH), which systemd-resolved doesn't support natively—you'd need something like dnscrypt-proxy for that.

Runtime DNS changes with resolvectl

You can temporarily override DNS settings without editing config files:

# Set DNS servers for eth0 until next reboot
resolvectl dns eth0 8.8.8.8 8.8.4.4

# Set routing domains for eth0
resolvectl domain eth0 ~internal.example.com

# Revert to defaults
resolvectl revert eth0

These changes don't persist across reboots, which is useful for testing.

Common troubleshooting scenarios

DNS queries timing out? Check if systemd-resolved is actually running:

systemctl status systemd-resolved
journalctl -u systemd-resolved -f  # Follow logs in real-time

VPN DNS not working? Verify your VPN client is setting DNS through systemd-resolved. Some VPN clients (especially older ones) still try to directly edit /etc/resolv.conf, which doesn't work when it's a symlink. You might need to configure the VPN to use resolvectl:

# In VPN up script
resolvectl dns $INTERFACE 10.10.10.10
resolvectl domain $INTERFACE ~corp.example.com

DNS leaking when on VPN? Check which DNS servers are being used:

resolvectl status
# Look for your VPN interface and verify routing domains are set correctly

Want to bypass systemd-resolved entirely? Replace the symlink with a real file:

rm /etc/resolv.conf
echo "nameserver 8.8.8.8" > /etc/resolv.conf
echo "nameserver 8.8.4.4" >> /etc/resolv.conf

But you'll lose all the advanced features—split DNS, DNSSEC, mDNS, etc.

Cache causing issues? Flush it:

resolvectl flush-caches
# Or restart the service entirely
systemctl restart systemd-resolved

This works great when it works. But when it breaks—say, DNS leaks after connecting to a VPN—debugging becomes harder because you now have this abstraction layer between your app and the actual DNS queries. The key is understanding resolvectl status output and watching the journal logs when issues occur.

The alternatives still exist

Not everyone loves systemd-resolved. You'll still find systems using:

dnsmasq: lightweight caching DNS proxy, popular in network appliances
resolvconf: the older dynamic /etc/resolv.conf updater (not to be confused with resolvectl)
plain old /etc/resolv.conf: some sysadmins just want simplicity and manage it manually or via configuration management

WHERE WE ARE TODAY

Modern Linux networking is powerful but complex. On a typical cloud VM running Ubuntu 22.04 or Rocky Linux 9, you might have:

iproute2 tools (ip, ss, bridge) for low-level interface management
systemd-networkd managing your primary network interface
systemd-resolved handling DNS with DNSSEC validation
Docker creating its own bridge networks and iptables rules
NetworkManager installed but disabled

The key insight is that Linux networking evolved from a monolithic model ("one tool per task, everything global") to a layered, namespaced, and delegated model ("multiple domains of control, isolation by default").

This makes sense given how we use Linux today. We're running containers, connecting to VPNs, managing split-horizon DNS, and dealing with cloud networking that changes dynamically. The tooling had to evolve to support these use cases.

THINGS TO KEEP IN MIND

If you're managing servers, learn the ip command inside and out. Know how to inspect routing tables, check interface statistics, and understand network namespaces. And pick either NetworkManager or systemd-networkd—running both is usually asking for trouble.

If you're debugging DNS issues, understand whether systemd-resolved is in the picture. Check resolvectl status before you start editing /etc/resolv.conf (which might be a symlink anyway).

If you're writing automation, use declarative configurations (systemd-networkd .network files or NetworkManager connection files) rather than calling ip commands in scripts. Your future self will thank you.

The Linux networking stack has gotten more complex, no question. But it's also far more capable than it was 20 years ago. We're now running thousands of isolated network stacks on a single host, dynamically reconfiguring interfaces in response to cloud metadata, and validating DNS responses cryptographically. The old tools couldn't handle that—and now we have tools that can.

Just don't forget to update your muscle memory from ifconfig to ip. That command-not-found error gets old fast.

Git: Move files from one repo to another with history

Vivek Dhami — Thu, 02 Dec 2021 11:00:00 GMT

Background

A lot of times as developers and code maintainers we need to move files/folders between code repos and most of the time we avoid exporting history because of the complexity and issues often faced during the export. However fret not anymore, I will share in this post how to easily export files with history using git command line and filter-repo command.

In the past git filter-branch used to be one of the way to achieve the export task. However due to issues associated with the history rewritings by this command, its no longer a preferred option and even the git cli prompts a warning when you use the command.

Some git users have been using patching as an alternative. However it also does not provides a seamless experience and can get complex. It might be a preferred option for git experts. However I would not suggest using it when you have a better alternative available i.e. filter-repo.

Installation

Unlike filter-branch, filter-repo is not part of git cli and needs to be installed separately.

Installation on systems using package managers can be accomplished using the available package manager and the appropriate package. However on windows the installation might be a little trickier. The installation instructions available at the project's github repo details all the peculiarities associated with the installation.

After installing filter-repo should be available as a subcommand with git cli.

    git filter-repo

Among all the available command options, the following are noteworthy for this post:

path: This is used to specify the folder/file to include with filter-repo operation.

invert-paths: Inverts the paths included i.e. the files/folders specified via path flag will now be excluded from filter-repo operation.

Walkthrough

For moving files from one repo to another please proceed along as follows:

Clean the source repo

In this step we include/exclude the files/folders and their history using the filter-repo command.

Before starting with the cleanup step in the source repo, its a good practice to perform the cleanup in a local branch where the changes are not going to have any impact on the original repo.
```
 # Include file and folder from the source repo
 git filter-repo --path  --path 

 # Exclude file and folder from the source repo
 git filter-repo --path  --path  --invert-paths
```
Add the source repo as a remote source for the target repo

In this step, we move to the target repo directory and add the source repo folder as a remote origin.
```
 git remote add  

 #e.g.
 git remote add move ../repo-to-move
```

Fetch and merge the changes and history from source repo In this step, we fetch changes from the remote source and merge it to the target branch.

 git fetch 

 git branch  remotes//

 git merge  --allow-unrelated-histories

 #e.g.
 git fetch move

 git branch merge-remote remotes/move/migrate-remote

 git merge merge-remote --allow-unrelated-histories

Cleanup

In this step, we clean up by removing the remote source and the branch.

 git remote rm 

 git branch -d 

 #e.g.
 git remote rm move

 git branch -d merge-remote

That's all folks!!

Azure Bicep: Pump up your azure deployments

Vivek Dhami — Fri, 15 Jan 2021 11:00:00 GMT

Azure Bicep: Pump up your azure deployments

Even though Azure Bicep was announced at Ignite 2020 I was not so keen to try it out because of the preview nature of the cli and my personal preference and past experience with beta phase tools and technology. However keeping my preference aside I decided to start learning and using bicep for managing azure resource deployments. So this post is an accumulation of my learnings so far with bicep.

Overview

Bicep is an abstraction over ARM templates that have been the bread and butter for azure deployments so far. It is a DSL (Domain Specific Language) for deploying resources in azure. It is similar to terraform's HCL (HashiCorp Configuration Language), however it currently supports only azure and does not supports state management. It is continuously updated with support for the newer ARM Api versions as they become available. You can either generate ARM templates from bicep files for deployment or you can directly deploy bicep files, as Azure Resource Manager now supports deployments via bicep files as well.

Bicep lets you define your azure resources via less lines of code as compared to ARM which can be humongous and unmanageable sometimes. Also bicep being a DSL supports various language features which are difficult to accomplish with ARM templates.

Installation: Bicep cli

Installation is pretty easy and quick if you already have azure cli installed. Otherwise install azure cli before in order to proceed with bicep installation.

There are also other installation options available for bicep cli. However this post will follow the azure cli based installation procedure and use the set of commands available via azure cli for working with bicep.

// Upgrade azure cli
az upgrade

// Install bicep command group
az bicep install

After installing bicep command group as part of azure cli, you can use the following commands:

// View all the available bicep sub-commands
az bicep --help

Group
    az bicep : Bicep CLI command group.

Commands:
    build           : Build a Bicep file.
    decompile       : Attempt to decompile an ARM template file to a Bicep file.
    format          : Format a Bicep file.
    generate-params : Generate parameters file for a Bicep file.
    install         : Install Bicep CLI.
    list-versions   : List out all available versions of Bicep CLI.
    publish         : Publish a bicep file to a remote module registry.
    restore         : Restore external modules for a bicep file.
    uninstall       : Uninstall Bicep CLI.
    upgrade         : Upgrade Bicep CLI to the latest version.
    version         : Show the installed version of Bicep CLI.

// Show the installed version of Bicep CLI
az bicep version

// Upgrade Bicep CLI to the latest version
az bicep upgrade

Example usage: Decompile and deploy

The easiest way to compare bicep and ARM would be via decompiling an existing ARM template to bicep and deploying it to azure.

For this example we will be using the simplest ARM template i.e. vNet with two subnets from azure quickstarts samples. To follow along download the file locally and use the following command to decompile and deploy to azure.

 # decompile arm template to bicep
 azure bicep decompile -f .\azuredeploy.json

 # login to azure
 az login --use-device-code

 # validate bicep template
 az group validate -g  -f .\azuredeploy.bicep

 # deploy bicep template
 az group deploy -g  -f .\azuredeploy.bicep

If you open the decompiled bicep template and compare it with the ARM template, you will find that the bicep template is more readable than the ARM template even for this simple example.

Visual Studio Code: Bicep extension

One tool that I can highly recommend for bicep is the bicep extension for VS Code. It will help you in writing bicep code from scratch using intellisense and autocomplete as well as in finding/highlighting issues within your bicep files.

Learning bicep

For learning bicep I recommend the following:

Setup a kubernetes cluster on raspberry pi/s using k3s

Vivek Dhami — Wed, 11 Nov 2020 11:00:00 GMT

This blog post will detail the steps for setting up a kubernetes cluster on your raspberry pi/s using k3s.

k3s: A brief introduction

While it is completely possible to install, manage and run a complete kubernetes distribution (pronounced k8s) on your raspberry pi cluster (or on similar devices), for better performance and resource utilization in a resource constrained device like raspberry pi its advisable to install a lightweight kubernetes distribution like k3s by rancher.

From k3s website:

K3s is a highly available, certified Kubernetes distribution designed for production workloads in unattended, resource-constrained, remote locations or inside IoT appliances.

K3s is packaged as a single <40MB binary that reduces the dependencies and steps needed to install, run and auto-update a production Kubernetes cluster.

Both ARM64 and ARMv7 are supported with binaries and multiarch images available for both. K3s works great from something as small as a Raspberry Pi to an AWS a1.4xlarge 32GiB server.

Prepare raspberry pi for k3s installation

Before we start installing k3s on raspberry pi we need to prepare it for the installation as follows:

Download Raspberry Pi Imager

Download the Raspberry Pi Imager from: https://www.raspberrypi.org/software/ choosing the version most appropriate for you. With Raspberry Pi Imager you have the option to install the desktop client or install it as a package for use with your linux cli.

For this article we will be using the desktop client version.
Format SD card

Insert the SD card in your computer and format the card using the erase option available with Raspberry Pi Imager.
Install Raspberry Pi OS

With the formatted SD card still inserted in your computer, install the Raspberry Pi OS using the available option in the Raspberry Pi Imager. For k3s setup it is recommended to use the lite (headless) version of the Raspberry Pi OS instead of the full version.

It will take some time for the Raspberry Pi OS to be installed on your SD card.

After the installation is completed you can remove and reinsert SD card on your computer.

After reinserting the SD card you will see two partitions (or drives) in your computer: boot and rootfs (or SDHC in windows).
Enable SSH

In order to enable SSH on your raspberry pi create a file named ssh in the boot partition of your SD card.
```
 sudo touch /ssh
```
Power up and customize raspberry pi

Insert the SD card in your raspberry pi and power it up.

After the raspberry pi is up and running find the IP address assigned to the raspberry pi using the admin UI on your router and SSH into it.
```
 ssh pi@
```
The default password for pi user is raspberry.

Once you are logged in, you can change the various raspberry pi settings like password, hostname, localization, time zone, and wifi using the raspi-config utility:
```
 sudo raspi-config
```
Set up a static IP for your pi by updating the entry in the file /etc/dhcpcd.conf as follows:
```
 interface eth0
 static ip_address=192.168.1.50/24
 static routers=192.168.1.1
 static domain_name_servers=192.168.1.1
```
Reboot and ssh into the pi using the static IP address.

Enable container features in the kernel by editing the file /boot/cmdline.txt and adding the following to the end of the line:
```
 cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory
```
Reboot the pi.

As an additional option you can use your existing ssh keys to login to rPi by copying over the ssh key using the following command:
```
 ssh-copy-id pi@
```
You can also generate a new ssh key using ssh-keygen utility. After copying over your ssh key you will no longer be prompted for a password while using ssh with your rPi.

By default the ssh-copy-id copies over the id_rsa.pub file. However you can specify a different ssh key using the -i flag as shown below:
```
 ssh-copy-id -i .ssh/.pub pi@192.168.1.50
```
After the ssh key has been copied over to the rPi, you can login as follows using the specific key:
```
 ssh pi@192.168.1.50 -i .ssh/
```

Install k3s on rPi

Now that the raspberry pi has been prepared for k3s installation. We will proceed with the installation using one of the following option:

Using latest stable k3s release

Login to your master node rPi using ssh and install the latest stable version of k3s using the following command:
```
 curl -sfL https://get.k3s.io | sh -
```
After the installation is complete you can check the status of k3s using the following command:
```
 sudo systemctl status k3s
```
If the status shows up as active (running) then k3s has been installed successfully.

From the master node copy the node-token from its location:
```
 sudo cat /var/lib/rancher/k3s/server/node-token
```
This token will be used by the other nodes to join the k3s cluster.

Login to the other nodes and install k3s as follows:
```
 export K3S_URL="https://:6443"

 export K3S_TOKEN=""

 curl -sfL https://get.k3s.io | sh -
```
You can check the status of k3s installation on the worker/agent nodes using the following command:
```
 sudo systemctl status k3s-agent
```
If the status shows as active (running) then k3s is up and running on the worker nodes.

Login to the master node and run the following command to check the status of the nodes:
```
 sudo kubectl get nodes
```
The command should output the master node along with all the worker nodes that have been added to the cluster.

If the above command returns an error or fails to run, then restart the k3s service as follows:
```
 sudo systemctl restart k3s
```
Also for troubleshooting purpose its always helpful to analyze the outputs from the following commands:
```
 sudo systemctl status k3s

 journalctl -xe
```
For uninstalling k3s use the following commands:
```
 //on server
 /usr/local/bin/k3s-uninstall.sh

 //on agent
 /usr/local/bin/k3s-agent-uninstall.sh
```
Using k3sup

k3sup (pronounced ketchup) is a utility that simplifies the k3s installation process on both server and agent nodes. Also additionally k3sup allows you to export the k3s cluster configuration to your system so you can easily access your new cluster.

Also k3sup is helpful with installing apps in your cluster using yaml files or helm charts, however that is not covered in this article.

k3sup can be installed on your system via the following commands:
```
 // install k3s binary
 curl -sLS https://get.k3sup.dev | sh
 sudo install k3sup /usr/local/bin/

 // test k3s installation
 k3s --help
```
You can install k3s on a rPi master node using the following commands:
```
 // check the available options
 k3sup install --help

 // install using ssh key and merge to local context at the mentioned path
 k3sup install --ip  --ssh-key ~/.ssh/ --user  --merge --local-path $HOME/.kube/config --context k3s-cluster
```
After k3s has been installed on the master node you can check the status of k3s using the following command:
```
 kubectl get nodes
```
The above command should output master node as available and ready. Also before running the above command ensure that your new k3s cluster is set as the current context using the following command:
```
 kubectl config current-context
```
Now we can proceed to install k3s on the worker (or agent) nodes using k3sup with the following commands:
```
 k3sup join --ip  --server-ip  --ssh-key ~/.ssh/ --user 
```
After k3s has been installed on the agent nodes, you can check the installation using the following command:
```
 kubectl get nodes
```
It should output all the available and ready (both master and agent) nodes which are part of the cluster.

That's all folks, k3s is installed and running on your raspberry pi and you can now install applications on your cluster.

Upgrade WSL/WSL2 Ubuntu version to 20.04 LTS

Vivek Dhami — Fri, 01 May 2020 00:00:00 GMT

Its the time of the year again to upgrade Ubuntu version in WSL/WSL2 since the Ubuntu 20.04 LTS came out last week.

Please follow along the following steps in your WSL console to upgrade to the new version:

Check installed Ubuntu version

Take a note of your current Ubuntu version by running the following command:
```
  lsb_release -a
```
```
  No LSB modules are available.
  Distributor ID: Ubuntu
  Description:    Ubuntu 18.04.4 LTS
  Release:        18.04
  Codename:       bionic
```
**Upgrading to 20.04 should work seamlessly when upgrading from 18.04. However if you are on older versions then the suggested path should be upgrading first to 18.04 from 16.04 for example.

Upgrade installed packages

  sudo apt update
  sudo apt list --upgradable
  sudo apt upgrade

Remove unused packages
```
  sudo apt --purge autoremove
```
Install update-manager-core package if not already installed
```
  sudo apt install update-manager-core
```

Upgrade to 20.04

  sudo do-release-upgrade

If you receive the following message:

  Checking for a new Ubuntu release
  There is no development version of an LTS available.
  To upgrade to the latest non-LTS develoment release 
  set Prompt=normal in /etc/update-manager/release-upgrades.

Then do an upgrade forcefully using the following command:

  sudo do-release-upgrade -d

Finally check the version after upgrade is done:

  lsb_release -a

and you should receive the following similar output:

  No LSB modules are available.
  Distributor ID: Ubuntu
  Description:    Ubuntu 20.04 LTS
  Release:        20.04
  Codename:       focal

Kubernetes: Working with multiple contexts

Vivek Dhami — Sat, 02 Feb 2019 00:00:00 GMT

A context in kubernetes defines the current scope under which all the kubectl commands will run. In simple words, it defines the cluster against which all the kubectl commands will execute.

The contexts are easy to work with as long as you have a fixed number of contexts and context files. However as the number of contexts keep on growing it becomes rather difficult to manage them. In this post I will try to demystify the way of working and managing multiple contexts.

A list of all available contexts can be fetched via:

kubectl config get-contexts

Also the current context in use can be checked via:

kubectl config current-context

Similarly the current context can be switched to another one via:

kubectl config use-context

All the above mentioned commands work well as long as all the contexts are present in the same configuration file, which is usually located at $HOME/.kube/config. This holds true depending upon the way the context was downloaded to your console. Sometimes the contexts are downloaded to different locations and in order to use those contexts, the following ways are preferred:

Temporary Context Switching

You can temporary switch your context to the preferred one by setting KUBECONFIG value to the desired kube configuration location. However this will be a temporary switch and will be lost once you close your terminal/console.
```
 export KUBECONFIG = $HOME/.newkube/config
 kubectl config view
```

Merging Context Configuration Files

Merging context configuration files is preferred if you would like to use the same configuration file for all your contexts. The only downside with this approach is that the configuration file might get very long as the number of contexts increase. The context configuration is usually automatically merged into the current context while downloading contexts from clusters in GKE AKS and EKS using the available CLIs.

However to manually merge the context files, please use the following commands:
```
 # Backup the current kube configuration
 cp $HOME/.kube/config $HOME/.kube/config.backup.$(date +%Y-%m-%d.%H:%M:%S)

 # Include all the configuration files to merge in KUBECONFIG
 export KUBECONFIG=$HOME/.kube/config:File1:File2

 # Merge the configurations and move it to the default configuration location
 kubectl config view --merge --flatten > ~/.kube/merged_kubeconfig && mv ~/.kube/merged_kubeconfig ~/.kube/config

 # View all available contexts
 kubectl config get-contexts
```

Git: Sync fork with upstream changes

Vivek Dhami — Thu, 07 Mar 2013 00:00:00 GMT

A forked repository can be synced with the upstream one as follows:

Clone the forked repository locally if not already done.

git clone https://github.com/OWNER/FORKED-REPOSITORY.git

List the remote repository configured for your fork:

git remote -v

Add a new remote upstream repository that will be synced with the local fork:

git remote add upstream https://github.com/OWNER/REPOSITORY.git

Verify the configured remote repository for your fork:

git remote -v

Pull the latest changes from upstream remote:

git fetch upstream

Checkout to your local master branch of the forked repository:

git checkout master

Merge upstream/master to local master branch:

git merge upstream/master

Commit and push your changes to remote fork:

git commit -m 'merged with upstream changes'
git push

Git: Submodule Init and Update

Vivek Dhami — Sat, 02 Feb 2013 00:00:00 GMT

For a repository with submodules all the submodules can be pulled down locally for the first time using:

git submodule update --init --recursive

Subsequently, submodules can be updated with remote changes using:

git submodule update --recursive --remote

MySQL–Restoring large database on windows

Vivek Dhami — Fri, 17 Aug 2012 00:00:00 GMT

Recently while trying to restore database from a large database backup file I encountered few issues with MySQL.

So for reference I am posting the solution steps for resolving these issues on windows :

Initiate the restoration process from command prompt as follows:

mysql –u root –p > [sql file path]

After this you will be prompted for root account’s password. After entering root account’s password server will begin the databse restoration process from the backup. However if mysql server’s configuration has not been adjusted to handle large backups then the restoration process will end with the following message:

Error 2006: MySQL server has gone away.

Stop MySQL service. Generally MySQL is installed as a service on windows. So open services window (use ‘services.msc‘ command on windows run dialog) and stop the required service.
Locate my.ini file on your system. On Windows XP the default location is:

:\Documents and Settings\All Users\Application Data\MySQL\MySQL Server

For Windows 7 the default location is:

:\ProgramData\MySQL\MySQL Server

Open my.ini file and add the following:

max_allowed_packet = M
wait_timeout = in seconds>

Use a split down value for , generally using a value half of the database size would be good to start with. Save the file and restart MySQL server by starting the stopped service. Now restart database restoration process as described in step 1. If the error still persists then reduce the max_allowed_packet size further in step 4 and continue back from step 1.

Following are some links for reference:

http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html

http://bogdan.org.ua/2008/12/25/how-to-fix-mysql-server-has-gone-away-error-2006.html

http://wpmu.org/how-to-backup-and-import-a-very-large-wordpress-mysql-database/

Vivek Dhami

Terraform: Eliminating phantom diffs using ignore_changes and replace_triggered_by

The Problem: Write-Only Fields and Perpetual Diffs

The Naive Fix (and Its Blind Spot)

The Better Fix: ignore_changes + replace_triggered_by

How This Works

Behavior Matrix

Scaling the Pattern

Modularizing the Pattern

Important Caveats

Replacement is Destructive

Sensitive Values in terraform_data

When NOT to Use This Pattern

Summary

Managing Terraform Phantom Diffs: A Practical Guide

What Are Phantom Diffs?

Why Do Phantom Diffs Happen?

1. Provider Defaults and API Normalization

2. Computed Attributes Conflicting with Config

3. Attribute Ordering and Serialization

4. Default Values Applied Server-Side

5. Sensitive or Computed-Only Values in State

6. Timestamp and Auto-Generated Fields

7. Inconsistent jsonencode / JSON Formatting

Strategies for Handling Phantom Diffs

Strategy 1: lifecycle { ignore_changes = [...] }

Strategy 2: Match the API's Canonical Form

Strategy 3: Use terraform state show to Inspect

Strategy 4: Pin Provider Versions

Strategy 5: Use replace_triggered_by Instead of Fighting Diffs

Strategy 6: Automate Phantom Diff Detection in CI/CD

Decision Framework

Key Takeaways

Understanding WASM and WASI: A complete guide

THE PROBLEM: RUNTIME PORTABILITY IS STILL MESSY

WHAT IS WEBASSEMBLY (WASM)?

Key Characteristics

WHAT IS WASI (WEBASSEMBLY SYSTEM INTERFACE)?

Core WASI Capabilities

WASI in Action

WHY WASM + WASI MATTERS FOR INFRASTRUCTURE

1. Cold Start Performance

2. Smaller Footprint

3. True Multi-Arch Support

4. Enhanced Security Posture

THE WASM RUNTIME ECOSYSTEM

Wasmtime

Wasmer

WasmEdge

Spin (Fermyon)

HOW WASM FITS INTO YOUR STACK

REAL-WORLD USE CASES

Edge Computing

Plugin Systems

Serverless Functions

Polyglot Microservices

WASM IN KUBERNETES

LIMITATIONS AND GOTCHAS

Limited Language Support

WASI Spec Is Still Evolving

Ecosystem Maturity

Not a Container Replacement (Yet)

GETTING STARTED: A PRACTICAL EXAMPLE

Step 1: Create a New Spin App

Step 2: Edit src/lib.rs

Step 3: Build and Run

Step 4: Deploy to the Edge (Optional)

WRAPPING UP

MicroVMs: the security-first alternative to containers

THE ISOLATION PROBLEM WITH CONTAINERS

WHAT EXACTLY IS A MICROVM?

COMPARING CONTAINERS, VMS, AND MICROVMS

REAL-WORLD MICROVM IMPLEMENTATIONS

Firecracker

gVisor

Kata Containers

SECURITY BENEFITS FOR APPLICATION DEPLOYMENT

Multi-tenant SaaS platforms

Serverless and function-as-a-service

CI/CD runners

The Better Fix: `ignore_changes` + `replace_triggered_by`

Sensitive Values in `terraform_data`

7. Inconsistent `jsonencode` / JSON Formatting

Strategy 1: `lifecycle { ignore_changes = [...] }`

Strategy 3: Use `terraform state show` to Inspect

Strategy 5: Use `replace_triggered_by` Instead of Fighting Diffs

Step 2: Edit `src/lib.rs`