Skip to content

etcd-manager fails to start after kOps upgrade 1.33.1 → 1.34.1 #17780

@gustav-b

Description

@gustav-b

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Before upgrade:

Client version: 1.34.1
Last applied server version: 1.33.1

After failed upgrade:

Client version: 1.34.1
Last applied server version: 1.34.1

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Component Version
Kubernetes 1.34.2
OS: Flatcar Stable 4230.2.4
containerd v1.7.23
etcd 3.5.21

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops update cluster --yes
kops rolling-update cluster --instance-group master --control-plane-interval 1s --cloudonly --yes
kops validate cluster --wait 15m

5. What happened after the commands executed?

The control plane never comes up because etcd-manager fails to start.

6. What did you expect to happen?

The control plane should come up again.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

Cluster manifest
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  generation: 1
  name: dev.k8s.local
spec:
  DisableSubnetTags: true
  api:
    loadBalancer:
      class: Network
      idleTimeoutSeconds: 3600
      subnets:
      - name: dev-internal
      type: Internal
  authentication:
    aws:
      image: public.ecr.aws/eks-distro/kubernetes-sigs/aws-iam-authenticator:v0.7.7-eks-1-32-28
  authorization:
    rbac: {}
  certManager:
    enabled: true
    hostedZoneIDs:
    - <redacted>
    - <redacted>
  channel: stable
  cloudProvider: aws
  clusterAutoscaler:
    cpuRequest: 100m
    enabled: true
    expander: least-waste
    memoryRequest: 300Mi
  configBase: s3://<redacted>-dev-kops-state/dev.k8s.local
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master
      name: a
    name: main
    version: 3.5.21
  - etcdMembers:
    - instanceGroup: master
      name: a
    name: events
    version: 3.5.21
  externalPolicies:
    master:
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
    node:
    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  fileAssets:
  - content: |
      [Socket]
      ListenStream=
      ListenStream=127.0.0.1:22
      FreeBind=true
    mode: "0644"
    name: sshd-restrict
    path: /etc/systemd/system/sshd.socket.d/10-sshd-restrict.conf
  - content: |
      apiVersion: v1
      kind: Config
      clusters:
        - name: audit.dev.k8s.local
          cluster:
            server: http://:30009/k8s-audit
      contexts:
        - name: webhook
          context:
            cluster: audit.dev.k8s.local
            user: ""
      current-context: webhook
      preferences: {}
      users: []
    name: audit-webhook-config
    path: /etc/kubernetes/audit/webhook-config.yaml
    roles:
    - ControlPlane
  - content: |
      apiVersion: audit.k8s.io/v1 # This is required.
      kind: Policy
      # Don't generate audit events for all requests in RequestReceived stage.
      omitStages:
        - "RequestReceived"
      rules:
        # Log pod changes at RequestResponse level
        - level: RequestResponse
          resources:
          - group: ""
            # Resource "pods" doesn't match requests to any subresource of pods,
            # which is consistent with the RBAC policy.
            resources: ["pods", "deployments"]
        - level: RequestResponse
          resources:
          - group: "rbac.authorization.k8s.io"
            # Resource "pods" doesn't match requests to any subresource of pods,
            # which is consistent with the RBAC policy.
            resources: ["clusterroles", "clusterrolebindings"]
        # Log "pods/log", "pods/status" at Metadata level
        - level: Metadata
          resources:
          - group: ""
            resources: ["pods/log", "pods/status"]
        # Don't log requests to a configmap called "controller-leader"
        - level: None
          resources:
          - group: ""
            resources: ["configmaps"]
            resourceNames: ["controller-leader"]
        # Don't log watch requests by the "system:kube-proxy" on endpoints or services
        - level: None
          users: ["system:kube-proxy"]
          verbs: ["watch"]
          resources:
          - group: "" # core API group
            resources: ["endpoints", "services"]
        # Don't log authenticated requests to certain non-resource URL paths.
        - level: None
          userGroups: ["system:authenticated"]
          nonResourceURLs:
          - "/api*" # Wildcard matching.
          - "/version"
        # Log the request body of configmap changes in kube-system.
        - level: Request
          resources:
          - group: "" # core API group
            resources: ["configmaps"]
          # This rule only applies to resources in the "kube-system" namespace.
          # The empty string "" can be used to select non-namespaced resources.
          namespaces: ["kube-system"]
        # Log configmap changes in all other namespaces at the RequestResponse level.
        - level: RequestResponse
          resources:
          - group: "" # core API group
            resources: ["configmaps"]
        # Log secret changes in all other namespaces at the Metadata level.
        - level: Metadata
          resources:
          - group: "" # core API group
            resources: ["secrets"]
        # Log all other resources in core and extensions at the Request level.
        - level: Request
          resources:
          - group: "" # core API group
          - group: "extensions" # Version of group should NOT be included.
        # A catch-all rule to log all other requests at the Metadata level.
        - level: Metadata
          # Long-running requests like watches that fall under this rule will not
          # generate an audit event in RequestReceived.
          omitStages:
            - "RequestReceived"
    name: audit-policy-config
    path: /etc/kubernetes/audit/policy-config.yaml
    roles:
    - ControlPlane
  hooks:
  - before:
    - update-engine.service
    manifest: |
      Type=oneshot
      ExecStartPre=/usr/bin/systemctl mask --now update-engine.service
      ExecStartPre=/usr/bin/systemctl mask --now locksmithd.service
      ExecStart=/usr/bin/systemctl reset-failed update-engine.service
    name: disable-automatic-updates.service
  - manifest: |
      Type=oneshot
      # Prune all unused docker images older than 7 days
      ExecStart=/usr/bin/docker system prune -af --filter "until=168h"
    name: docker-prune.service
    requires:
    - docker.service
  - manifest: |
      [Unit]
      Description=Prune docker daily

      [Timer]
      OnCalendar=daily
      Persistent=true

      [Install]
      WantedBy=timers.target
    name: docker-prune.timer
    useRawManifest: true
  - before:
    - protokube.service
    manifest: |-
      Type=oneshot
      ExecStart=/usr/bin/systemctl restart sshd.socket
    name: sshd-socket-restart.service
  iam:
    allowContainerRegistry: true
    legacy: false
    serviceAccountExternalPermissions:
      <redacted>>
    useServiceAccountExternalPermissions: true
  kubeAPIServer:
    auditPolicyFile: /etc/kubernetes/audit/policy-config.yaml
    auditWebhookBatchMaxWait: 5s
    auditWebhookConfigFile: /etc/kubernetes/audit/webhook-config.yaml
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    hostnameOverride: '@hostname'
  kubernetesApiAccess:
  - <redacted>>
  kubernetesVersion: 1.34.2
  masterPublicName: api.dev.k8s.local
  networkCIDR: <redacted>>
  networkID: vpc-3d03f75b
  networking:
    cilium:
      enableL7Proxy: true
      hubble:
        enabled: true
  nonMasqueradeCIDR: 100.64.0.0/10
  podIdentityWebhook:
    enabled: true
  rollingUpdate:
    maxSurge: 2
  serviceAccountIssuerDiscovery:
    discoveryStore: <redacted>
    enableAWSOIDCProvider: true
  snapshotController:
    enabled: true
  sshKeyName: dev
  subnets:
  - egress: <redacted>
    id: <redacted>
    name: dev-internal
    type: Private
    zone: eu-west-1a
  - egress: <redacted>
    id: <redacted>
    name: dev-private
    type: Private
    zone: eu-west-1a
  - id: <redacted>
    name: dev-public
    type: Utility
    zone: eu-west-1a
  topology:
    dns:
      type: None

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2025-11-22T21:21:09Z"
  labels:
    kops.k8s.io/cluster: dev.k8s.local
  name: master
spec:
  autoscale: false
  image: ami-02d94ae5d4360b407
  instanceMetadata:
    httpPutResponseHopLimit: 1
    httpTokens: required
  machineType: t4g.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master
    tier: master
  role: Master
  rootVolumeSize: 16
  rootVolumeType: gp3
  subnets:
  - dev-internal

---

<redacted>

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

Here are the logs of kubelet and containerd on the master node (journactl -u kubelet -u containerd > fail.log): fail.log

I notice that the container does not even start (it does not even appear in crictl ps -a). Here are the logs from when it tries to start the container:

Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.740709278Z" level=info msg="ImageCreate event name:\"registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.789435941Z" level=info msg="stop pulling image registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d: active requests=0, bytes read=85783858"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.833518870Z" level=info msg="ImageCreate event name:\"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.835087313Z" level=info msg="Pulled image \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" with image id \"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\", repo tag \"\", repo digest \"registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\", size \"85782882\" in 4.969761125s"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.835133016Z" level=info msg="PullImage \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" returns image reference \"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\""
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.836774296Z" level=info msg="PullImage \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\""
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.873277155Z" level=info msg="CreateContainer within sandbox \"d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775\" for container &ContainerMetadata{Name:etcd-manager,Attempt:0,}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.874584322Z" level=error msg="CreateContainer within sandbox \"d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775\" for &ContainerMetadata{Name:etcd-manager,Attempt:0,} failed" error="failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875156    3936 log.go:32] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory" podSandboxID="d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875294    3936 kuberuntime_manager.go:1449] "Unhandled Error" err="container etcd-manager start failed in pod etcd-manager-events-i-06f1a3baa9bed8ccd_kube-system(abe485e9c356c6883d5536e2e3788153): CreateContainerError: failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory" logger="UnhandledError"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875335    3936 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd-manager\" with CreateContainerError: \"failed to generate container \\\"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\\\" spec: failed to generate spec: failed to mkdir \\\"\\\": mkdir : no such file or directory\"" pod="kube-system/etcd-manager-events-i-06f1a3baa9bed8ccd" podUID="abe485e9c356c6883d5536e2e3788153"

And this is error in the first error line from containerd above:

error="failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\"
spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory"

I've compared a log of a successful startup (kOps 1.33.1) and the logs contain no such error lines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions