-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
/kind bug
1. What kops version are you running? The command kops version, will display
this information.
Before upgrade:
Client version: 1.34.1
Last applied server version: 1.33.1
After failed upgrade:
Client version: 1.34.1
Last applied server version: 1.34.1
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
| Component | Version |
|---|---|
| Kubernetes | 1.34.2 |
| OS: Flatcar Stable | 4230.2.4 |
| containerd | v1.7.23 |
| etcd | 3.5.21 |
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
kops update cluster --yes
kops rolling-update cluster --instance-group master --control-plane-interval 1s --cloudonly --yes
kops validate cluster --wait 15m
5. What happened after the commands executed?
The control plane never comes up because etcd-manager fails to start.
6. What did you expect to happen?
The control plane should come up again.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
Cluster manifest
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
generation: 1
name: dev.k8s.local
spec:
DisableSubnetTags: true
api:
loadBalancer:
class: Network
idleTimeoutSeconds: 3600
subnets:
- name: dev-internal
type: Internal
authentication:
aws:
image: public.ecr.aws/eks-distro/kubernetes-sigs/aws-iam-authenticator:v0.7.7-eks-1-32-28
authorization:
rbac: {}
certManager:
enabled: true
hostedZoneIDs:
- <redacted>
- <redacted>
channel: stable
cloudProvider: aws
clusterAutoscaler:
cpuRequest: 100m
enabled: true
expander: least-waste
memoryRequest: 300Mi
configBase: s3://<redacted>-dev-kops-state/dev.k8s.local
etcdClusters:
- etcdMembers:
- instanceGroup: master
name: a
name: main
version: 3.5.21
- etcdMembers:
- instanceGroup: master
name: a
name: events
version: 3.5.21
externalPolicies:
master:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
node:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
fileAssets:
- content: |
[Socket]
ListenStream=
ListenStream=127.0.0.1:22
FreeBind=true
mode: "0644"
name: sshd-restrict
path: /etc/systemd/system/sshd.socket.d/10-sshd-restrict.conf
- content: |
apiVersion: v1
kind: Config
clusters:
- name: audit.dev.k8s.local
cluster:
server: http://:30009/k8s-audit
contexts:
- name: webhook
context:
cluster: audit.dev.k8s.local
user: ""
current-context: webhook
preferences: {}
users: []
name: audit-webhook-config
path: /etc/kubernetes/audit/webhook-config.yaml
roles:
- ControlPlane
- content: |
apiVersion: audit.k8s.io/v1 # This is required.
kind: Policy
# Don't generate audit events for all requests in RequestReceived stage.
omitStages:
- "RequestReceived"
rules:
# Log pod changes at RequestResponse level
- level: RequestResponse
resources:
- group: ""
# Resource "pods" doesn't match requests to any subresource of pods,
# which is consistent with the RBAC policy.
resources: ["pods", "deployments"]
- level: RequestResponse
resources:
- group: "rbac.authorization.k8s.io"
# Resource "pods" doesn't match requests to any subresource of pods,
# which is consistent with the RBAC policy.
resources: ["clusterroles", "clusterrolebindings"]
# Log "pods/log", "pods/status" at Metadata level
- level: Metadata
resources:
- group: ""
resources: ["pods/log", "pods/status"]
# Don't log requests to a configmap called "controller-leader"
- level: None
resources:
- group: ""
resources: ["configmaps"]
resourceNames: ["controller-leader"]
# Don't log watch requests by the "system:kube-proxy" on endpoints or services
- level: None
users: ["system:kube-proxy"]
verbs: ["watch"]
resources:
- group: "" # core API group
resources: ["endpoints", "services"]
# Don't log authenticated requests to certain non-resource URL paths.
- level: None
userGroups: ["system:authenticated"]
nonResourceURLs:
- "/api*" # Wildcard matching.
- "/version"
# Log the request body of configmap changes in kube-system.
- level: Request
resources:
- group: "" # core API group
resources: ["configmaps"]
# This rule only applies to resources in the "kube-system" namespace.
# The empty string "" can be used to select non-namespaced resources.
namespaces: ["kube-system"]
# Log configmap changes in all other namespaces at the RequestResponse level.
- level: RequestResponse
resources:
- group: "" # core API group
resources: ["configmaps"]
# Log secret changes in all other namespaces at the Metadata level.
- level: Metadata
resources:
- group: "" # core API group
resources: ["secrets"]
# Log all other resources in core and extensions at the Request level.
- level: Request
resources:
- group: "" # core API group
- group: "extensions" # Version of group should NOT be included.
# A catch-all rule to log all other requests at the Metadata level.
- level: Metadata
# Long-running requests like watches that fall under this rule will not
# generate an audit event in RequestReceived.
omitStages:
- "RequestReceived"
name: audit-policy-config
path: /etc/kubernetes/audit/policy-config.yaml
roles:
- ControlPlane
hooks:
- before:
- update-engine.service
manifest: |
Type=oneshot
ExecStartPre=/usr/bin/systemctl mask --now update-engine.service
ExecStartPre=/usr/bin/systemctl mask --now locksmithd.service
ExecStart=/usr/bin/systemctl reset-failed update-engine.service
name: disable-automatic-updates.service
- manifest: |
Type=oneshot
# Prune all unused docker images older than 7 days
ExecStart=/usr/bin/docker system prune -af --filter "until=168h"
name: docker-prune.service
requires:
- docker.service
- manifest: |
[Unit]
Description=Prune docker daily
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
name: docker-prune.timer
useRawManifest: true
- before:
- protokube.service
manifest: |-
Type=oneshot
ExecStart=/usr/bin/systemctl restart sshd.socket
name: sshd-socket-restart.service
iam:
allowContainerRegistry: true
legacy: false
serviceAccountExternalPermissions:
<redacted>>
useServiceAccountExternalPermissions: true
kubeAPIServer:
auditPolicyFile: /etc/kubernetes/audit/policy-config.yaml
auditWebhookBatchMaxWait: 5s
auditWebhookConfigFile: /etc/kubernetes/audit/webhook-config.yaml
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
hostnameOverride: '@hostname'
kubernetesApiAccess:
- <redacted>>
kubernetesVersion: 1.34.2
masterPublicName: api.dev.k8s.local
networkCIDR: <redacted>>
networkID: vpc-3d03f75b
networking:
cilium:
enableL7Proxy: true
hubble:
enabled: true
nonMasqueradeCIDR: 100.64.0.0/10
podIdentityWebhook:
enabled: true
rollingUpdate:
maxSurge: 2
serviceAccountIssuerDiscovery:
discoveryStore: <redacted>
enableAWSOIDCProvider: true
snapshotController:
enabled: true
sshKeyName: dev
subnets:
- egress: <redacted>
id: <redacted>
name: dev-internal
type: Private
zone: eu-west-1a
- egress: <redacted>
id: <redacted>
name: dev-private
type: Private
zone: eu-west-1a
- id: <redacted>
name: dev-public
type: Utility
zone: eu-west-1a
topology:
dns:
type: None
---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: "2025-11-22T21:21:09Z"
labels:
kops.k8s.io/cluster: dev.k8s.local
name: master
spec:
autoscale: false
image: ami-02d94ae5d4360b407
instanceMetadata:
httpPutResponseHopLimit: 1
httpTokens: required
machineType: t4g.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master
tier: master
role: Master
rootVolumeSize: 16
rootVolumeType: gp3
subnets:
- dev-internal
---
<redacted>8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
Here are the logs of kubelet and containerd on the master node (journactl -u kubelet -u containerd > fail.log): fail.log
I notice that the container does not even start (it does not even appear in crictl ps -a). Here are the logs from when it tries to start the container:
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.740709278Z" level=info msg="ImageCreate event name:\"registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.789435941Z" level=info msg="stop pulling image registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d: active requests=0, bytes read=85783858"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.833518870Z" level=info msg="ImageCreate event name:\"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.835087313Z" level=info msg="Pulled image \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" with image id \"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\", repo tag \"\", repo digest \"registry.k8s.io/etcd-manager/etcd-manager-slim@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\", size \"85782882\" in 4.969761125s"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.835133016Z" level=info msg="PullImage \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\" returns image reference \"sha256:d3799cccd18379e55a8bd60d26c78eef5112fa3aba4dd4d5660dbf3b4e853e83\""
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.836774296Z" level=info msg="PullImage \"registry.k8s.io/etcd-manager/etcd-manager-slim:v3.0.20250917@sha256:3b8eeeef21b10f32ccd5ca58394d5f48fd3cacc40b6db7160e52fd7f0af5290d\""
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.873277155Z" level=info msg="CreateContainer within sandbox \"d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775\" for container &ContainerMetadata{Name:etcd-manager,Attempt:0,}"
Nov 28 22:01:40 ip-10-1-45-152 containerd[2429]: time="2025-11-28T22:01:40.874584322Z" level=error msg="CreateContainer within sandbox \"d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775\" for &ContainerMetadata{Name:etcd-manager,Attempt:0,} failed" error="failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875156 3936 log.go:32] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory" podSandboxID="d73a73fd99645a1880cfdf879e0d36f1059f479dc45a5afc441c5790a81ba775"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875294 3936 kuberuntime_manager.go:1449] "Unhandled Error" err="container etcd-manager start failed in pod etcd-manager-events-i-06f1a3baa9bed8ccd_kube-system(abe485e9c356c6883d5536e2e3788153): CreateContainerError: failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\" spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory" logger="UnhandledError"
Nov 28 22:01:40 ip-10-1-45-152 kubelet[3936]: E1128 22:01:40.875335 3936 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd-manager\" with CreateContainerError: \"failed to generate container \\\"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\\\" spec: failed to generate spec: failed to mkdir \\\"\\\": mkdir : no such file or directory\"" pod="kube-system/etcd-manager-events-i-06f1a3baa9bed8ccd" podUID="abe485e9c356c6883d5536e2e3788153"
And this is error in the first error line from containerd above:
error="failed to generate container \"023161db502985850e15feb58f6a56eeeee2876ee57cb593eaee43d2a3583f8b\"
spec: failed to generate spec: failed to mkdir \"\": mkdir : no such file or directory"
I've compared a log of a successful startup (kOps 1.33.1) and the logs contain no such error lines.