Debugging windows nodes in EKS
When the node doesn't see the service network
Debugging windows nodes in EKS#
After a couple of days dealing with windows nodes in Kubernetes AWS EKS, I thought I would provide a quick write-up of what I’ve seen. For two reasons
- In case in a week time I need to remember what I did, I know where to look at.
- In case somebody faces the same issues and Google offers them this page (BTW, I don’t have any analytics on this page, so shout on Twitter or LinkedIn if you found this useful)
Anyway, we have some nodegroups with windows machines. The machines would join the cluster correctly and kubectl get nodes
would show the nodes in ready status. But it seemed that pods were having some connectivity problem. We were trying to schedule a jenkins slave pod using the Jenkins Kubernetes plugin but the jnlp container was dying unable to connect to the jenkins service.
So I RDP into the node, and do a curl to the Jenkins service by IP (the service network is a different “ficticious network”). Unable to connect. Obviously, trying to do nslookup using the DNS service for the cluster (CoreDNS) would have the same issue.
Next, I tried to connect the pod directly, not the service. In AWS, pods have a routable IP. Connecting the pod directly would work.
OK. At that point it was obvious the nodes were having some issue trying to connect to the service network. What could that be?
First, I tried to see if the node had joined correctly the cluster. So time to see the logs for the bootstrap process.
more C:\ProgramData\Amazon\EC2-Windows\Launch\Log\UserdataExecution.log
didn’t show anything strange. All seemed to have gone correctly.
Part of the boot process generates a configuration file for CNI (networking). Time to see that
more C:\ProgramData\Amazon\EKS\cni\config\vpc-shared-eni.conf
{
"cniVersion": "0.3.1",
"name": "vpc",
"type": "vpc-shared-eni",
"eniMACAddress": "<macaddress>",
"eniIPAddresses": ["<CIDR self>"],
"gatewayIPAddress": "<ip gateway>",
"vpcCIDRs": [
"<CIDR vpc>"
],
"serviceCIDR": "<CIDR service>",
"dns": {
"nameservers": ["<ip dns>"],
"search": [
"{%namespace%}.svc.cluster.local",
"svc.cluster.local",
"cluster.local"
]
}
}
Again. Everything correct.
Service network. It should be the kube-proxy. How to to debug kube-proxy in Windows? How to see logs in Windows?
It turns out there is no such thing as journalctl or so. You use get-eventlog
In particular, this is the command to get a list of the logs in tabular format
> get-eventlog eks|more
In that list, something caught my eye inmediately, because it was of type error, not information nor warning.
So I laser focussed on that event
> get-eventlog EKS |where index -eq 8556|format-list *
EventID : 0
MachineName : <name machine>
Data : {}
Index : 8556
Category : (0)
CategoryNumber : 0
EntryType : Error
Message : E0624 12:04:06.002944 3084 reflector.go:138]
k8s.io/client-go/informers/factory.go:134: Failed
to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice:
endpointslices.discovery.k8s.io is forbidden: User
"system:node:ip-<ip>.ec2.internal" cannot list resource
"endpointslices" in API group "discovery.k8s.io" at the cluster scope
Source : kube-proxy
ReplacementStrings : {E0624 12:04:06.002944 3084 reflector.go:138]
k8s.io/client-go/informers/factory.go:134: Failed
to watch *v1.EndpointSlice : failed to list *v1.EndpointSlice:
endpointslices.discovery.k8s.io is forbidden: User
"system:node:ip-<ip>.ec2.internal" cannot list resource
"endpointslices" in API group "discovery.k8s.io" at the cluster scope}
InstanceId : 0
TimeGenerated : 6/24/2022 12:04:06 PM
TimeWritten : 6/24/2022 12:04:06 PM
UserName :
Site :
Container :
Something about kube-proxy not being able to access a specific endpoint.
"system:node:ip-<ip>.ec2.internal" cannot list resource "endpointslices" in API group "discovery.k8s.io"
Hmm. Immediately I knew it was going to be related to aws-auth. Remember that aws-auth is the way EKS gives permissions to nodes (and actual people) to connect to the cluster.
There was some red-herring with a bug on CoreDNS. But I dismissed it quickly.
It had to be aws-auth.
I checked the configuration, and yes, aws-auth had indeed two groups of nodes. Linux nodes, have applied two clusterrolebindings (I thought it was roles, but not). And windows, have a third clusterrolebinding.
$ kubectl get cm aws-auth -n kube-system -o yaml
apiVersion: v1
data:
mapRoles: |
- rolearn: <arn linux nodes>
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
- rolearn: <arn windows nodes>
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
- eks:kube-proxy-windows
So far, so good. I checked that the cluster role binding was correct
$ kubectl get clusterrolebinding eks:kube-proxy-windows -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
[...]
labels:
eks.amazonaws.com/component: kube-proxy
k8s-app: kube-proxy
name: eks:kube-proxy-windows
[...]
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node-proxier
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: eks:kube-proxy-windows
There we are, system:node-proxier
. Everything looks correct. Next, check the Cluster role
$ kubectl get clusterrole system:node-proxier -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:node-proxier
[...]
rules:
[...]
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- list
- watch
and lo and behold, discovery.k8s.io
apigroups, endpointslices
resources, and list
.
Hmm. Everything looks correct.
Rubber duck moment.
The k8s cluster role should have permissions, the cluster role gets applied to nodes that have the AWS role I have passed for windows machines.
But … our window machine doesn’t have permissions?
Let’s check the last piece. It cannot be so stupid…
Does the window machine have the windows role?
Yes, in my case, the issue was so stupid. The node had an incorrect AWS role applied.
I hope that the whole debugging process was more interesting than actually the issue.