Kubernetes cluster nodes become unresponsive

I have created two different clusters on Azure and I am facing different problems on both the clusters.

1) After running for some time (7-8 hours) node becomes unresponsive. I checked log on the node and found below error. Which I am expecting is the root cause for becoming the node responsive:

  • docker nginx load balancing not working with Azure
  • Docker try to download unnecessary busybox image on creation of redis pod with kubernetes tools
  • Dockerized .net core app doesn't load on Azure
  • DC/OS JMX Access
  • CoreOS : when pulling large docker image of size greater than 4 GB
  • CoreOS: fleetctl status shows service inactive, while container is up
  • May 06 22:00:01 kube-01 kernel: VFS: file-max limit 709091 reached
    

    However I am unable to figure out the root cause of the problem.

    Also I checked the Master’s log at the same time it’s continuously throwing below errors:

    May 06 22:01:48 kube-00 bash[1188]: Error from server: error when creating "/etc/kubernetes/addons/skydns-rc.yaml": replicationControllers "kub-dns-v9" already exists
    
    May 06 22:01:48 kube-00 bash[1188]: Error from server: error when creating "/etc/kubernetes/addons/skydns-svc.yaml": Service "kube-dns" is invalid: spec.clusterIP: invalid value '10.16.0.3', Details: provided IP is already allocated
    

    I restarted sky-dns pods.

    Now above error from master is gone. But I am not sure why this happened.

    2) On other cluster I am not able to setup cluster. I am getting continuous timeout error in etcd as below:

    May 08 21:05:27 etcd-00 etcd2[569]: publish error: etcdserver: **request timed out**
    May 08 21:05:28 etcd-00 etcd2[569]: 8910a392b5f210d1 is starting a new election at term 2095
    May 08 21:05:28 etcd-00 etcd2[569]: 8910a392b5f210d1 became candidate at term 2096
    May 08 21:05:28 etcd-00 etcd2[569]: 8910a392b5f210d1 received vote from 8910a392b5f210d1 at term 2096
    May 08 21:05:28 etcd-00 etcd2[569]: 8910a392b5f210d1 [logterm: 90, index: 572430] sent vote request to 20ba77ebc61a2b91 at term 2096
    May 08 21:05:29 etcd-00 etcd2[569]: 8910a392b5f210d1 is starting a new election at term 2096
    May 08 21:05:29 etcd-00 etcd2[569]: 8910a392b5f210d1 became candidate at term 2097
    May 08 21:05:29 etcd-00 etcd2[569]: 8910a392b5f210d1 received vote from 8910a392b5f210d1 at term 2097
    May 08 21:05:29 etcd-00 etcd2[569]: 8910a392b5f210d1 [logterm: 90, index: 572430] sent vote request to 20ba77ebc61a2b91 at term 2097
    May 08 21:05:31 etcd-00 etcd2[569]: 8910a392b5f210d1 is starting a new election at term 2097
    May 08 21:05:31 etcd-00 etcd2[569]: 8910a392b5f210d1 became candidate at term 2098
    May 08 21:05:31 etcd-00 etcd2[569]: 8910a392b5f210d1 received vote from 8910a392b5f210d1 at term 2098
    May 08 21:05:31 etcd-00 etcd2[569]: 8910a392b5f210d1 [logterm: 90, index: 572430] sent vote request to 20ba77ebc61a2b91 at term 2098
    May 08 21:05:32 etcd-00 etcd2[569]: 8910a392b5f210d1 is starting a new election at term 2098
    May 08 21:05:32 etcd-00 etcd2[569]: 8910a392b5f210d1 became candidate at term 2099
    May 08 21:05:32 etcd-00 etcd2[569]: 8910a392b5f210d1 received vote from 8910a392b5f210d1 at term 2099
    May 08 21:05:32 etcd-00 etcd2[569]: 8910a392b5f210d1 [logterm: 90, index: 572430] sent vote request to 20ba77ebc61a2b91 at term 2099
    May 08 21:05:34 etcd-00 etcd2[569]: 8910a392b5f210d1 is starting a new election at term 2099
    May 08 21:05:34 etcd-00 etcd2[569]: 8910a392b5f210d1 became candidate at term 2100
    May 08 21:05:34 etcd-00 etcd2[569]: 8910a392b5f210d1 received vote from 8910a392b5f210d1 at term 2100
    May 08 21:05:34 etcd-00 etcd2[569]: 8910a392b5f210d1 [logterm: 90, index: 572430] sent vote request to 20ba77ebc61a2b91 at term 2100
    May 08 21:05:34 etcd-00 etcd2[569]: publish error: etcdserver: **request timed out**
    

    On master I am getting error:

    error: couldn't read version from server: Get http://localhost:8080/api: dial tcp 127.0.0.1:8080: connection refused
    

    Which is because api server is not running and it’s giving below error as it’s unable to connect to etcd:

    May 08 21:06:36 kube-00 kube-apiserver[8051]: E0508 21:06:36.189684    8051 cacher.go:149] unexpected ListAndWatch error: pkg/storage/cacher.go:115: Failed to list *api.Node: 501: All the given peers are not reachable (failed to propose on members [http://etcd-00:4001 http://etcd-01:4001] twice [last error: Get http://etcd-00:4001/v2/keys/registry/minions?quorum=false&recursive=true&sorted=true: dial tcp 172.18.0.4:4001: connection refused]) [0]
    

    I am using below coreos and docker version:

    CoreOs Version: 835.12.0
    Docker Version: 1.8.3
    

    I have other cluster which is running without any issue on below versions:

    CoreOs Version: 835.13.0
    Docker Version: 1.8.3
    

  • XDebug: Windows + Docker + PHPStorm
  • Docker run command error
  • .ebextensions with Docker on elasticbeanstalk
  • Deploying multiple Deis clusters
  • “No such file or directory” what's wrong in this Dockerfile?
  • How to ignore stream and log attributes from Docker JSON logs
  • Docker will be the best open platform for developers and sysadmins to build, ship, and run distributed applications.