Kubernetes cluster nodes become unresponsive

I have created two different clusters on Azure and I am facing different problems on both the clusters.

1) After running for some time (7-8 hours) node becomes unresponsive. I checked log on the node and found below error. Which I am expecting is the root cause for becoming the node responsive:

  • Can I deploy a docker container to Azure Webapp
  • How to reconnect to docker instance
  • Howto assign second ip to docker container with ip:port:port
  • Connect to WordPress Docker Container
  • Disable SSH access from docker container to it's host
  • x509 certificate signed by unknown authority- Kubernetes
  • May 06 22:00:01 kube-01 kernel: VFS: file-max limit 709091 reached
    

    However I am unable to figure out the root cause of the problem.

    Also I checked the Master’s log at the same time it’s continuously throwing below errors:

    May 06 22:01:48 kube-00 bash[1188]: Error from server: error when creating "/etc/kubernetes/addons/skydns-rc.yaml": replicationControllers "kub-dns-v9" already exists
    
    May 06 22:01:48 kube-00 bash[1188]: Error from server: error when creating "/etc/kubernetes/addons/skydns-svc.yaml": Service "kube-dns" is invalid: spec.clusterIP: invalid value '10.16.0.3', Details: provided IP is already allocated
    

    I restarted sky-dns pods.

    Now above error from master is gone. But I am not sure why this happened.

    2) On other cluster I am not able to setup cluster. I am getting continuous timeout error in etcd as below:

    May 08 21:05:27 etcd-00 etcd2[569]: publish error: etcdserver: **request timed out**
    May 08 21:05:28 etcd-00 etcd2[569]: 8910a392b5f210d1 is starting a new election at term 2095
    May 08 21:05:28 etcd-00 etcd2[569]: 8910a392b5f210d1 became candidate at term 2096
    May 08 21:05:28 etcd-00 etcd2[569]: 8910a392b5f210d1 received vote from 8910a392b5f210d1 at term 2096
    May 08 21:05:28 etcd-00 etcd2[569]: 8910a392b5f210d1 [logterm: 90, index: 572430] sent vote request to 20ba77ebc61a2b91 at term 2096
    May 08 21:05:29 etcd-00 etcd2[569]: 8910a392b5f210d1 is starting a new election at term 2096
    May 08 21:05:29 etcd-00 etcd2[569]: 8910a392b5f210d1 became candidate at term 2097
    May 08 21:05:29 etcd-00 etcd2[569]: 8910a392b5f210d1 received vote from 8910a392b5f210d1 at term 2097
    May 08 21:05:29 etcd-00 etcd2[569]: 8910a392b5f210d1 [logterm: 90, index: 572430] sent vote request to 20ba77ebc61a2b91 at term 2097
    May 08 21:05:31 etcd-00 etcd2[569]: 8910a392b5f210d1 is starting a new election at term 2097
    May 08 21:05:31 etcd-00 etcd2[569]: 8910a392b5f210d1 became candidate at term 2098
    May 08 21:05:31 etcd-00 etcd2[569]: 8910a392b5f210d1 received vote from 8910a392b5f210d1 at term 2098
    May 08 21:05:31 etcd-00 etcd2[569]: 8910a392b5f210d1 [logterm: 90, index: 572430] sent vote request to 20ba77ebc61a2b91 at term 2098
    May 08 21:05:32 etcd-00 etcd2[569]: 8910a392b5f210d1 is starting a new election at term 2098
    May 08 21:05:32 etcd-00 etcd2[569]: 8910a392b5f210d1 became candidate at term 2099
    May 08 21:05:32 etcd-00 etcd2[569]: 8910a392b5f210d1 received vote from 8910a392b5f210d1 at term 2099
    May 08 21:05:32 etcd-00 etcd2[569]: 8910a392b5f210d1 [logterm: 90, index: 572430] sent vote request to 20ba77ebc61a2b91 at term 2099
    May 08 21:05:34 etcd-00 etcd2[569]: 8910a392b5f210d1 is starting a new election at term 2099
    May 08 21:05:34 etcd-00 etcd2[569]: 8910a392b5f210d1 became candidate at term 2100
    May 08 21:05:34 etcd-00 etcd2[569]: 8910a392b5f210d1 received vote from 8910a392b5f210d1 at term 2100
    May 08 21:05:34 etcd-00 etcd2[569]: 8910a392b5f210d1 [logterm: 90, index: 572430] sent vote request to 20ba77ebc61a2b91 at term 2100
    May 08 21:05:34 etcd-00 etcd2[569]: publish error: etcdserver: **request timed out**
    

    On master I am getting error:

    error: couldn't read version from server: Get http://localhost:8080/api: dial tcp 127.0.0.1:8080: connection refused
    

    Which is because api server is not running and it’s giving below error as it’s unable to connect to etcd:

    May 08 21:06:36 kube-00 kube-apiserver[8051]: E0508 21:06:36.189684    8051 cacher.go:149] unexpected ListAndWatch error: pkg/storage/cacher.go:115: Failed to list *api.Node: 501: All the given peers are not reachable (failed to propose on members [http://etcd-00:4001 http://etcd-01:4001] twice [last error: Get http://etcd-00:4001/v2/keys/registry/minions?quorum=false&recursive=true&sorted=true: dial tcp 172.18.0.4:4001: connection refused]) [0]
    

    I am using below coreos and docker version:

    CoreOs Version: 835.12.0
    Docker Version: 1.8.3
    

    I have other cluster which is running without any issue on below versions:

    CoreOs Version: 835.13.0
    Docker Version: 1.8.3
    

  • how to access OSX non-storage devices from docker container
  • Failed to connect to Dockerized elasticsearch via java-client
  • apt-get install in Ubuntu 16.04 docker image: '/etc/resolv.conf': Device or resource busy
  • Force a problematic docker container to restart itself?
  • ManageIQ web application inside docker container can’t connect to Hawkular outside container
  • Using Ansible to install Docker
  • Docker will be the best open platform for developers and sysadmins to build, ship, and run distributed applications.