distributed wide and deep with tf.contrib.learn api example stuck on k8s

I am new to distributed tensorflow. I tried to run distributed wide-and-deep example on one node k8s cluster, but the worker tasks all stuck at INFO:tensorflow:Create CheckpointSaverHook.

Test in localhost and in docker are all OK.

  • Deploy Cloudsuite benchmark using Docker swarm mode
  • $(pwd) - one level up
  • How to assign as static port to a container?
  • Docker : find sendmail in other container
  • How to make docker networking — interfaces added using pipework persistent
  • Random container names when building from the same docker-compose file
  • Here is my code. https://github.com/zhoudongyan/wide-and-deep

    • docker version: 17.03.1-ce
    • k8s version: v1.6.3
    • tensorflow version: 1.1.0, python3
    • os: ubuntu 14.04 64bit

    Anyone know how to run it correctly? Thanks a lot!

  • What are the typical uses cases for LXC versus VM?
  • Celery doesn't work on docker
  • Docker error to access private registry (Win)
  • Downloading file from S3 using boto3 inside Docker fails
  • Limit JVM memory consumption in a Docker container
  • How persistent are docker data-only containers
  • Docker will be the best open platform for developers and sysadmins to build, ship, and run distributed applications.