How to expose Spark Driver behind dockerized Apache Zeppelin?

I am currently building a custom docker container from a plain distribution with Apache Zeppelin + Spark 2.x inside.

My Spark jobs will run in a remote cluster and I am using yarn-client as master.

  • How to integrate spark cluster with other Hadoop2.x cluster on Docker
  • spark-submit proxy host / port configuration not respected when deploy mode is cluster
  • Getting java.lang.OutOfMemoryError thrown at me when running Spark inside Docker
  • What is the meaning of “sandbox” in “docker run” command?
  • hive on spark, always wrong executor_cores in job application from spark master web UI
  • How to configure priorities for Spark and OpenMPI to coexist on a cluster?
  • When I run a notebook and try to print sc.version, the program gets stuck. If I go to the remote resource manager, an application has been created and accepted but in the logs I can read:

    INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable

    My understanding of the situation is that the cluster is unable to talk to the driver in the container but I don’t know how to solve this issue.

    I am currently using the following configuration:

    • spark.driver.port set to PORT1 and option -p PORT1:PORT1 passed to the container
    • spark.driver.host set to 172.17.0.2 (ip of the container)
    • SPARK_LOCAL_IP set to 172.17.0.2 (ip of the container)
    • spark.ui.port set to PORT2 and option -p PORT2:PORT2 passed to the container

    I have the feeling I should change the SPARK_LOCAL_IP to the host ip but if I do so, SparkUI is unable to start, blocking the process a step before.

    Thanks in advance for any ideas / advices !

  • installing `lightdm` in Dockerfile raises interactive keyboard layout menu
  • I need to remove folder and any descendents
  • Best Practice to run commands after starting up a prebuilt docker image
  • Can not ping docker in macOS
  • docker compose run bash command on start
  • Issue in “Accessing S3 bucket from ElasticBeanstalk using Docker Json file”
  • One Solution collect form web for “How to expose Spark Driver behind dockerized Apache Zeppelin?”

    Good question! First of all, as you know Apache Zeppelin runs interpreters in a separate processes.

    Apache Zeppelin architecture diagram

    In your case, Spark interpreter JVM process hosts a SparkContext and serves as aSparkDriver instance for the yarn-client deployment mode. This process inside the container, according to the Apache Spark documentation, needs to be able to communicate back and forth to\from YARN ApplicationMaster and all SparkWorkers machines of the cluster.

    Apache Spark architecture diagram

    This implies that you have to have number of ports open and manually forwarded between the container and a host machine. Here is an example of a project at ZEPL doing similar job, where it took us 7 ports to get the job done.

    Anoter aproach can be running Docker networking in a host mode (though it apparently does not work on os x, due to a recent bug)

    Docker will be the best open platform for developers and sysadmins to build, ship, and run distributed applications.