Strategy to persist the node's data for dynamic Elasticsearch clusters

I’m sorry that this is probably a kind of broad question, but I didn’t find a solution form this problem yet.

I try to run an Elasticsearch cluster on Mesos through Marathon with Docker containers. Therefore, I built a Docker image that can start on Marathon and dynamically scale via either the frontend or the API.

  • What's the default url that docker using when run `docker pull`?
  • How to debug dockerized self-detaching program?
  • Publish docker container port and access that port from another docker container
  • How to publish an Artifact from inside a Docker container
  • How to run Nginx within a Docker container without halting?
  • regenerating certificates hangs on windows 7
  • This works great for test setups, but the question remains how to persist the data so that if either the cluster is scaled down (I know this is also about the index configuration itself) or stopped, and I want to restart later (or scale up) with the same data.

    The thing is that Marathon decides where (on which Mesos Slave) the nodes are run, so from my point of view it’s not predictable if the all data is available to the “new” nodes upon restart when I try to persist the data to the Docker hosts via Docker volumes.

    The only things that comes to my mind are:

    • Using a distributed file system like HDFS or NFS, with mounted volumes either on the Docker host or the Docker images themselves. Still, that would leave the question how to load all data during the new cluster startup if the “old” cluster had for example 8 nodes, and the new one only has 4.

    • Using the Snapshot API of Elasticsearch to save to a common drive somewhere in the network. I assume that this will have performance penalties…

    Are there any other way to approach this? Are there any recommendations? Unfortunately, I didn’t find a good resource about this kind of topic. Thanks a lot in advance.

  • docker + rails + redis - rescue workers are not running
  • Docker : find sendmail in other container
  • Docker: how to automatically accept “Do you really want to push to public registry?”
  • Permission Issue in Docker container for Symfony2
  • docker process not running in background
  • How to parse a simple array with ng-admin
  • 2 Solutions collect form web for “Strategy to persist the node's data for dynamic Elasticsearch clusters”

    Elasticsearch and NFS are not the best of pals ;-). You don’t want to run your cluster on NFS, it’s much too slow and Elasticsearch works better when the speed of the storage is better. If you introduce the network in this equation you’ll get into trouble. I have no idea about Docker or Mesos. But for sure I recommend against NFS. Use snapshot/restore.

    The first snapshot will take some time, but the rest of the snapshots should take less space and less time. Also, note that “incremental” means incremental at file level, not document level.

    The snapshot itself needs all the nodes that have the primaries of the indices you want snapshoted. And those nodes all need access to the common location (the repository) so that they can write to. This common access to the same location usually is not that obvious, that’s why I’m mentioning it.

    The best way to run Elasticsearch on Mesos is to use a specialized Mesos framework. The first effort is this area is https://github.com/mesosphere/elasticsearch-mesos. There is a more recent project, which is, AFAIK, currently under development: https://github.com/mesos/elasticsearch. I don’t know what is the status, but you may want to give it a try.

    Docker will be the best open platform for developers and sysadmins to build, ship, and run distributed applications.