100 docker containers vs 100 small machines
so we have an application which is not thread-safe.
Some of the library it is using is doing a lock on file-system level. Unfortunately, it’s not working correctly and will crash and throw an error if there are some concurrent usage of the library. We also can’t switch this library away. To achieve concurrency, which one is better? Running 100 containers in one powerful machine or splitting it into 100 small machines?
Since we are using Amazon, I am thinking about 100 X t2.micro instances each running one container VS one c4.8xlarge machine with 100 docker containers. We don’t have any problem with memory. The tasks are CPU-bound. But it’s also not so heavy that a t2.micro instance is enough to handle it as long as it only processes one at one time.
I got into a discussion with a colleague about which one is better. I prefer the 100 instances because I think the Docker isolation will be a significant overhead. It’s like you have only one resource, but it’s split into 100 people who needs to use the resource. On the other side, my colleague makes a point which I think might be valid. Creating a Linux namespace is lighter than starting a whole OS. So if we have 100 machines, we have 100 OSes, while with a big machine, we only have 1 OS.
The thing is, I don’t know which one is correct. Could someone who have knowledge in this explain which one would be better and give me a concrete reason?
Since I realized I have just asked a bad question, I will try to add more information here. To make the question more precise, I am not really asking which one will be better in my specific use case, or which is cheaper. It’s just a curiosity which one will perform better in terms of CPU. Just imagine we have a very big computational problem, and we have to do 100 of them. We want to parallelize them, but they are not thread-safe. Is it better to do them in 100 small machines or 1 powerful machines with 100 containers? Which one will complete faster and why?
If we have only 1 powerful machines, will all these 100 containers not be fighting for resource and slow down the overall process? And if it’s 100 small machines, maybe the overall performance will be slower because of the OSes or other factors? In any case, I don’t have any experience with this. Of course I could try this, but in the end, since it’s not the ideal environment (with a lot of factors), the result won’t be authoritative anyway. I was looking for an answer from people who knows how both things work in low level and could argument which environment will complete the task faster.
3 Solutions collect form web for “100 docker containers vs 100 small machines”
The only “appropriate” answer to your question is: you have to test both options and find out which one is better. The reason for this is: you are running a very specific application, with a very specific workload and very specific requirements. Any recommendation without actual testing is a guess. Maybe an “educated guess”, but not more than that.
That said, let me show you what I would consider when doing my analysis for such a scenario.
The docker overhead should be absolutely minimal. The tool “docker” itself is not doing anything — it’s just using regular Linux Kernel features to create an isolated environment for your application.
After the OS has booted up, it will consume some memory, true. But the CPU consumption by the OS itself should be negligible (even for very small instances). Since you mentioned that you don’t have any problems with memory, it seems like we can assume that this “OS overhead” that your colleague mentioned would also probably be negligible.
If you consider the route “a lot of very small instances”, you could also consider using the recently released
t2.nanoinstance type. You need to test if it has enough resources to actually run your application.
If you consider the route “one single very large instance”, you should also consider the
c4.8xlinstance. This should give you significantly more CPU power than the c3.8xl.
Cost analysis (prices in us-east-1):
- 1x c3.8xlarge: 1.68 $/h
- 1x c4.8xlarge: 1.675 $/h (roughly the same as c3.8xl)
- 100x t2.micro: 1.30 $/h (that is, 100x 0.013$/h = 1.30$/h)
- 100x t2.nano: 0.65 $/h. (winner) (that is, 100x 0.0065$/h = 0.65$/h)
Now let’s analyze the amount of resources that you have on each setup. I’m focusing only on CPU and ignoring memory, since you mentioned that your application isn’t memory hungry:
- 1x c3.8xlarge: 32 vCPU
- 1x c4.8xlarge: 36 vCPU (each one typically with better performance than the c3.8xl; the chip for c4 was specifically designed by Intel for EC2) (winner)
- 100x t2.micro: 100x 10% of a vCPU ~= “10 vCPU” aggregate
- 100x t2.nano: 100x 5% of a vCPU ~= “5 vCPU” aggregate
Finally, let’s analyze the cost per resource
- 1x c3.8xlarge: 1.68$/h / 32 vCPU = 0.0525 $ / (vCPU-h)
- 1x c4.8xlarge: 0.0465 $ / (vCPU-h) (winner)
- 100x t2.micro: 0.13 $ / (vCPU-h)
- 100x t2.nano: 0.13 $ / (vCPU-h)
As you can see, larger instances typically provide higher compute density which is typically less expensive.
You should also consider the fact that the T2 instances are “burstable”, in that they can go beyond their baseline performances (10% and 5% as above) for some time, depending on how much “CPU credits” they have. However, you should know that, although they start with some credit balance, it’s typically enough for booting up the OS and not much more than that (you’d accrue more CPU credits over time if you don’t push your CPU beyond your baseline, but it seems that it won’t be the case here, since we are optimizing for performance…). And as we have seen, the “cost per resource” is nearly 3x that of the 8xl instances, so this short burst that you’ll get probably wouldn’t change this rough estimate.
You might also want to consider network utilization. Is the application network intensive? Either in latency requirements, or bandwidth requirements, or in the amount of packets per second?
- The smaller instances have less network performance available, the larger instances have much more. But the network for each smaller instance would be used by one single application, while the powerful network of the larger instance would be shared among all the containers. The 8xl instances come with a 10Gbps network card. The t2 instances have very low network performance compared to that.
Now, what about resiliency? How time-sensitive are these jobs? What would be the “cost of not finishing them in a timely manner”? You might also want to consider some failure modes:
- What happens if “one single instance dies”? In the case of 1x c3.8xl or 1x c4.8xl, your whole fleet would be down, and your workers would stop. Would they be able to “recover”? Would they need to “restart” their work? In the case of “a lot of small instances”, well, the effect of “one single instance dying” might be much less impactful.
To reduce the effect of “one single instance dying” on your workload, and still get some benefits from higher density compute (ie, large c3 or c4 instances), you could consider other options, such as: 2x c4.4xl, or 4x c4.2xl, and so on. Notice that the c4.8xl costs twice the c4.4xl but it contains more than twice the number of vCPUs. So the analysis above wouldn’t be “linear”, you’d need to recalculate some costs.
Assuming that you are OK with instances “failing” and your application can somehow deal with that — another interesting point to consider is using Spot instances. With Spot instances, you name your price. If the “market price” (regulated by offer – demand) is below your bid, you get the instance and pay only the “market price”. If the price fluctuates above your bid, then your instances are terminated. It’s not uncommon to see up to 90% discounts when compared to On Demand. As of right now, c3.8xl is approximately 0.28$/h in one AZ (83% less than On Demand) and c4.8xl is about the same in one AZ (83% less as well). Spot pricing is not available for t2 instances.
You can also consider Spot Block, in which you say the number of hours you want to run your instances for, you’ll pay typically 30% – 45% less than On Demand and there’s no risk of getting “out bided” during the period you specified. After the period, your instances are terminated.
Finally, I would try to size my fleet of servers so that they are required for nearly a “full number of hours” (but not exceeding that number) (unless I’m required to finish execution ASAP). That is, it’s much better to have a smaller fleet that will finish the jobs in 50 minutes than a larger one capable of finishing the job in 10 minutes. The reason is: you pay by the hour, at the beginning of the hour. Also, it’s typically much better to have a larger fleet that will finish the job in 50 minutes than a smaller one that would require 1h05 minutes — again, because you pay by the hour, at the beginning of the hour.
Finally, you mention that you are looking for “the best performance”. What exactly do you mean by that? What is your Key Performance Indicator? Do you want to optimize to reduce the amount of time spent in total? Maybe reduce the amount of time “per unit”/”per job”? Are you trying to reduce your costs? Are you trying to be more “energy efficient” and reduce your carbon footprint? Or maybe optimize for the amount of maintenance required? Or focus on simplifying to reduce the ramp-up period that other colleagues less knowledgeable would require before they would be able to maintain the solution?
Maybe a combination of many of the performance indicators above? How would they combine? There’s always a trade off…
As you can see, there isn’t a clear winner. At least not without more specific information about your application and your optimization goals. That’s why most of the time the best option for any kind of performance optimization is really testing it. Testing is also typically inexpensive: it would require maybe a couple hours of work to setup your environment, and then likely less than $2/h of test.
So, that should give you enough information to start your investigation.
Based on EC2 instances CPU alone
t2.micro‘s have less than a 1/3 of the CPU power of the
t2.small‘s have less than 2/3’s
t2.large‘s might be getting close.
It’s fairly likely the
c4.8xl would be a lot quicker but there’s no way anyone can authoritatively say that. As Bruno starts with, you need to run your workload through your app on both instance types to see. There are 1000’s of variables that could affect the results. There’s no immediate reason you can’t run 100 containers/processes on a linux host, it’s been doing multi core, multi process for a long time. There may be some simple system limits that need to be tuned as you go (
ulimit -a and
sysctl -a). On the other hand there might be something in your application performs really badly in this setup.
Any Docker overhead pretty much cancels out as you run a container in both your single and multi instance setup. Anything you can improve in this area is going to help limit the issues you run into on the shared host. IBM published a great report An Updated Performance Comparison of Virtual Machines and Linux Containers detailing Docker overheads.
- Container overhead will be negligible for CPU/Memory.
- NAT networking includes some overhead so use
- AUFS disk will incur overhead. For anything write intensive mount your hosts EBS/SSD directly into the container.
One big difference with many Docker containers is the workload on the VM’s process scheduler will be completely different. It’s not necessarily going to be any worse when running more containers but you are heading into “less tested” territory. You rely less on the Xen hypervisor running on hardware and more on the linux kernel running inside the VM for scheduling which use two completely different algorithms. Again, Linux is capable of running many processes and dealing with contention but you’re only going to find the answer for your app by testing it.
So at a complete guess, not taking your application into account in any way, purely on general machine specs and a CPU heavy workload.
c4.8xlshould be quicker than a
t2.smalls might get close.
- 50 x
t2.larges with 2 containers might be on par.
- 100 x
m3.mediumswould blitz it.
- 3 x
c4.8xl‘s would give you the quickest processor hyper thread dedicated to each process/thread of your app.
tl;dr Test them all. Use the quickest.
In a general case, I would use docker. Simply because it easier and faster to add/change/remove nodes and do load balancing. I am assuming here that you do not need exactly 100 nodes, but rather a changing number of them.
It is also much faster to setup. There is a thing called docker swarm https://docs.docker.com/swarm/overview/ and I’m sure it’s worth looking into. Additionally, if you can’t put a 100 or 1000 or 10000 containers on one physical machine (VM even?) you can also distribute it with overlay network https://docs.docker.com/engine/userguide/networking/dockernetworks/#an-overlay-network
EDIT: after re-reading the question it seems that you want a more of a performance-wise answer. It’s really hard to say (if not impossible) without testing. With performance there are always caveats and things you can’t think of (on a simple account of being human:) ).
Again to set up 100 or 3 or 999999 machines is much more time then setting up the same number of docker containers. I know that container/image terminology still gets mixed up, so just to clarify – you would create one docker image, which is some work, and than run it in N instances (which are containers). Someone please correct me for the terminology thing if I am wrong – this is something that I often discuss with colleagues 🙂