Why I don’t benchmark HTTP static object serving

Usually I tend to pick a solution based on its features and architecture. That doesn’t mean that I’d pick a dog slow application that pushes the network traffic. But here’s the kicker: in a distributed system, you play by the lowest common denominator aka the real bottleneck. One thing I learned from experience is that given the appropriate software, the hardware usually isn’t the limit. On the other hand, the network pipe is.

Few months ago I had an interesting task: place something in front of Amazon S3 in order to cut down the file distributions costs. While S3 is great for storage, for the actual distribution you’d see that the traffic counter is spinning fast and the USD counter even faster. Certain web hosting providers have good deals for dedicated servers, therefore something like the S3 traffic could be turned on into a fixed cost, given by the amount of machines and the bandwidth pool specified into the contract.

This is the part where the technology kicks in. As I said earlier, the hardware rarely is the limit. Remember the C10K problem? Some servers deal with with this stuff better than others. This means that I wouldn’t pick Apache, or any other web server that deals with the connections by the same model. The network has a high latency compared to the rest of the hardware, even the local disk. People should stop thinking in blocking mode.

For the above task, nginx did the load balancing, fail-over, and URL protection part really well. squid did the other part aka the proxy cache job. Now you may wonder why the good old squid since I think nginx doesn’t need introduction.

I tried Varnish as the proxy cache, but the people behind it thought it was funny not to have an unbuffered transfer mode. This might work well sometime, with the “most of the web traffic”, as they say, but when you try to push a large file over HTTP this point is somewhat the real world usage for our case:

HAI, I ARE TEH CLIENT. PLZ GIVE ME /large-file.zip
LOL, OK, BUT WAI TILL HAZ IT IN ME CACHE

I considered trying Varnish for all the hype surrounding it. I kept myself for going straight to squid because of so called StackOverflow type of specialist that doesn’t recommend it since Varnish is “way faster”. Got a little news flash for you brother: it doesn’t matter. The fact that a client needs to wait till the proxy cache buffers all the stuff from the origin server (S3) won’t count as “speed”. It may work great for CDN-like type of traffic where certain patterns emerge, but not here where we have to deal even with hundred of megabytes for a single object.

The next bus stop was Apache Traffic Server. While it looks good on paper, it required rocket scientists to configure a simple usage mode. While I like the pictures with the baseball bat explaining the concept of a cache hit or miss, it still lacks proper, understandable documentation for a new comer in order to get something up and running in a timely manner. I am not a sysadmin newb, but some things are harder to pick up from scratch than others. Work against a short deadline in order to get my point. My squid configuration file has exactly 18 lines. All human readable and easy to understand. The default Traffic Server installation had as much as configuration files. How am I supposed to figure out this mess?

Things I also considered: nginx – it is dreadful as proxy cache due to the lack of HTTP 1.1 for the proxy backend, therefore no If-Modify-Since or If-None-Match support for invalidating a cached object. Amazon S3 sends both Last-Modified and ETag. A half-baked product from this side of the story. We have HTTP 1.1 and the 304 status. Igor, are you there?

lighttpd – also annoyed me with some proxy cache limitations that didn’t play nice. Don’t get me wrong, excluding some security record, lighttpd is actually a great web server, performance wise. It pushed our file distribution traffic for almost 4 years from a couple of FreeBSD machines. We changed the hardware every 2 years though. I never saw those machines going over 200MiB of RAM even at “rush hour”, while the CPU used to sip power from the socket instead of going like crazy. I didn’t even install a monitoring system since there was nothing to alert. The hardware load balancer did the fail-over stuff back then. Believe me or not, I actually felt sorry for turning off a web server with almost 600 days of uptime, after moving to EC2. But I still hate to use lighttpd as proxy cache.

I thought about Apache + mod_proxy for 3.14 seconds, but I moved on. Even with Event MPM it falls behind the competition. While I have a great deal of respect for the Apache Foundation, I still don’t get how they manage to screw things up with the Apache HTTP Server. I mean, come on, I expect more from the so called market share leader.

I gave a thought to node.js which I think is a terrific piece of technology, if you can ignore some of the silliness of JavaScript. But I wanted a piece of software that’s maintain towards my need by a bunch of people that do this stuff for quite a while, not a home brewed proxy cache.

Which brought me down to squid. A lot of “expert opinions” kept me from picking it in the first place, sparing me the pain of installing, configuring, and testing a bunch of alternatives, even though I’ve being using it as forward proxy for years. It does everything that it needs to do for a proxy cache with more or less configuration. It is an asynchronous server. It isn’t as slow as advertised, even though it’s written into that “slow, modern language” called C++. The configuration is straight forward. In fact, since the cluster is in production, it didn’t went even near the hardware limits.

Something bad happened into the first couple of days due to lack of proper production traffic that didn’t match the theoretical testing: clients who send invalidation headers along with ranged requests, which squid doesn’t handle very well. This sent the whole cluster into a vicious circle. The solution was simple at the nginx layer: strip down all the headers, pass just the relevant stuff. nginx does the header management way much better than squid that tries to implement some sort of “standards compliant” stuff that makes it very stiff. But even then, even though the cluster was working like crazy, the CPU didn’t even come close to 20% while the RAM was mostly doing nothing. The network pipe on the other hand, was working at full capacity, therefore some high I/O waits started to build up. Same goes for nginx.

Therefore no, I won’t do the same mistake twice. I stopped doing benchmarks. I don’t care about faster, but just about fast enough and well enough engineered. Business wise, we don’t pay less for using less CPU or RAM. Or even for drawing less power from the socket. Even Amazon doesn’t do that with it’s pay-as-you-go-only-what-you-use model. We have a bunch of dedicated hardware that we can use at 1% or at 100% load. There’s a bandwidth pool that’s the actual interest in saving money. I encourage fellow sysadmins do the same: stop caring about irrelevant stuff like: this web server is 10% faster. Even if it’s 200% faster (made up number), if the network pipe is the limit, it doesn’t matter overall. That stuff matters into the application server market where you have to shuffle data around, instead of taking a bunch of bytes of the disk and push them over the network interface. Today’s hardware is stupid fast. When 10 Gbps is going to be a standard, then yes, some of the great debates may actually have some ground. Currently they don’t. People like to argue about anything. Even toiler paper positioning. I fail to see where’s the productive part.

Don’t get me wrong. I hate bloat-ware. I love the raw speed. I love the machines that return a fast answer. But sometimes good enough is enough. Don’t keep “the religion” standing in your way of properly doing your job. People should focus on finding solutions instead of fighting religious wars with colorful graphs.

For example, serving 55.000 requests per second for a 2 KiB object over the loopback interface requires higher bandwidth than a 1 Gbps network pipe can provide. 2 KiB is a very small file. Most JavaScript libs, and style-sheets go over that limit, therefore the bandwidth limit is easier to hit. I don’t have to mention the images or plain large objects. Yes, some tests may be great for the ego of a developer behind a certain product. But most of the time some sort of apple to oranges comparison is involved. Certain features cost CPU time. But most of the time, the lack of specific features costs us time. The time is money. The CPU time, in this case, isn’t.

PS: we’re caching hundreds of GiB. We’re serving at least one TiB of data per day. Not bad for an old, slow, proxy cache running on a couple of machines that nobody wants to use since we have this new cool technology. Do we?