Wednesday, September 2, 2009

Latency: Solving 80% of the Problem

There’s a great post describing the “solve 80% of the problem” mantra that’s worth paying attention to, especially in IT, where idealism yields to much tougher budget, risk-appetite, time, and resource constraints; in IT, you can really drown in the last 20%, even on problems much easier than real-time search.

We’ve had a couple of interesting discussions on a very old and widely-covered topic: latency. As it relates to web response times, studies (by Google, Amazon and others) suggest that delays as small as 250-500ms matter to your audience and impact revenue [6].

Two completely separate threads, with two completely different teams, debated issues that ultimately distilled down to latency; participants included upper levels of management:

  1. How to reduce excessive round-trips required to render the homepage;
  2. How to keep users pegged to a particular Data Center without introducing routing delays;

Before looking at solutions, let’s make sure we have a good sense for what latency is and why it matters so much. What may have once been dismissed as a ‘implementation detail’ is increasingly in spotlight as businesses build layers of redundancy, cater to wider audiences, and rely on distributed solutions. Side-stepping text-book definitions [1], a basic description resonates well is offered by Stuart Cheshire in his now infamous “It’s the Latency, Stupid” post:

If you have a network link with low bandwidth then it's an easy matter of putting several in parallel to make a combined link with higher bandwidth, but if you have a network link with bad latency then no amount of money can turn any number of them into a link with good latency.

To see how important this is to the internet elite, look no further than recent headlines [2]:

Miller: The choice of rural North Carolina suggests that the bottom line for Apple is cost, rather than connectivity. The site in Maiden, NC is not far from a large data center by Google, which usually chases cheap power and tax incentives. Power from Duke Energy is about 4 to 5 cents per kilowatt hour, compared to 7 to 12 cents in California. The company also maximized its incentives by pitting Virginia and North Carolina against one another in trying to wring the best tax incentives out of both states (a popular strategy in data center site location).

Some large companies use distributed data centers to manage their latency and content delivery costs. That may be part of Apple’s thinking, since they’re a major customer for CDNs (I believe they use both Akamai and Limelight Networks). Facebook cited latency to Europe as a key factor in its decision to add data centers in Virginia. Before that, MySpace added a data center in Los Angeles to reduce its reliance on CDNs. But in both cases, those companies sought out Internet hubs where they could connect with dozens of other networks to manage their Internet traffic. You don’t get that in rural North Carolina, so Apple seems more focused on cost and scale than on connectivity – which again would suggest a cloud focus.

A big problem that’s only going to get bigger, this much is accepted [4]:

But, as Todd Hoff notes in "Latency is Everywhere and it Costs You Sales - How to Crush it," latency concerns are still very much with us. In fact, the nine sources of latency that he lists suggest that latency is actually a much thornier problem in a world where applications are broken into pieces and often distributed around the world.

Just a few days before these debates stirred up, I stumbled across a Google tech talk on “Multihoming” (operating out of multiple Data Centers) [5], particularly relevant at the time as we had just encountered some of the challenges – even after ‘cheating’, in a sense, by dodging the toughest aspect of multihoming (distributing data).

In his talk, Ryan highlights the fact that CDNs [3] are indeed masters at “edge” caching (removing the latency); we’ve covered this as well, but to recap: it’s one of the most sensible, cost-effective things you can do (when catering to distributed audiences); regardless of what Jeff A. would have you believe [6], a well-selected CDN improves performance and reduces bandwidth expenses at the same time.

image

image

He also goes onto highlight that geolocality, "putting stuff near your users so their requests get to it and back, FAST”, is a primary driver for them too: “in the US, we are somewhat spoiled… and it’s easy to ignore the benefits of geolocality. Lots of stuff is here and we have good backbone connectivity to everywhere else. If you’ve ever spent time in India, China… there is a mixed bag of infrastructure at best. Even in Australia, there is good infrastructure but it’s far away.”

(We’ve seen this first-hand: imagine trying to serve-up Microsoft’s SharePoint to remote mine sites using Satellite links – it’s not pretty.)

Even here in North America, round trip from east coast to west cost is 30 ms, without any queuing delays or router delays. “Purely speed of light,” Ryan adds. “If you go through some of the big peering points, PAC-East, PAC-West, MAE-East, MAE-West, “God help you”, “they’re always overloaded”, “they’re going to add another 30-50ms without breaking a sweat.”

When you add this up, it takes just 10 round-trips to create a delay that users notice.

It’s not that we didn’t appreciate this, it's just that it's not always easy to fix, and in this case, the cost/benefit wasn't particularly attractive; latency could be fought more efficiently and more easily on other fronts.

With the two initial problems, we accepted the complexities and agreed that:

  1. Instead of all-out browser caching (for static requests), we can offer most of the benefit (80% ?) without custom development simply by using hourly or daily expirations that coincide with release schedules (otherwise, you need to start thinking about filenames that include embedded release info to invalidate client copies - had this been baked into the initial design, it'd be a different story);
  2. We can keep more than 80%+ of users pegged to a DC without material changes to the application; instead of working on a treatment for the other 20%, we can spend the time reducing more obvious latency issues in a manner that benefits everyone.

Technical Notes:

On keeping users pegged to a particular DC – because the application requires web-server affinity (relies on in-process memory), it also requires data-center affinity. This is normally provided by the BigIP F5 load balancer (GTM, to be specific) through stickiness based on the IP that queried the DNS record. It turns out that some ISPs proxy out these queries, so a particular user, mid-session (behavior varies across browsers) may suddenly see a new IP for the next DNS query, and potentially land at a new DC. And this effect is only amplifiied by short TTLs.

Though there a couple of alternative solutions, choices really boil down to: re-routing users back to the proper DC behind the scenes (and introducing latency, via a 100KM link that all requests for these users must funnel through) or exposing DCs under unique host-names (messy from an SEO, usability perspective).

Which would you choose?

image

[1] - Latency: The Silent Killer of Application Performance http://www3.villanova.edu/gartner/research/123400/123455/123455.pdf

Latency refers to the time required for a packet to traverse the network from source to destination. When dealing with the WAN, latency is measured as "round trip time" (RTT). In LAN environments, where applications are usually designed and tested, RTT is less than 10 milliseconds (ms). When dealing with cross-continental WANs, RTT increases from 50 ms to 75 ms; in global networks, it will reach 250 ms or more. The bandwidth provisioned is not a significant factor (especially once connections reach T1/E1 rates) in the overall latency because the predominant contributor to RTT (and hence, application delay) is the time "on the wire," which is gated by the speed of light. Therefore, latency (and application performance) is directly proportional to distance from the data center.

[2] - Interview: Apple’s Gigantic New Data Center Hints at Cloud Computing http://www.cultofmac.com/interview-apples-gigantic-new-data-center-hints-at-cloud-computing/14680#more-14680

[3] - What CDN would you recommend? http://highscalability.com/what-cdn-would-you-recommend-0

[4] - Latency (still) matters http://news.cnet.com/8301-13556_3-10024650-61.html

[5] - Multihoming: How Google Serves Data from Multiple Datacenters http://code.google.com/events/io/sessions/TransactionsAcrossDatacenters.html http://highscalability.com/how-google-serves-data-multiple-datacenters

[6] - YSlow: Yahoo's Problems Are Not Your Problems http://www.codinghorror.com/blog/archives/000932.html

[7] – When the Speed of Light is Too Slow http://t1rex.blogspot.com/2005/03/when-speed-of-light-is-too-slow.html http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it

0 comments: