Tuesday, July 14, 2009

Hosting Your Site on a Content Distribution Network (CDN)

What if you were able to deploy a completely updatable static HTML site in minutes and have it distributed to a  global audience for pennies a GB?

A thought occurred to me the other day: offerings like Google Sites and SharePoint Online, though not branded as CDNs per se,  likely enjoy reliable access times globally; and both allow arbitrary files to be addressable at user defined URLs using CNAME records. However, with Sites at least, *.HTML, *.CSS, and *.JS files are explicitly prohibited, leaving users struggling to achieve concepts not permitted in the default templates and relegating the products to internal-use scenarios.

But imagine if you had free-reign over static files that leverage the same infrastructure? Imagine if a CDN could host your entire website and not just its static elements. Sure Azure (pricing just announced [14]), EC2 and AppEngine carry promise for offloading applications to the cloud, but the serving of static sites still seems better suited for well-established CDNs.

Welcome to the world that I’ve been immersed in for the last few days: evaluating CDNs as a cost-effective approach for hosting a static site with a gradual ramp-up to ~2M visitors a day. Solution requirements: high-availability, high-scalability and cost-predictability.

Traditional players (Limelight, Akamai and Level 3 – who collectively control 80% of the market) are slowly being challenged by a new breed of pay-as-you-go ‘budget’ CDNs courtesy of pioneers like Amazon, SimpleCDN and the like. Even Limelight re-sellers like Rackspace are undercutting aggressively while offering users a cheap alternative to the same network. In terms of high-level feature set, they’re all roughly comparable; key differences lie mostly in the quality and reach of the distribution networks via so-called “edge” servers around the world.

But there are massive price differences [3] – as large as 10-20x between the least expensive (SimpleCDN) and the most expensive (Akamai). It’s not clear what the noteworthy differentiator between the leading networks is (or whether there is one) – I think they jury’s still out and, in practise, it likely depends largely on the account and their desire to accommodate you.  Attempts to quantify performance between the two is a seriously challenged science [10]. The conclusion of “..if you want to really test a CDN’s performance, and see what it will do for your content and your users, you’ve got to run a trial” seems to be the only take-away. And the challengers aren’t without their share of criticism either [6] – both are relatively new (Amazon’s CloudFront is still in beta and doesn’t offer an SLA; it has had some outages) and both have relatively limited distribution networks (SimpleCDN has 10 points of presence in US/Europe and Amazon has 14 that cover Asia as well [9]).

Another consideration: separation of storage from delivery – normally, the CDN can either host your content and charge you a storage fee ($.15/GB, this is pretty standard across the lot, including Azure) or it can access content that’s already hosted and distribute it. With Amazon, you must use their storage services (S3) to be eligible for the CDN, and they will charge you the transfer rates for propagating content from S3 to the CloudFront CDN. SimpleCDN has a product in beta (Lightening) that supports already-hosted content but their mature offering, like Amazon, also relies on a built-in storage service (also at $0.15/GB).

But hosting your own content defeats the purpose of freeing yourself from infrastructure and high-availability concerns; you would have to ensure that the source content is always available or risk requests from the CDN (upon cache expirations) resulting in 404s (this behaviour was confirmed for at least Limelight).  And this brings me to my next point…

Cache-Control Headers – Why they matter

Cache-control headers on source content (whether hosted by you or a storage services) are respected by the CDN distribution network. This is the only mechanism you have to expire content located on edge servers, otherwise it defaults to something like 24 hours.  Normally, there is no programmatic API to invalidate cached files on edge-servers [5],  which creates a tricky balance: to be able to update something, you must define the TTL when you create the file, before it propagates to edge-servers. This is particularly relevant for URLs that can’t be versioned. You can’t selectively update something: either you pay the price of updating it regularly of give up the ability entirely. In our case, we’re looking to employ the CDN in a unique way to host the entire site. In order to do so, we need to refresh HTML every few minutes and cache-controls on the source files allow us to do this. At the same time, we want to employ standard caching practises to ensure that images, CSS files are held as far as the clients and that only HTML requests beyond say x minutes reach the edge-servers and propagate back to the source server.

Can you run the entire website on a CDN?

I’d like to think so, and we’re still trying to see if there are any hidden catches in this scheme. One concern that we’ve run into: how do we ensure that domain.com, www.domain.com all redirect to the index page www.domain.com/index.html? Remember this isn’t a web server where you can just go into IIS and configure the default page. These are CNAME records that point to the custom CDN sub-domain that’s been provisioned for your particular account. In my discussions with Limelight, they mentioned that their engineer can apply this as a custom one-time configuration, which is nice touch. For SimpleCDN and Amazon, I think your only option is to set up DNS WebForward records so that both domain.com and www.domain.com point to content.domain.com/index.html, with the CNAME record for content.domain.com pointing to your CDN. Under this workaround, trying to access content.domain.com will generate a 403 [11]. The lack of a “index file” option in the pay-as-you-go CDNs could be a bit of a hindrance if you’re URL sensitivity.

And where does the HTML come from?

An internal ASP.NET site generates the desired HTML; a scheduled service consumes this HTML and publishes to the storage service (via FTP or SOAP APIs) on a regular interval, with cache headers to expire the content as needed. This publication scheme itself needs some level of redundancy for a solution demanding high-availability, but instead of failures propagating to your end-users they’ll simply result in stale content.

Pricing

Traditional hosting doesn’t stand a chance in competing with these prices and is more than x2 the cost even at the Limelight tier. Managed hosting providers will likely charge bandwidth on the 95th percentile of the transfer rate as opposed to the cumulative transfer amount. This means they take the highest 5% hourly/daily Mbps rate , drop them, and use the next highest Mbps to calculate your monthly invoice. Say you had a commitment to 3-5Mbps burstable to 100Mbps at an overcharge of $350/1Mbps. This carries tremendous cost-risks for a solution that may spike above the 5% barrier but otherwise remain near the initial commitment level. And if you need gigabit burstable network, then costs effectively double from there. By their own admission, managed hosting providers currently lack support for on-demand scaling and are inherently unequipped to compete in these scenarios; not to mention they would require a minimum 12-month commitment that would discourage event-based sites campaigns that are active for only a portion of the year. (Limelight also has a minimum 12-month commitment, but they offer so-called ‘bucket pricing’ that gives you a fixed $/Gb rate that can be used for shorter durations.)

Rackspace Cloud (Formerly Mosso)

We looked briefly at Rackspace for their strategic partnership with Limelight Networks. While access to the full-range of the Limelight reach (60 points-of-presence globally) at .22/GB is compelling, we had no choice but to dismiss this option for two reasons: lack of CNAME support and lack of access for direct uploads to Limelight. That we would have to upload to Limelight via Rackspace introduced a new point of failure in our solution that relies on a frequent update schedule.  Others have also highlighted the lack of SSL support with Rackspace [12].

Limelight Networks

I’d also like to take this opportunity to acknowledge the unusually strong first impression that Limelight  customer service had on me; these guys are a class-act all the way, from quick turn-arounds to their willingness to accommodate our specifics and for not being afraid to refer us to resellers if the scenario warranted it.

Takeaway

Performance-wise, Limelight is head-and-shoulders above the field surveyed here [12] [13] (by one report ~4.2x faster than CloudFront in US/Europe, ~1.9x faster in Asia and ~3.1x faster globally). If CNAME and SSL aren’t necessary, than Cloud Files is a compelling access point to this network. I think all CDNs could support read-only (i.e. HTTP GETs) sites just as easily; if you need to support POSTs, you have to figure out a way to pass those directly to the source server. Released 3-4 months ago, it seems that LimelightSITE might be specifically geared towards this scenario. Akamai also offers a WhitePaper that describes their take on the same concept (scaling dynamic applications):

image

Aside from the J2EE “EdgeComputing” support, I think these options are both geared towards scaling the serving of content (even if it’s personalized) and not scaling the processing of data (like an Azure or EC2). 

More Info:

If you’re at all interested in keeping up with these subjects you can also subscribe to www.cdnevangelist.com or www.businessofvideo.com for excellent coverage.

References:

[1] – Tools for Amazon’s CloudFront (CloudBerry S3 Explorer & S3 Fox, a FireFox Plug-in)
http://paulstamatiou.com/how-to-getting-started-with-amazon-cloudfront
http://troytolle.blogspot.com/2009/04/tips-on-using-amazon-cloudfront.html
http://www.labnol.org/internet/setup-content-delivery-network-with-amazon-s3-cloudfront/5446/
http://cloudberrylab.com/
http://www.s3fox.net/

[3] – Simple CDN
http://www.simplecdn.com/savings
http://www.simplecdn.com/solutions

Lightening will syndicate content that you host vs. StormFront charging for $149/GB or $.15/MB.

[4] – Amazon CloudFront vs. CacheFly - Performance Benchmark at Sinopop.net
http://sinopop.net/2008/11/20/quick-review-amazon-cloudfront-vs-cachefly-vs-amazon-s3/
http://74.125.95.132/search?q=cache:IVP0f4NsZKAJ:sinopop.net/2008/11/20/quick-review-amazon-cloudfront-vs-cachefly-vs-amazon-s3/+quick-review-amazon-cloudfront-vs-cachefly-vs-amazon-s3&cd=1&hl=en&ct=clnk&gl=ca

[5] – Cache-Control Headers
http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30969&tstart=60
http://blog.bigcurl.de/2008/11/amazon-s3-save-money-by-setting-cache.html
http://www.redmonk.com/jgovernor/2008/11/28/amazon-cloudfront-simple-caching-and-naming/
http://developer.amazonwebservices.com/connect/thread.jspa?messageID=123129&#123129

[6] – CloudFront vs. SimpleCDN (Community Commentary)
http://news.ycombinator.com/item?id=369366

[7] – CDN Pricing Pressure
http://blog.streamingmedia.com/the_business_of_online_vi/2007/11/pricing-pressur.html

[9] – Q: Where are the edge locations used by Amazon CloudFront?
http://aws.amazon.com/cloudfront/faqs/#Where_are_the_edge_locations_used_by_Amazon_CloudFront

[10] – The Microsoft CDN Case Study
http://cloudpundit.com/2008/10/16/the-microsoft-cdn-study/

[11] – ListBucketResult: reply for a CDN request to null
http://developer.amazonwebservices.com/connect/thread.jspa?messageID=108789&tstart=0

[12] – Cloud Files (Rackspace Cloud)
http://stupidsucks.com/2009/05/13/amazon-cloudfront-s3-vs-mosso-cloud-files/
http://techhui.ning.com/profiles/blogs/cloudbased-content-delivery
http://blog.mosso.com/2009/02/a-quantitative-comparison-of-rackspace-and-amazon-cloud-storage-solutions/

[13] – Cloud Files vs. Cloud Front (Performance Reports from Pingdom)
http://www.pingdom.com/reports/nju8qlu8micn/

[14] - Microsoft announces Azure pricing, details
http://news.cnet.com/8301-13860_3-10285904-56.html

3 comments:

Rupert said...

Looking at the same question myself. One issue I have noticed: with ASP.NET controls such as the MenuControl, different content can be served up to different browsers. Also mobile content might be quite different, although sometimes this is done using a redirect.

In your static HTML model, are you relying on the pages being browser-neutral? This would be a nightmare for IE6 compatibility, but most modern browsers are fairly similar.

photography dissertation said...

Superb, brilliant weblog structure! I like your blog post Hosting Your Site on a Content Distribution Network (CDN) and method of writing,

Mark Wright said...


It seems to collect a lot of different ideas in the Word document. I do it in different sheets of paper in my notebook. You have what you do now?
Nursing Essays