Wednesday, September 02, 2009

 

Engineering large digital infrastructures is not trivial

Gmail was down a while. Google describes how it happened.

In essence, a mechanism designed to throttle load on heavily used parts of the infrastructure reduced total capacity. If demand is then not reduced this leads to congestion, similar to what happens in a traffic jam.

One way for gmail to reduce demand is to signal to the webbrowser to decrease the frequency with which it polls the gmail servers for new mail (I do not know if they already do this).

On a side note: can anyone think of a way to beta test systems of this size?

Comments:
Beta test what? It was an upgrade taking servers out service. The process is probably well known, well documented and otherwise reliable. There just wasn't headroom and it led to a series of cascading failures, much like the August 2003 problems in the NE United States.

Paul Kedrosky has spoken about these tightly coupled systems and cascading failures before too. Commonly, actually.

It sounds like Google has plans to address that failure mode, and the solutions they outline in the last paragraph sound very reasonable. Over-provisioning sucks but having that headroom is useful.

So I guess I don't get what you're talking about beta testing. What do you mean?
 
Admittedly, I think it a bad idea to perform this kind of maintenance in the middle of the day in North America (which I imagine has more GMail users than most other regions, on a traffic basis). But hey, who am I? Besides, if they've done it before successfully, and I imagine they do it a lot, it's only a problem when something goes wrong.
 
Yz, you are quick to respond! Where did you pick it up!

Beta testing is of course impossible on this scale. I meant it ironically. I work with a number of projects that seem to take forever to engineer out a risk that you can only see in production. You might as well jump in.
 
Well, I bring it up because it sounds like a process failure, which you can't really beta test for.

(Ok, not entirely true, but a lot more difficult.)

When I read your question, I was thinking about a side conversation I had with Jonathan Heiliger @ GigaOm's Structure 09 conference back in June. I asked Jonathan about how they test new features, and he basically said you have to test them live. You can't generate enough traffic in test environments to simulate production. How they (Facebook) do it is dark launching, letting the feature code run on pages w/o the associated UI code, to see if it can handle the stress of being loaded (even though users never see the feature).

Again, based on the Google blog entry, it sounded like an honest mistake. A small bit of over-provisioning, some recalculations on the number of servers, and a few software releases for the routers, and they'll have new problems to fix. Which is what you want at scale. Fix problems want, because as you grow, there will always be new ones.
 
Oh, and I saw it linked back from the Google blog post. Remember who owns Blogger...
 
Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?