Web 2.0 Expo: Capacity Planning for Web Operations

Schill suggested I go to this one as well. Again, mostly because Schill knows the guy, but given how this is perennial problem for us as well, it couldn’t hurt to see how Flickr handles the problem.
Presenter: John Allspaw, Flickr

  • Traditional capacity planning: queuing theory
  • Some sites (e.g. Flickr) put out updates 20 times a day; timelines are much, much shorter
    • Release early, release often (fail early, fail often)
  • Why capacity planning is important
    • Hardware costs
    • Network and data storage costs money
    • Cloudware costs $$$ too
    • Too little is bad, too much is waste
  • Normal growth is planned, expected, projected, hoped for
  • Instantaneous growth are unexpected, spikes, external events, Digg effect
    • Slam and/or destroy your performance
  • Instantaneous coping
    • Disable heavier features on the site (Flickr builds featres with config files for quick disabling)
    • Aggressive caching or serve stale data
    • Bake dynamic pages into static ones
  • Capacity != performance
    • Making something last doesn’t make it fast
    • Tuning is good, just don’t count on it
    • Accept what performance you have, not what you want
  • Good capacity measurement tools
    • Measure and record any number you give it over time (metric collection tools; aka trending)
    • Easily compare metrics to any other metrics
    • Import/export
    • Examples: Cacti.net, muninprojects.linpro.no, ganglia.info, hyperic.com
    • Flickr uses ganglia
  • Related questions
    • How much can a server handle before it dies?
    • How many can we lose before we’re screwed?
    • How quickly can we get another server?
  • Need to relate the network/CPU performance to your application performance
    • Only real way to establish how much a given server can handle, and how many servers you might need
  • Benchmarking is a bit of a red herring, but can be used if you’ve no other choice
  • “Diagonal” scaling: vertical scaling (big, powerful) + horizontal scaling (lots of the same thing)
  • Flickr went from old, slower machines to new, faster machines — less of them that did more
  • Use Common Sense (TM)
    • pay attention to the right metrics (many of them are irrelevant or misleading, but it might not be the one that shows where/how a server died)
    • Review graphs constantly (weekly, hourly, seasonally)
  • Complex simulation/modelling rarely worth the time and effort
    • Better to put it into production and see what happens
    • “I’ve got a stack of napkins and a pen, and I’m not afraid to use ’em!”
  • Tuning and weaking will never gain you excess capacity

Leave a Reply

Your email address will not be published. Required fields are marked *