Schill suggested I go to this one as well. Again, mostly because Schill knows the guy, but given how this is perennial problem for us as well, it couldn’t hurt to see how Flickr handles the problem.
Presenter: John Allspaw, Flickr
Traditional capacity planning: queuing theory
Some sites (e.g. Flickr) put out updates 20 times a day; timelines are much, much shorter - Release early, release often (fail early, fail often)
Why capacity planning is important - Hardware costs
Network and data storage costs money
Cloudware costs $$$ too
Too little is bad, too much is waste
Normal growth is planned, expected, projected, hoped for
Instantaneous growth are unexpected, spikes, external events, Digg effect - Slam and/or destroy your performance
Instantaneous coping - Disable heavier features on the site (Flickr builds featres with config files for quick disabling)
Aggressive caching or serve stale data
Bake dynamic pages into static ones
Capacity != performance - Making something last doesn’t make it fast
Tuning is good, just don’t count on it
Accept what performance you have, not what you want
Good capacity measurement tools - Measure and record any number you give it over time (metric collection tools; aka trending)
Easily compare metrics to any other metrics
Import/export
Examples: Cacti.net, muninprojects.linpro.no, ganglia.info, hyperic.com
Flickr uses ganglia
Related questions - How much can a server handle before it dies?
How many can we lose before we’re screwed?
How quickly can we get another server?
Need to relate the network/CPU performance to your application performance - Only real way to establish how much a given server can handle, and how many servers you might need
Benchmarking is a bit of a red herring, but can be used if you’ve no other choice
“Diagonal” scaling: vertical scaling (big, powerful) + horizontal scaling (lots of the same thing)
Flickr went from old, slower machines to new, faster machines — less of them that did more
Use Common Sense (TM) - pay attention to the right metrics (many of them are irrelevant or misleading, but it might not be the one that shows where/how a server died)
Review graphs constantly (weekly, hourly, seasonally)
Complex simulation/modelling rarely worth the time and effort - Better to put it into production and see what happens
“I’ve got a stack of napkins and a pen, and I’m not afraid to use ’em!”
Tuning and weaking will never gain you excess capacity