Twitter’s geolocation guru Raffi Krikorian recently gave an interview to O’Reilly ahead of next month’s Where 2.0 conference in San Jose. The interview is obviously heavy on how Twitter is working with geo, but the last question, or more specifically, the last answer struck a chord and goes straight to the problem of trying to deal with the massive amounts of real-time data a lot of web players are dealing with today.
James Turner: What do you see as the technical side of geolocation, in terms of what’s going to be the new interesting technologies coming along, and how they’re going to be used?
Raffi Krikorian: From Twitter’s standpoint, it’s how do you accept all of this real-time data, index and analyze it and spread it throughout our system in almost real-time. People have traditionally built a bunch of GIS-like systems on top of PostgreSQL or on top of MySQL, and that’s fine, but it doesn’t scale after a while. After you throw a couple million or a couple hundred million entries at it, the amount it takes for one of those databases to process that, to insert it, all I have to do is select against it, and you can understand it’s untenable for real-time operation. And by real-time, I mean sub-second operation.
So the stuff that we’re doing is more geared towards how can you accept tweets that are coming in at what you can imagine to be an incredibly fast rate. Tweets are coming in, figure out their location, attach appropriate metadata data to it. Store it in our database. Span it out to anyone who wants to look at it. Run research and analytics on it and index it in their search index, and do this all within a couple of seconds on the way through the system. I think there’s a lot of interesting stuff being done out there on how things are being stored, how things are being indexed. But I think our personal contribution will be how do you do it at that kind of speed?
Of course, so far Twitter has been doing an admirable job (Fail Whales excluded) of providing uptime on a service with (forget a couple million) 9 billion rows in its statuses table — a table that’s growing by 50 million rows per day.
Those 600 tweets per second are certainly what turned Twitter away from a SQL cluster to using rival Facebook’s Cassandra system. On Tuesday, Twitter’s Ryan King explained why Twitter is turning to Cassandra.
We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate. We need a system that can grow in a more automated fashion and be highly available.
My day job involves a lot of media monitoring with such products at Radian 6. Speaking last week with one of R6′s competitors, I was told that no one in the media monitoring space can really do real-time monitoring — there’s just too much data. I think that’s overstating the challenge a little bit, but it will surely be some time before a company can say it’s “real-time” without employing an army of engineers.