Anders Brownworth

Technology and Disruption

Distributed Data - Hadoop, Hbase and Hive

Working at, we run across quite a bit of data. For example, just keeping a record of each phone call that goes across our network creates Gigabytes of data on a daily basis. Summarizing all of this data by throwing it into a relational database like PostgreSQL, or MySQL requires ever more expensive computers with scads of RAM and disk space. Eventually you start resorting to tricks like using a new table for each day's worth of data so your working set of data doesn't become unrealistically large. This works well for applications that need to see data within one day like a pricing engine but requires a view or some other overarching mechanism for summarizing multi-day data. And then as data becomes old, it needs to be migrated off to other servers because there is never enough disk space, complicating large date range queries even more. Your only option is to buy a more beefy computer and keep the data less fragmented.

This is simply unsustainable when your data gets into the Terabyte and Petabyte range. So rather than stay in this vicious cycle, I decided to start taking a look at the distributed approach. Rather than build a monolithic system, we should distribute the problem across a high number of cheap computers. If we spread the data across a large number of nodes, we can just ask all the nodes for the data that matches a specific query and only the nodes that have it will reply. Of course this takes more compute cycles to do but as they are done in parallel, there isn't as high a cost in terms of time like there would be for a more monolithic setup. In turn, because all of our nodes have the smarts to be able to process data, we can process the data right where the data resides rather than migrating the data to a central processing node.

Where we would be buying the most expensive computer that we could reasonably afford in the monolithic paradigm, those last few chunks of performance probably aren't worth it to us in a distributed context. Rather, we want to purchase commodity machines where we can take advantage of a reasonably fast computer for a reasonably cheap price. We don't want to pay the premium for the latest hardware where we will get more performance by buying more lower cost machines. At the same time, however, we don't want to get something underpowered for the amount of electricity each machine will consume. After all, power is the single biggest constraint in a server farm.

Now of course you have to worry about many more disks, power supplies and network connections that could fail. But rather than try to keep everything going all the time, lets just assume that failure is expected and our system should just handle that. We'll copy data to more than one node so if we loose a few of them, we'll still have access to all of our data. As nodes die, we'll probably need to redistribute that data onto other nodes so we'll need some sort of a balancing process to take care of that but that shouldn't get in the way of the entire system as a whole functioning. Solving the redundancy question by necessity right up front is a really great side perk to this design as well!

As you read this, you're probably thinking to yourself, "Google"! This is essentially the architecture of their network, and I would say they have it right. As it happens, Google released whitepapers on their distributed filesystem called GFS and a parallel computing architecture called MapReduce. Since then, a number of open-source projects have started that implement these ideas. Of note, though, is the Apache project Hadoop which implements a distributed filesystem and the MapReduce framework in Java. So I decided to build a cluster of 10 nodes and try it out.

I decided to start out with fairly weak machines knowing that we would build up in production. I began with a cluster of 10 machines, each sporting a single 2.8Ghz Pentium 4 chip (not even HyperThreaded!) and 1.25 gigs of RAM with an 80 Gig IDE hard drive and Gigabit networking. With Linux and Hadoop installed I had about 67 Gigs of free space for the distributed filesystem and just under a Gig of free RAM. In production, the clustered machines are roughly ten times the performance of these test boxes so I could rely on generally "much better"(tm) results when we reached production.

With a running system, I could copy the data into the distributed filesystem and write MapReduce jobs to analyze it directly but as it is fairly structured data, I decided to also use the Hbase Apache project as well. I wrote a program ( to create a table in Hbase called "cdrs" and another ( to import some data into that table. As mentioned before, the commodity machines I used were very basic but I was able to insert conservatively about 500 records per second with this setup. I kept blowing the circuit breaker at the office as well forcing me to spread the machines across several power circuits but it proved that the system was at least fault tolerant!

The performance was just as quick inserting the 10 millionth row as it was inserting the first row. Of course indexes are not being maintained across the data so there is no climbing cost to adding data. However, in our case the cluster machines are mostly CPU bound during the insert process. If we had at least dual core chips in there, I'd guess we would turn the corner to being IO bound instead.

Facebook is a company in a similar (although much more drastic) position as in regards to data. They were early adopters of Hadoop as well but rather than using Hbase to store structured data in the distributed filesystem, they created their own data warehouse tool called Hive which has recently been open-sourced as an Apache project. Hive looks quite a bit more like a traditional relational database because it allows adhoc queries across any column via a SQL-like language. While Hbase had "HQL" which is also similar to SQL, your "where" clause can only look at row keys. In my example above, that means we could only restrict queries by time. We would use a MapReduce routine on the output of the Hbase query to further constrain our rows. If the query can't be restricted by time (for example: show me all calls from this company across all of time) the MapReduce instances would end up reading every key on the system. While not entirely bad because the task is distributed, this isn't quite as flexible as being able to do ad-hoc queries across other "columns" and we don't want to have to keep multiple copies of the same data in the filesystem. Hive allows adhoc queries so it is probably a better match for tasks that don't constrain the data by time.

Ideally, we can query all of our data live and never have to "retire" old information out of the database. A distributed filesystem allows the resource to grow as needed without becoming unrealistically expensive. As nodes go bad, we visit the farm once per month to collect them and replace them with a new node that joins the cluster and picks up where the dead node left off. The maximum filesystem size just shrinks and grows with time as nodes die and new nodes come online. Compare this to paying an exponentially growing pricetag on a monolithic setup and you will see how eventually there is no other choice but to distribute your way out of a growing dataset problem!

Clearly monolithic systems do take some aspects of distribution into account. Multiple CPUs and CPU cores as well as multiple drives in a drive array are clearly examples of distributing your way out of a bottleneck. RAID is interesting because it starts to seriously think of failure as the norm and not the exception. What is different here is that we are distributing the entire system rather than just components of it. As we scale into the thousands of nodes, failure is obviously expected. The more nodes we have, the more likely we are to have failure. The system bus is now the network in the distributed case making the network that much more critical. Rather than have "dumb" component parts with purpose built hardware to accomplish distribution, we get a more intelligent distribution. In short, we distribute computation, even special purpose queries, and that makes what we are doing here significantly different.

Of course you have to look at the problem from a total cost of ownership standpoint. There is a point where a distributed system starts to make sense, but your dataset may not be big enough to justify it. Generally, when you can't fit all of your indexes in RAM, you're probably there but you could be bound by CPU processing power well before this point. When the cost of a monolithic machine approaches 10 or more "commodity" machines, you should probably at least have spent some time researching distributed systems. If your dataset is growing beyond that, you probably want to make the move. In our case, a loaded Dell blade costs about $10,000 including the slice of the chassis where it is installed. This cost doesn't take the slice of the NetApp filesystem it uses into account. This cost depends heavily on how much space we take up on the NetApp. Our commodity machines are priced around $1200 including 1.5T of filesystem for each node. Ignoring how notoriously expensive NetApps are, we probably easily beat the "heafty" blade architecture. The question is how much power does each solution consume. Clearly the blade architecture is going to be more efficient so there is a price you pay in power for a distributed cluster. I'm still looking at how significant of a difference this is. That will likely be the subject of future posts.

There are other interesting distributed data warehousing applications that we are looking at. Instead of Hbase to store structured data, we could just write the data directly to the distributed filesystem and analize it with Pig although this doesn't offer a solution to mutating the data. (adding rating information to calls) For our Ruby on Rails projects needing MapReduce functionality, Skynet is very interesting. You can "gem install" it and there is no single dedicated name node or job tracker like there is with Hadoop. Other solutions exist such as clustering MySQL or PostgreSQL and fanning queries across them but all of these technologies are very young at this point and the future is far from clear. All I know from what I've seen is you have to be distributing your way out of problems these days rather than hoping monolithic hardware remains cheap enough for your problem!

Chime in with your experiences with distributed data warehousing by leaving a comment.

Comments (8)

Frank from


Great writeup. Unfortunately, I can't chime in with an actual distributed data warehousing project of my own, but I can say this concept has been rolling around in my head for some time. So it's nice to read when someone actually tackles things like this with a practical case.

As someone who's a bit of a "jack of all trades" in computers, whose job currently is networking but who was a database person in a former life, I'm always looking at the right tools for the job. And with various projects, the notion of dataset growth over time--from desktop apps like calendaring or GTD apps to higher end solutions like university course management systems with ever-growing collections of course materials--and how to deal with that has come up again and again.

I mean, the typical, old-school solution is some kind of database-centric solution, where the dataset continues to grow, eventually bogging down the DBMS just doing inserts/updates/indexing. To counter this slowdown, typically some kind of maintenance/pruing processing is done, either deleting outdated records or archiving them to some other database. But then over time that archive continues to grow and IT now bogs down the solution.

One simple example even on the desktop is a GTD app like OmniFocus for Mac OS X. Over time the database of actions grows, slowing down how fast the app starts/etc. So recently an update came out that added the ability to archive older records so the app would stay nimble. Sounds reasonable enough. If the user needs to search for an old action, if that query takes a little longer no big deal.

But invariably you'll be archiving again. As the app designer, do you add new archived data to old (as in keep a single, large archive, eventually suffering the same slowdown since the archived data will continue to grow just as the primary did before), or do you store multiple, separate archives? And if the latter, how does that impact the user's ability to search for information? Clearly as a programmer it makes things a little harder.

That's just a simple, desktop app, but the concept remains. So what about larger solutions?

One example (though admittedly not a good one) if I may. I work for a higher education/state gov't agency in a rural state. About 6 years ago we purchased via a grant what today is termed an SIM (Security Information Manager) or SEIM (Security Event Information Management). The product, from a company called Network Intelligence (now owned by RSA who, in turn, is owned by EMC), was called enVision, and it was basically a "black box" appliance solution built on an HP/Compaq ProLiant server running Windows 2000. It was, for lack of a better explanation, a syslog server on steroids.

Six (6) 73GB UltraSCSCI HDs configured in a typical RAID5 config provided ~300GB of redundant storage (keep in mind how long ago this was). The point of such a box was to suck in syslog messages from your various equipment, parse it, slice it, dice it, and put it into a database, from which you could then query for various information. The power of the enVision box was that it provided the end user a pretty Web UI from which to perform all tasks, and it knew how to deal with syslogs from a variety of types of equipment (routers, servers, firewalls, you name it).

Now we tapped maybe 1% of what this product could do. Our only interest was in collecting syslogs from our Cisco PIX firewalls which protected a private, Class A state gov't network. And truthfully, all we cared about were the 'build' and 'teardown' messages which let us know who from inside the network was connecting via which translation IP address to some external IP and port. Reason? If we were contacted about some abuse which originated from our firewalls, we needed to trace it back inside to the source.

Not having the budget of larger states/corporations, and at a cost of ~$50K, this was not a cheap solution. But it served its purpose... to an extent. With as restricted a setup as we had, this still overwhelmed this server over time with ~3900 EPS (Events/messages Per Second), our licensing limit.

Thus, our usage became the traditional approach. Due to the rate of data collection, we eventually had to rotate the logs at 30 days as there simply wasn't enough storage for more, and queries often had to be done in very small timeframes or the server would never come back with an answer. While for the purpose for which this box was intended (forensics) this worked fine, it did necessarily restrict its use. There were times when state/federal law enforcement agencies might contact us about some ongoing investigation and request information, but due to the nature of such investigations and how long it takes for things to be done, by the time they asked us, the relevant data had long been deleted. (Luckily we do not fall under SOX or HIPAA regs. as those would make our existence impossible.)

Now I mention all this because that particular box was End-Of-Lifed on us with little warning a few years ago, and the replacement configuration proposed would cost $150K. Note the original purchase was done via a grant (i.e., we don't have the budget for such an expense), so the tripling of costs (due I believe mostly to the fact the company was bought by ever-larger corporations not exactly known for cost-effective products) put this out of our reach. So we ran the box "naked" until it died several months ago. Without maintenance, we no longer had access to the software, let alone license keys, to rebuild the system, so we had to look at alternatives. Our current solution is what one might expect in such situations: an extremely low-cost Linux config running syslog-ng and at least collecting the logs. As for having all the nice features like having the syslogs parsed and stored in a database or the pretty Web UI for searching... well, grep will have to do for now.

While this particular use doesn't fit directly with your ever-growing dataset scenario (since such syslogs are a bit pointless beyond a reasonable timeframe), the problems of cost, lack of redundancy, storage growth, etc., all still play a part. The general notion of a distributed, "hive" solution is extremely appealing, even in cases where data warehousing isn't necessarily the case. That is, you may not need perpetual storage of ever-growing data but you do want to be able to grow/shrink your computing/storage/redundancy by simply adding/removing nodes in some highly automated fashion. Solutions like

* the IPAM(IP Address Management)/DNS/DHCP boxes by Infoblox which use a "grid" model as they call it, or
* the distributed Mnesia database that is part of Erlang, or
* Beowulf clusters in Linux,
* Xgrid in Mac OS X Server, or
* the various distributed computing solutions like the SETI@home project, etc.

All these, in some fashion, take solutions beyond the single point of failure model.

I have long wondered if/when we'll see this approach taken in the design of all kinds of software. I just imagine an SEIM solution where you download an ISO and build a node similar to doing something like a Trixbox install. If you want redundancy, you simply grab another PC, setup another node, and point it at the first to get configuration/data in some highly automated fashion. If either node fails, all you need do is light up a new one and hook it into the cluster. Need more performance/storage/etc.? Just add more nodes.

Even VoIP solutions like Cisco's CUCM or Asterisk/FreeSWITCH solutions could benefit from this. Currently the CUCM allows for clustering, but its setup is non-trivial. And in the open-source telephony world, imagine the possibilities if even the basic bundled solutions (e.g, Trixbox, Elastix, etc.) offered this.

The problem, just as with data warehousing, is how to minimize the manual configuration element. In your scenario, how is it determined what data lives on which nodes? In the case of a VoIP PBX, how do the nodes determine which node handles which incoming connections? And in cases like a web farm with some kind of round-robin dispatch at the front end, how not to have a single point of failure in the system is still always going to be an issue.

Ok, I'm rambling a bit now. But your post got me thinking once again how incredible it would be if more solutions had redundancy and an automated, distributed model built in from the get-go.

Anders from RTP

Frank: Having distribution in all apps from the get-go would be ideal, no argument on that point! To address some of the situations you mention at the end, with FreeSWITCH, there are two ways to do this. One is a media-stateful migration typically using OpenVZ and the other is to front a FreeSWITCH cluster with an OpenSIPs (OpenSER) "SIP distribution" cluster. You would distribute the traffic to the various nodes and if you lost a FreeSWITCH instance, you would basically loose all calls on them. Then some external process would figure out that FreeSWITCH had died on that node and update the database that OpenSIPs uses for distribution. At the moment, we do this manually but we should get this automated shortly.

The task of figuring out which node handles which connection in the above example is actually quite simple. OpenSIPs doesn't even have to remember any state. As Call-IDs are unique to a particular call and therefore a particular SIP server, you can just assign some arbitrary number, say 1-100, to each of your SIP servers. Then OpenSIPs hashes Call-IDs and does a modulus against 100 to see what server to send the traffic to. As servers die, other servers take over the numbers that the servers used, therefore "redistributing" the job. Of course unless you somehow migrate the call (not currently possible) you will still loose all in-progress calls, but the whole thing works quite well.

This likely doesn't make much sense unless you are very into FreeSWITCH and OpenSIPs so maybe its worth another more significant blog post at some point.

Doug from Boston

very interesting read which prompted several more tabs to open. thanks for sharing!

Adrian from Oradea / BH / Romania

i am trying to build something similar but using native c code instead of java hadoop and hbase. The alternatives i found is cloudstore (formerly kosmosfs) and Hypertable. It works verry good but due to no knowledge of java i was never able to setup a hadoop/hbase to compare performance. I was just courious if you ever met cloudstore and hypertable and know anything abt their performance. thanks.

Anders from RTP

Adrian: I haven't tested Cloudstore but Hypertable is very similar to Hbase in terms of architecture. Last I checked though it didn't implement a distributed filesystem. In terms of speed, I don't have any direct measurements. I'd be happy to work with you on that though.

Anders from NY

There is a new Hadoop / PostgreSQL hybrid project out of Yale that might also be interesting:

Tahreem from PK

hey i wanna ask how u distributed the hbase to all the slave machines???

Anders from Cambridge, MA

Tahreem: Initially I rsynced hbase to all slaves and started it on each of them.

Leave a Comment

Location: (city / state / country)
Email: (not published / no spam)

No HTML is allowed. Cookies must be enabled to post. Your comment will appear on this page after a moderator OKs it. Offensive content will not be published.

Click the umbrella to submit your comment.

To create links in comments:
[link:] becomes
[link:|] becomes
Notice there is no rel="nofollow" in these hrefs. Links in comments will carry page rank from this site so only link to things worthy of people's attention.