Sun 4 Oct 2009
Yesterday I attended the first-ever Hadoop World, sponsored by Cloudera and held in The Roosevelt Hotel in New York City. I took an early Amtrak train up to the big city and a late train back that same night. The conference was well attended, over 500 big-data heads were there and the organizers did a fantastic job.
Some of the best stuff was just hearing about how other folks are using Hadoop. I also enjoyed hearing about the sizes of other people’s big-data problems. There were three tracks, so I only heard 1/3 of what took place, but here are some notes on what I did hear after the break.
It was a great day, a long day, glad I went.
Yahoo and Hadoop – Eric14
Yahoo search assist is now build with Hadoop based on 3 years of log data. It is a 20 step map-reduce job and it runs in about 20 minutes on their production cluster. This is down from a 26 day run (did I get that right?) in the pre-Hadoop days. Wow. Y! has several Hadoop clusters and the biggest is 4K nodes, 16Pb disk, 64 Tb RAM, 32K cores running Hadoop 18.3. Wow. I’m drooling over that.
Yahoo! folks are the biggest Hadoop committers and Y! sounds solidly behind the thing just like they have been since its inception. That was great to hear. Just about every zone of the Yahoo! home page has some contribution from a map-reduce job at some point.
Some stuff to keep an eye out for
- Zebra – a column oriented data store on hdfs
- Ooozie – a workflow and scheduling piece for map-reduce
- Mumak simulator – need to look that one up
Facebook – Ashish Thusoo
Fb gets 4Tb of compressed data per day with a 6-7x compression factor. That’s a lot of data. They separate production and ad-hoc jobs on to different clusters and use hive-replication to get the data in both places. This is something that is not publicly available yet but they have plans to open-source it when it is ready. Their big cluster is 4800 cores, 5.5 Pb disk, 12Tb disk per node. They run 7500 hive jobs per day, 95% of all jobs are hive, and this cluster runs 80,000 compute hours per day.
Visa – Joe Cunningham
Visa has had Hadoop in their research department for about 9 months and is doing some interesting things with it. They certainly have a big-data problem: 28 million merchants, 1.4 million ATM machines, 500 million accounts, 100-130 million transactions per day (8k per second). The Visa Net portion of a transaction approval happens in 50 ms and Visa Net has 2 seconds downtime per year.
Visa uses lots of FOSS in the enterprise so their interest in Hadoop came about in a perfectly natural way. One of the things the current research does is to produce fraud models that can be used in the real-time system. These models are produced off-line and it currently takes about 1 month to generate a fully-functional model. This involves sampling data, moving data, compressing data, etc. With Hadoop a fully-functional model can be generated in 13 minutes. Lots of the research happens on synthetic data and Visa has created map jobs to generate synthetic transaction data at the rate of 2 years of synthetic transactions in 6 hours.
Rackspace – Stu Hood
Stu talked most about the email and app hosting division of Rackspace but there are a lot of interesting things happening in Rackspace, both in the mail hosting division and in other divisions. The mail hosting division generates a lot of logs and needs to be very proactive in finding and surfacing delivery problems and dealing with spam and malware. They generate 300Gb/day of logs and have a 6 month window for analysis. There was some interesting talk of using map-reduce to generate a Lucene intermediate index format and then using Solr to merge it up into a searchable index. The time window target is to be able to query on data within 15 minutes of the event and to query on any dimension.
They either are or will be using Scribe to ingest data (syslog data?) directly into HDFS. I need to look into Scribe more as I had sort of forgotten about it. It could help solve some pressing issues.
JP Morgan Chase
One very interesting point the speakers brought out is that once your transactional data gets to be over 1 day old it is usually entirely static. But still it sits there provisioned and protected in your enterprise RDBMS transactional data store — read only. We can do better than that!
HDFS Security – Owen O’Malley, Yahoo!
There is lots of work going on to secure Hadoop Namenode, Jobtracker and Datanode. This is super important to our clients too and so we are very interested in seeing this succeed. Some of the major points:
- Namenode will give the user a Kerberos token to access blocks on a data node
- Only the user of a map-reduce job will be able to modify or kill it
- Map-reduce jobs run as the user
- Task working directory visibility limited to the user
- Only the right reduce tasks can see teh map outputs (secure shuffle)
- Encryption optional
- SPNEGO = permissions everywhere
In addition to the security work Owen talked some about how to make backward compatible APIs and how to mark them as such. The need for this grows as the project grows larger ecosystems. File system compatibility is also important. We know this right well as we got stuck on the bogus Hadoop 0.19 version that no one even mentions in public. We couldn’t go back to 0.18 and 0.20 wasn’t stable yet. Stuck stuck stuck.
VAIDYA
There are 165+ tunable parameters in Hadoop. Changing one usually has impact on the others. Vaidya uses a set of rules to analyze map-reduce history (after a job has run) using the user history and the job conf. It can measure things like reducer balance, check for compressed intermediate data, determine whether a combiner would be likely to help, and many other things. This is a lot of the stuff we do by hand when evaluating a map-reduce job. You could see something like this being used as a part of the certification that a job is ready to run on your production cluster. JIRA 4179.
eBay Streaming Architecture – Neel Sundaresan
eBay users spend $1400 per second. A vehicle is sold every 2 minutes. There are about 10 million new listings per day and 250 million users. The typical item on eBay is not catalogable (is that a word?) as most items are one-of-a-kind. So as compared to a Netflix type recommender system where you might need a 100k x 100k matrix, the eBay recommender would be an extremely large, sparse array. Neel called this a “long-tail” type system. There are terabytes of new transaction and user session data per day. eBay Research Labs studies this data to determine behavior in a number of dimensions, figures out how to perform A/B testing experiments, and does trending analysis.
The streaming architecture (Mobius) has a query language (MQL) that gives an SQL like interface but adds a start and end section to the query to give the window of data to operate on.
High Availability Hadoop – webContext
I had some problems understanding this talk at the beginning, but warmed up to it as it went along. There were lots of interesting products and tools mentioned that I should look into more.
- Hotslice to monitor job exit status
- Network bonding with LACP
- DRBD for disk mirroring
- Linux HA from linux-ha.org
- Spacewalk – an open source system like RedHat Satellite server
These guys have two machines running the Namenode using LinuxHA and DRBD on a virtual IP address. In addition to that they use the REST interface to the NameNode to /getimage and /getedit once per hour and make additional backups. They have had 6 NN failures in the last 18 months, 3 planned, 3 unplanned and have about a 15 second rollover to the backup. That is intense. Read more about this configuration on the cloudera blog.
RADFS
This is a set of patches in progress to make a low-latency HDFS. This is important work for things like hBase. There are some huge inefficiencies in sockets, file handles, and network usage for small positioned reads. There is still a lot of work to do on these patches before it could reasonably take over for the current HDFS approach. Interesting things to think about. JIRA HDFS-516.
Automatic Problem Diagnosis – CMU Team
The CMU team is using the M45 cluster that Y! has made available to several universities. They have some really nice visualizations of problems occurring in the cluster that come from looking at 64 different metrics in collected log files. One metric I thought especially interesting was heartbeat date skew between the master and slave. This work is being tracked in JIRA Chukwa – 94.