Simon Willison’s Weblog

Subscribe
Atom feed for hadoop

9 items tagged “hadoop”

2009

Today, Facebook counts 29% of its employees (and growing!) as Hive users. More than half (51%) of those users are outside of Engineering. They come from distinct groups like User Operations, Sales, Human Resources, and Finance. Many of them had never used a database before working here. Thanks to Hive, they are now all data ninjas who are able to move fast and make great decisions with data.

Facebook Data Team

# 30th November 2009, 11:30 am / facebook, hadoop, hive

Introducing Cloudera Desktop. It’s a GUI for Hadoop, and under the hood is a whole stack of open source software, including Python, Django, MooTools, Twisted, lxml, CherryPy, Mako, Java and AspectJ.

# 21st October 2009, 6:48 pm / hadoop, open-source, cloudera, python, django, mootools, twisted, lxml, cherrypy, mako, java, aspectj

Finding similar items with Amazon Elastic MapReduce, Python, and Hadoop streaming. Tutorial for running Hadoop jobs on Elastic MapReduce using Python and the 2005 Audioscrobbler dataset.

# 7th April 2009, 9:19 am / audioscrobbler, amazon, amazon-web-services, hadoop, mapreduce, elasticmapreduce, python

Amazon Elastic MapReduce (via) Hadoop as a service. Basically a web based GUI around Hadoop—you could roll this yourself on EC2 but for a small markup on regular EC2 prices you get to avoid the extra work setting everything up. Data processing scripts can be written in Java, Ruby, Perl, Python, PHP, R, or C++ and are loaded in to S3 before firing off the job.

# 2nd April 2009, 10:25 am / cloud-computing, hadoop, amazon-web-services, amazon, mapreduce, ec2, s3

2008

Cascading. A Java API abstraction layer over Hadoop that lets developers think in terms of pipes and filters rather than map/reduce. The Cascading developers claim that this model is easier to understand and less error prone.

# 1st October 2008, 1:22 pm / mapreduce, cascading, java, hadoop, pipesfilters

3 and 1/2 minutes to sort a Terabyte, and a look at Hadoop’s code structure. Bill de hÓra uses some clever static analysis tools to explore Hadoop’s 100,000+ lines of code.

# 7th July 2008, 2:15 pm / hadoop, bill-de-hora, staticanalysis, java

Python + Hadoop = Flying Circus Elephant. Last.fm have released Dumbo, a Python module that lets you easily write Hadoop map/reduce tasks using Python and generators.

# 31st May 2008, 2:14 pm / hadoop, python, generators, lastfm, dumbo, mapreduce

2007

Writing An Hadoop MapReduce Program In Python. Hadoop (the open source map/reduce framework) can interact with any program that reads from stdin and outputs on stdout—so it’s trivial to drop in Python scripts for the map and reduce steps.

# 9th October 2007, 11:33 am / hadoop, mapreduce, python

2006

Hadoop. Open-source Google File System / map-reduce equivalent. Apparently scales amazingly well.

# 23rd August 2006, 8:36 am / hadoop