Showing posts with label MapReduce. Show all posts
Showing posts with label MapReduce. Show all posts

Friday, February 22, 2013

SQL coming to Hadoop

SQL is what’s next for Hadoop: Here’s who’s doing it — Tech News and Analysis

From GigaOm.

Interesting. SQL is powerful and well understood by many developers, probably much more than  MapReduce (on which Hadoop is based), so this may help more people to start using Hadoop for Big Data projects, and Big Data is an upcoming trend.

Monday, August 13, 2012

Inferno on Disco, Python MapReduce library / daemon for structured text

By Vasudev Ram


Inferno is an open-source Python MapReduce library. It has (from the site):

[ A query language for large amounts of structured text (CSV, JSON, etc).

A continuous and scheduled MapReduce daemon with an HTTP interface that automatically launches MapReduce jobs to handle a constant stream of incoming data. ]

Overview of Inferno.

This overview page has a nice serial example: starting with a small set of test data, it shows how to query for a certain result, in SQL and then in AWK (both are easy one-liners), but then goes on to show how the achieve the same result using Inferno.

The interesting point is that the Inferno code is also small (a "rule" of ~10 lines, presumably stored in a config file) and a one-line command, but the difference from the SQL and AWK examples is that this runs a Disco MapReduce job to distribute the work across the nodes on a cluster. There is almost nothing in the Inferno code to indicate that this is a distributed computing MapReduce job.

Inferno uses Disco.

Disco is "a distributed computing framework based on the MapReduce paradigm. Disco is open-source; developed by Nokia Research Center to solve real problems in handling massive amounts of data."

Some users of Disco: (Chango, Nokia, Zemanta). Chango staff seem to be the developers of Disco.

- Vasudev Ram - Dancing Bison Enterprises