Nutch in a Nutshell
Presented by
Liew Guo Min
Zhao Jin
Outline
Recap
Special features
Running Nutch in a distributed environment
(with demo)
Q&A
Discussion
Recap
Complete web search engine
Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)
Java based, open source
Features:
Customizable
Extensible
Distributed
Nutch as a crawler
Initial URLs
Injector Web
CrawlDB Webpages/files
update get
Generator CrawlDBTool Fetcher
read/write
generate read/write
Segment Parser
Special Features
Extensible (Plugin system)
Most of the essential functionalities of Nutch
are implemented as plugins
Three layers
Extension points
What can be extended: Protocol, Parser, ScoringFilter, etc.
Extensions
The interfaces to be implemented for the extension points
Plugins
The actual implementation
Special Features
Extensible (Plugin system)
Anyone can write a plugin
Write the code
Prepare metadata files
Plugin.xml: what has been extended by what
Build.xml: how ant can build your source code
Ask nutch to include your plugin in conf/nutch-
site.xml
Tell ant to build your in src/plugin/build.xml
More details @ http://
wiki.apache.org/nutch/PluginCentral
Special Features
Extensible (Plugin system)
To use a plugin
Make sure you have modified Nutch-site.xml to
include the plugin
Then, either
Nutch would automatically call it when needed, or
You can write something to call it with its classname and
then use it
Special Features
Distributed (Hadoop)
Map-Reduce (Diagram)
A framework for distributed programming
Map -- Process the splits of data to get
intermediate results and the keys to indicate what
should be put together later
Reduce -- Process the intermediate results with
the same key and output final result
Special Features
Distributed (Hadoop)
MapReduce in Nutch
Example1: Parsing
Input: <url, content> files from fetch
Map(url,content) <url, parse> by calling parser plugins
Reduce is identity
Example2: Dumping a segment
Input: <url, CrawlDatum>, <url, ParseText> etc. files from
segment
Map is identity
Reduce(url, value*) <url, ConcatenatedValue> by simply
concatenating the text representation of values
Special Features
Distributed (Hadoop)
Distributed File system
Write-once-read-many coherence model
High throughput
Master/slave
Simple architecture
Single point of failure
Transparent
Access via Java API
More info @ http://lucene.apache.org/hadoop/hdfs_design.html
Running Nutch in a distributed
environment
MapReduce
In hadoop-site.xml
Specify job tracker host & port
mapred.job.tracker
Specify task numbers
mapred.map.tasks
mapred.reduce.tasks
Specify location for temporary files
Mapred.local.dir
Running Nutch in a distributed
environment
DFS
In hadoop-site.xml
Specify namenode host, port & directory
fs.default.name
dfs.name.dir
Specify location for files on each datanode
dfs.data.dir
Demo time!
Q&A
Discussion
Exercises
Hands-on exercises
Install Nutch, crawl a few webpages using the crawl command and
perform a search on it using the GUI
Repeat the crawling process without using the crawl command
Modify your configuration to perform each of the following crawl jobs
and think when they would be useful.
To crawl only webpages and pdfs but not anything else
To crawl the files on your harddisk
To crawl but not to parse
(Challenging) Modify Nutch such that you can unpack the crawled
files in the segments back into their original state
Reference
http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch
plugins
http://lucene.apache.org/hadoop/ -- Hadoop homepage
http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/map
"MapReduce in Nutch"
http://wiki.apache.org/nutch-
data/attachments/Presentations/attachments/oscon05.pdf "Scalable
Computing with MapReduce“
http://www.mail-archive.com/nutch-
[email protected]/msg01951.html Updated tutorial on setting
up Nutch, Hadoop and Lucene together
Excursion: MapReduce
Problem
Find the number of occurrences of “cat” in a
file
What if the file is 20GB large?
Why not do it with more computers?
Solution
Split 1 PC1 200 PC1 500
File
Split 2
PC2 300
Excursion: MapReduce
Problem
Find the number of occurrences of both “cat”
and “dog” in a very large file
Solution
PC1 cat:
200,200, cat: 200,
Split 1 PC1 cat:500
dog:
250 250 300
File
Split 2
PC2 cat:
300,300, dog: 250, PC2 dog:500
dog:
250 250 250
Map Sort/Group Reduce
Input Files Intermediate files Output files
Excursion: MapReduce
Generalized Framework
Master
k1:v1
k1:v1,v2
Split 1 Worker k3:v2
Worker Output 1
Split 2 k2:v4,v5
Worker k1:v3
Split 3 Worker Output 2
k2:v4
Split 4 k3:v2
Worker Worker Output 3
k2:v5
k4:v6 k4:v6
Map Sort/Group Reduce
Input Files Intermediate files Output files
back