0% found this document useful (0 votes)

692 views20 pages

Nutch & Hadoop for Developers

The document provides an overview of the Nutch web search engine. It discusses Nutch's features such as being customizable, extensible through a plugin system, and capable of distributed processing using Hadoop. The document also demonstrates how to run Nutch in a distributed environment with Hadoop and describes MapReduce processing in Nutch for tasks like parsing webpages.

Uploaded by

Lúa Pérez Barros

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

692 views20 pages

Nutch & Hadoop for Developers

Uploaded by

Lúa Pérez Barros

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Nutch in a Nutshell

Presented by
Liew Guo Min
Zhao Jin
Outline
 Recap
 Special features
 Running Nutch in a distributed environment
(with demo)
 Q&A
 Discussion
Recap
 Complete web search engine
 Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)

 Java based, open source

 Features:
 Customizable
 Extensible
 Distributed
Nutch as a crawler
Initial URLs

Injector Web

CrawlDB Webpages/files
update get

Generator CrawlDBTool Fetcher

read/write

generate read/write
Segment Parser
Special Features
 Extensible (Plugin system)
 Most of the essential functionalities of Nutch
are implemented as plugins
 Three layers
 Extension points
 What can be extended: Protocol, Parser, ScoringFilter, etc.
 Extensions
 The interfaces to be implemented for the extension points
 Plugins
 The actual implementation
Special Features
 Extensible (Plugin system)
 Anyone can write a plugin
 Write the code
 Prepare metadata files
 Plugin.xml: what has been extended by what
 Build.xml: how ant can build your source code

 Ask nutch to include your plugin in conf/nutch-

site.xml
 Tell ant to build your in src/plugin/build.xml

 More details @ http://

wiki.apache.org/nutch/PluginCentral
Special Features
 Extensible (Plugin system)
 To use a plugin
 Make sure you have modified Nutch-site.xml to
include the plugin
 Then, either
 Nutch would automatically call it when needed, or
 You can write something to call it with its classname and
then use it
Special Features
 Distributed (Hadoop)
 Map-Reduce (Diagram)
 A framework for distributed programming
 Map -- Process the splits of data to get

intermediate results and the keys to indicate what

should be put together later
 Reduce -- Process the intermediate results with

the same key and output final result

Special Features
 Distributed (Hadoop)
 MapReduce in Nutch
 Example1: Parsing
 Input: <url, content> files from fetch
 Map(url,content)  <url, parse> by calling parser plugins
 Reduce is identity

 Example2: Dumping a segment

 Input: <url, CrawlDatum>, <url, ParseText> etc. files from
segment
 Map is identity
 Reduce(url, value*)  <url, ConcatenatedValue> by simply
concatenating the text representation of values
Special Features
 Distributed (Hadoop)
 Distributed File system
 Write-once-read-many coherence model
 High throughput
 Master/slave
 Simple architecture
 Single point of failure
 Transparent
 Access via Java API
 More info @ http://lucene.apache.org/hadoop/hdfs_design.html
Running Nutch in a distributed
environment
 MapReduce
 In hadoop-site.xml
 Specify job tracker host & port
 mapred.job.tracker
 Specify task numbers
 mapred.map.tasks
 mapred.reduce.tasks

 Specify location for temporary files

 Mapred.local.dir
Running Nutch in a distributed
environment
 DFS
 In hadoop-site.xml
 Specify namenode host, port & directory
 fs.default.name
 dfs.name.dir

 Specify location for files on each datanode

 dfs.data.dir
Demo time!
Q&A
Discussion
Exercises
 Hands-on exercises
 Install Nutch, crawl a few webpages using the crawl command and
perform a search on it using the GUI

 Repeat the crawling process without using the crawl command

 Modify your configuration to perform each of the following crawl jobs

and think when they would be useful.
 To crawl only webpages and pdfs but not anything else
 To crawl the files on your harddisk
 To crawl but not to parse

 (Challenging) Modify Nutch such that you can unpack the crawled
files in the segments back into their original state
Reference
 http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch
plugins
 http://lucene.apache.org/hadoop/ -- Hadoop homepage
 http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki
 http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/map
"MapReduce in Nutch"
 http://wiki.apache.org/nutch-
data/attachments/Presentations/attachments/oscon05.pdf "Scalable
Computing with MapReduce“
 http://www.mail-archive.com/nutch-
[email protected]/msg01951.html Updated tutorial on setting
up Nutch, Hadoop and Lucene together
Excursion: MapReduce
 Problem
 Find the number of occurrences of “cat” in a
file
 What if the file is 20GB large?

Why not do it with more computers?

 Solution
Split 1 PC1 200 PC1 500
File
Split 2
PC2 300
Excursion: MapReduce
 Problem
 Find the number of occurrences of both “cat”
and “dog” in a very large file
 Solution
PC1 cat:
200,200, cat: 200,
Split 1 PC1 cat:500
dog:
250 250 300
File
Split 2
PC2 cat:
300,300, dog: 250, PC2 dog:500
dog:
250 250 250

Map Sort/Group Reduce

Input Files Intermediate files Output files

Excursion: MapReduce
 Generalized Framework
Master

k1:v1
k1:v1,v2
Split 1 Worker k3:v2
Worker Output 1
Split 2 k2:v4,v5
Worker k1:v3
Split 3 Worker Output 2
k2:v4
Split 4 k3:v2
Worker Worker Output 3
k2:v5
k4:v6 k4:v6

Map Sort/Group Reduce

Input Files Intermediate files Output files

back

Apache Nutch Installation Guide
No ratings yet
Apache Nutch Installation Guide
33 pages
Nutch Setup Guide for Developers
No ratings yet
Nutch Setup Guide for Developers
3 pages
Nutch Configuration
No ratings yet
Nutch Configuration
6 pages
Nutch Version 0.7 Tutorial
No ratings yet
Nutch Version 0.7 Tutorial
5 pages
Overview of Apache Nutch Crawler
No ratings yet
Overview of Apache Nutch Crawler
9 pages
Intranet Search With Nutch: Doug Cutting
No ratings yet
Intranet Search With Nutch: Doug Cutting
18 pages
History of Hadoop Development
No ratings yet
History of Hadoop Development
4 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
How To Build A Jobs Aggregation Engine With Nutch, Solr and Views 3
No ratings yet
How To Build A Jobs Aggregation Engine With Nutch, Solr and Views 3
29 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
AI Assignment
No ratings yet
AI Assignment
2 pages
History and Origin of Hadoop
No ratings yet
History and Origin of Hadoop
1 page
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
55 pages
Top Hadoop Training in Bangalore
No ratings yet
Top Hadoop Training in Bangalore
31 pages
BDA Module2
No ratings yet
BDA Module2
43 pages
Hadoop 1
No ratings yet
Hadoop 1
26 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Understanding Hadoop: Architecture & Use Cases
No ratings yet
Understanding Hadoop: Architecture & Use Cases
55 pages
Bda Aiml Note Unit 2
No ratings yet
Bda Aiml Note Unit 2
13 pages
Chicago Crime (2013) Analysis Using Pig and Visualization Using R
No ratings yet
Chicago Crime (2013) Analysis Using Pig and Visualization Using R
61 pages
Hadoop and Their Ecosystem
100% (2)
Hadoop and Their Ecosystem
24 pages
History and Architecture of Hadoop
No ratings yet
History and Architecture of Hadoop
53 pages
Hadoop and Big Data Solutions
No ratings yet
Hadoop and Big Data Solutions
61 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Cs572 HW Nutch
No ratings yet
Cs572 HW Nutch
7 pages
Migrating from HTTrack to Apache Nutch
No ratings yet
Migrating from HTTrack to Apache Nutch
18 pages
Chap 2 Hadoop
No ratings yet
Chap 2 Hadoop
24 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
50 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
13 pages
Unit 5
No ratings yet
Unit 5
35 pages
Unit III
No ratings yet
Unit III
32 pages
WIRE: Open Source Web Crawler Overview
No ratings yet
WIRE: Open Source Web Crawler Overview
4 pages
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
100% (1)
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
57 pages
Overview of Hadoop and MapReduce
No ratings yet
Overview of Hadoop and MapReduce
5 pages
Scalable Web Crawling Solutions
No ratings yet
Scalable Web Crawling Solutions
6 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
AWS EC2 Basics for Beginners
No ratings yet
AWS EC2 Basics for Beginners
56 pages
Assignment 1 Write-Up
No ratings yet
Assignment 1 Write-Up
8 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Hadoop
No ratings yet
Hadoop
7 pages
Unit 2
No ratings yet
Unit 2
28 pages
Apache Lucene
No ratings yet
Apache Lucene
5 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
Hadoop for Data Enthusiasts
No ratings yet
Hadoop for Data Enthusiasts
24 pages
Unit 5
No ratings yet
Unit 5
101 pages
Comprehensive Guide to Hadoop Framework
No ratings yet
Comprehensive Guide to Hadoop Framework
62 pages
Lecture16 Crawling
No ratings yet
Lecture16 Crawling
39 pages
C Pointers and Dereferencing Explained
No ratings yet
C Pointers and Dereferencing Explained
18 pages
CS 6302 - Application Lifecycle Management - VC - Sept 25
No ratings yet
CS 6302 - Application Lifecycle Management - VC - Sept 25
22 pages
ReqIF Studio
No ratings yet
ReqIF Studio
60 pages
Audio Native With React - ElevenLabs Documentation
No ratings yet
Audio Native With React - ElevenLabs Documentation
4 pages
AD3251 - DSD - QB - UNIT - 2 - 2 Mark
No ratings yet
AD3251 - DSD - QB - UNIT - 2 - 2 Mark
5 pages
Master AI Prompting For Stunning UI - Design+Code
No ratings yet
Master AI Prompting For Stunning UI - Design+Code
6 pages
FS MM PR Enhance
No ratings yet
FS MM PR Enhance
7 pages
Rapid Application Development Model
No ratings yet
Rapid Application Development Model
2 pages
Bridge Course: Class XII Computer Science
No ratings yet
Bridge Course: Class XII Computer Science
59 pages
Programming in C Laboratory - CS3271 - Lab Manual
No ratings yet
Programming in C Laboratory - CS3271 - Lab Manual
66 pages
Advanced Java
No ratings yet
Advanced Java
37 pages
6 - Venugopal - Tech Lead - 8 Yrs 6 Months
No ratings yet
6 - Venugopal - Tech Lead - 8 Yrs 6 Months
6 pages
Debug and Validate Invalid Objects (Doc ID 300056.1)
No ratings yet
Debug and Validate Invalid Objects (Doc ID 300056.1)
5 pages
10 Excercise Java Problem Solving
No ratings yet
10 Excercise Java Problem Solving
2 pages
Intermediate COCOMO Model Analysis
No ratings yet
Intermediate COCOMO Model Analysis
2 pages
CHAPTER 8. Display Data From Multiple Tables
No ratings yet
CHAPTER 8. Display Data From Multiple Tables
6 pages
Linux Command and Troubleshooting Guide
No ratings yet
Linux Command and Troubleshooting Guide
15 pages
Dataflow Modeling in Digital Systems
No ratings yet
Dataflow Modeling in Digital Systems
26 pages
Unit V Introduction To Ajax and Web Services
86% (7)
Unit V Introduction To Ajax and Web Services
39 pages
ER Data Modeling Guide
No ratings yet
ER Data Modeling Guide
31 pages
Fingerprint PPT
No ratings yet
Fingerprint PPT
26 pages
Interview Help Guide
75% (4)
Interview Help Guide
7 pages
VVM Exam App
No ratings yet
VVM Exam App
11 pages
Software Requirements Process
0% (1)
Software Requirements Process
3 pages
Pue-Cse 5 KCS-052
No ratings yet
Pue-Cse 5 KCS-052
2 pages
Final - Course - Outcomes - of - All - Courses - Comp Dept - RCTI
No ratings yet
Final - Course - Outcomes - of - All - Courses - Comp Dept - RCTI
14 pages
TK Inter
100% (1)
TK Inter
35 pages
Dotnet Aspire Whats New
No ratings yet
Dotnet Aspire Whats New
149 pages
Enhancements & Badi Concepts
100% (1)
Enhancements & Badi Concepts
14 pages
Research On Software Quality Assurance Based On Software Quality Standards and Technology M
No ratings yet
Research On Software Quality Assurance Based On Software Quality Standards and Technology M
6 pages

Nutch & Hadoop for Developers

Uploaded by

Nutch & Hadoop for Developers

Uploaded by

Nutch in a Nutshell

 Java based, open source

Generator CrawlDBTool Fetcher

 Ask nutch to include your plugin in conf/nutch-

 More details @ http://

intermediate results and the keys to indicate what

the same key and output final result

 Example2: Dumping a segment

 Specify location for temporary files

 Specify location for files on each datanode

 Repeat the crawling process without using the crawl command

 Modify your configuration to perform each of the following crawl jobs

Why not do it with more computers?

Map Sort/Group Reduce

Input Files Intermediate files Output files

Map Sort/Group Reduce

Input Files Intermediate files Output files

You might also like