0% found this document useful (0 votes)

78 views35 pages

Lucene & Solr for Java Developers

Apache Lucene is a free and open-source information retrieval software library. It allows for indexing and searching of text-based documents and includes features like powerful query syntax, fast indexing and searching, and relevance ranking. Solr builds on Lucene to provide a search server with web-based administration and an HTTP API. Nutch is an open-source web crawler that uses Lucene for indexing and searching and can be used to build an internet search engine.

Uploaded by

Gökay Arpacı

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views35 pages

Lucene & Solr for Java Developers

Uploaded by

Gökay Arpacı

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Apache Lucene

Searching the Web and Everything Else

Daniel Naber Mindquarry GmbH ID 380

AGENDA
> What's a search engine > Lucene Java Features Code example > Solr Features Integration > Nutch Features Usage example > Conclusion and alternative solutions

About the Speaker

> > > > Studied computational linguistics Java developer Worked 3.5 years for an Enterprise Search company (using Lucene Java) Now at Mindquarry, creators on an Open Source Collaboration Software (Mindquarry uses Solr)

Question: What is a Search Engine?

> Answer: A software that builds an index on text answers queries using that index But we have a database already A search engine offers Scalability Relevance Ranking Integrates different data sources (email, web pages, files, database, ...)

What is a search engine? (cont.)

> Works on words, not on substrings auto != automatic, automobile > Indexing process: Convert document Extract text and meta data Normalize text Write (inverted) index Example:
Document 1: Apache Lucene at Jazoon Document 2: Jazoon conference
Index:

apache -> 1 conference -> 2 jazoon -> 1, 2 lucene -> 1

Apache Lucene Overview

> Lucene Java 2.2 Java library > Solr 1.2 http-based index and search server > Nutch 0.9 Internet search engine software > http://lucene.apache.org

Lucene Java
> > > > > > > > Java library for indexing and searching No dependencies (not even a logging framework) Works with Java 1.4 or later Input for indexing: Document objects Each document: set of Fields, field name: field content (plain text) Input for searching: query strings or Query objects Stores its index as files on disk No document converters No web crawler

Lucene Java Users

> > > > > > > IBM OmniFind Yahoo! Edition technorati.com Eclipse Furl Nuxeo ECM Monster.com ...

Lucene Java Features

> > > > > > > Powerful query syntax Create queries from user input or programmatically Fast indexing Fast searching Sorting by relevance or other fields Large and active community Apache License 2.0

Lucene Query Syntax

> Query examples: jazoon jazoon AND java <=> +jazoon +java jazoon OR java jazoon NOT php <=> jazoon -php conference AND (java OR j2ee) Java conference title:jazoon j?zoon jaz* schmidt~ schmidt, schmit, schmitt price:[000 TO 050] + more

Lucene Code Example: Indexing

01 Analyzer analyzer = new StandardAnalyzer(); 02 IndexWriter iw = new IndexWriter("/tmp/testindex", analyzer, true ); 03 04 Document doc = new Document(); loop 05 doc.add(new Field("body", "This is my TEST document", 06 Field.Store.YES, Field.Index.TOKENIZED)); 07 iw.addDocument(doc); 08 09 iw.optimize(); 10 iw.close();

StandardAnalyzer: my, test, document

Lucene Code Example: Searching

01 Analyzer analyzer = new StandardAnalyzer(); 02 IndexSearcher is = new IndexSearcher("/tmp/testindex"); 03 04 QueryParser qp = new QueryParser("body", analyzer); 05 String userInput = "document AND test"; 06 Query q = qp.parse(userInput); 07 Hits hits = is.search(q); 08 for (Iterator iter = hits.iterator(); iter.hasNext();) { 09 Hit hit = (Hit) iter.next(); 10 System.out.println(hit.getScore() + " " + hit.get("body")); 11 } 12 13 is.close();

Lucene Hints
> Tools: Luke Lucene index browser http://www.getopt.org/luke/ Lucli > Common pitfalls and misconceptions Limit to 10.000 tokens by default see IndexWriter.setMaxFieldLength() There's no error if a field doesn't exist You cannot update single fields You cannot join tables (Lucene is based on documents, not tables) Lucene works on strings only -> 42 is between 1 and 9 Use 0042 Do not misuse Lucene as a database

Advanced Lucene Java

> Text normalization (Analyzer) Tokenize foo-bar: text -> foo, bar, text Lowercase Linguistic normalization (children -> child) Stopword removal (the, a, ...) You can create your own Analyzer (search + index) > Ranking algorithm TF-IDF (term frequency inverse document frequency) You can add your own algorithm Difficult to evaluate

Lucene Java: How to get Started

> API docs http://lucene.zones.apache.org:8080/hudson/job/LuceneNightly/javadoc/overview-summary.html#overview_description > FAQ http://wiki.apache.org/lucene-java/LuceneFAQ

Lucene Java Summary

> > > > > Java Library for indexing and searching Lightweight / no dependencies Powerful and fast No document conversion No end-user front-end

Solr
> > > > > An index and search server (jetty) A web application Requires Java 5.0 or later Builds on Lucene Java Programming only to build and parse XML No programming at all using Cocoon > communicates via HTTP index: use http POST to index XML search: use GET request, Solr returns XML Parameters e.g. q = query start rows Future versions will make use without http easier (Java API)

Solr Indexing Example

> http POST to http://localhost:8983/solr/update <add> <doc> <field name="url">http://www.myhost.org/solr-rocks.html</field> <field name="title">Solr is great</field> <field name="creationDate">2007-06-25T12:04:00.000Z</field> <field name="content">Solr is a great open source search server. It scales, it's easy to configure....</field> </doc> </add> > Delete a document: POST this XML: <delete><query>myID:12345</query></delete>

Solr Search Example

GET this URL: http://localhost:8983/solr/select/?indent=on&q=solr Response (simplified!): <response> <result name="response" numFound="1" start="0" maxScore="1.0"> <doc> <float name="score">1.0</float> <str name="title">Solr is Great</str> <str name="url">http://www.myhost.org/solr-rocks.html</str> </doc> </response>

Solr Faceted Browsing

> Makes it easy to browse large search results

Solr Faceted Browsing (cont.)

schema.xml: <field name="topic" type="string" indexed="true" stored="true"/> Query URL: http://.../select?facet=true& facet.field=topic Output from Solr: <lst name="topic"> <int name="Genetic algorithms">6</int> <int name="Artificial intelligence">3</int> ...

Solr: How to get Started

> > > > Download Solr 1.2 Install the WAR Use the post.jar from the exampledocs directory to index some documents Browse to the admin panel at http://localhost:8080/solr/admin/ and make some searches > Configure schema.xml and solrconfig.xml in WEB-INF/classes > Details at Search smarter with Apache Solr http://www.ibm.com/developerworks/java/library/j-solr1/ http://www.ibm.com/developerworks/java/library/j-solr2/ > FAQ http://wiki.apache.org/solr/FAQ

Solr Summary
> A search server > Access via XML sent over http Client doesn't need to be Java > Web-based administration panel > Like Lucene Java, it does no document conversion > Security: make sure your Solr server cannot be accessed from outside!

Nutch
> > > > > > Internet search engine software (software only, not the search service) Builds on Lucene Java for indexing and search Command line for indexing Web application for searching Contains a web crawler Adds document converters

> Issues: Scalability Crawler Politeness Crawler Management Web Spam

Nutch Users
> Internet Archive www.archive.org > Krugle krugle.com

> Several vertical search engines, see http://wiki.apache.org/nutch/PublicServers

Getting started with Nutch

> Download Nutch 0.9 (try SVN in case of problems) > Indexing: add start URLs to a text file configure conf/crawl-urlfilter.txt configure conf/nutch-site.xml command line call bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > Searching: install the WAR search at e.g. http://localhost:8080/

Getting started with Nutch (cont.)

Nutch Summary
> Powerful for vertical search engines > Meant for indexing Intranet/Internet via http, indexing local files is possible with some configuration > Not as mature as Lucene and Solr yet > You will need to invest some time

Other Lucene Features

> Did you mean... Spell checker based on the terms in the index See contrib/spellchecker in Lucene Java > Find similar documents Selects documents similar to a given document, based on the document's significant terms See contrib/queries MoreLikeThis.java in Lucene Java > NON-features: security Lucene doesn't care about security! You need to filter results yourself For Solr, you need to secure http access

Other Projects at Apache Lucene

> Hadoop - a distributed computing platform Map/Reduce Used by Nutch

> Lucene.Net - C# port of Lucene, compatible on any level (API, index, ...) Used by Beagle, Wikipedia, ...

Lucene project The big Picture

> Lucene: Java fulltext search library

> Solr = Lucene Java > Nutch = Lucene Java + Hadoop + Web administration frontend + Web crawler + HTTP frontend + Document converters + Typed fields (schema) + Web search frontend + Faceted Browsing + Link analysis + Configurable Caching + Distributed search + XML configuration, no Java needed + Document IDs + Replication

Alternative Solutions for Search

> Commercial vendors (FAST, Autonomy, Google, ...) Enterprise search > Commercial search engines based on Lucene and Lucene support (see Wiki) IBM OmniFind Yahoo! Edition > RDBMS with integrated search features Lucene has more powerful syntax and can be easily adapted and integrated > Egothor Lucene has a much bigger community

Conclusion
> - no Enterprise Search (but: Intranet indexing using Nutch) > > > > > + can be embedded or integrated in almost any situation + fast + powerful + large, helpful community + the quasi-standard in Open Source search

Daniel Naber www.danielnaber.de Mindquarry GmbH www.mindquarry.com

[email protected]

Presentation license: http://creativecommons.org/licenses/by/3.0/

Welcome To Lucene!
No ratings yet
Welcome To Lucene!
11 pages
Musa Talukdar: Software Engineer 28 June, 2012
No ratings yet
Musa Talukdar: Software Engineer 28 June, 2012
19 pages
Lucene and Solr Search Engine Guide
No ratings yet
Lucene and Solr Search Engine Guide
6 pages
Chapter - 6 - Searching and Indexing
No ratings yet
Chapter - 6 - Searching and Indexing
44 pages
Apache Lucene
No ratings yet
Apache Lucene
5 pages
Tutorial 3
No ratings yet
Tutorial 3
38 pages
Apache Lucene
No ratings yet
Apache Lucene
5 pages
Lucene and Solr Overview
No ratings yet
Lucene and Solr Overview
24 pages
Solr Setup and Usage Guide
No ratings yet
Solr Setup and Usage Guide
20 pages
4
No ratings yet
4
35 pages
IR Project Guide for CS Students
No ratings yet
IR Project Guide for CS Students
15 pages
Lucene Boot Camp Overview and Schedule
No ratings yet
Lucene Boot Camp Overview and Schedule
83 pages
Apache Lucene 4: Search Library Insights
No ratings yet
Apache Lucene 4: Search Library Insights
8 pages
Searching and Indexing
No ratings yet
Searching and Indexing
21 pages
Lucene 4 Guide for Developers
No ratings yet
Lucene 4 Guide for Developers
28 pages
Apache Solr Presentation
100% (1)
Apache Solr Presentation
37 pages
Marc Krellenst's Session at Lucene Revolution 2011
No ratings yet
Marc Krellenst's Session at Lucene Revolution 2011
16 pages
Apache Solr Essentials Overview
No ratings yet
Apache Solr Essentials Overview
25 pages
Solr vs Elasticsearch: Key Features
No ratings yet
Solr vs Elasticsearch: Key Features
10 pages
Programmer Open Source Search What's New in Apache Lucene Programmer's Guide To Open Source Search: Apache Lucene 3.0
No ratings yet
Programmer Open Source Search What's New in Apache Lucene Programmer's Guide To Open Source Search: Apache Lucene 3.0
36 pages
Advanced Lucene Techniques for IR
0% (1)
Advanced Lucene Techniques for IR
37 pages
NLP 05
No ratings yet
NLP 05
26 pages
Apache Solr For Indexing Data - Sample Chapter
No ratings yet
Apache Solr For Indexing Data - Sample Chapter
19 pages
Build a Rich Snippets Search Engine
No ratings yet
Build a Rich Snippets Search Engine
37 pages
Introduction to SOLR and Lucene
No ratings yet
Introduction to SOLR and Lucene
21 pages
Apache Nutch Installation Guide
No ratings yet
Apache Nutch Installation Guide
33 pages
Apache Lucene
No ratings yet
Apache Lucene
19 pages
Lucene Tutorial
100% (1)
Lucene Tutorial
189 pages
Lucene Software Architecture Lecture
No ratings yet
Lucene Software Architecture Lecture
11 pages
Apache Lucene Installation and Usage Guide
100% (1)
Apache Lucene Installation and Usage Guide
13 pages
Apache Solr Search Patterns Overview
No ratings yet
Apache Solr Search Patterns Overview
33 pages
Logo 345 1649916914 Elasticsearch-Introductions
No ratings yet
Logo 345 1649916914 Elasticsearch-Introductions
86 pages
HD Mod10 Solr
No ratings yet
HD Mod10 Solr
73 pages
5 Indexing and Searching Big Data
No ratings yet
5 Indexing and Searching Big Data
11 pages
Advanced Search With Lucene
No ratings yet
Advanced Search With Lucene
30 pages
Hibernate Search
No ratings yet
Hibernate Search
96 pages
Lucene 4.0: Flexible Indexing Guide
No ratings yet
Lucene 4.0: Flexible Indexing Guide
35 pages
Solr/Lucene Search Revolution Insights
No ratings yet
Solr/Lucene Search Revolution Insights
27 pages
Search Engine Using Apache Lucene
No ratings yet
Search Engine Using Apache Lucene
5 pages
Built On Solr Simplified, Accelerated Produc Vity Cost Effec Ve Architecture
No ratings yet
Built On Solr Simplified, Accelerated Produc Vity Cost Effec Ve Architecture
7 pages
Spring MVC Framework Overview
No ratings yet
Spring MVC Framework Overview
39 pages
Apache Solr for Developers
No ratings yet
Apache Solr for Developers
17 pages
Elasticsearch: by Maruf Hassan
No ratings yet
Elasticsearch: by Maruf Hassan
14 pages
Seach Engine
50% (2)
Seach Engine
18 pages
Solr Setup and Data Management Guide
No ratings yet
Solr Setup and Data Management Guide
5 pages
Elasticsearch Server Sample Chapter
No ratings yet
Elasticsearch Server Sample Chapter
42 pages
L01
No ratings yet
L01
33 pages
Chapter 5 1712934164766
No ratings yet
Chapter 5 1712934164766
13 pages
Chapter - 6 Part 1
100% (1)
Chapter - 6 Part 1
21 pages
Search Engine Architecture Guide
No ratings yet
Search Engine Architecture Guide
23 pages
Open-Source Search Engines Overview
No ratings yet
Open-Source Search Engines Overview
52 pages
Untitled Document
No ratings yet
Untitled Document
9 pages
Tutorial
No ratings yet
Tutorial
59 pages
ODI Business Rules
100% (2)
ODI Business Rules
35 pages
Grau Archiv Manager Service Guide
No ratings yet
Grau Archiv Manager Service Guide
9 pages
DBMS Important Notes For BCA
No ratings yet
DBMS Important Notes For BCA
10 pages
Essential SEO Website Audit Checklist
No ratings yet
Essential SEO Website Audit Checklist
5 pages
Supabase Multi Tenancy - Simple and Fast - by Ryan O'Neill
No ratings yet
Supabase Multi Tenancy - Simple and Fast - by Ryan O'Neill
5 pages
Introduction to DBMS Basics
No ratings yet
Introduction to DBMS Basics
14 pages
Understanding B+ Tree Structure and Use
No ratings yet
Understanding B+ Tree Structure and Use
36 pages
Data Eng
No ratings yet
Data Eng
10 pages
What Is Online Analytical Processing & OLAP Operations
100% (1)
What Is Online Analytical Processing & OLAP Operations
15 pages
DBMS Full Notes
No ratings yet
DBMS Full Notes
49 pages
Database Management Systems (R22a0504)
No ratings yet
Database Management Systems (R22a0504)
96 pages
Internal Table Operations - Internal Tables and Work Areas - Sapnuts
No ratings yet
Internal Table Operations - Internal Tables and Work Areas - Sapnuts
4 pages
RD Research Topic
No ratings yet
RD Research Topic
6 pages
Database (Aswan)
No ratings yet
Database (Aswan)
23 pages
Data Mining Overview and Techniques
No ratings yet
Data Mining Overview and Techniques
84 pages
Db2 Faq: Db2 Questions Amit Sethi Page 1 of 49
100% (1)
Db2 Faq: Db2 Questions Amit Sethi Page 1 of 49
49 pages
9 - Databases New Syllabus 2210 (MT-L)
No ratings yet
9 - Databases New Syllabus 2210 (MT-L)
16 pages
Balbhim Patil: SQL & Big Data Developer
No ratings yet
Balbhim Patil: SQL & Big Data Developer
4 pages
MCA Internship Project Report
No ratings yet
MCA Internship Project Report
4 pages
of Chapter-2.1 Functional Dependencies and Normalization
No ratings yet
of Chapter-2.1 Functional Dependencies and Normalization
68 pages
18CSC205J Operating Systems Unit 5 - New
No ratings yet
18CSC205J Operating Systems Unit 5 - New
140 pages
C# Crystal Reports with SQL Queries
No ratings yet
C# Crystal Reports with SQL Queries
21 pages
March 2025
No ratings yet
March 2025
1 page
Lesson 05 Retail Mart Management Solution
No ratings yet
Lesson 05 Retail Mart Management Solution
9 pages
Chapter 17: Disk Storage, Basic File Structures, and Hashing
No ratings yet
Chapter 17: Disk Storage, Basic File Structures, and Hashing
54 pages
The Database Environment and Development Process
No ratings yet
The Database Environment and Development Process
55 pages
Chapter 6 CDB PDB Architecture Creation of CDB
No ratings yet
Chapter 6 CDB PDB Architecture Creation of CDB
4 pages
Airline Reservation System E-R Diagram Design
No ratings yet
Airline Reservation System E-R Diagram Design
15 pages
Data Warehousing for Analysts
No ratings yet
Data Warehousing for Analysts
26 pages
SQL Database Practicum Exam 2021/2022
No ratings yet
SQL Database Practicum Exam 2021/2022
15 pages