0% found this document useful (0 votes)
14 views3 pages

Load and Index Data in Search

Uploaded by

l00pback63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

Load and Index Data in Search

Uploaded by

l00pback63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Load and Index Data in Search http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_load_index_da...

Cloudera.com Training Support Documentation Dev Center | Contact Us Downloads

This is the documentation for Cloudera 5.4.x. Documentation for other versions is available at Cloudera Documentation.

Load and Index Data in Search


Execute the script found in a subdirectory of the following locations. The path for the script often includes the product version, such as Cloudera Manager 5.4.x, so path
details vary. To address this issue, use wildcards.
Packages: /usr/share/doc. If Search for CDH 5.4.2 is installed to the default location using packages, the Quick Start script is found in /usr/share
/doc/search-*/quickstart.
Parcels: /opt/cloudera/parcels/CDH/share/doc. If Search for CDH 5.4.2 is installed to the default location using parcels, the Quick Start script is found in
/opt/cloudera/parcels/CDH/share/doc/search-*/quickstart.

The script uses several defaults that you might want to modify:

Table 1. Script Parameters and Defaults

NAMENODE_CONNECT `hostname`:8020 For use on an HDFS HA cluster. If you use NAMENODE_CONNECT, do not use
NAMENODE_HOST or NAMENODE_PORT.

NAMENODE_HOST `hostname` If you use NAMENODE_HOST and NAMENODE_PORT, do not use


NAMENODE_CONNECT.

NAMENODE_PORT 8020 If you use NAMENODE_HOST and NAMENODE_PORT, do not use


NAMENODE_CONNECT.

ZOOKEEPER_HOST `hostname`

ZOOKEEPER_PORT 2181

1 di 3 30/05/2015 19:57
Load and Index Data in Search http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_load_index_da...

ZOOKEEPER_ROOT /solr

HDFS_USER ${HDFS_USER:="${USER}"}

SOLR_HOME /opt/cloudera/parcels/SOLR/lib/solr

By default, the script is configured to run on the NameNode host, which is also running ZooKeeper. Override these defaults with custom values when you start
quickstart.sh. For example, to use an alternate NameNode and HDFS user ID, you could start the script as follows:

$ NAMENODE_HOST=nnhost HDFS_USER=jsmith ./quickstart.sh

The first time the script runs, it downloads required files such as the Enron data and configuration files. If you run the script again, it uses the Enron information already
downloaded, as opposed to downloading this information again. On such subsequent runs, the existing data is used to re-create the enron-email-collection
SolrCloud collection.

Note: Downloading the data from its server, expanding the data, and uploading the data can be time consuming. Although your connection and CPU speed
determine the time these processes require, fifteen minutes is typical and longer is not uncommon.

The script also generates a Solr configuration and creates a collection in SolrCloud. The following sections describes what the script does and how you can complete these
steps manually, if desired. The script completes the following tasks:

1. Set variables such as hostnames and directories.


2. Create a directory to which to copy the Enron data and then copy that data to this location. This data is about 422 MB and in some tests took about five minutes to
download and two minutes to untar.
3. Create directories for the current user in HDFS, change ownership of that directory to the current user, create a directory for the Enron data, and load the Enron
data to that directory. In some tests, it took about a minute to copy approximately 3 GB of untarred data.
4. Use solrctl to create a template of the instance directory.
5. Use solrctl to create a new Solr collection for the Enron mail collection.
6. Create a directory to which the MapReduceBatchIndexer can write results. Ensure that the directory is empty.
7. Use the MapReduceIndexerTool to index the Enron data and push the result live to enron-mail-collection. In some tests, it took about seven minutes to
complete this task.

<< Prerequisites Using Search to Query Loaded Data >>

©2015 Cloudera, Inc. All rights reserved

2 di 3 30/05/2015 19:57
Load and Index Data in Search http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/search_load_index_da...

Terms and Conditions Privacy Policy

3 di 3 30/05/2015 19:57

You might also like