{"id":92691,"date":"2020-07-27T11:00:00","date_gmt":"2020-07-27T08:00:00","guid":{"rendered":"https:\/\/examples.javacodegeeks.com\/?p=92691"},"modified":"2020-07-20T18:46:24","modified_gmt":"2020-07-20T15:46:24","slug":"apache-solr-and-apache-tika-integration-tutorial","status":"publish","type":"post","link":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/","title":{"rendered":"Apache Solr and Apache Tika Integration Tutorial"},"content":{"rendered":"<p>This article is a tutorial about Apache Solr and Apache Tika Integration.<\/p>\n<h2 class=\"wp-block-heading\"><a name=\"introduction\"><\/a>1. Introduction<\/h2>\n<p>A Solr index can accept data from many different sources, such as CSV, XML, databases and common binary files. If the data to be indexed is in binary format, such as WORD, PPT, XLS, and PDF, the Solr Content Extraction Library (the <a aria-label=\"undefined (opens in a new tab)\" href=\"https:\/\/lucene.apache.org\/solr\/guide\/8_5\/uploading-data-with-solr-cell-using-apache-tika.html\" target=\"_blank\" rel=\"noreferrer noopener\">Solr Cell<\/a> framework) built upon <a aria-label=\"undefined (opens in a new tab)\" href=\"https:\/\/tika.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Tika<\/a> is used for ingesting binary files or structured files. In this example we are going to show you how Apache Solr and Apache Tika integration works.<\/p>\n<div class=\"toc\">\n<h3>Table Of Contents<\/h3>\n<dl>\n<dt><a href=\"#introduction\">1. Introduction<\/a><\/dt>\n<dt><a href=\"#technologies_used\">2. Technologiees Used<\/a><\/dt>\n<dt><a href=\"#apache_solr_apache_tika_integration\">3. Apache Solr And Apache Tika Integration<\/a><\/dt>\n<dd>\n<dl>\n<dt><a href=\"#the_basics\">3.1 The Basics<\/a><\/dt>\n<dt><a href=\"#setting_up_the_integration\">3.2 Setting Up The Integration<\/a><\/dt>\n<dt><a href=\"#examples\">3.3 Examples<\/a><\/dt>\n<\/dl>\n<\/dd>\n<dt><a href=\"#download\">4. Download The Sample Data File<\/a><\/dt>\n<\/dl>\n<\/div>\n<p>&nbsp;<\/p>\n<h2 class=\"wp-block-heading\"><a name=\"technologies_used\"><\/a>2. Technologies Used<\/h2>\n<p>The steps and commands described in this example are for <a href=\"https:\/\/lucene.apache.org\/solr\/downloads.html#solr-852\" target=\"_blank\" aria-label=\"undefined (opens in a new tab)\" rel=\"noreferrer noopener\">Apache Solr 8.5<\/a> on Windows 10. The JDK version we use to run the SolrCloud in this example is <a href=\"https:\/\/jdk.java.net\/java-se-ri\/13\" target=\"_blank\" aria-label=\"undefined (opens in a new tab)\" rel=\"noreferrer noopener\">OpenJDK 13<\/a>. Before we start, please make sure your computer meet the <a href=\"https:\/\/lucene.apache.org\/solr\/8_5_0\/SYSTEM_REQUIREMENTS.html\" target=\"_blank\" aria-label=\"undefined (opens in a new tab)\" rel=\"noreferrer noopener\">system requirements<\/a>. Also, please download the binary release of <a href=\"https:\/\/lucene.apache.org\/solr\/downloads.html#solr-852\" target=\"_blank\" aria-label=\"undefined (opens in a new tab)\" rel=\"noreferrer noopener\">Apache Solr 8.5<\/a>.<\/p>\n<h2 class=\"wp-block-heading\"><a name=\"apache_solr_apache_tika_integration\"><\/a>3. Apache Solr And Apache Tika Integration<\/h2>\n<h3 class=\"wp-block-heading\"><a name=\"the_basics\"><\/a>3.1 The Basics<\/h3>\n<p><a aria-label=\"undefined (opens in a new tab)\" href=\"https:\/\/tika.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Tika<\/a> is a content analysis toolkit which detects and extracts metadata and text from over a thousand different file types (such as WORD, PPT, XLS, and PDF). This makes Tika very useful for indexing binary data in Solr. The Solr Cell framework uses code from the Tika project internally to support uploading binary files for data extraction and indexing. Let&#8217;s see how to set up the integration in the next section.<\/p>\n<h3 class=\"wp-block-heading\"><a name=\"setting_up_the_integration\"><\/a>3.2 Setting Up The Integration<\/h3>\n<p>We don&#8217;t need to download Apach Tika for the integration. Solr Cell as a contrib contains all dependencies required to run Tika. It is not automatically included in the configSet but need to be configured.<\/p>\n<h4 class=\"wp-block-heading\">3.2.1 Putting Jars On Classpath<\/h4>\n<p>To use the Solr Cell, we must add additional jars to Solr\u2019s classpath. There are a few options to make other plugins available to Solr as described in <a href=\"https:\/\/lucene.apache.org\/solr\/guide\/8_5\/solr-plugins.html#installing-plugins\" target=\"_blank\" aria-label=\"undefined (opens in a new tab)\" rel=\"noreferrer noopener\">Solr Plugins<\/a>. We use the standard approach the directive in <code>solrconfig.xml<\/code> as shown below:<\/p>\n<pre class=\"brush:xml\">&lt;lib dir=\"${solr.install.dir:..\/..\/..\/..\/..}\/contrib\/extraction\/lib\" regex=\".*\\.jar\" \/&gt;\n&lt;lib dir=\"${solr.install.dir:..\/..\/..\/..\/..}\/dist\/\" regex=\"solr-cell-\\d.*\\.jar\" \/&gt;<\/pre>\n<h4 class=\"wp-block-heading\">3.2.2 ExtractingRequestHandler Parameters And Configuration<\/h4>\n<p>A <code>SolrRequestHandler<\/code> is used to defines the logic executed for any request sent to Solr. When working with Solr Cell framework, Solr\u2019s <code>ExtractingRequestHandler<\/code> which implements <code>SolrRequestHandler<\/code> interface uses Tika internally to support uploading binary files for data extraction and indexing. The parameters listed in the table below are accepted by the <code>ExtractingRequestHandler<\/code>. We can specify them as request parameters for each indexing request or add them to <code>ExtractingRequestHandler<\/code> configured in <code>solrconfig.xml<\/code> for all requests.<\/p>\n<figure class=\"wp-block-table\">\n<table>\n<tbody>\n<tr>\n<th>Parameter<\/th>\n<th>Description<\/th>\n<th>Example of Request Parameter<\/th>\n<\/tr>\n<tr>\n<td>capture<\/td>\n<td>Captures XHTML elements with the specified name.<\/td>\n<td><code>capture=p<\/code><\/td>\n<\/tr>\n<tr>\n<td>captureAttr<\/td>\n<td>Indexes attributes of the Tika XHTML elements into separate fields.<\/td>\n<td><code>captureAttr=true<\/code><\/td>\n<\/tr>\n<tr>\n<td>commitWithin<\/td>\n<td>Add the document within the specified number of milliseconds.<\/td>\n<td><code>commitWithin=5000<\/code><\/td>\n<\/tr>\n<tr>\n<td>defaultField<\/td>\n<td>A default field to use if the uprefix parameter is not specified and a field is not defined in the schema.<\/td>\n<td><code>defaultField=_text_<\/code><\/td>\n<\/tr>\n<tr>\n<td>extractOnly<\/td>\n<td>If true, returns the extracted content from Tika without indexing the document. Default is false.<\/td>\n<td><code>extractOnly=true<\/code><\/td>\n<\/tr>\n<tr>\n<td>extractFormat<\/td>\n<td>The serialization format of the extract content: xml (default) or text.<\/td>\n<td><code>extractFormat=text<\/code><\/td>\n<\/tr>\n<tr>\n<td>fmap.source_field<\/td>\n<td>Maps source field in incoming document to another field.<\/td>\n<td><code>fmap.content=_text_<\/code><\/td>\n<\/tr>\n<tr>\n<td>ignoreTikaException<\/td>\n<td>Skips exception when processing when set to true.<\/td>\n<td><code>ignoreTikaException=true<\/code><\/td>\n<\/tr>\n<tr>\n<td>literal.fieldname<\/td>\n<td>Populates a field with the specified value for each document.<\/td>\n<td><code>literal.id=word-doc-1<\/code><\/td>\n<\/tr>\n<tr>\n<td>literalsOverride<\/td>\n<td>If true (default), overrides field values with literal values; otherwise appends to the same field which must be multivalued.<\/td>\n<td><code>literalsOverride=false<\/code><\/td>\n<\/tr>\n<tr>\n<td>lowernames<\/td>\n<td>Maps all fields to lowercase with underscore when set to true.<\/td>\n<td><code>lowernames=true<\/code><\/td>\n<\/tr>\n<tr>\n<td>multipartUploadLimitInKB<\/td>\n<td>Max upload document size allowed. Default is 2048KB<\/td>\n<td><code>multipartUploadLimitInKB=1024000<\/code><\/td>\n<\/tr>\n<tr>\n<td>parseContext.config<\/td>\n<td>Specifies a Tika parser config file.<\/td>\n<td><code>parseContext.config=doc-config.xml<\/code><\/td>\n<\/tr>\n<tr>\n<td>passwordsFile<\/td>\n<td>Specifies a filename-password mapping file when indexing encrypted documents.<\/td>\n<td><code>passwordsFile=\/path\/to\/passwords.txt<\/code><\/td>\n<\/tr>\n<tr>\n<td>resource.name<\/td>\n<td>Specifies the name of the file to index.<\/td>\n<td><code>resource.name=jcg_examples.doc<\/code><\/td>\n<\/tr>\n<tr>\n<td>resource.password<\/td>\n<td>Defines the password for an encrypted document.<\/td>\n<td><code>resource.password=secret123<\/code><\/td>\n<\/tr>\n<tr>\n<td>tika.config<\/td>\n<td>Specifies a custom Tika config file.<\/td>\n<td><code>tika.config=\/path\/to\/tika.config<\/code><\/td>\n<\/tr>\n<tr>\n<td>uprefix<\/td>\n<td>Prefixes all fields that are not defined in the schema with the given prefix.<\/td>\n<td><code>uprefix=ignored_<\/code><\/td>\n<\/tr>\n<tr>\n<td>xpath<\/td>\n<td>Defines an XPath expression to restrict the XHTML returned by Tika.<\/td>\n<td><code>xpath=\/xhtml:html\/xhtml:body\/xhtml:div\/\/node()<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table><figcaption>Table. 1. ExtractingRequestHandler Parameters<\/figcaption><\/figure>\n<\/p>\n<p>An example of the ExtractingRequestHandler configuration in <code>solrconfig.xml<\/code> is below:<\/p>\n<pre class=\"brush:xml\">&lt;requestHandler name=\"\/update\/extract\"\n                startup=\"lazy\"\n                class=\"solr.extraction.ExtractingRequestHandler\" &gt;\n  &lt;lst name=\"defaults\"&gt;\n    &lt;str name=\"lowernames\"&gt;true&lt;\/str&gt;\n    &lt;str name=\"fmap.content\"&gt;_text_&lt;\/str&gt;\n    &lt;!--&lt;str name=\"uprefix\"&gt;ignored_&lt;\/str&gt;--&gt;\n    &lt;!-- capture link hrefs but ignore div attributes --&gt;\n    &lt;str name=\"captureAttr\"&gt;true&lt;\/str&gt;\n    &lt;str name=\"fmap.a\"&gt;links&lt;\/str&gt;\n    &lt;str name=\"fmap.div\"&gt;ignored_div&lt;\/str&gt;\n  &lt;\/lst&gt;\n&lt;\/requestHandler&gt;<\/pre>\n<p>In the example configuration above, we map all fields to lowercase with underscore and map <code>content<\/code> field in incoming documents to <code>_text_<\/code> field. As the sample word document we are going to index contains several links, we set <code>captureAttr<\/code> to <code>true<\/code> to capture them and map <code>hrefs<\/code> captured to the <code>links<\/code> field. In addition, the <code>uprefix<\/code> parameter has been commented out at the moment and we will see an example later which sets <code>uprefix<\/code> to <code>ignored_<\/code> to ignore all fields extracted by Tika but not defined in the schema.<\/p>\n<h4 class=\"wp-block-heading\">3.2.3 Defining Schema<\/h4>\n<p>Open <code>managed-schema<\/code> file with any text editor in <code>jcg_example_configs<\/code> configSet under the directory <code>${solr.install.dir}\\server\\solr\\configsets\\jcg_example_configs\\conf<\/code>. Make sure the following fields have been defined:<div style=\"display:inline-block; margin: 15px 0;\"> <div id=\"adngin-JavaCodeGeeks_incontent_video-0\" style=\"display:inline-block;\"><\/div> <\/div><\/p>\n<pre class=\"brush:xml\">&lt;field name=\"id\" type=\"string\" multiValued=\"false\" indexed=\"true\" required=\"true\" stored=\"true\"\/&gt;\n&lt;field name=\"author\" type=\"string\" indexed=\"true\" stored=\"true\"\/&gt;\n&lt;field name=\"links\" type=\"strings\" indexed=\"true\" stored=\"true\"\/&gt;\n&lt;field name=\"last_modified\" type=\"pdate\" indexed=\"true\" stored=\"true\"\/&gt;\n&lt;field name=\"_text_\" type=\"text_general\" multiValued=\"true\" indexed=\"true\" stored=\"false\"\/&gt;<\/pre>\n<p>For your convinience, a <code>jcg_example_configs.zip<\/code> file containing all configurations and schema is attached to the article. You can simply download and extract it to the directory <code>${solr.install.dir}\\server\\solr\\configsets\\jcg_example_configs\\conf\\<\/code>.<\/p>\n<h4 class=\"wp-block-heading\">3.2.4 Starting Solr Instance<\/h4>\n<p>For simplicity, instead of setting up a SolrCloud on your local machine as demonstrated in <a aria-label=\"undefined (opens in a new tab)\" href=\"https:\/\/examples.javacodegeeks.com\/apache-solr-clustering-example\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Solr Clustering Example<\/a>, we run a single Solr instance on our local machine with the command below:<\/p>\n<pre class=\"brush:bash\">bin\\solr.cmd start<\/pre>\n<p>The output would be:<\/p>\n<pre class=\"brush:bash\">Waiting up to 30 to see Solr running on port 8983\nStarted Solr server on port 8983. Happy searching!<\/pre>\n<h4 class=\"wp-block-heading\">3.2.5 Creating A New Core<\/h4>\n<p>As we are running Solr in standalone mode, we need to create a new core named <code>jcg_example_core<\/code> with the <code>jcg_example_configs<\/code> configSet on the local machine. For example, we can do it via the CoreAdmin API:<\/p>\n<pre class=\"brush:bash\">curl -G http:\/\/localhost:8983\/solr\/admin\/cores --data-urlencode action=CREATE --data-urlencode name=jcg_example_core --data-urlencode configSet=jcg_example_configs<\/pre>\n<p>The output would be:<\/p>\n<pre class=\"brush:bash\">{\n  \"responseHeader\":{\n    \"status\":0,\n    \"QTime\":641},\n  \"core\":\"jcg_example_core\"}<\/pre>\n<p>If the <code>jcg_example_core<\/code> exists, you can remove it via the CoreAdmin API as below:<\/p>\n<pre class=\"brush:bash\">curl -G http:\/\/localhost:8983\/solr\/admin\/cores --data-urlencode action=UNLOAD --data-urlencode core=jcg_example_core --data-urlencode deleteInstanceDir=true<\/pre>\n<p>The output would be:<\/p>\n<pre class=\"brush:bash\">{\n  \"responseHeader\":{\n    \"status\":0,\n    \"QTime\":37\n  }\n}<\/pre>\n<h3 class=\"wp-block-heading\"><a name=\"examples\"><\/a>3.3 Examples<\/h3>\n<p>Apache Tika supports several document formats and is able to extract metadata and\/or textual content from the <a href=\"https:\/\/tika.apache.org\/1.24.1\/formats.html\" target=\"_blank\" aria-label=\"undefined (opens in a new tab)\" rel=\"noreferrer noopener\">Supported Document Formats<\/a>. Time to see some examples of how the Solr Cell works.<\/p>\n<h4 class=\"wp-block-heading\"><a name=\"indexing_data\"><\/a>3.3.1 Indexing Data<\/h4>\n<p>Download and extract the sample data file attached to this article and index the <code>jcg_example_articles.docx<\/code> with the following command:<\/p>\n<pre class=\"brush:bash\">curl \"http:\/\/localhost:8983\/solr\/jcg_example_core\/update\/extract?literal.id=word-doc-1&amp;commit=true\" -F \"myfile=@jcg_example_articles.docx\"<\/pre>\n<p>The output would be:<\/p>\n<pre class=\"brush:bash\">{\n  \"responseHeader\":{\n    \"status\":0,\n    \"QTime\":1789\n  }\n}<\/pre>\n<p>Based on the configuration we have for the <code>ExtractingRequestHandler<\/code>, the URL above calls the <code>ExtractingRequestHandler<\/code>, uploads the file <code>jcg_example_articles.docx<\/code>, and assigns it the unique ID <code>word-doc-1<\/code>. Note that to specify a unique Id for the document being indexed is very important in our example. Without it, if we index the same document again by running the command above, a new document in the index will be created with a new unique id because we have the <code>uuid<\/code> update processor defined in the <code>solrconfig.xml<\/code>. In other use cases, we may choose to map a metadata field to the ID, generate a new UUID, or generate an ID from a signature (hash) of the content. The <code>commit=true<\/code> parameter let Solr commit changes after indexing the document so that we can find it immediately by query. For optimum performance when loading many documents, don\u2019t call the commit command until you are done. The <code>-F<\/code> flag allows us to specify HTTP multipart POST data for curl to upload a binary file.<\/p>\n<p>Another useful parameter is <code>extractOnly<\/code>. We can set it to <code>true<\/code> to extract data without indexing It for testing purpose.<\/p>\n<p>The example below sets the <code>extractOnly=true<\/code> parameter to extract data without indexing it:<\/p>\n<pre class=\"brush:bash\">curl \"http:\/\/localhost:8983\/solr\/jcg_example_core\/update\/extract?extractOnly=true\" -F \"myfile=@jcg_example_articles.docx\"<\/pre>\n<p>The output would be:<\/p>\n<pre class=\"brush:json\">{\n  \"responseHeader\":{\n    \"status\":0,\n    \"QTime\":59},\n  \"jcg_example_articles.docx\":\"&lt;?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"?&gt;\\n&lt;html xmlns=\\\"http:\/\/www.w3.org\/1999\/xhtml\\\"&gt;\\n&lt;head&gt;\\n&lt;meta name=\\\"date\\\"\\ncontent=\\\"2020-07-18T09:49:00Z\\\"\/&gt;\\n&lt;meta name=\\\"Total-Time\\\"\\ncontent=\\\"8\\\"\/&gt;\\n&lt;meta name=\\\"extended-properties:AppVersion\\\"\\ncontent=\\\"12.0000\\\"\/&gt;\\n&lt;meta name=\\\"stream_content_type\\\"\\n            content=\\\"application\/octet-stream\\\"\/&gt;\\n&lt;meta\\nname=\\\"meta:paragraph-count\\\" content=\\\"1\\\"\/&gt;\\n&lt;meta name=\\\"subject\\\"\\n            content=\\\"articles; kevin yang; examples\\\"\/&gt;\\n&lt;meta\\nname=\\\"Word-Count\\\" content=\\\"103\\\"\/&gt;\\n&lt;meta name=\\\"meta:line-count\\\"\\ncontent=\\\"4\\\"\/&gt;\\n&lt;meta name=\\\"Template\\\" content=\\\"Normal.dotm\\\"\/&gt;\\n&lt;meta\\nname=\\\"Paragraph-Count\\\" content=\\\"1\\\"\/&gt;\\n&lt;meta name=\\\"stream_name\\\"\\n            content=\\\"jcg_example_articles.docx\\\"\/&gt;\\n&lt;meta\\nname=\\\"meta:character-count-with-spaces\\\" content=\\\"694\\\"\/&gt;\\n&lt;meta\\nname=\\\"dc:title\\\" content=\\\"Articles Written By Kevin Yang\\\"\/&gt;\\n&lt;meta\\nname=\\\"modified\\\" content=\\\"2020-07-18T09:49:00Z\\\"\/&gt;\\n&lt;meta\\nname=\\\"meta:author\\\" content=\\\"Kevin Yang\\\"\/&gt;\\n&lt;meta\\nname=\\\"meta:creation-date\\\" content=\\\"2020-07-18T09:41:00Z\\\"\/&gt;\\n&lt;meta\\nname=\\\"extended-properties:Application\\\"\\n            content=\\\"Microsoft Office Word\\\"\/&gt;\\n&lt;meta\\nname=\\\"stream_source_info\\\" content=\\\"myfile\\\"\/&gt;\\n&lt;meta name=\\\"Creation-Date\\\"\\n            content=\\\"2020-07-18T09:41:00Z\\\"\/&gt;\\n&lt;meta\\nname=\\\"Character-Count-With-Spaces\\\" content=\\\"694\\\"\/&gt;\\n&lt;meta\\nname=\\\"Last-Author\\\" content=\\\"Kevin Yang\\\"\/&gt;\\n&lt;meta name=\\\"Character Count\\\"\\ncontent=\\\"592\\\"\/&gt;\\n&lt;meta name=\\\"Page-Count\\\" content=\\\"1\\\"\/&gt;\\n&lt;meta\\nname=\\\"Application-Version\\\" content=\\\"12.0000\\\"\/&gt;\\n&lt;meta\\nname=\\\"extended-properties:Template\\\" content=\\\"Normal.dotm\\\"\/&gt;\\n&lt;meta\\nname=\\\"Author\\\" content=\\\"Kevin Yang\\\"\/&gt;\\n&lt;meta name=\\\"publisher\\\"\\ncontent=\\\"Java Code Geeks\\\"\/&gt;\\n&lt;meta name=\\\"meta:page-count\\\"\\ncontent=\\\"1\\\"\/&gt;\\n&lt;meta name=\\\"cp:revision\\\" content=\\\"3\\\"\/&gt;\\n&lt;meta\\nname=\\\"Keywords\\\" content=\\\"articles; kevin yang; examples\\\"\/&gt;\\n&lt;meta\\nname=\\\"Category\\\" content=\\\"example\\\"\/&gt;\\n&lt;meta name=\\\"meta:word-count\\\"\\ncontent=\\\"103\\\"\/&gt;\\n&lt;meta name=\\\"dc:creator\\\" content=\\\"Kevin Yang\\\"\/&gt;\\n&lt;meta\\nname=\\\"extended-properties:Company\\\" content=\\\"Java Code Geeks\\\"\/&gt;\\n&lt;meta\\nname=\\\"dcterms:created\\\" content=\\\"2020-07-18T09:41:00Z\\\"\/&gt;\\n&lt;meta\\nname=\\\"dcterms:modified\\\" content=\\\"2020-07-18T09:49:00Z\\\"\/&gt;\\n&lt;meta\\nname=\\\"Last-Modified\\\" content=\\\"2020-07-18T09:49:00Z\\\"\/&gt;\\n&lt;meta\\nname=\\\"Last-Save-Date\\\" content=\\\"2020-07-18T09:49:00Z\\\"\/&gt;\\n&lt;meta\\nname=\\\"meta:character-count\\\" content=\\\"592\\\"\/&gt;\\n&lt;meta name=\\\"Line-Count\\\"\\ncontent=\\\"4\\\"\/&gt;\\n&lt;meta name=\\\"meta:save-date\\\"\\n            content=\\\"2020-07-18T09:49:00Z\\\"\/&gt;\\n&lt;meta\\nname=\\\"Application-Name\\\" content=\\\"Microsoft Office Word\\\"\/&gt;\\n&lt;meta\\nname=\\\"extended-properties:TotalTime\\\" content=\\\"8\\\"\/&gt;\\n&lt;meta\\nname=\\\"Content-Type\\\"\\n            content=\\\"application\/vnd.openxmlformats-officedocument.wordprocessingml.document\\\"\/&gt;\\n&lt;meta\\nname=\\\"stream_size\\\" content=\\\"11162\\\"\/&gt;\\n&lt;meta name=\\\"X-Parsed-By\\\"\\n            content=\\\"org.apache.tika.parser.DefaultParser\\\"\/&gt;\\n&lt;meta\\nname=\\\"X-Parsed-By\\\"\\n            content=\\\"org.apache.tika.parser.microsoft.ooxml.OOXMLParser\\\"\/&gt;\\n&lt;meta\\nname=\\\"creator\\\" content=\\\"Kevin Yang\\\"\/&gt;\\n&lt;meta name=\\\"dc:subject\\\"\\n            content=\\\"articles; kevin yang; examples\\\"\/&gt;\\n&lt;meta\\nname=\\\"meta:last-author\\\" content=\\\"Kevin Yang\\\"\/&gt;\\n&lt;meta\\nname=\\\"xmpTPg:NPages\\\" content=\\\"1\\\"\/&gt;\\n&lt;meta name=\\\"Revision-Number\\\"\\ncontent=\\\"3\\\"\/&gt;\\n&lt;meta name=\\\"meta:keyword\\\"\\n            content=\\\"articles; kevin yang; examples\\\"\/&gt;\\n&lt;meta\\nname=\\\"cp:category\\\" content=\\\"example\\\"\/&gt;\\n&lt;meta name=\\\"dc:publisher\\\" content=\\\"Java Code Geeks\\\"\/&gt;\\n&lt;title&gt;Articles Written By Kevin Yang&lt;\/title&gt;\\n&lt;\/head&gt;\\n&lt;body&gt;\\n&lt;h1 class=\\\"title\\\"&gt;Articles written by Kevin Yang&lt;\/h1&gt;\\n&lt;h1&gt;Apache Solr&lt;\/h1&gt;\\n&lt;p\/&gt;\\n&lt;p&gt;Examples of Apache Solr.&lt;\/p&gt;\\n&lt;p&gt;\\n            &lt;a href=\\\"https:\/\/examples.javacodegeeks.com\/apache-solr-function-query-example\/\\\"&gt;Apache Solr Function Query Example&lt;\/a&gt;\\n&lt;\/p&gt;\\n&lt;p&gt;\\n            &lt;a href=\\\"https:\/\/examples.javacodegeeks.com\/apache-solr-standard-query-parser-example\/\\\"&gt;Apache Solr Standard Query Parser Example&lt;\/a&gt;\\n&lt;\/p&gt;\\n&lt;p&gt;\\n            &lt;a href=\\\"https:\/\/examples.javacodegeeks.com\/apache-solr-fuzzy-search-example\/\\\"&gt;Apache Solr Fuzzy Search Example&lt;\/a&gt;\\n&lt;\/p&gt;\\n&lt;p&gt;\\n            &lt;a href=\\\"https:\/\/examples.javacodegeeks.com\/apache-solr-opennlp-tutorial\/\\\"&gt;Apache Solr OpenNLP Tutorial \u9225?Part 1&lt;\/a&gt;\\n&lt;\/p&gt;\\n&lt;p&gt;\\n            &lt;a href=\\\"https:\/\/examples.javacodegeeks.com\/apache-solr-opennlp-tutorial-part-2\/\\\"&gt;Apache Solr OpenNLP Tutorial \u9225?Part 2&lt;\/a&gt;\\n&lt;\/p&gt;\\n&lt;\/body&gt;\\n&lt;\/html&gt;\\n\",\n  \"jcg_example_articles.docx_metadata\":[\n    \"date\",[\"2020-07-18T09:49:00Z\"],\n    \"Total-Time\",[\"8\"],\n    \"extended-properties:AppVersion\",[\"12.0000\"],\n    \"stream_content_type\",[\"application\/octet-stream\"],\n    \"meta:paragraph-count\",[\"1\"],\n    \"subject\",[\"articles; kevin yang; examples\"],\n    \"Word-Count\",[\"103\"],\n    \"meta:line-count\",[\"4\"],\n    \"Template\",[\"Normal.dotm\"],\n    \"Paragraph-Count\",[\"1\"],\n    \"stream_name\",[\"jcg_example_articles.docx\"],\n    \"meta:character-count-with-spaces\",[\"694\"],\n    \"dc:title\",[\"Articles Written By Kevin Yang\"],\n    \"modified\",[\"2020-07-18T09:49:00Z\"],\n    \"meta:author\",[\"Kevin Yang\"],\n    \"meta:creation-date\",[\"2020-07-18T09:41:00Z\"],\n    \"extended-properties:Application\",[\"Microsoft Office Word\"],\n    \"stream_source_info\",[\"myfile\"],\n    \"Creation-Date\",[\"2020-07-18T09:41:00Z\"],\n    \"Character-Count-With-Spaces\",[\"694\"],\n    \"Last-Author\",[\"Kevin Yang\"],\n    \"Character Count\",[\"592\"],\n    \"Page-Count\",[\"1\"],\n    \"Application-Version\",[\"12.0000\"],\n    \"extended-properties:Template\",[\"Normal.dotm\"],\n    \"Author\",[\"Kevin Yang\"],\n    \"publisher\",[\"Java Code Geeks\"],\n    \"meta:page-count\",[\"1\"],\n    \"cp:revision\",[\"3\"],\n    \"Keywords\",[\"articles; kevin yang; examples\"],\n    \"Category\",[\"example\"],\n    \"meta:word-count\",[\"103\"],\n    \"dc:creator\",[\"Kevin Yang\"],\n    \"extended-properties:Company\",[\"Java Code Geeks\"],\n    \"dcterms:created\",[\"2020-07-18T09:41:00Z\"],\n    \"dcterms:modified\",[\"2020-07-18T09:49:00Z\"],\n    \"Last-Modified\",[\"2020-07-18T09:49:00Z\"],\n    \"title\",[\"Articles Written By Kevin Yang\"],\n    \"Last-Save-Date\",[\"2020-07-18T09:49:00Z\"],\n    \"meta:character-count\",[\"592\"],\n    \"Line-Count\",[\"4\"],\n    \"meta:save-date\",[\"2020-07-18T09:49:00Z\"],\n    \"Application-Name\",[\"Microsoft Office Word\"],\n    \"extended-properties:TotalTime\",[\"8\"],\n    \"Content-Type\",[\"application\/vnd.openxmlformats-officedocument.wordprocessingml.document\"],\n    \"stream_size\",[\"11162\"],\n    \"X-Parsed-By\",[\"org.apache.tika.parser.DefaultParser\",\n      \"org.apache.tika.parser.microsoft.ooxml.OOXMLParser\"],\n    \"creator\",[\"Kevin Yang\"],\n    \"dc:subject\",[\"articles; kevin yang; examples\"],\n    \"meta:last-author\",[\"Kevin Yang\"],\n    \"xmpTPg:NPages\",[\"1\"],\n    \"Revision-Number\",[\"3\"],\n    \"meta:keyword\",[\"articles; kevin yang; examples\"],\n    \"cp:category\",[\"example\"],\n    \"dc:publisher\",[\"Java Code Geeks\"]]}<\/pre>\n<h4 class=\"wp-block-heading\">3.3.2 Verifying The Results<\/h4>\n<p>Now we can execute a query and find that document with a request below.<\/p>\n<pre class=\"brush:bash\">curl -G http:\/\/localhost:8983\/solr\/jcg_example_core\/select --data-urlencode \"q=kevin\"<\/pre>\n<p>The output would be:<\/p>\n<pre class=\"brush:json\">{\n  \"responseHeader\":{\n    \"status\":0,\n    \"QTime\":0,\n    \"params\":{\n      \"q\":\"kevin\"}},\n  \"response\":{\"numFound\":1,\"start\":0,\"docs\":[\n      {\n        \"meta\":[\"date\",\n          \"2020-07-18T09:49:00Z\",\n          \"Total-Time\",\n          \"8\",\n          \"extended-properties:AppVersion\",\n          \"12.0000\",\n          \"stream_content_type\",\n          \"application\/octet-stream\",\n          \"meta:paragraph-count\",\n          \"1\",\n          \"subject\",\n          \"articles; kevin yang; examples\",\n          \"Word-Count\",\n          \"103\",\n          \"meta:line-count\",\n          \"4\",\n          \"Template\",\n          \"Normal.dotm\",\n          \"Paragraph-Count\",\n          \"1\",\n          \"stream_name\",\n          \"jcg_example_articles.docx\",\n          \"meta:character-count-with-spaces\",\n          \"694\",\n          \"dc:title\",\n          \"Articles Written By Kevin Yang\",\n          \"modified\",\n          \"2020-07-18T09:49:00Z\",\n          \"meta:author\",\n          \"Kevin Yang\",\n          \"meta:creation-date\",\n          \"2020-07-18T09:41:00Z\",\n          \"extended-properties:Application\",\n          \"Microsoft Office Word\",\n          \"stream_source_info\",\n          \"myfile\",\n          \"Creation-Date\",\n          \"2020-07-18T09:41:00Z\",\n          \"Character-Count-With-Spaces\",\n          \"694\",\n          \"Last-Author\",\n          \"Kevin Yang\",\n          \"Character Count\",\n          \"592\",\n          \"Page-Count\",\n          \"1\",\n          \"Application-Version\",\n          \"12.0000\",\n          \"extended-properties:Template\",\n          \"Normal.dotm\",\n          \"Author\",\n          \"Kevin Yang\",\n          \"publisher\",\n          \"Java Code Geeks\",\n          \"meta:page-count\",\n          \"1\",\n          \"cp:revision\",\n          \"3\",\n          \"Keywords\",\n          \"articles; kevin yang; examples\",\n          \"Category\",\n          \"example\",\n          \"meta:word-count\",\n          \"103\",\n          \"dc:creator\",\n          \"Kevin Yang\",\n          \"extended-properties:Company\",\n          \"Java Code Geeks\",\n          \"dcterms:created\",\n          \"2020-07-18T09:41:00Z\",\n          \"dcterms:modified\",\n          \"2020-07-18T09:49:00Z\",\n          \"Last-Modified\",\n          \"2020-07-18T09:49:00Z\",\n          \"Last-Save-Date\",\n          \"2020-07-18T09:49:00Z\",\n          \"meta:character-count\",\n          \"592\",\n          \"Line-Count\",\n          \"4\",\n          \"meta:save-date\",\n          \"2020-07-18T09:49:00Z\",\n          \"Application-Name\",\n          \"Microsoft Office Word\",\n          \"extended-properties:TotalTime\",\n          \"8\",\n          \"Content-Type\",\n          \"application\/vnd.openxmlformats-officedocument.wordprocessingml.document\",\n          \"stream_size\",\n          \"11162\",\n          \"X-Parsed-By\",\n          \"org.apache.tika.parser.DefaultParser\",\n          \"X-Parsed-By\",\n          \"org.apache.tika.parser.microsoft.ooxml.OOXMLParser\",\n          \"creator\",\n          \"Kevin Yang\",\n          \"dc:subject\",\n          \"articles; kevin yang; examples\",\n          \"meta:last-author\",\n          \"Kevin Yang\",\n          \"xmpTPg:NPages\",\n          \"1\",\n          \"Revision-Number\",\n          \"3\",\n          \"meta:keyword\",\n          \"articles; kevin yang; examples\",\n          \"cp:category\",\n          \"example\",\n          \"dc:publisher\",\n          \"Java Code Geeks\"],\n        \"h1\":[\"title\"],\n        \"links\":[\"https:\/\/examples.javacodegeeks.com\/apache-solr-function-query-example\/\",\n          \"https:\/\/examples.javacodegeeks.com\/apache-solr-standard-query-parser-example\/\",\n          \"https:\/\/examples.javacodegeeks.com\/apache-solr-fuzzy-search-example\/\",\n          \"https:\/\/examples.javacodegeeks.com\/apache-solr-opennlp-tutorial\/\",\n          \"https:\/\/examples.javacodegeeks.com\/apache-solr-opennlp-tutorial-part-2\/\"],\n        \"id\":\"word-doc-1\",\n        \"date\":\"2020-07-18T09:49:00Z\",\n        \"total_time\":8,\n        \"extended_properties_appversion\":12.0,\n        \"stream_content_type\":[\"application\/octet-stream\"],\n        \"meta_paragraph_count\":1,\n        \"subject\":[\"articles; kevin yang; examples\"],\n        \"word_count\":103,\n        \"meta_line_count\":4,\n        \"template\":[\"Normal.dotm\"],\n        \"paragraph_count\":1,\n        \"stream_name\":[\"jcg_example_articles.docx\"],\n        \"meta_character_count_with_spaces\":694,\n        \"dc_title\":[\"Articles Written By Kevin Yang\"],\n        \"modified\":\"2020-07-18T09:49:00Z\",\n        \"meta_author\":[\"Kevin Yang\"],\n        \"meta_creation_date\":\"2020-07-18T09:41:00Z\",\n        \"extended_properties_application\":[\"Microsoft Office Word\"],\n        \"stream_source_info\":[\"myfile\"],\n        \"creation_date\":\"2020-07-18T09:41:00Z\",\n        \"character_count_with_spaces\":694,\n        \"last_author\":[\"Kevin Yang\"],\n        \"character_count\":592,\n        \"page_count\":1,\n        \"application_version\":12.0,\n        \"extended_properties_template\":[\"Normal.dotm\"],\n        \"author\":[\"Kevin Yang\"],\n        \"publisher\":[\"Java Code Geeks\"],\n        \"meta_page_count\":1,\n        \"cp_revision\":3,\n        \"keywords\":[\"articles; kevin yang; examples\"],\n        \"category\":[\"example\"],\n        \"meta_word_count\":103,\n        \"dc_creator\":[\"Kevin Yang\"],\n        \"extended_properties_company\":[\"Java Code Geeks\"],\n        \"dcterms_created\":\"2020-07-18T09:41:00Z\",\n        \"dcterms_modified\":\"2020-07-18T09:49:00Z\",\n        \"last_modified\":\"2020-07-18T09:49:00Z\",\n        \"title\":[\"Articles Written By Kevin Yang\"],\n        \"last_save_date\":\"2020-07-18T09:49:00Z\",\n        \"meta_character_count\":592,\n        \"line_count\":4,\n        \"meta_save_date\":\"2020-07-18T09:49:00Z\",\n        \"application_name\":[\"Microsoft Office Word\"],\n        \"extended_properties_totaltime\":8,\n        \"content_type\":[\"application\/vnd.openxmlformats-officedocument.wordprocessingml.document\"],\n        \"stream_size\":11162,\n        \"x_parsed_by\":[\"org.apache.tika.parser.DefaultParser\",\n          \"org.apache.tika.parser.microsoft.ooxml.OOXMLParser\"],\n        \"creator\":[\"Kevin Yang\"],\n        \"dc_subject\":[\"articles; kevin yang; examples\"],\n        \"meta_last_author\":[\"Kevin Yang\"],\n        \"xmptpg_npages\":1,\n        \"revision_number\":3,\n        \"meta_keyword\":[\"articles; kevin yang; examples\"],\n        \"cp_category\":[\"example\"],\n        \"dc_publisher\":[\"Java Code Geeks\"],\n        \"_version_\":1672550496610549760}]\n  }}<\/pre>\n<p>We can see that several metadata associated to the example document has been extracted. Each of them has its own field created because we are running in <code>schemaless<\/code> mode configured in the <code>solrconfig.xml<\/code> by having <code>add-unknown-fields-to-the-schema<\/code> update request processor chain enabled.<\/p>\n<h4 class=\"wp-block-heading\">3.3.3 A Simplified Example<\/h4>\n<p>The behaviour of adding new fields for all metadata extracted above may not be desired in your use case and you may only care about a few specific fields and have defined them in your schema. How can we deal with other fields extracted we don&#8217;t care about? The <code>uprefix<\/code> parameter and <code>ignored<\/code> field type can be used for this.<\/p>\n<p>Firstly, we can uncomment the following line within the <code>ExtractingRequestHandler<\/code> in <code>solrconfig.xml<\/code>:<\/p>\n<pre class=\"brush:xml\">&lt;str name=\"uprefix\"&gt;ignored_&lt;\/str&gt;<\/pre>\n<p>Then, make sure the <code>ignored<\/code> field type and <code>ignored<\/code> dynamic field are defined in <code>managed-schema<\/code>:<\/p>\n<pre class=\"brush:xml\">&lt;fieldType name=\"ignored\" class=\"solr.StrField\" indexed=\"false\" stored=\"false\" multiValued=\"true\"\/&gt;\n&lt;dynamicField name=\"ignored_*\" type=\"ignored\"\/&gt;<\/pre>\n<p>By doing this we indicates Solr not to index all unknown fields extracted by Solr Cell. To see how these configurations work, we need to restart Solr and recreate the <code>jcg_example_core<\/code> with the attached configSet <code>jcg_example_configs.zip<\/code> or a copy of the <code>_default<\/code> configSet with configurations we mentioned before. Otherwise those autogenerated fields in the previous example will remain. Once finished, we can run the command in section <a href=\"#indexing_data\">3.3.1 Indexing Data<\/a> to index the example document.<\/p>\n<p>Lastly, run the query below to see the indexed document:<\/p>\n<pre class=\"brush:bash\">curl -G http:\/\/localhost:8983\/solr\/jcg_example_core\/select --data-urlencode \"q=kevin\"<\/pre>\n<p>The output would be:<\/p>\n<pre class=\"brush:json\">{\n  \"responseHeader\":{\n    \"status\":0,\n    \"QTime\":1,\n    \"params\":{\n      \"q\":\"kevin\"}},\n  \"response\":{\"numFound\":1,\"start\":0,\"docs\":[\n      {\n        \"links\":[\"https:\/\/examples.javacodegeeks.com\/apache-solr-function-query-example\/\",\n          \"https:\/\/examples.javacodegeeks.com\/apache-solr-standard-query-parser-example\/\",\n          \"https:\/\/examples.javacodegeeks.com\/apache-solr-fuzzy-search-example\/\",\n          \"https:\/\/examples.javacodegeeks.com\/apache-solr-opennlp-tutorial\/\",\n          \"https:\/\/examples.javacodegeeks.com\/apache-solr-opennlp-tutorial-part-2\/\"],\n        \"id\":\"word-doc-1\",\n        \"author\":\"Kevin Yang\",\n        \"last_modified\":\"2020-07-18T09:49:00Z\",\n        \"_version_\":1672565163665915904}]\n  }}<\/pre>\n<p>We can see from the output above that all link addresses in <code>jcg_example_articles.docx<\/code> have been extracted successfully and added to the <code>links<\/code> field. In addition, both the <code>author<\/code> field and the <code>last_modified<\/code> field have been extracted and added to the index correctly. All unknown fields in the indexing document have been ignored and no corresponding field is created.<\/p>\n<h2 class=\"wp-block-heading\"><a name=\"download\"><\/a>4. Download the Sample Data File<\/h2>\n<div class=\"download\"><strong>Download<\/strong><br \/>\nYou can download the sample data file of this example here: <a href=\"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2020\/07\/apache-solr-apache-tika-integration-tutorial.zip\"><strong>Apache Solr and Apache Tika Integration Tutorial<\/strong><\/a><\/div>\n","protected":false},"excerpt":{"rendered":"<p>This article is a tutorial about Apache Solr and Apache Tika Integration. 1. Introduction A Solr index can accept data from many different sources, such as CSV, XML, databases and common binary files. If the data to be indexed is in binary format, such as WORD, PPT, XLS, and PDF, the Solr Content Extraction Library &hellip;<\/p>\n","protected":false},"author":223,"featured_media":25294,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[949],"tags":[946,46169,46170,1226],"class_list":["post-92691","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apache-solr","tag-apache-solr","tag-apache-tika","tag-solr-cell","tag-tutorial"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Apache Solr &amp; Apache Tika Integration - Examples Java Code Geeks<\/title>\n<meta name=\"description\" content=\"This article is a tutorial about Apache Solr and Apache Tika Integration. 1. Introduction A Solr index can accept data from many different sources, such\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache Solr &amp; Apache Tika Integration - Examples Java Code Geeks\" \/>\n<meta property=\"og:description\" content=\"This article is a tutorial about Apache Solr and Apache Tika Integration. 1. Introduction A Solr index can accept data from many different sources, such\" \/>\n<meta property=\"og:url\" content=\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/\" \/>\n<meta property=\"og:site_name\" content=\"Examples Java Code Geeks\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/javacodegeeks\" \/>\n<meta property=\"article:published_time\" content=\"2020-07-27T08:00:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2015\/07\/apache-solr-logo.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"150\" \/>\n\t<meta property=\"og:image:height\" content=\"150\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Kevin Yang\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:site\" content=\"@javacodegeeks\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Kevin Yang\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"17 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/\"},\"author\":{\"name\":\"Kevin Yang\",\"@id\":\"https:\/\/examples.javacodegeeks.com\/#\/schema\/person\/3f6ff013b8204dc7f5e6d2660fbc9f8f\"},\"headline\":\"Apache Solr and Apache Tika Integration Tutorial\",\"datePublished\":\"2020-07-27T08:00:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/\"},\"wordCount\":1435,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/examples.javacodegeeks.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2015\/07\/apache-solr-logo.jpg\",\"keywords\":[\"Apache Solr\",\"Apache Tika\",\"Solr Cell\",\"tutorial\"],\"articleSection\":[\"Apache Solr\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/\",\"url\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/\",\"name\":\"Apache Solr & Apache Tika Integration - Examples Java Code Geeks\",\"isPartOf\":{\"@id\":\"https:\/\/examples.javacodegeeks.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2015\/07\/apache-solr-logo.jpg\",\"datePublished\":\"2020-07-27T08:00:00+00:00\",\"description\":\"This article is a tutorial about Apache Solr and Apache Tika Integration. 1. Introduction A Solr index can accept data from many different sources, such\",\"breadcrumb\":{\"@id\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#primaryimage\",\"url\":\"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2015\/07\/apache-solr-logo.jpg\",\"contentUrl\":\"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2015\/07\/apache-solr-logo.jpg\",\"width\":150,\"height\":150},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/examples.javacodegeeks.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Java Development\",\"item\":\"https:\/\/examples.javacodegeeks.com\/category\/java-development\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Enterprise Java\",\"item\":\"https:\/\/examples.javacodegeeks.com\/category\/java-development\/enterprise-java\/\"},{\"@type\":\"ListItem\",\"position\":4,\"name\":\"Apache Solr\",\"item\":\"https:\/\/examples.javacodegeeks.com\/category\/java-development\/enterprise-java\/apache-solr\/\"},{\"@type\":\"ListItem\",\"position\":5,\"name\":\"Apache Solr and Apache Tika Integration Tutorial\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/examples.javacodegeeks.com\/#website\",\"url\":\"https:\/\/examples.javacodegeeks.com\/\",\"name\":\"Java Code Geeks\",\"description\":\"Java Examples and Code Snippets\",\"publisher\":{\"@id\":\"https:\/\/examples.javacodegeeks.com\/#organization\"},\"alternateName\":\"JCG\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/examples.javacodegeeks.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/examples.javacodegeeks.com\/#organization\",\"name\":\"Exelixis Media P.C.\",\"url\":\"https:\/\/examples.javacodegeeks.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/examples.javacodegeeks.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png\",\"contentUrl\":\"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png\",\"width\":864,\"height\":246,\"caption\":\"Exelixis Media P.C.\"},\"image\":{\"@id\":\"https:\/\/examples.javacodegeeks.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/javacodegeeks\",\"https:\/\/x.com\/javacodegeeks\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/examples.javacodegeeks.com\/#\/schema\/person\/3f6ff013b8204dc7f5e6d2660fbc9f8f\",\"name\":\"Kevin Yang\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/examples.javacodegeeks.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/2efb55f26af9d8752be93a78f2cdd9b2529df1f087c7b8901b68dbe11b7cf5ee?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/2efb55f26af9d8752be93a78f2cdd9b2529df1f087c7b8901b68dbe11b7cf5ee?s=96&d=mm&r=g\",\"caption\":\"Kevin Yang\"},\"description\":\"A software design and development professional with seventeen years\u2019 experience in the IT industry, especially with Java EE and .NET, I have worked for software companies, scientific research institutes and websites.\",\"sameAs\":[\"https:\/\/www.linkedin.com\/in\/kevinyang2050\/\"],\"url\":\"https:\/\/examples.javacodegeeks.com\/author\/kevin-yang\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Apache Solr & Apache Tika Integration - Examples Java Code Geeks","description":"This article is a tutorial about Apache Solr and Apache Tika Integration. 1. Introduction A Solr index can accept data from many different sources, such","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/","og_locale":"en_US","og_type":"article","og_title":"Apache Solr & Apache Tika Integration - Examples Java Code Geeks","og_description":"This article is a tutorial about Apache Solr and Apache Tika Integration. 1. Introduction A Solr index can accept data from many different sources, such","og_url":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/","og_site_name":"Examples Java Code Geeks","article_publisher":"https:\/\/www.facebook.com\/javacodegeeks","article_published_time":"2020-07-27T08:00:00+00:00","og_image":[{"width":150,"height":150,"url":"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2015\/07\/apache-solr-logo.jpg","type":"image\/jpeg"}],"author":"Kevin Yang","twitter_card":"summary_large_image","twitter_creator":"@javacodegeeks","twitter_site":"@javacodegeeks","twitter_misc":{"Written by":"Kevin Yang","Est. reading time":"17 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#article","isPartOf":{"@id":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/"},"author":{"name":"Kevin Yang","@id":"https:\/\/examples.javacodegeeks.com\/#\/schema\/person\/3f6ff013b8204dc7f5e6d2660fbc9f8f"},"headline":"Apache Solr and Apache Tika Integration Tutorial","datePublished":"2020-07-27T08:00:00+00:00","mainEntityOfPage":{"@id":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/"},"wordCount":1435,"commentCount":0,"publisher":{"@id":"https:\/\/examples.javacodegeeks.com\/#organization"},"image":{"@id":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2015\/07\/apache-solr-logo.jpg","keywords":["Apache Solr","Apache Tika","Solr Cell","tutorial"],"articleSection":["Apache Solr"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/","url":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/","name":"Apache Solr & Apache Tika Integration - Examples Java Code Geeks","isPartOf":{"@id":"https:\/\/examples.javacodegeeks.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#primaryimage"},"image":{"@id":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#primaryimage"},"thumbnailUrl":"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2015\/07\/apache-solr-logo.jpg","datePublished":"2020-07-27T08:00:00+00:00","description":"This article is a tutorial about Apache Solr and Apache Tika Integration. 1. Introduction A Solr index can accept data from many different sources, such","breadcrumb":{"@id":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#primaryimage","url":"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2015\/07\/apache-solr-logo.jpg","contentUrl":"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2015\/07\/apache-solr-logo.jpg","width":150,"height":150},{"@type":"BreadcrumbList","@id":"https:\/\/examples.javacodegeeks.com\/apache-solr-and-apache-tika-integration-tutorial\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/examples.javacodegeeks.com\/"},{"@type":"ListItem","position":2,"name":"Java Development","item":"https:\/\/examples.javacodegeeks.com\/category\/java-development\/"},{"@type":"ListItem","position":3,"name":"Enterprise Java","item":"https:\/\/examples.javacodegeeks.com\/category\/java-development\/enterprise-java\/"},{"@type":"ListItem","position":4,"name":"Apache Solr","item":"https:\/\/examples.javacodegeeks.com\/category\/java-development\/enterprise-java\/apache-solr\/"},{"@type":"ListItem","position":5,"name":"Apache Solr and Apache Tika Integration Tutorial"}]},{"@type":"WebSite","@id":"https:\/\/examples.javacodegeeks.com\/#website","url":"https:\/\/examples.javacodegeeks.com\/","name":"Java Code Geeks","description":"Java Examples and Code Snippets","publisher":{"@id":"https:\/\/examples.javacodegeeks.com\/#organization"},"alternateName":"JCG","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/examples.javacodegeeks.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/examples.javacodegeeks.com\/#organization","name":"Exelixis Media P.C.","url":"https:\/\/examples.javacodegeeks.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/examples.javacodegeeks.com\/#\/schema\/logo\/image\/","url":"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","contentUrl":"https:\/\/examples.javacodegeeks.com\/wp-content\/uploads\/2022\/06\/exelixis-logo.png","width":864,"height":246,"caption":"Exelixis Media P.C."},"image":{"@id":"https:\/\/examples.javacodegeeks.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/javacodegeeks","https:\/\/x.com\/javacodegeeks"]},{"@type":"Person","@id":"https:\/\/examples.javacodegeeks.com\/#\/schema\/person\/3f6ff013b8204dc7f5e6d2660fbc9f8f","name":"Kevin Yang","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/examples.javacodegeeks.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/2efb55f26af9d8752be93a78f2cdd9b2529df1f087c7b8901b68dbe11b7cf5ee?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/2efb55f26af9d8752be93a78f2cdd9b2529df1f087c7b8901b68dbe11b7cf5ee?s=96&d=mm&r=g","caption":"Kevin Yang"},"description":"A software design and development professional with seventeen years\u2019 experience in the IT industry, especially with Java EE and .NET, I have worked for software companies, scientific research institutes and websites.","sameAs":["https:\/\/www.linkedin.com\/in\/kevinyang2050\/"],"url":"https:\/\/examples.javacodegeeks.com\/author\/kevin-yang\/"}]}},"_links":{"self":[{"href":"https:\/\/examples.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/92691","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/examples.javacodegeeks.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/examples.javacodegeeks.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/examples.javacodegeeks.com\/wp-json\/wp\/v2\/users\/223"}],"replies":[{"embeddable":true,"href":"https:\/\/examples.javacodegeeks.com\/wp-json\/wp\/v2\/comments?post=92691"}],"version-history":[{"count":0,"href":"https:\/\/examples.javacodegeeks.com\/wp-json\/wp\/v2\/posts\/92691\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/examples.javacodegeeks.com\/wp-json\/wp\/v2\/media\/25294"}],"wp:attachment":[{"href":"https:\/\/examples.javacodegeeks.com\/wp-json\/wp\/v2\/media?parent=92691"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/examples.javacodegeeks.com\/wp-json\/wp\/v2\/categories?post=92691"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/examples.javacodegeeks.com\/wp-json\/wp\/v2\/tags?post=92691"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}