Tutorial
Tutorial
workshop
Hector Correa
[email protected]
http://hectorcorrea.com/solr-for-newbies
This work is licensed under the Creative Commons Attribution 4.0 International
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/.
What is Solr
Solr is an open source search engine developed by the Apache Software
Foundation. On its home page Solr advertises itself as
The fact that Solr is a search engine means that there is a strong focus on
speed, large volumes of text data, and the ability to sort the results by
relevance.
What is Lucene
The core functionality that Solr makes available is provided by a Java library
called Lucene. Lucene is the brain behind the “indexing and search technology,
as well as spellchecking, hit highlighting and advanced analysis/tokenization
capabilities” that we will see in this tutorial.
But Lucene is a Java Library than can only be used from other Java programs.
Solr on the other hand is a wrapper around Lucene that allows us to use the
Lucene functionality from any programming language that can submit HTTP
requests.
NOTE: You can also download and install the Solr binaries directly on your
machine without using Docker. You’ll need to have the Java Development Kit
(JDK) for this to method to work. If you are interested in this approach take a
look at these instructions instead.
Once installed run the following command from the terminal to make sure it’s
running:
$ docker ps
#
# You'll see something like this
# CONTAINER ID IMAGE COMMAND CREATED STATUS
PORTS NAMES
If Docker is not running we’ll see an error that will indicate something along the
lines of
Once Docker has been installed and it’s up and running we can create a
container to host Solr 9.1.0 with the following command:
#
# You'll see something like this...
#
# Unable to find image 'solr:9.1.0' locally
# 9.1.0: Pulling from library/solr
# 846c0b181fff: Pull complete
# ...
# fc8f2125142b: Pull complete
# Digest:
sha256:971cd7a5c682390f8b1541ef74a8fd64d56c6a36e5c0849f6b48210a47b16fa
2
# Status: Downloaded newer image for solr:9.1.0
#
47e8cd4d281db5a19e7bfc98ee02ca73e19af66e392e5d8d3532938af5a76e9
6
The parameter -d in the previous command tells Docker to run the container in
the background (i.e. detached) and the parameter -p 8983:8983 tells Docker to
forwards calls to our local port 8983 to the port 8983 on the container.
We can check that the new container is running with the following command:
#
# You'll see something like this...
#
# CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS
NAMES
# 47e8cd4d281d solr:9.1.0 "docker-entrypoint.s…" 2
minutes ago Up 2 minutes 0.0.0.0:8983->8983/tcp, :::8983-
>8983/tcp solr-container
Notice that now we have a container NAMED solr-container using the IMAGE
solr:9.1.0. We can check the status of Solr with the following command:
For our purposes, let’s create a core named bibdata with the following
command:
#
# WARNING: Using _default configset with data driven schema
functionality. NOT RECOMMENDED for production use.
# To turn off: bin/solr config -c bibdata -p 8983
-action set-user-property -property update.autoCreateFields -
value false
#
# Created new core 'bibdata'
Now that our core has been created we can query it with the following
command:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*'
#
# {
# "responseHeader":{
# "status":0,
# "QTime":0,
# "params":{
# "q":"*"}},
# "response":
{"numFound":0,"start":0,"numFoundExact":true,"docs":[]
# }}
and we’ll see "numFound":0 indicating that there are no documents on it. We can
also point our browser to http://localhost:8983/solr#bibdata/query and click
the “Execute Query” button at the bottom of the page and see the same result.
Now let’s add a few documents to our bibdata core. First, download this sample
data file:
#
# You'll see something like this...
# % Total % Received % Xferd Average Speed Time
Time Time Current
# Dload Upload Total
Spent Left Speed
# 100 1998 100 1998 0 0 5561 0 --:--:-- --
:--:-- --:--:-- 5581
#
File books.json contains a small sample data a set with information about a few
thousand books. We can take a look at it with something like head books.json or
using the text editor of our choice. Below is an example on one of the books in
this file:
{
"id": "00008027",
"author_txt_en": "Patent, Dorothy Hinshaw.",
"authors_other_txts_en": [
"Muñoz, William,"
],
"title_txt_en": "Horses /",
"responsibility_txt_en": "by Dorothy Hinshaw Patent ;
photographs by William Muñoz.",
"publisher_place_s": "Minneapolis, Minn. :",
"publisher_name_s": "Lerner Publications,",
"publisher_date_s": "c2001.",
"subjects_ss": [
"Horses",
"Horses"
],
"subjects_form_ss": [
"Juvenile literature"
]
},
To import this data to our Solr we’ll first copy the file to the Docker container
#
# /opt/java/openjdk/bin/java -classpath
/opt/solr/server/solr-webapp/webapp/WEB-INF/lib/solr-core-
9.1.0.jar ...
# SimplePostTool version 5.0.0
# Posting files to [base] url
http://localhost:8983/solr/bibdata/update...
# POSTing file books.json (application/json) to
[base]/json/docs
# 1 files indexed.
# COMMITting Solr index changes to
http://localhost:8983/solr/bibdata/update...
# Time spent: 0:00:01.951
$ curl 'http://localhost:8983/solr/bibdata/select?q=*'
#
# {
# "responseHeader":{
# "status":0,
# "QTime":0,
# "params":{
# "q":"*"}},
# "response":
{"numFound":30424,"start":0,"numFoundExact":true,"docs":[
# {
# ...the information for the first 10 documents will be
displayed here..
#
If you look at the content of the books.json file that we imported into our
bibdata core you’ll notice that the documents have the following fields:
The suffix added to each field (e.g. _txt_en) is a hint for Solr to pick the
appropriate field type for each field as it ingests the data. We will look closely
into this in a later section.
Fetching data
To fetch data from Solr we make an HTTP request to the select handler. For
example:
$ curl 'http://localhost:8983/solr/bibdata/select?q=*'
There are many parameters that we can pass to this handler to define what
documents we want to fetch and what fields we want to fetch.
We can use the fl parameter to indicate what fields we want to fetch. For
example to request the id and the title_txt_en of the documents we would use
fl=id,title_txt_en as in the following example:
$ curl 'http://localhost:8983/solr/bibdata/select?
q=*&fl=id,title_txt_en'
Note: When issuing the commands via cURL (as in the previous example) make
sure that the fields are separated by a comma without any spaces in between
them. In other words make sure the URL says fl=id,title_txt_en and not fl=id,
title_txt_en. If the parameter includes spaces Solr will not return any results
and give you a cryptic error message instead.
In the previous examples you might have seen an inconspicuous q=* parameter
in the URL. The q (query) parameter tells Solr what documents to retrieve. This
is somewhat similar to the WHERE clause in a SQL SELECT query.
If we want to retrieve all the documents we can just pass q=*. But if we want to
filter the results we can use the following syntax: q=field:value to filter
documents where a specific field has a particular value. For example, to include
only documents where the title_txt_en has the word “teachers” we would use
q=title_txt_en:teachers:
$ curl 'http://localhost:8983/solr/bibdata/select?
q=title_txt_en:teachers'
$ curl 'http://localhost:8983/solr/bibdata/select?
q=title_txt_en:teachers+author_txt_en:Alice'
As we saw in the previous example, by default, Solr searches for either of the
terms. If we want to force that both conditions are matched we must explicitly
use the AND operator in the q value as in q=title_txt_en:teachers AND
author_txt_en:Alice Notice that the AND operator must be in uppercase.
$ curl 'http://localhost:8983/solr/bibdata/select?
q=title_txt_en:teachers+AND+author_txt_en:Alice'
Now let’s try something else. Let’s issue a search for books where the title says
“art history” using q=title_txt_en:"art history" (make sure the text “art
history” is in quotes)
Notice how all three results have the term “art history” somewhere on the title.
Now let’s issue a slightly different query using q=title_txt_en:"art history"~3 to
indicate that we want the words “art” and “history” to be present in the
title_txt_en but they can be a few words apart (notice the ~3):
$ curl 'http://localhost:8983/solr/bibdata/select?
fl=id,title_txt_en&q=title_txt_en:"art+history"~3'
The result for this query will include a few more books (notice that numFound is
now 10 instead of 6) and some of the new tiles include
these new books include the words “art” and “history” but they don’t have to be
exactly next to each other, as long as they are close to each other they are
considered a match (the ~3 in our query asks for “edit distance of 3”).
When searching multi-word keywords for a given field make sure the keywords
are surrounded by quotes, for example make sure to use q=title_txt_en:"art
history" and not q=title_txt_en:art history. The later will execute a search for
“school” in the title_txt_en field and “teachers” in the _text_ field.
You can validate this by running the query and passing the debug flag and
seeing the parsedquery value. For example in the following command we
surround both search terms in quotes:
#
# "parsedquery":"PhraseQuery(title_txt_en:\"art histori\")",
#
notice that the parsedQuery shows that Solr is searching for, as we would expect,
both words in the title_txt_en field.
Now let’s look at the parsedQuery when we don’t surround the search terms in
quotes:
$ curl -s 'http://localhost:8983/solr/bibdata/select?
debugQuery=on&q=title_txt_en:art+history' | grep parsedquery
#
# "parsedquery":"title_txt_en:art _text_:history",
#
notice that Solr searched for the word “art” in the title_txt_en field but
searched for the word “history” on the _text_ field. Certainly not what we were
expecting. We’ll elaborate in a later section on the significance of the _text_
field but for now make sure to surround in quotes the search terms when
issuing multi word searches.
One last thing to notice is that Solr returns results paginated, by default it
returns the first 10 documents that match the query. We’ll see later on this
tutorial how we can request a large page size (via the rows parameter) or
another page (via the start parameter). But for now just notice that at the top
of the results Solr always tells us the total number of results found:
$ curl 'http://localhost:8983/solr/bibdata/select?
q=title_txt_en:education&fl=id,title_txt_en'
#
# response will include
# "response":{"numFound":340,"start":0,"docs":[
#
Getting facets
When we issue a search, Solr is able to return facet information about the data
in our core. This is a built-in feature of Solr and easy to use, we just need to
$ curl 'http://localhost:8983/solr/bibdata/select?
q=title_txt_en:education&facet=on&facet.field=subjects_ss'
Updating documents
To update a document in Solr we have two options. The most common option is
to post the data for that document again to Solr and let Solr overwrite the old
document with the new data. The key for this to work is to provide the same ID
in the new data as the ID of an existing document.
# "response":
{"numFound":1,"start":0,"numFoundExact":true,"docs":[
# {
# "id":"00007345",
# "authors_other_txts_en":["Giannakis, Georgios B."],
# "title_txt_en":"Signal processing advances in wireless
and mobile communications /",
# "responsibility_txt_en":"edited by G.B. Giannakis ...
[et al.].",
# "publisher_place_s":"Upper Saddle River, NJ :",
# "publisher_name_s":"Prentice Hall PTR,",
# "publisher_date_s":"c2001.",
# "subjects_ss":["Signal processing", "Wireless
communication systems"],
# "_version_":1755414312334131200
# }
#
If we post to Solr a new document with the same ID Solr will overwrite the
existing document with the new data. Below is an example of how to update
this document with new JSON data using curl to post the data to Solr. Notice
that the command is issued against the update endpoint rather than the select
endpoint we used in our previous commands.
Out of the box Solr supports multiple input formats (JSON, XML, CSV), section
Uploading Data with Index Handlers in the Solr guide provides more details out
this.
If we query for the document with ID 00007345 again we will see the new data
and notice that the fields that we did not provide during the update are now
gone from the document, that’s because Solr overwrote the old document with
ID 00000034 with our new data that included only two fields (id and
title_txt_en).
# "response":{"numFound":1,"start":0,"docs":[
# {
# "id":"00007345",
# "title_txt_en":"the new title",
# }]}
#
$ curl 'http://localhost:8983/solr/bibdata/select?
q=id:00007450'
#
# "title_txt_en":"Principles of fluid mechanics /",
#
$ curl 'http://localhost:8983/solr/bibdata/select?
q=id:00007450'
#
# title will say "the new title for 00007450"
# and the rest of the fields will remain unchanged
#
...
Deleting documents
To delete documents from the bibdata core we also use the update endpoint but
the structure of the command is as follows:
$ curl 'http://localhost:8983/solr/bibdata/update?commit=true'
--data '<delete><query>id:00008056</query></delete>'
We can also pass a less specific query like title_txt_en:teachers to delete all
documents where the title includes the word “teachers” (or a variation of it). Or
we can delete all documents with a query like *:*.
Be aware that even if you delete all documents from a Solr core the schema
and the core’s configuration will remain intact. For example, the fields that were
defined are still available in the schema even if no documents exist in the core
anymore.
If you want to delete the entire core (documents, schema, and other
configuration associated with it) you can use the Solr delete command instead:
You will need to re-create the core if you want to re-import data to it.
In earlier versions of Solr documents were self-contained and did not support
nested documents. Starting with version 8 Solr provides support for nested
documents. This tutorial does not cover nested documents.
Inverted index
Search engines like Solr use a data structure called inverted index to support
fast retrieval of documents even with complex query expression on large
datasets. The basic idea of an inverted index is to use the terms inside a
document as the key of the index rather than the document’s ID as the key.
Let’s illustrate this with an example. Suppose we have three books that we want
to index. With a traditional index we would create something like this:
ID TITLE
-- ------------------------------
1 Princeton guide for dog owners
2 Princeton tour guide
3 Cats and dogs
With an inverted index Solr would take each of the words in the title of our
books and use those words as the index key:
KEY DOCUMENT ID
--------- -----------
princeton 1, 2
owners 1
dogs 1, 3
guide 1, 2
tour 2
cats 3
Field Types are the building blocks to define fields in our schema. Examples of
field types are: binary, boolean, pfloat, string, text_general, and text_en. These
are similar to the field types that are supported in a relational database like
MySQL but, as we will see later, they are far more configurable than what you
can do in a relational database.
There are three kind of fields that can be defined in a Solr schema:
Fields are the specific fields that you define for your particular core. Fields
are based of a field type, for example, we might define field title based on
the string field type, description based on the text field type, and price
base of the pfloat field type.
copyFields are instructions to tell Solr how to automatically copy the value
given for one field to another field. This is useful if we want to perform
different transformation to the values as we ingest them. For example, we
might want to remove punctuation characters for searching but preserve
them for display purposes.
Our newly created bibdata core already has a schema and you can view the
definition through the Solr Admin web page via the Schema Browser Screen at
http://localhost:8983/solr/#/bibdata/schema or by exploring the managed-schema
file via the Files Screen.
You can also view this information with the Schema API as shown in the
following example. The (rather long) response will be organized in four
categories: fieldTypes, fields, dynamicFields, and copyFields as shown below:
# {
# "responseHeader": {"status": 0, "QTime": 2},
# "schema": {
# "fieldTypes": [lots of field types defined],
#
# "fields": [lots of fields defined],
#
# "dynamicFields":[lots of dynamic fields defined],
#
# "copyFields": [a few copy fields defined]
# }
# }
#
The advantage of the Schema API is that it allows you to view and update the
information programatically which is useful if you need to recreate identical Solr
cores without manually configuring each field definition (e.g. development vs
production)
You can request information about each of these categories individually in the
Schema API with the following commands (notice that combined words like
fieldTypes and dynamicFields are not capitalized in the URLs below):
$ curl localhost:8983/solr/bibdata/schema/fieldtypes
$ curl localhost:8983/solr/bibdata/schema/fields
$ curl localhost:8983/solr/bibdata/schema/dynamicfields
$ curl localhost:8983/solr/bibdata/schema/copyfields
Notice that unlike a relational database, where only a handful field types are
available to choose from (e.g. integer, date, boolean, char, and varchar) in Solr
there are lots of predefined field types available out of the box, and each of
them with its own configuration.
Note for Solr 4.x users: In Solr 4 the default mechanism to update the
schema was by editing the file schema.xml. Starting in Solr 5 the default
mechanism is through the “Managed Schema Definition” which uses the
Schema API to add, edit, and remove fields. There is now a managed-schema file
with the same information as schema.xml but you are not supposed to edit this
file. See section “Managed Schema Definition in SolrConfig” in the Solr
Reference Guide 5.0 (PDF) for more information about this.
Solr automatically created most of these fields when we imported the data from
the books.json file. If you look at a few of the elements in the books.json file
you’ll recognize that they match most of the fields defined in our schema. Below
is the data for one of the records in our sample data:
{
"id":"00007345",
"authors_other_txts_en":["Giannakis, Georgios B."],
"title_txt_en":"Signal processing advances in wireless and
mobile communications /",
"responsibility_txt_en":"edited by G.B. Giannakis ... [et
al.].",
"publisher_place_s":"Upper Saddle River, NJ :",
"publisher_name_s":"Prentice Hall PTR,",
"publisher_date_s":"c2001.",
"subjects_ss":["Signal processing", "Wireless communication
systems"],
}
The process that Solr follows when a new document is ingested into Solr is
more or less as follows:
1. If there is an exact match for a field being ingested and the fields defined in
the schema then Solr will use the definition in the schema to ingest the
data. This is what happened for the id field. Our JSON data has an id field
and so does the schema, therefore Solr stored the id value in the id field as
indicated in the schema (i.e. as single-value string.)
2. If there is no exact match in the schema then Solr will look at the
dynamicFields definitions to see if the field can be handled with some
predefined settings. This is what happened with the title_txt_en field.
Because there is not title_txt_en definition in the schema Solr used the
dynamic field definition for *_txt_en that indicated that the value should be
indexed using the text in English (text_en) field definition.
3. If no match is found in the dynamic fields either Solr will guess what is the
best type to use based on the data for this field in the first document. This
is what happened with the authors_other_txts_en field (notice that this field
ends with _txts_en rather than _txt_en). In this case, since there is no
dynamic field definition to handle this ending, Solr guessed and created field
authors_other_txts_en as text_general. For production use Solr recommends
to disable this automatic guessing, this is what the “WARNING: Using
_default configset with data driven schema functionality. NOT
RECOMMENDED for production use” was about when we first created our
In the following sections we are going to drill down into some of specifics of the
fields and dynamic field definitions that are configured in our Solr core.
Field: id
$ curl localhost:8983/solr/bibdata/schema/fields/id
#
# Will return something like this
# {
# "responseHeader":{...},
# "field":{
# "name":"id",
# "type":"string",
# "multiValued":false,
# "indexed":true,
# "required":true,
# "stored":true
# }
# }
#
Notice how the field is of type string but also it is marked as not multi-value, to
be indexed, required, and stored.
The type string has also its own definition which we can view via:
$ curl localhost:8983/solr/bibdata/schema/fieldtypes/string
# {
# "responseHeader":{...},
# "fieldType":{
# "name":"string",
# "class":"solr.StrField",
# "sortMissingLast":true,
# "docValues":true
# }
# }
#
Field: title_txt_en
Now let’s look at a more complex field and field type. If we look for a definition
for the title_txt_en Solr will report that we don’t have one:
$ curl localhost:8983/solr/bibdata/schema/fields/title_txt_en
# {
# "responseHeader":{...
# "error":{
# "metadata":[
# "error-class","org.apache.solr.common.SolrException",
# "root-error-
class","org.apache.solr.common.SolrException"],
# "msg":"No such path /schema/fields/title_txt_en",
# "code":404}}
#
However, if we look at the dynamic field definitions we’ll notice that there is one
for fields that end in _txt_en:
$ curl
localhost:8983/solr/bibdata/schema/dynamicfields/*_txt_en
# {
# "responseHeader":{...
# "dynamicField":{
# "name":"*_txt_en",
# "type":"text_en",
# "indexed":true,
# "stored":true}}
#
This tells Solr that any field name in the source data that does not already exist
in the schema and that ends in _txt_en should be created as a field of type
text_en. That looks innocent enough, so let’s look closer to see what field type
text_en means:
# {
# "responseHeader":{...}
# "fieldType":{
# "name":"text_en",
# "class":"solr.TextField",
# "positionIncrementGap":"100",
# "indexAnalyzer":{
# "tokenizer":{
# "class":"solr.StandardTokenizerFactory"
# },
# "filters":[
# { "class":"solr.StopFilterFactory" ... },
# { "class":"solr.LowerCaseFilterFactory" },
# { "class":"solr.EnglishPossessiveFilterFactory" },
# { "class":"solr.KeywordMarkerFilterFactory" ... },
# { "class":"solr.PorterStemFilterFactory" }
# ]
# },
# "queryAnalyzer":{
# "tokenizer":{
# "class":"solr.StandardTokenizerFactory"
# },
# "filters":[
# { "class":"solr.SynonymGraphFilterFactory" ... },
# { "class":"solr.StopFilterFactory" ... },
# { "class":"solr.LowerCaseFilterFactory" },
# { "class":"solr.EnglishPossessiveFilterFactory" },
# { "class":"solr.KeywordMarkerFilterFactory" ... },
# { "class":"solr.PorterStemFilterFactory" }
# ]
# }
# }
# }
This is obviously a much more complex definition than the ones we saw before.
Although the basics are the same (e.g. the field type points to class
solr.TextField) notice that there are two new sections indexAnalyzer and
queryAnalyzer for this field type. We will explore those in the next section.
Note: The fact that the Solr schema API does not show dynamically created
fields (like title_txt_en) is baffling, particularly since they do show in the
Schema Browser Screen of the Solr Admin screen. This has been a known issue
When a value is indexed for a particular field the value is first passed to a
tokenizer and then to the filters defined in the indexAnalyzer section for that
field type. Similarly, when we query for a value in a given field the value our
query is first processed by a tokenizer and then by the filters defined in the
queryAnalyzer section for that field.
If we look again at the definition for the text_en field type we’ll notice that “stop
words” (i.e. words to be ignored) are handled at index and query time (notice
the StopFilterFactory filter appears in the indexAnalyzer and the queryAnalyzer
sections.) However, notice that “synonyms” will only be applied at query time
since the filter SynonymGraphFilterFactory only appears on the queryAnalyzer
section.
We can customize field type definitions to use different filters and tokenizers via
the Schema API which we will discuss later on this tutorial.
Tokenizers
For most purposes we can think of a tokenizer as something that splits a given
text into individual tokens or words. The Solr Reference Guide defines
Tokenizers as follows:
For example if we give the text “hello world” to a tokenizer it might split the
text into two tokens like “hello” and “word”.
Solr comes with several built-in tokenizers that handle a variety of data. For
example if we expect a field to have information about a person’s name the
Standard Tokenizer might be appropriated for it. However, for a field that
contains e-mail addresses the UAX29 URL Email Tokenizer might be a better
option.
Filters
Notice that unlike tokenizers, whose job is to split text into tokens, the job of
filters is a bit more complex since they might replace the token with a new one
or discard it altogether.
Solr comes with many built-in Filters that we can use to perform useful
transformations. For example the ASCII Folding Filter converts non-ASCII
characters to their ASCII equivalent (e.g. “México” is converted to “Mexico”).
Likewise the English Possessive Filter removes singular possessives (trailing ’s)
from words. Another useful filter is the Porter Stem Filter that calculates word
stems using English language rules (e.g. both “jumping” and “jumped” will be
reduced to “jump”.)
When we looked at the definition for the text_en field type we noticed that at
index time several filters were applied ( StopFilterFactory,
LowerCaseFilterFactory, EnglishPossessiveFilterFactory,
KeywordMarkerFilterFactory, and PorterStemFilterFactory.)
That means that if we index the text “The Television is Broken!” in a text_en
field the filters defined in the indexAnalyzer will transform this text into two
tokens: “televis”, and “broken”. Notice how the tokens were lowercased, the
The Analysis Screen in the Solr Admin tool is a great way to see how a
particular text is either indexed or queried by Solr depending on the field type.
Point your browser to http://localhost:8983/solr/#/bibdata/analysis and try the
following examples:
Enter “The quick brown fox jumps over the lazy dog” in the “Field Value
(index)”, select string as the field type and see how is indexed. Then select
text_general and click “Analyze Values” to see how it’s indexed. Lastly,
select text_en and see how it’s indexed. You might want to uncheck the
“Verbose output” to see the differences more clearly.
With the text still on the “Field Value (index)” text box, enter “The quick
brown fox jumps over the LAZY dog” on the “Field Value (query)” and try
the different field types (string/text_general/text_en) again to see how each
of them shows different matches.
Try changing the text on the “Field Value (query)” text box to “The quick
brown foxes jumped over the LAZY dogs”. Compare the results using
text_general versus text_en.
Now enter “The TV is broken!” on the “Field Value (index)” text box, clear
the “Field Value (query)” text box, select text_en, and see how the value is
indexed. Then do the reverse, clear the indexed value and enter “The TV is
broken!” on the “Field Value (query)” text box and notice synonyms being
applied.
Now enter “The TV is broken!” on the “Field Value (index)” text box and “the
television is broken” on the “Field Value (query)”. Notice how they are
matched because the use of synonyms applied for text_en fields.
Now enter “The TV is broken!” on the “Field Value (index)” text box and
clear the “Field Value (query)” text box, select text_general and notice how
the stop words were not removed because we are not using English specific
rules.
You can see the definition of this field type with the following command. Notice
how there are two new filters (CJKWidthFilterFactory and
CJKBigramFilterFactory) that are different from what we saw in the text_en
definition.
$ curl localhost:8983/solr/bibdata/schema/fieldtypes/text_cjk
# ...
# "fieldType":{
# "name":"text_cjk",
# "class":"solr.TextField",
# "positionIncrementGap":"100",
# "analyzer":{
# "tokenizer":{
# "class":"solr.StandardTokenizerFactory"},
# "filters":[
# {"class":"solr.CJKWidthFilterFactory"},
# {"class":"solr.LowerCaseFilterFactory"},
# {"class":"solr.CJKBigramFilterFactory"}]}}}
#
If you go to the Analysis Screen again and enter “胡志明” (Ho Chi Minh) as the
“Field Value (index)”, select text_general as the FieldType and analyse the
values you’ll notice how Solr calculated three tokens (“胡”, “志”, and “明”) which
is incorrect in Chinese. However, if you select text_cjk and analyze the values
again you’ll notice that you’ll end with two tokens (“胡志” and “志明”) thanks to
the CJKBigramFilterFactory and that is the expected behavior for text in Chinese.
The data for this section was taken from this blog post. Although the
technology referenced in the blog post is a bit dated, the basic concepts
explained are still relevant, particularly if you, like me, are not a CJK speaker.
Naomi Dushay’s CJK with Solr for Libraries is a great resource on this topic.
Fields that are stored but not indexed can be fetched once a document has
Technically it’s also possible to add a field that is neither stored nor indexed but
that’s beyond the scope of this tutorial.
There are many reasons to toggle the stored and indexed properties of a field.
For example, perhaps we want to store a complex object as string in Solr so
that we can display it to the user but we really don’t want to index its values
because we don’t expect to ever search by this field. Conversely, perhaps we
want to create a field with a combination of values to optimize a particular kind
of search, but we don’t want to display it to the users (the default _text_ field in
our schema is such an example).
Also, although it’s nice that we can do sophisticated searches by title (because
it is a text_en field) we could not sort the results by this field because it’s a
tokenized field (technically we can sort by it but the results will not be what we
would expect.)
Let’s customize our schema a little bit to get the most out of Solr.
Let’s begin by recreating our Solr core so that we have a clean slate.
Then re-create it
And finally query it (you should have zero documents since we have not re-
imported the data)
$ curl 'http://localhost:8983/solr/bibdata/select?q=*:*'
#
# "response":{"numFound":0,"start":0,"docs":[]
#
This time before we import the data in the books.json file we are going to add a
few field definitions to the schema to make sure the data is indexed and stored
in the way that we want to.
The first thing we’ll do is add a new dynamicField definition to account for multi-
value text fields in English for fields that end with _txts_en in our JSON data:
this will make sure Solr indexes these fields as text_en rather than the default
text_general that it used when we did not have an dynamicField to account for
them.
Secondly we’ll ask Solr to store a string version of the title (in addition to the
text version) so we can sort results by title. To do this we’ll add a copy-field
directive to our Schema to copy the value of the title_txt_en to another field
Right now we have two separate fields for author information (author_txt_en for
the main author and authors_other_txts_en for additional authors) which means
that if we want to find books by a particular author we have to issue a query
against two separate fields: author_txt_en:"Sarah" OR
authors_other_txts_en:"Sarah"
Let’s use a copy-field directive to have Solr automatically combine the main
author and additional authors into a new field. Notice that the new field
authors_all_txts_en matches the dynamicField directive that we just created,
meaning that it will be indexed as text_en multi-valued.
Now that we have a string version of the title field is possible for us to sort our
search results by this field, for example, let’s search for books that have the
word “water” in the title (q=title_txt_en:water) and sort them by title
(sort=title_s+asc):
#
# response will include
# ...
# "title_txt_en":"A practical guide to creating and
maintaining water quality /",
# "title_txt_en":"A practical guide to particle counting
for drinking water treatment /",
# "title_txt_en":"Applied ground-water hydrology and well
hydraulics /",
# "title_txt_en":"Assessment of blue-green algal toxins in
raw and finished drinking water /",
# "title_txt_en":"Bureau of Reclamation..."
# "title_txt_en":"Carry me across the water : a novel /",
# "title_txt_en":"Clean Water Act : proposed revisions to
EPA regulations to clean up polluted waters /",
# "title_txt_en":"Cold water burning /"
# ...
#
notice that the result are sorted alphabetically by title because we are using the
string version of the field (title_s) for sorting. Try and see what the results look
like if you sort by the text version of the title (title_txt_en):
$ curl 'http://localhost:8983/solr/bibdata/select?
fl=id,title_txt_en&q=title_txt_en:water&sort=title_txt_en+asc'
The results in this case will not look correct because Solr will be using the
tokenized value of the title_txt_en field to sort rather than the string version.
Take a look at the data for this particular book that has many authors and
notice how the authors_all_txts_en field has the combination of author_txt_en
and authors_other_txts_en even though our source data didn’t have an
authors_all_txts_en field:
#
# {
# "id":"00009214",
# "author_txt_en":"Everett, Barbara,",
# "authors_other_txts_en":["Gallop, Ruth,"]
# "authors_all_txts_en":["Everett, Barbara,", "Gallop,
Ruth,"],
# }
#
Likewise, let’s search for books authored by “Gallop” using our new
authors_all_txts_en field (q=authors_all_txts_en:Gallop ) and notice how this
document will be on the results regardless of whether Ruth Gallop is the main
author or an additional author.
$ curl 'http://localhost:8983/solr/bibdata/select?
q=authors_all_txts_en:Gallop'
Let’s run a query without specifing what field to search on, for example
q:biology
$ curl 'http://localhost:8983/solr/bibdata/select?
q=biology&debug=all'
The result will include all documents where the word “biology” is found in the
_text_ field and since we are now populating this field with a copy of every
value in our documents this means that we’ll get back any document that has
the word “biology” in the title, the author, or the subject.
We can confirm that Solr is searching on the _text_ field by looking at the
information in the parsed query, it will looks like this:
"debug":{
"rawquerystring":"biology",
"querystring":"biology",
"parsedquery":"_text_:biology",
$ curl 'http://localhost:8983/solr/bibdata/select?
q=*&fl=id,title_txt_en'
The components in Solr that parse these parameters are called Query Parsers.
Their job is to extract the parameters and create a query that Lucene can
understand. Remember that Lucene is the search engine underneath Solr.
Query Parsers
Out of the box Solr comes with three query parsers: Standard, DisMax, and
Extended DisMax (eDisMax). Each of them has its own advantages and
disadvantages.
The Standard query parser (aka the Lucene Parser) “supports a robust and
fairly intuitive syntax allowing you to create a variety of structured queries.
The largest disadvantage is that it’s very intolerant of syntax errors, as
compared with something like the DisMax Query Parser which is designed to
throw as few errors as possible.”
The DisMax query parser (DisMax) interface “is more like that of Google
than the interface of the ‘lucene’ Solr query parser. This similarity makes
DisMax the appropriate query parser for many consumer applications. It
accepts a simple syntax, and it rarely produces error messages.”
One key difference among these parsers is that they recognize different
parameters. For example, the DisMax and eDisMax parsers supports a qf
parameter to specify what fields should be searched for but this parameter is
not supported by the Standard parser.
The rest of the examples in this section are going to use the eDisMax parser,
notice the defType=edismax in our queries to Solr to make this selection. As we
To see a list a comprehensive list of the parameters that apply to all parsers
take a look at the Common Query Parameters and the Standard Query Parser
sections in the Solr Reference Guide.
Below are some of the parameters that are supported by all parsers:
defType: Query parser to use (default is lucene, other possible values are
dismax and edismax)
q: Search query, the basic syntax is field:"value".
sort: Sorting of the results (default is score desc, i.e. highest ranked
document first)
rows: Number of documents to return (default is 10)
start: Index of the first document to result (default is 0)
fl: List of fields to return in the result.
fq: Filters results without calculating a score.
Below are a few sample queries to show these parameters in action. Notice that
spaces are URL encoded as + in the curl commands below, you do not need to
encode them if you are submitting these queries via the Solr Admin interface in
your browser.
Retrieve the first 10 documents where the title_txt_en includes the word
“washington” (q=title_txt_en:washington)
$ curl 'http://localhost:8983/solr/bibdata/select?
q=title_txt_en:washington'
The next 15 documents for the same query (notice the start=10 and rows=15
parameters)
$ curl 'http://localhost:8983/solr/bibdata/select?
q=title_txt_en:washington&start=10&rows=15'
$ curl 'http://localhost:8983/solr/bibdata/select?
fl=id,title_txt_en,author_txt_en,authors_other_txts_en&q=authors_other_txts_en:*
'
$ curl 'http://localhost:8983/solr/bibdata/select?
fl=*&q=NOT+authors_other_txts_en:*'
$ curl 'http://localhost:8983/solr/bibdata/select?
fl=id,title_txt_en,subjects_all_txts_en&q=subjects_all_txts_en:communication&debug=all
'
Documents where title include “science” and at least one of the subjects is
“women” (q=title_txt_en:science AND subjects_all_txts_en:women notice that
both search conditions are indicated in the q parameter) Again, notice that
the AND operator must be in uppercase.
$ curl 'http://localhost:8983/solr/bibdata/select?
fl=id,title_txt_en,subjects_all_txts_en&q=title_txt_en:science+AND+subjects_all_txts_e
'
Documents where title includes the word “history” but does not include the
word “art” (q=title_txt_en:history AND NOT title_txt_en:art)
The Solr Reference Guide and this Lucene tutorial are good places to check for
quick reference on the query syntax.
the qf parameter
The DisMax and eDisMax query parsers provide another parameter, Query
Fields qf, that should not be confused with the q or fq parameters. The qf
parameter is used to indicate the list of fields that the search should be
executed on along with their boost values.
If we want to search for the same value in multiple fields at once (e.g. if we
want to find all books where the title or the author includes the text
“Washington”) we must indicate each field/value pair individually:
q=title_txt_en:"Washington" authors_all_txts_en:"Washington".
The qf parameter allows us specify the fields separate from the terms so that
we can use instead: q="Washington" and qf=title_txt_en authors_all_txts_en.
This is really handy if we want to customize what fields are searched in an
application in which the user enters a single text (say “Washington”) and the
application automatically searches multiple fields.
$ curl 'http://localhost:8983/solr/bibdata/select?
q="washington"&qf=title_txt_en+authors_all_txts_en&defType=edismax
'
debugQuery
Solr provides an extra parameter debug=all that we can use to get debug
information about a query. This is particularly useful if the results that we get
are not what we were expecting. For example, let’s run the same query again
but this time passing the debug=all parameter:
Notice the debug property in the output, inside this property there is information
about:
what value the server received for the search (querystring) which is useful
to detect if you are not URL encoding properly the value sent to the server
how the server parsed the query (parsedquery) which is useful to detect if
the syntax on the q parameter was parsed as we expected (e.g. remember
the example earlier when we passed two words art history without
surrounding them in quotes and the parsed query showed that it was
querying two different fields title_txt_en for “art” and _text_ for “history”)
you can also see that some of the search terms were stemmed (e.g. if you
query for “running” you’ll notice that the parsed query will show “run”)
how each document was ranked ( explain)
what query parser (QParser) was used
Check out this blog post for more information about debugQuery.
Ranking of documents
When Solr finds documents that match the query it ranks them so that the
most relevant documents show up first. You can provide Solr guidance on what
Let’s say that we want documents where the word “Washington” (q=washington)
is found in the title or in the author (qf=title_txt_en authors_all_txts_en)
$ curl 'http://localhost:8983/solr/bibdata/select?
&q=washington&qf=title_txt_en+authors_all_txts_en&defType=edismax
'
Now let’s say that we want to boost the documents where the author has the
word “Washington” ahead of the documents where “Washington” was found in
the title. To this we update the qf parameter as follows qf=title_txt_en
authors_all_txts_en^5 (notice the ^5 to boost the authors_all_txts_en field)
$ curl 'http://localhost:8983/solr/bibdata/select?
&q=washington&qf=title_txt_en+authors_all_txts_en^5&defType=edismax
'
Notice how documents where the author is named “Washington” come first, but
we still get documents where the title includes the word “Washington”.
Boost values are arbitrary, you can use 1, 20, 789, 76.2, 1000, or whatever
number you like, you can even use negative numbers (qf=title_txt_en
authors_all_txts_en^-10). They are just a way for us to hint Solr which fields we
consider more important in a particular search.
If want to see why Solr ranked a result higher than another you can pass an
additional parameter debug.explain.structured=true to see the explanation on
how Solr ranked each of the documents in the result:
$ curl 'http://localhost:8983/solr/bibdata/select?
q=title_txt_en:west+authors_all_txts_en:washington&debug=all&debug.explain.structured=
'
The result will include an explain node with a ton of information for each of the
documents ranked. This information is rather complex but it has a wealth of
details that could help us figure out why a particular document is ranked higher
or lower than what we would expect. Take a look at this blog post to get an idea
on how to interpret this information.
You can also filter a field to be within a range by using the bracket operator with
the following syntax: field:[firstValue TO lastValue]. For example, to request
$ curl 'http://localhost:8983/solr/bibdata/select?q=id:\
[00010500+TO+00012050\]'
Be aware that range filtering with string fields would work as you would expect
it to, but with text_general and text_en fields it will filter on the terms indexed
not on the value of the field.
Searching is a large topic and complex topic. I’ve found the book “Relevant
search with applications for Solr and Elasticsearch” (see references) to be a
good conceptual reference with specifics on how to understand and configure
Solr to improve search results. Chapter 3 on this book goes into great detail on
how to read and understand the ranking of results.
Facets
One of the most popular features of Solr is the concept of facets. The Solr
Reference Guide defines it as:
You can easily get facet information from a query by selecting what field (or
fields) you want to use to generate the categories and the counts. The basic
syntax is facet=on followed by facet.field=name-of-field. For example to facet
our dataset by subjects we would use the following syntax:
facet.field=subjects_ss as in the following example:
IMPORTANT: You might have noticed that we are using the string
representation of the subjects (subjects_ss) to generate the facets rather than
the text_en version stored in the subjects_all_txts_en field. This is because, as
the Solr Reference Guide indicates facets are calculated “based on indexed
terms”. The indexed version of the subjects_all_txts_en field is tokenized
whereas the indexed version of subjects_ss is the entire string.
There are several extra parameters that you can pass to Solr to customize how
many facets are returned on result set. For example, if you want to list only the
top 20 subjects in the facets rather than all of them you can indicate this with
the following syntax: f.subjects_ss.facet.limit=20. You can also filter only get
facets that have at least certain number of matches, for example only subjects
that have at least 50 books f.subjects_ss.facet.mincount=50 as shown the
following example:
$ curl 'http://localhost:8983/solr/bibdata/select?
q=*&facet=on&facet.field=subjects_ss&f.subjects_ss.facet.limit=20&f.subjects_ss.facet.
'
You can also facet by multiple fields at once this is called Pivot Faceting. The
way to do this is via the facet.pivot parameter.
Note: Unfortunately the facet.pivot parameter is not available via the Solr
Admin web page, if you want to try this example you will have to do it via the
This parameter allows you to list the fields that should be used to facet the
data, for example to facet the information by subject and then by publisher
(facet.pivot=subjects_ss,publisher_name_s) you could issue the following
command:
$ curl 'http://localhost:8983/solr/bibdata/select?
q=*&facet=on&facet.pivot=subjects_ss,publisher_name_s&facet.limit=5
'
#
# response will include facets organized as follows:
#
# "facet_counts":{
# "facet_pivot":{
# "subjects_ss,publisher_name_s":[{
# "field":"subjects_ss",
# "value":"Women",
# "count":435,
# "pivot":[{
# "field":"publisher_name_s",
# "value":"Chelsea House Publishers,",
# "count":22},
# {
# "field":"publisher_name_s",
# "value":"Enslow Publishers,",
# "count":13},
# ...
# ]
# }
# ]
# ...
#
Notice how the results for the subject “Women” (435 results) are broken down
by publisher under the “pivot” section.
Hit highlighting
Another Solr feature is the ability to return a fragment of the document where
the match was found for a given search term. This is called highlighting.
Let’s say that we search for books where one of the authors or the title include
$ curl 'http://localhost:8983/solr/bibdata/select?
defType=edismax&q=washington&qf=title_txt_en+authors_all_txts_en&hl=on
'
#
# response will include a highlight section like this
#
# "highlighting":{
# "00065343":{
# "title_txt_en":["<em>Washington</em> Irving's The
legend of Sleepy Hollow.."],
# "authors_all_txts_en":["Irving,
<em>Washington</em>,"]},
# "00107795":{
# "authors_all_txts_en":["<em>Washington</em>,
Durthy."]},
# "00044606":{
# "title_txt_en":["University of <em>Washington</em>
/"]},
#
Notice how the highlighting property includes the id of each document in the
result (e.g. 00065343), the field where the match was found
(e.g. authors_all_txts_en and/or title_txt_en) and the text that matched within
the field (e.g. University of <em>Washington</em> /). You can display this
information along with your search results to allow the user to “preview” why
each result was rendered.
Open a separate terminal window and execute the following command to login
into the container and see the files inside it:
#
# You'll see something like this
#
# bin CHANGES.txt docker lib LICENSE.txt
NOTICE.txt README.txt
# books.json contrib example licenses modules
prometheus-exporter server
#
While still on the Docker container issue a command as follow to see the files
with the configuration for our bibdata core:
#
# You'll see something like this
#
# drwxr-xr-x 2 solr solr 4096 Nov 11 07:31 lang
# -rw-r--r-- 1 solr solr 26665 Jan 15 18:07 managed-
schema.xml
# -rw-r--r-- 1 solr solr 873 Nov 11 07:31 protwords.txt
# -rw-r--r-- 1 503 dialout 48192 Jan 15 19:45
solrconfig.xml
# -rw-r--r-- 1 solr solr 781 Nov 11 07:31 stopwords.txt
# -rw-r--r-- 1 solr solr 1124 Nov 11 07:31 synonyms.txt
Before we continue let’s exit from the Docker container with the exit command
(don’t worry the Docker container is still up and running in the background):
$ exit
Synonyms
In a previous section, when we looked at the text_general and text_en field
types, we noticed that it used a filter to handle synonyms at query time.
#
# "queryAnalyzer":{
# "tokenizer":{
# ...
# },
# "filters":[
# ...
# a few filters go here
# ...
# {
# "class":"solr.SynonymGraphFilterFactory",
# "expand":"true",
# "ignoreCase":"true",
# "synonyms":"synonyms.txt"
# },
# ...
#
You can view the contents of the synonyms.txt file for our bibdata core through
the Files option in the Solr Admin web page:
http://localhost:8983/solr/#/bibdata/files?file=synonyms.txt
$ curl 'http://localhost:8983/solr/bibdata/select?
fl=id,title_txt_en&q=title_txt_en:"twentieth+century"'
#
# result will include 84 results
#
$ curl 'http://localhost:8983/solr/bibdata/select?
fl=id,title_txt_en&q=title_txt_en:"20th+century"'
#
# result will include 22 results
#
Adding synonyms
We can indicate Solr that “twentieth” and “20th” are synonyms by updating the
synonyms.txt file by adding a line as follows:
20th,twentieth
Because our Solr is running inside a Docker container we need to update the
synonyms.txt file inside the container. We are going to do this in four steps:
1. First we’ll copy synonyms.txt from the Docker container to our machine
2. Then we’ll update the file in our machine (with whatever editor we are
comfortable with)
3. Next we’ll copy our updated local copy back to the container
4. And lastly, we’ll tell Solr to reload the core’s configuration so the changes
take effect.
To copy the synonyms.txt from the container to our machine we’ll issue the
following command:
#
# drwxr-xr-x 3 user-id staff 96 Jan 16 18:02 .
# drwxr-xr-x 51 user-id staff 1632 Jan 12 20:10 ..
# -rw-r--r--@ 1 user-id staff 1124 Nov 11 02:31
synonyms.txt
#
$ cat synonyms.txt
#
# will include a few lines including
#
# GB,gib,gigabyte,gigabytes
# Television, Televisions, TV, TVs
#
Let’s edit this file with whatever editor your are comfortable. Our goal is to add
a new line to make 20th and twentieth synonyms, we can do it like this:
Now that we have updated our local copy of the synonyms file we need to copy
this new version back to the Docker container, we can do this with a command
like this:
You can also reload the core via the Solr Admin page. Select “Core Admin”, then
“bibdata”, and click “Reload”.
If you run the queries again they will both report “106 results found” regardless
of whether you search for q=title_txt_en:"twentieth century" or
q=title_txt_en:"20th century":
$ curl 'http://localhost:8983/solr/bibdata/select?
fl=id,title_txt_en&q=title_txt_en:"twentieth+century"'
#
# result will include 106 results
# 88 with "twentieth century" plus 22 with "20th century"
#
To find more about synonyms take a look at this blog post where I talk about
the different ways of adding synonyms, how to test them in the Solr Admin
tool, and the differences between applying synonyms at index time versus
query time.
Core-specific configuration
One of the most important configuration files for a Solr core is solrconfig.xml
located in the configuration folder for the core. We can view the content of this
file in our bibdata core in this URL http://localhost:8983/solr/#/bibdata/files?
file=solrconfig.xml
Note: Despite its name, file solrconfig.xml controls the configuration for our
core, not for the entire Solr installation. Each core has its own solrconfig.xml
file.
$ docker cp solr-
container:/var/solr/data/bibdata/conf/solrconfig.xml
solrconfig.xml
$ docker cp solr-
container:/var/solr/data/bibdata/conf/solrconfig.xml
solrconfig.bak
$ ls
#
# drwxr-xr-x 4 user-id staff 128 Jan 16 18:19 .
# drwxr-xr-x 51 user-id staff 1632 Jan 12 20:10 ..
# -rw-r--r--@ 1 user-id staff 47746 Jan 16 18:36
solrconfig.bak
# -rw-r--r--@ 1 user-id staff 47746 Jan 16 18:36
solrconfig.xml
# -rw-r--r--@ 1 user-id staff 1151 Jan 16 18:12
synonyms.txt
#
solrconfig.xml is the file that we will be working on. Like with the synonyms.txt
file before, the general workflow will be to make changes to this local version of
the file, copy the updated file to the Docker container, and reload the Solr core
to pick up the changes.
Request Handlers
$ curl 'http://localhost:8983/solr/bibdata/select?q=*'
We can make changes to this section to indicate that we want to use the
eDisMax query parser (defType) by default and set the default query fields (qf)
to title and author. To do so we could update the “defaults” section as follows:
We need to copy our updated file back to the Docker container and reload the
core for the changes to take effect, let’s do this with the following commands:
notice now now we we can issue a much simpler query since we don’t have to
specify the qf or defType parameters in the URL:
$ curl 'http://localhost:8983/solr/bibdata/select?q=west'
Be careful, an incorrect setting on the solrconfig.xml file can take our core
down or cause queries to give unexpected results. For example, entering the qf
value as title_txt_en, authors_all_txts_en (notice the comma to separate the
fields) will cause Solr to ignore this parameter.
The Solr Reference Guide has excellent documentation on what the values for a
Search Components
You can find the definition of the search components in the solrconfig.xml by
looking at the searchComponent elements defined in this file. For example, in our
solrconfig.xml there is a section like this for the highlighting feature that we
used before:
<searchComponent class="solr.HighlightComponent"
name="highlight">
<highlighting>
... lots of other properties are define here...
<formatter name="html"
default="true"
class="solr.highlight.HtmlFormatter">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
... lots of other properties are define here...
Notice that the HTML tokens (<em> and </em>) that we saw in the highlighting
results in previous section are defined here.
Spellchecker
Solr provides spellcheck functionality out of the box that we can use to help
users when they misspell a word in their queries. For example, if a user
searches for “Washingon” (notice the missing “t”) most likely Solr will return
zero results, but with the spellcheck turned on Solr is able to suggest the
correct spelling for the query (i.e. “Washington”).
In our current bibdata core a search for “Washingon” will return zero results:
#
# response will indicate
# {
# "responseHeader":{
# "status":0,
# "params":{
# "q":"title:washingon",
# "fl":"id,title"}},
# "response":{"numFound":0,"start":0,"docs":[]
# }}
#
To do this let’s edit our local solrconfig.xml and replace the <requestHandler
name="/select" class="solr.SearchHandler"> node again but now with the
following content:
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
<searchComponent name="spellcheck"
class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">text_general</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">_text_</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
...
</lst>
</searchComponent
Notice how by default it will use the _text_ field for spellcheck.
Now that our bibdata core has been configured to use spellcheck let’s try our
misspelled “washingon” query again:
#
# response will still indicate zero documents found,
# but the spellcheck node will be populated
#
# "response":
{"numFound":0,"start":0,"numFoundExact":true,"docs":[]},
# "spellcheck":{
# "suggestions":[
# "washingon",{
# "numFound":3,
# "startOffset":13,
# "endOffset":22,
# "suggestion":["washington",
# "washigton",
# "washing"]}],
# "collations":[
# "collation",{
# "collationQuery":"title_txt_en:washington",
# "hits":41,
# "misspellingsAndCorrections":[
# "washingon","washington"]},
Notice that even though we still got zero results back (numFound:0), the response
now includes a spellcheck section with the words that were misspelled and the
suggested spelling for it. We can use this information to alert the user that
perhaps they misspelled a word or perhaps re-submit the query with the correct
spelling.
REFERENCES
The steps to create the books.json file from the original MARC data are as
follow:
marcli is a small utility program that I wrote in Go to parse MARC files. If you
are interested in the part that generates the JSON out of the MARC record take
a look at the processorSolr.go file.
Acknowledgements
I would like to thank my former team at the Brown University Library for their
support and recommendations as I prepared the initial version of this tutorial
back in 2017 as well as those that attended the workshop at the Code4Lib
conference in Washington, DC in 2018 and San Jose, CA in 2019. A special
thanks goes to Birkin Diana for helping me run the workshop in 2018 and 2019
and for taking the time to review the materials (multiple times!) and
painstakingly testing each of the examples.
Likewise, a big thanks to Bess Sadler, Carolyn Cole, Francis Kayiwa, and James
Griffin from the Princeton University Library for helping me run the workshop at
Code4Lib 2023.