_analyze API
An analyzer in Elasticsearch is a tool that processes text into tokens (words or
terms) by applying a series of operations like lowercasing, removing stop words,
and stemming. The `_analyze` API lets you see how text is broken down and processed
by a specific analyzer.
##STEPS:
Doc ---> analyzer ---> stored
Character filters <--|--> Tokenization
Token filtering
1) Character Filtering: Before tokenization, character filters modify the text by
removing or replacing certain characters. For example, they can strip HTML tags or
replace certain symbols. By default, there is no character filtering.
2) Tokenizing: The tokenizer then breaks the modified text into individual tokens
(words or terms). This step defines the boundaries of each token, often using
spaces or punctuation.
3) Token Filters: After tokenization, token filters can modify or filter out
tokens. For example, they might lowercase tokens, remove stop words, or apply
stemming to reduce words to their root form.
###########################################################
--> If you want to analyze data by creating your own custom analyzer:
POST _analyze
{
"text" : "Hello, How are you ? What's up ? This is so high-end!",
"analyzer" : "standard" #already standard as it is default can change too.
}
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "how",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "are",
"start_offset": 11,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "you",
"start_offset": 15,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "what's",
"start_offset": 21,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "up",
"start_offset": 28,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "this",
"start_offset": 33,
"end_offset": 37,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "is",
"start_offset": 38,
"end_offset": 40,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "so",
"start_offset": 41,
"end_offset": 43,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "high",
"start_offset": 44,
"end_offset": 48,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "end",
"start_offset": 49,
"end_offset": 52,
"type": "<ALPHANUM>",
"position": 10
}
]
}
POST _analyze
{
"text" : "Hello, How are you ? What's up ? This is so high-end!",
"analyzer" : "whitespace"
}
{
"tokens": [
{
"token": "Hello,",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "How",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "are",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 2
},
{
"token": "you",
"start_offset": 15,
"end_offset": 18,
"type": "word",
"position": 3
},
{
"token": "?",
"start_offset": 19,
"end_offset": 20,
"type": "word",
"position": 4
},
{
"token": "What's",
"start_offset": 21,
"end_offset": 27,
"type": "word",
"position": 5
},
{
"token": "up",
"start_offset": 28,
"end_offset": 30,
"type": "word",
"position": 6
},
{
"token": "?",
"start_offset": 31,
"end_offset": 32,
"type": "word",
"position": 7
},
{
"token": "This",
"start_offset": 33,
"end_offset": 37,
"type": "word",
"position": 8
},
{
"token": "is",
"start_offset": 38,
"end_offset": 40,
"type": "word",
"position": 9
},
{
"token": "so",
"start_offset": 41,
"end_offset": 43,
"type": "word",
"position": 10
},
{
"token": "high-end!",
"start_offset": 44,
"end_offset": 53,
"type": "word",
"position": 11
}
]
}
POST _analyze
{
"text" : "Hello, How are you ? What's up ? This is so high-end!",
"analyzer" : "stop"
}
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "how",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "you",
"start_offset": 15,
"end_offset": 18,
"type": "word",
"position": 3
},
{
"token": "what",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 4
},
{
"token": "s",
"start_offset": 26,
"end_offset": 27,
"type": "word",
"position": 5
},
{
"token": "up",
"start_offset": 28,
"end_offset": 30,
"type": "word",
"position": 6
},
{
"token": "so",
"start_offset": 41,
"end_offset": 43,
"type": "word",
"position": 9
},
{
"token": "high",
"start_offset": 44,
"end_offset": 48,
"type": "word",
"position": 10
},
{
"token": "end",
"start_offset": 49,
"end_offset": 52,
"type": "word",
"position": 11
}
]
}
---> If want to write the three steps too.
POST _analyze
{
"text" : "Hello, How are you ? What's up ? This is so high-end!",
"char_filter": [],
"tokenizer": "standard",
"filter": ["lowercase"]
}