0% found this document useful (0 votes)
3 views5 pages

Analyze API

The _analyze API in Elasticsearch processes text into tokens using character filters, tokenization, and token filters. It allows users to see how text is broken down by different analyzers, such as 'standard', 'whitespace', and 'stop'. Users can create custom analyzers to analyze their data by specifying text and analyzer types.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

Analyze API

The _analyze API in Elasticsearch processes text into tokens using character filters, tokenization, and token filters. It allows users to see how text is broken down by different analyzers, such as 'standard', 'whitespace', and 'stop'. Users can create custom analyzers to analyze their data by specifying text and analyzer types.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd

_analyze API

An analyzer in Elasticsearch is a tool that processes text into tokens (words or


terms) by applying a series of operations like lowercasing, removing stop words,
and stemming. The `_analyze` API lets you see how text is broken down and processed
by a specific analyzer.

##STEPS:

Doc ---> analyzer ---> stored


Character filters <--|--> Tokenization
Token filtering

1) Character Filtering: Before tokenization, character filters modify the text by


removing or replacing certain characters. For example, they can strip HTML tags or
replace certain symbols. By default, there is no character filtering.

2) Tokenizing: The tokenizer then breaks the modified text into individual tokens
(words or terms). This step defines the boundaries of each token, often using
spaces or punctuation.

3) Token Filters: After tokenization, token filters can modify or filter out
tokens. For example, they might lowercase tokens, remove stop words, or apply
stemming to reduce words to their root form.

###########################################################
--> If you want to analyze data by creating your own custom analyzer:
POST _analyze
{
"text" : "Hello, How are you ? What's up ? This is so high-end!",
"analyzer" : "standard" #already standard as it is default can change too.
}
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "how",
"start_offset": 7,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "are",
"start_offset": 11,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "you",
"start_offset": 15,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "what's",
"start_offset": 21,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "up",
"start_offset": 28,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "this",
"start_offset": 33,
"end_offset": 37,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "is",
"start_offset": 38,
"end_offset": 40,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "so",
"start_offset": 41,
"end_offset": 43,
"type": "<ALPHANUM>",
"position": 8
},
{
"token": "high",
"start_offset": 44,
"end_offset": 48,
"type": "<ALPHANUM>",
"position": 9
},
{
"token": "end",
"start_offset": 49,
"end_offset": 52,
"type": "<ALPHANUM>",
"position": 10
}
]
}

POST _analyze
{
"text" : "Hello, How are you ? What's up ? This is so high-end!",
"analyzer" : "whitespace"
}
{
"tokens": [
{
"token": "Hello,",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 0
},
{
"token": "How",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "are",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 2
},
{
"token": "you",
"start_offset": 15,
"end_offset": 18,
"type": "word",
"position": 3
},
{
"token": "?",
"start_offset": 19,
"end_offset": 20,
"type": "word",
"position": 4
},
{
"token": "What's",
"start_offset": 21,
"end_offset": 27,
"type": "word",
"position": 5
},
{
"token": "up",
"start_offset": 28,
"end_offset": 30,
"type": "word",
"position": 6
},
{
"token": "?",
"start_offset": 31,
"end_offset": 32,
"type": "word",
"position": 7
},
{
"token": "This",
"start_offset": 33,
"end_offset": 37,
"type": "word",
"position": 8
},
{
"token": "is",
"start_offset": 38,
"end_offset": 40,
"type": "word",
"position": 9
},
{
"token": "so",
"start_offset": 41,
"end_offset": 43,
"type": "word",
"position": 10
},
{
"token": "high-end!",
"start_offset": 44,
"end_offset": 53,
"type": "word",
"position": 11
}
]
}

POST _analyze
{
"text" : "Hello, How are you ? What's up ? This is so high-end!",
"analyzer" : "stop"
}
{
"tokens": [
{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "how",
"start_offset": 7,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "you",
"start_offset": 15,
"end_offset": 18,
"type": "word",
"position": 3
},
{
"token": "what",
"start_offset": 21,
"end_offset": 25,
"type": "word",
"position": 4
},
{
"token": "s",
"start_offset": 26,
"end_offset": 27,
"type": "word",
"position": 5
},
{
"token": "up",
"start_offset": 28,
"end_offset": 30,
"type": "word",
"position": 6
},
{
"token": "so",
"start_offset": 41,
"end_offset": 43,
"type": "word",
"position": 9
},
{
"token": "high",
"start_offset": 44,
"end_offset": 48,
"type": "word",
"position": 10
},
{
"token": "end",
"start_offset": 49,
"end_offset": 52,
"type": "word",
"position": 11
}
]
}

---> If want to write the three steps too.


POST _analyze
{
"text" : "Hello, How are you ? What's up ? This is so high-end!",
"char_filter": [],
"tokenizer": "standard",
"filter": ["lowercase"]
}

You might also like