Utilize cutoff_frequency parameter in all queries by orangejulius · Pull Request #1213 · pelias/api

orangejulius · 2018-10-17T00:53:58Z

This PR adds use of the Elasticsearch cutoff_frequency feature across nearly all of our query clauses.

This feature helps with performance for queries that could potentially match lots of documents by only looking at documents that match uncommon terms first.

It does this in a way that has no effect on the end result, but massively reduces the number of documents that are scored.

My understanding of the meaning of the parameter is as follows: what fraction of the documents in a shard must contain this term before it is treated as a filtering term, rather than a scoring term? The value most commonly seen in documentation is 0.01, meaning any term that is in more than 1% of documents will be used as a filter. It's worth testing different values to see which one gives the best performance, but my guess is there aren't huge wins to be had from extensive tuning of this variable (I'd love to be proven wrong).

Essentially, cutoff_frequency performs the "pseudo-stopword" handling that I suggested in pelias/schema#310 (comment), except it does it automatically, without any work on our part such as crafting extra query clauses or coming up with a list of common tokens.

In my testing of the difficult Washington University in St Louis query from pelias/schema#310, these changes have the following effect:

	master	cutoff_frequency	change
Document Hits	67603219	4336420	-93.59%
Lowest Latency Seen	294	82	-72.11%
Highest Latency Seen	420	136	-67.62%

This was using a full planet build for testing with 560 million documents. Before this change, a full 12% of the documents in the index were being searched.

Additionally, running a full set of acceptance tests showed exactly zero differences using this branch compared to master.

In addition to immediate performance improvements, this PR will allow us to make changes in the future to tokenize on whitespace, hyphens, or whatever we want, without fearing that extremely common terms (like Street, avenue, etc) will cause unacceptable query performance. Thus, it's basically a pre-requisite for merging pelias/schema#310.

Early feedback on this PR as well as pelias/query#88 appreciated

References

https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-match-query.html#query-dsl-match-query-cutoff
https://www.elastic.co/guide/en/elasticsearch/guide/current/common-terms.html#common-terms
http://kempe.net/blog/2015/02/25/elasticsearch-query-full-text-search.html

orangejulius · 2018-10-24T16:58:16Z

In a review @missinglink points out that using cutoff_frequency in combination with an and query doesn't accomplish much. This is a good point, and many of our existing queries use the and operator.

Many don't however, such as some autocomplete queries. It seems that we're very inconsistent about it's usage. One advantage of this PR is that we might be able to forgo usage of the and operator in more cases, where performance previously would have been prohibitive.

missinglink · 2018-10-26T09:18:03Z

I'm a little worried about this feature just due to personal ignorance on the topic, I'd need to study it a bit more before I can give some considered feedback.

package.json

missinglink · 2019-01-14T08:08:36Z

I just rebased master and added some new commits which coincide with the changes I made in pelias/query (per-field settings for cutoff_frequency).

I also had to update the code in this branch to be compatible with the recent changes merged for boundary_gid.

Should be good to go now, pending editing package.json so it doesn't point to a git repo.

orangejulius · 2019-01-18T01:37:35Z

With those changes this branch has no acceptance-test regressions and appears to be ready to go. I am gathering some performance stats now before merging.

orangejulius · 2019-01-19T01:35:59Z

After a day of testing with real-world queries I can confirm this PR cuts off the long tail of slow search and structured search queries quite a bit!

In this chart, the master branch is the grey lines, and this branch is the white lines. There's a huge decrease in p99 latency for both endpoints. Autocomplete stays about the same, and reverse is of course not affected. So this is a good change :)

orangejulius force-pushed the cutoff_frequency branch 3 times, most recently from 567713a to 7ee62c3 Compare October 17, 2018 01:59

orangejulius requested a review from missinglink October 17, 2018 02:13

This was referenced Oct 17, 2018

Improved synonyms system pelias/schema#310

Merged

Synonyms for french addresses pelias/schema#301

Closed

orangejulius assigned missinglink Nov 5, 2018

orangejulius force-pushed the cutoff_frequency branch 2 times, most recently from c9c457b to 8955e36 Compare November 26, 2018 10:25

missinglink approved these changes Jan 14, 2019

View reviewed changes

package.json Outdated Show resolved Hide resolved

missinglink force-pushed the cutoff_frequency branch from 23cdc25 to 8a3d675 Compare January 14, 2019 08:06

orangejulius mentioned this pull request Jan 14, 2019

Use cutoff_frequency everywhere pelias/query#88

Merged

orangejulius force-pushed the cutoff_frequency branch from 8a3d675 to eccef05 Compare January 18, 2019 01:14

orangejulius changed the title ~~WIP: cutoff_frequency~~ Utilize cutoff_frequency parameter in all queries Jan 18, 2019

feat(query) Set cutoff_frequency in all queries

05fe727

orangejulius force-pushed the cutoff_frequency branch from eccef05 to 05fe727 Compare January 18, 2019 03:15

orangejulius merged commit 5946ea5 into master Jan 19, 2019

orangejulius deleted the cutoff_frequency branch January 19, 2019 01:36

missinglink mentioned this pull request Apr 18, 2024

Deprecate cutoff_frequency for ES7 pelias/query#118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Utilize cutoff_frequency parameter in all queries#1213

Utilize cutoff_frequency parameter in all queries#1213
orangejulius merged 1 commit intomasterfrom
cutoff_frequency

orangejulius commented Oct 17, 2018 •

edited

Loading

Uh oh!

orangejulius commented Oct 24, 2018

Uh oh!

missinglink commented Oct 26, 2018

Uh oh!

Uh oh!

missinglink commented Jan 14, 2019 •

edited

Loading

Uh oh!

orangejulius commented Jan 18, 2019

Uh oh!

orangejulius commented Jan 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

orangejulius commented Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

References

Uh oh!

orangejulius commented Oct 24, 2018

Uh oh!

missinglink commented Oct 26, 2018

Uh oh!

Uh oh!

missinglink commented Jan 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orangejulius commented Jan 18, 2019

Uh oh!

orangejulius commented Jan 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

orangejulius commented Oct 17, 2018 •

edited

Loading

missinglink commented Jan 14, 2019 •

edited

Loading