Skip to content

Utilize cutoff_frequency parameter in all queries#1213

Merged
orangejulius merged 1 commit intomasterfrom
cutoff_frequency
Jan 19, 2019
Merged

Utilize cutoff_frequency parameter in all queries#1213
orangejulius merged 1 commit intomasterfrom
cutoff_frequency

Conversation

@orangejulius
Copy link
Copy Markdown
Member

@orangejulius orangejulius commented Oct 17, 2018

This PR adds use of the Elasticsearch cutoff_frequency feature across nearly all of our query clauses.

This feature helps with performance for queries that could potentially match lots of documents by only looking at documents that match uncommon terms first.

It does this in a way that has no effect on the end result, but massively reduces the number of documents that are scored.

My understanding of the meaning of the parameter is as follows: what fraction of the documents in a shard must contain this term before it is treated as a filtering term, rather than a scoring term? The value most commonly seen in documentation is 0.01, meaning any term that is in more than 1% of documents will be used as a filter. It's worth testing different values to see which one gives the best performance, but my guess is there aren't huge wins to be had from extensive tuning of this variable (I'd love to be proven wrong).

Essentially, cutoff_frequency performs the "pseudo-stopword" handling that I suggested in pelias/schema#310 (comment), except it does it automatically, without any work on our part such as crafting extra query clauses or coming up with a list of common tokens.

In my testing of the difficult Washington University in St Louis query from pelias/schema#310, these changes have the following effect:

  master cutoff_frequency change
Document Hits 67603219 4336420 -93.59%
Lowest Latency Seen 294 82 -72.11%
Highest Latency Seen 420 136 -67.62%

This was using a full planet build for testing with 560 million documents. Before this change, a full 12% of the documents in the index were being searched.

Additionally, running a full set of acceptance tests showed exactly zero differences using this branch compared to master.

In addition to immediate performance improvements, this PR will allow us to make changes in the future to tokenize on whitespace, hyphens, or whatever we want, without fearing that extremely common terms (like Street, avenue, etc) will cause unacceptable query performance. Thus, it's basically a pre-requisite for merging pelias/schema#310.

Early feedback on this PR as well as pelias/query#88 appreciated

References

https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-match-query.html#query-dsl-match-query-cutoff
https://www.elastic.co/guide/en/elasticsearch/guide/current/common-terms.html#common-terms
http://kempe.net/blog/2015/02/25/elasticsearch-query-full-text-search.html

@orangejulius
Copy link
Copy Markdown
Member Author

In a review @missinglink points out that using cutoff_frequency in combination with an and query doesn't accomplish much. This is a good point, and many of our existing queries use the and operator.

Many don't however, such as some autocomplete queries. It seems that we're very inconsistent about it's usage. One advantage of this PR is that we might be able to forgo usage of the and operator in more cases, where performance previously would have been prohibitive.

@missinglink
Copy link
Copy Markdown
Member

I'm a little worried about this feature just due to personal ignorance on the topic, I'd need to study it a bit more before I can give some considered feedback.

@orangejulius orangejulius force-pushed the cutoff_frequency branch 2 times, most recently from c9c457b to 8955e36 Compare November 26, 2018 10:25
@missinglink
Copy link
Copy Markdown
Member

missinglink commented Jan 14, 2019

I just rebased master and added some new commits which coincide with the changes I made in pelias/query (per-field settings for cutoff_frequency).

I also had to update the code in this branch to be compatible with the recent changes merged for boundary_gid.

Should be good to go now, pending editing package.json so it doesn't point to a git repo.

@orangejulius orangejulius changed the title WIP: cutoff_frequency Utilize cutoff_frequency parameter in all queries Jan 18, 2019
@orangejulius
Copy link
Copy Markdown
Member Author

With those changes this branch has no acceptance-test regressions and appears to be ready to go. I am gathering some performance stats now before merging.

@orangejulius
Copy link
Copy Markdown
Member Author

After a day of testing with real-world queries I can confirm this PR cuts off the long tail of slow search and structured search queries quite a bit!

In this chart, the master branch is the grey lines, and this branch is the white lines. There's a huge decrease in p99 latency for both endpoints. Autocomplete stays about the same, and reverse is of course not affected. So this is a good change :)
screenshot_2019-01-18_17-34-24

@orangejulius orangejulius merged commit 5946ea5 into master Jan 19, 2019
@orangejulius orangejulius deleted the cutoff_frequency branch January 19, 2019 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants