Utilize cutoff_frequency parameter in all queries#1213
Conversation
567713a to
7ee62c3
Compare
|
In a review @missinglink points out that using Many don't however, such as some autocomplete queries. It seems that we're very inconsistent about it's usage. One advantage of this PR is that we might be able to forgo usage of the |
|
I'm a little worried about this feature just due to personal ignorance on the topic, I'd need to study it a bit more before I can give some considered feedback. |
c9c457b to
8955e36
Compare
23cdc25 to
8a3d675
Compare
|
I just rebased master and added some new commits which coincide with the changes I made in I also had to update the code in this branch to be compatible with the recent changes merged for Should be good to go now, pending editing |
8a3d675 to
eccef05
Compare
|
With those changes this branch has no acceptance-test regressions and appears to be ready to go. I am gathering some performance stats now before merging. |
eccef05 to
05fe727
Compare

This PR adds use of the Elasticsearch cutoff_frequency feature across nearly all of our query clauses.
This feature helps with performance for queries that could potentially match lots of documents by only looking at documents that match uncommon terms first.
It does this in a way that has no effect on the end result, but massively reduces the number of documents that are scored.
My understanding of the meaning of the parameter is as follows: what fraction of the documents in a shard must contain this term before it is treated as a filtering term, rather than a scoring term? The value most commonly seen in documentation is 0.01, meaning any term that is in more than 1% of documents will be used as a filter. It's worth testing different values to see which one gives the best performance, but my guess is there aren't huge wins to be had from extensive tuning of this variable (I'd love to be proven wrong).
Essentially,
cutoff_frequencyperforms the "pseudo-stopword" handling that I suggested in pelias/schema#310 (comment), except it does it automatically, without any work on our part such as crafting extra query clauses or coming up with a list of common tokens.In my testing of the difficult
Washington University in St Louisquery from pelias/schema#310, these changes have the following effect:This was using a full planet build for testing with 560 million documents. Before this change, a full 12% of the documents in the index were being searched.
Additionally, running a full set of acceptance tests showed exactly zero differences using this branch compared to master.
In addition to immediate performance improvements, this PR will allow us to make changes in the future to tokenize on whitespace, hyphens, or whatever we want, without fearing that extremely common terms (like Street, avenue, etc) will cause unacceptable query performance. Thus, it's basically a pre-requisite for merging pelias/schema#310.
Early feedback on this PR as well as pelias/query#88 appreciated
References
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-match-query.html#query-dsl-match-query-cutoff
https://www.elastic.co/guide/en/elasticsearch/guide/current/common-terms.html#common-terms
http://kempe.net/blog/2015/02/25/elasticsearch-query-full-text-search.html