Conversation
|
@missinglink it might be useful to expose |
|
Hi @mihneadb, I think it's better to leave it as configurable on startup. Having it variable per-request could cause quality-of-service issues for cloud providers like ourselves as there would be huge variability in the CPU usage of each request. I think it's important that operations engineers can control this setting. You can adjust the settings for your docker installation without generating a new docker image by copying the If you're not familiar with docker-compose you can look in the You'll then need to bring the API service down and up again to apply that configuration. |
|
Makes sense, thanks! |
|
Hi @missinglink, I'm testing this PR, and I found something for the parameters We may need some benchmarks to weigh the pros and cons 🤔 |
|
Thanks @Joxit, interesting feedback, that's certainly unexpected behaviour. |
f0bbd61 to
2ff49d4
Compare
|
So I tested this out a bit yesterday, and it's quite good! I do think @Joxit your example makes sense. With /v1/autocomplete?text=40 Rue De l arsenal, Bordeaux, France So I think we should try increasing the |
Connects pelias/schema#301 Connects pelias/api#1268 Connects pelias/api#1279
2ff49d4 to
6e528a1
Compare
Some new querty types were added since the fuzziness PR was created, this updates them with proper test expectations.
|
I just rebased this branch against the latest in |
|
Hi there, Here are the changes I've made. - 'fuzzy:prefix_length': 1,
- 'fuzzy:max_expansions': 10,
+ 'fuzzy:prefix_length': 0,
+ 'fuzzy:max_expansions': 25,Source: Joxit/pelias-api jawg/v3.63.0-fuzzy. I will see if we have some performance issues or incorrect responses 😄. |
|
I just found something than can be weird when we use fuzzy, perfect matches may not be in first positions. (This is completely random) {
"function_score":{
"query":{
"match":{
"name.default":{
"analyzer":"peliasQueryPartialToken",
"query":"vannes",
"operator":"and"
}
}
},
"max_boost":20,
"functions":[
{
"weight":1
}
],
"score_mode":"first",
"boost_mode":"replace"
}
} |
|
Hi @missinglink, is there a way I could help with this PR? |
|
Yes, please build a docker project from pelias/docker and compare the differences in result quality between Let us know what's better/worse |
|
This branch also needs to rebase |
|
The issue is that I already migrated to the version with Elastic 7, so testing it out won't be a quick one either.. |
|
I'm not sure if follow, why would that be an issue? Are you saying the queries are not compatible with ES7? |
|
I get this error when running a query: |
|
Hi, any progress with this PR?. I believe is a very interesting functionality |
|
Any news on this? Would love to see this implemented. Any issues preventing it to be merged? |
|
The technical answer is that fuzziness increases recall at the cost of precision, and this alone does not increase relevance. The main issue is the way Lucene implements edit-distance queries (what they call fuzziness), these alternative terms are given an equal score as exact matches. This is obviously not ideal since you can spell a word perfectly and have spelling mistakes ranked higher due to other properties influencing the score. This is further complicated by the complexity of our queries, often a single query contains multiple sub-queries each of which targets a different field of the document. So if you set an edit-distance of 2 (for instance) then it's not clear how many total edits will be made to the input text. For instance if the parser detected 4 different classifications (eg. number, street, locality, country) in the input then the text may be mutated up to 8 times! I believe that the result of these two issues will only serve to slow down existing queries while also reducing precision, so it's lose-lose. I'd be happy to find a solution which provides the increased recall without the ranking issues. |
|
Hello.. |
This PR is WIP and I wouldn't suggest merging it as-is, it's open for testing, discussion and feedback.
One confusing and under-documented feature of the elasticsearch
matchquery is that it supportstype: phraseand alsofuzziness: *but not both at the same time 🤷♀️There are unfortunately no warnings about this, which is what's contributing to the confusion, it might be the same for
cutoff_frequencyandphrase.So this PR simply rewrites the top
MUSTquery for most queries generated for autocomplete to removephraseand enablefuzziness.I've chosen conservative values for
fuzzinessin order to avoid increasing the CPU usage significantly, afuzzinessof 1 means that the Levenshtein edit distance is 1, so a single character can be added, removed or replaced.The
prefix_lengthis set to 1, this means that the first character is not considered for edits, the default is 0 which would obviously generate a lot more permutations and require a lot more CPU, the tradeoff here is that if someone mistypes the first letter then they're doomed from the get-go.Lastly the
max_expansionsis set to 10, I'm not 100% sure what the correct setting for this should be so I again chose something conservative, my understanding of this is that only a maximum of 10 tokens in the index will be used to generate an 'OR' type condition.From what I've read online, the vast majority of typos are Levenshtein 1 and rarely on the first character, so this should catch a wide range of typing errors without increasing the CPU requirements too much.
Things I would like to change before merging:
The otherphrasequeries now have'fuzziness': 1which, as I explained above, will do nothing and only serves to confuse.Consider the implications of not usingphraseand whether it andslopare even required here.autocomplete_defaultsvalues out of thephrase:*namespace, maybe in to afuzzy:*namespace?