WIP: fuzzy autocomplete by missinglink · Pull Request #1268 · pelias/api

missinglink · 2019-03-26T16:11:57Z

This PR is WIP and I wouldn't suggest merging it as-is, it's open for testing, discussion and feedback.

One confusing and under-documented feature of the elasticsearch match query is that it supports type: phrase and also fuzziness: * but not both at the same time 🤷‍♀️

There are unfortunately no warnings about this, which is what's contributing to the confusion, it might be the same for cutoff_frequency and phrase.

So this PR simply rewrites the top MUST query for most queries generated for autocomplete to remove phrase and enable fuzziness.

I've chosen conservative values for fuzziness in order to avoid increasing the CPU usage significantly, a fuzziness of 1 means that the Levenshtein edit distance is 1, so a single character can be added, removed or replaced.

The prefix_length is set to 1, this means that the first character is not considered for edits, the default is 0 which would obviously generate a lot more permutations and require a lot more CPU, the tradeoff here is that if someone mistypes the first letter then they're doomed from the get-go.

Lastly the max_expansions is set to 10, I'm not 100% sure what the correct setting for this should be so I again chose something conservative, my understanding of this is that only a maximum of 10 tokens in the index will be used to generate an 'OR' type condition.

From what I've read online, the vast majority of typos are Levenshtein 1 and rarely on the first character, so this should catch a wide range of typing errors without increasing the CPU requirements too much.

Things I would like to change before merging:

Clean up the queries, they are becoming confusing and we might be able to simplify them
~~The other phrase queries now have 'fuzziness': 1 which, as I explained above, will do nothing and only serves to confuse.~~
Test the performance implications of running this on every keypress for a global index.
~~Consider the implications of not using phrase and whether it and slop are even required here.~~
Move autocomplete_defaults values out of the phrase:* namespace, maybe in to a fuzzy:* namespace?

mihneadb · 2019-03-27T08:25:03Z

@missinglink it might be useful to expose fuzziness as a parameter to autocomplete. E.g. maybe someone is OK with more CPU usage for correcting more input strings?

missinglink · 2019-03-27T09:31:56Z

Hi @mihneadb, I think it's better to leave it as configurable on startup.

Having it variable per-request could cause quality-of-service issues for cloud providers like ourselves as there would be huge variability in the CPU usage of each request. I think it's important that operations engineers can control this setting.

You can adjust the settings for your docker installation without generating a new docker image by copying the autocomplete_defaults file to your local machine, editing it and then bind-mounting your local copy as a replacement for the file in the container.

If you're not familiar with docker-compose you can look in the docker-compose.yml file and you'll see that we bind-mount the local file pelias.json and make it available in the container using the same method.

You'll then need to bring the API service down and up again to apply that configuration.

mihneadb · 2019-03-27T09:34:00Z

Makes sense, thanks!

Joxit · 2019-04-17T13:53:50Z

Hi @missinglink,

I'm testing this PR, and I found something for the parameters prefix_length.
I'm trying to found this address : 40 Rue De L'arsenal, Bordeaux, France with autocomplete.
When I type 40 Rue De L Arsenal Bordeaux (without ') or 40 Rue De Arsenal, Bordeaux (without L' or 40 Rue Arsenal, Bordeaux (without De L'), I have no results with prefix_length = 1 (but works with prefix_length = 0).
This is common in french to omit determiners and can be a bit strange when we have no results.

We may need some benchmarks to weigh the pros and cons 🤔

missinglink · 2019-04-17T15:58:40Z

Thanks @Joxit, interesting feedback, that's certainly unexpected behaviour.

orangejulius · 2019-04-25T19:24:25Z

So I tested this out a bit yesterday, and it's quite good! I do think max_expansions needs to be a bit higher than 10, since that's not enough to even support all the possible replacements of a single character (there would be 36 if we are assuming only [a-z0-9]).

@Joxit your example makes sense. With prefix_length = 1, changing/insertion of the first character is supported, so larsenal would be one of the generated possibilities when you type arsenal. However with prefix_length = 0, it would not be. The underlying token for L'arsenal is in fact larsenal, and indeed the following queries don't currently work:

/v1/autocomplete?text=40 Rue De l arsenal, Bordeaux, France
/v1/autocomplete?text=40 Rue De arsenal, Bordeaux, France

So I think we should try increasing the max_expansions to at least 100 or so, and do a bit of performance testing, as well as look for possible issues where the fuzzy searching causes otherwise working queries to now fail. If neither of those end up being a huge concern I really think this is surprisingly close to 🚢.

Connects pelias/schema#301 Connects pelias/api#1268 Connects pelias/api#1279

Some new querty types were added since the fuzziness PR was created, this updates them with proper test expectations.

orangejulius · 2019-05-09T12:29:36Z

I just rebased this branch against the latest in master, and updated some newly added fixtures to take fuzziness parameters into account. The tests should pass again!

Joxit · 2019-05-11T09:10:18Z

Hi there,
I'm experimenting this PR (with some changes) on our production server.

Here are the changes I've made.

-  'fuzzy:prefix_length': 1,
-  'fuzzy:max_expansions': 10,
+  'fuzzy:prefix_length': 0,
+  'fuzzy:max_expansions': 25,

Source: Joxit/pelias-api jawg/v3.63.0-fuzzy.

I will see if we have some performance issues or incorrect responses 😄.

Joxit · 2019-06-25T13:09:46Z

I just found something than can be weird when we use fuzzy, perfect matches may not be in first positions.
Some work will be required in ES scoring for this.

(This is completely random)

{
  "function_score":{
    "query":{
      "match":{
        "name.default":{
          "analyzer":"peliasQueryPartialToken",
          "query":"vannes",
          "operator":"and"
        }
      }
    },
    "max_boost":20,
    "functions":[
      {
        "weight":1
      }
    ],
    "score_mode":"first",
    "boost_mode":"replace"
  }
}

ambrusadrianz · 2020-02-06T06:44:41Z

Hi @missinglink, is there a way I could help with this PR?

missinglink · 2020-02-06T15:08:13Z

Yes, please build a docker project from pelias/docker and compare the differences in result quality between pelias/api:master and pelias/api:fuzzy_autocomplete.

https://hub.docker.com/layers/pelias/api/fuzzy_autocomplete/images/sha256-9da0db501c473a7079e1df26569657d0bca89a3c97e0976f01a183a92b7a4f9d

Let us know what's better/worse

missinglink · 2020-02-06T16:33:08Z

This branch also needs to rebase master, I had a quick look at it and it's not a trivial task.

ambrusadrianz · 2020-02-07T09:56:10Z

The issue is that I already migrated to the version with Elastic 7, so testing it out won't be a quick one either..

missinglink · 2020-02-07T09:58:40Z

I'm not sure if follow, why would that be an issue? Are you saying the queries are not compatible with ES7?

ambrusadrianz · 2020-02-07T10:28:04Z

I get this error when running a query:

[parsing_exception] [constant_score] query does not support [query], with { line=1 & col=54 }

itssoc2 · 2020-10-07T16:04:44Z

Hi, any progress with this PR?. I believe is a very interesting functionality

jnorkus · 2022-02-21T14:26:15Z

Any news on this? Would love to see this implemented. Any issues preventing it to be merged?

missinglink · 2022-02-21T15:59:28Z

The technical answer is that fuzziness increases recall at the cost of precision, and this alone does not increase relevance.

The main issue is the way Lucene implements edit-distance queries (what they call fuzziness), these alternative terms are given an equal score as exact matches. This is obviously not ideal since you can spell a word perfectly and have spelling mistakes ranked higher due to other properties influencing the score.

This is further complicated by the complexity of our queries, often a single query contains multiple sub-queries each of which targets a different field of the document.

So if you set an edit-distance of 2 (for instance) then it's not clear how many total edits will be made to the input text. For instance if the parser detected 4 different classifications (eg. number, street, locality, country) in the input then the text may be mutated up to 8 times!

I believe that the result of these two issues will only serve to slow down existing queries while also reducing precision, so it's lose-lose.

I'd be happy to find a solution which provides the increased recall without the ranking issues.

misra-jitas · 2024-04-24T14:39:13Z

Hello..
Any update on this Pelias Fuzziness
Is anybody using Pelias Fuzziness? =)

missinglink mentioned this pull request Mar 26, 2019

Fuzzy autocomplete pelias/pelias#785

Open

orangejulius force-pushed the fuzzy_autocomplete branch from f0bbd61 to 2ff49d4 Compare April 24, 2019 00:55

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Apr 26, 2019

Update and add more french addresses

f9d5c55

Connects pelias/schema#301 Connects pelias/api#1268 Connects pelias/api#1279

missinglink mentioned this pull request May 7, 2019

Searching for statue of liberty in Russian should find Statue of Liberty #127

Open

missinglink added 4 commits May 9, 2019 08:07

WIP: fuzzy autocomplete

ea37b28

WIP: fuzzy autocomplete, clean up vars and add "operator=and"

5524502

WIP: fuzzy autocomplete, move housenumber to MUST conditions

ee5666f

WIP: fuzzy autocomplete, additional query refactoring

6e528a1

orangejulius force-pushed the fuzzy_autocomplete branch from 2ff49d4 to 6e528a1 Compare May 9, 2019 12:07

Test fuzziness in newly added queries

a2c5966

Some new querty types were added since the fuzziness PR was created, this updates them with proper test expectations.

Joxit mentioned this pull request Sep 9, 2019

replace addressit with pelias native parser #1287

Merged

andreibalaban mentioned this pull request Feb 10, 2021

Parameterised fuzzy autocomplete #1510

Open

missinglink mentioned this pull request Nov 9, 2021

Issue with the spelling of a French town - La Queue-lez-Yvelines pelias/pelias#918

Open

Uh oh!

Conversation

missinglink commented Mar 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihneadb commented Mar 27, 2019

Uh oh!

missinglink commented Mar 27, 2019

Uh oh!

mihneadb commented Mar 27, 2019

Uh oh!

Joxit commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

missinglink commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orangejulius commented Apr 25, 2019

Uh oh!

orangejulius commented May 9, 2019

Uh oh!

Joxit commented May 11, 2019

Uh oh!

Joxit commented Jun 25, 2019

Uh oh!

ambrusadrianz commented Feb 6, 2020

Uh oh!

missinglink commented Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

missinglink commented Feb 6, 2020

Uh oh!

ambrusadrianz commented Feb 7, 2020

Uh oh!

missinglink commented Feb 7, 2020

Uh oh!

ambrusadrianz commented Feb 7, 2020

Uh oh!

itssoc2 commented Oct 7, 2020

Uh oh!

jnorkus commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

missinglink commented Feb 21, 2022

Uh oh!

misra-jitas commented Apr 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

missinglink commented Mar 26, 2019 •

edited

Loading

Joxit commented Apr 17, 2019 •

edited

Loading

missinglink commented Apr 17, 2019 •

edited

Loading

missinglink commented Feb 6, 2020 •

edited

Loading

jnorkus commented Feb 21, 2022 •

edited

Loading