Improved synonyms system by orangejulius · Pull Request #310 · pelias/schema

orangejulius · 2018-08-02T15:34:11Z

This PR includes all the changes in #291 except for those which are Portland-specific, and not generally applicable.

It makes the following changes:

Use whitespace to split tokens in the peliasStreet analyzer
Move synonym definitions out of dynamically created filters and into synonym files, greatly reducing the size/complexity of the schema itself

missinglink · 2018-09-20T08:00:38Z

I came across a relevant example today 17217 SW STARBUCK LN in Portland. In this example SW STARBUCK LN is a valid street but STARBUCK LN is not a valid street anywhere in the world (as far as I could tell).

Maybe not an issue because SW STARBUCK LN is potentially a valid match for STARBUCK LN.

orangejulius · 2018-09-20T13:08:10Z

It's been a while since I've thought about this issue. We really do need to decide whether or not to merge this soon. Can you expand on how your example is relevant? And is it an example of how this PR breaks things, how this PR helps, or something else?

has disable_fielddata and improved synonyms (pelias/schema#310)

orangejulius · 2018-10-12T19:19:51Z

I took a look at what sort of impact this branch would have by running a North America build.

TLDR: the change in query results is not too bad, but the impact on performance is noticeably bad.

Query result changes

Because we now tokenize on whitespace on peliasStreets analyzer used to index into and search against the address_parts.street field, we now can return matches that match all of the tokens in the query, as well as extras. Two examples of this, which are now failing tests are:

/v1/search?text=49 kay street

This query now returns results from Kay Street as well as De Kay Street. This is listed as an "unexpected" result in our acceptance tests, but it does score lower than the exact matches, so I believe it's not a huge deal

/v1/search?text=450 37th st, new york, ny 11232

This query now returns results for both 450 37th st and 450 West 37th St. Again, not a huge deal, and the more exact matches are shown first.

Performance

However, as we feared, there a some obvious performance degradation from searching the index for common query terms like street. This only happens when there aren't enough uncommon other terms to lower the total number of matched documents. A particularly unperformant query from the acceptance tests is actually a venue query:

/v1/search?text=Washington University in St Louis

This query takes over 3 secondsto complete on a c5.xlarge (4 core) instance, with just north america data. (It takes 15-20 seconds with the Elasticsearch profile feature enabled!)

Looking at the Elasticsearch profiler output, it's fairly clear where the problem lies: the total count of matched documents for the query is 42,912,211, or about 1/4 of all the documents in a North America build!

Digging further, most parts of the query are very fast. For example, querying the name.phrase field for washington university in (st street) louis takes only 1.2 milliseconds. However, searching address_parts.street for just st and street (both are searched due to synonyms) took 6767ms on a single shard:

                           "query_type": "BooleanQuery",
                           "lucene": "address_parts.street:st address_parts.street:street",
                           "time": "6767.610199ms",

Here's the full profile output: https://gist.github.com/orangejulius/ecd43607a059fa383c499a7b49c185e6

Curiously, a query that was fairly fast is 5 avenue street. I think this is because there are actually very few records that match both avenue and street

orangejulius · 2018-10-12T21:54:16Z

Regarding performance, the "Washington University in St Louis" really is a uniquely bad case.

Exploring other queries, even a query for "main st" is quite fast, matching 1.1M documents and taking around 180ms, or 300-400ms with profiling enabled (profiler output).

Furthermore, its possible we can improve query performance significantly without too much work by detecting common tokens, and writing queries in a way that helps Elasticsearch match fewer documents.

For example, in the university query, we ended up with a should query like the following:

        {   
          "match": {
            "address_parts.street": {
              "analyzer": "peliasStreet",
              "boost": 5,
              "query": "Washington University in St"
            }   
          }   
        },

With an extra filter clause as follows we can get much better results:

                  {
                    "match": {
                      "address_parts.street": {
                        "analyzer": "peliasStreet",
                        "boost": 5,
                        "query": "Washington University"
                      }
                    }

This basically tells Elasticsearch to only consider documents that match the "uncommon" tokens in the query, excluding a massive number of documents that only match the common tokens "in" and "st".

This reduces the number of matched documents to only 590,000 (down from 42 million!), and then the query takes only about 200ms.
Here's a full query and profile output of that query.
university-street-with-common-token-filter-profile-output.txt

This sort of approach could give us the following benefits:

allows us to detect expensive query terms and mitigate their cost, while still using them for matching
in the case where the query is entirely common terms (for example, "avenue street"), we can keep the query as-is, instead of returning no results, as would happen if we used Elasticsearch stop words, or removed any expensive terms entirely

orangejulius · 2018-10-16T04:24:29Z

I tested setting operator: and with the above, query, and it didn't work! The query returned no results.

I also tried using what sounded like the perfect match query option, cutoff_frequency. However, it also didn't have any effect, even though it should have "automatically" handled high-frequency terms. It's very likely I didn't use it correctly. Here's a subquery I tried (it was in a should block):

        {
          "match": {
            "address_parts.street": {
              "analyzer": "peliasStreet",
              "cutoff_frequency": 0.01,
              "boost": 5,
              "operator": "and",
              "query": "Washington University in St"
            }
          }
        },

Update: It turns out I was using cutoff_frequency correctly, but in fact had to use it in the main name.default query clause as well to actually reduce the number of documents hit.

pelias/api#1213 is a PR to use cutoff_frequency everywhere since it looks very promising.

orangejulius · 2018-11-28T23:17:58Z

I've run a full planet build with these changes, and did not notice any new acceptance test failures. The only thing left before being able to merge this is to do a little bit of performance comparison, with and without pelias/api#1213 involved.

Joxit · 2019-01-12T13:52:28Z

Hi there,

We are using this PR (customised) on a full planet build and it seems that we have no performance issues.
Here is the customised branch we use : joxit/pelias-schema/build-2018-11.

I ran a full acceptance tests on both old prod (91.4%) and new one (89.51%), but errors are not relevant in our use-case.

BTW 👍 for ES5, works like a charm 😄

missinglink

I think it was good that we were originally apprehensive to merge this but now is the time to merge this.

@Joxit has also confirmed that it's a positive change, so let's go ahead and pull the trigger on this.

Are there any final actions required in terms of unit test coverage, etc?

👍

This reverts commit a285cae.

orangejulius · 2019-01-19T02:41:29Z

Nothing left to do :)

This PR looks good, and it's time for it to be merged. Looking forward to building on top of it in the future!

Now that pelias/schema#310 has been merged, using the `portland-synonyms` branch is no longer required.

Since pelias/schema#310, we do not require exact matches on street names: extra tokens are now allowed. This means that a search for 'Kay St' will return 'De Kay St', but 'Kay St' should be returned first.

Since pelias/schema#310 non-exact street matches, where there are additional tokens in the street field, are now allowed. This means its required to expand this test to two tests to check that: 1.) the result with the correct zipcode comes first 2.) the result with an incorrect zipcode comes later

Combined with pelias/query#101, this change allows for improved matching of streets when using the addressSearchUsingIDs query (the first query executed by the search endpoint). Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of the Elasticsearch query, its now possible to omit a single word _in the middle_ of a street name and still match it. Omitting words at the start or end of the street name has been supported since pelias/schema#310. Some examples of where this helps: | Actual street name | Newly matching input text | | --- | --- | | SE Martin Luther King Jr Blvd | Martin Luther King Blvd | | Carrer de Balmes | Carrer Balmes |

Combined with pelias/query#101, this change allows for improved matching of streets when using the addressSearchUsingIDs query (the first query executed by the search endpoint). Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of the Elasticsearch query, its now possible to omit a single word _in the middle_ of a street name and still match it. Omitting words at the start or end of the street name has been supported since pelias/schema#310. Some examples of where this helps: | Actual street name | Newly matching input text | | --- | --- | | SE Martin Luther King Jr Blvd | Martin Luther King Blvd | | Carrer de Balmes | Carrer Balmes | Since pelias/query#113 has been merged, this commit will result in changes to how queries behave.

Now that pelias/schema#310 has been merged, using the `portland-synonyms` branch is no longer required.

orangejulius mentioned this pull request Aug 2, 2018

Spanish address (C/ vs Calle) pelias/api#783

Open

missinglink mentioned this pull request Aug 17, 2018

Queries for streets with hyphens in the name pelias/pelias#738

Open

missinglink mentioned this pull request Aug 31, 2018

Strange search pipeline behaviour pelias/pelias#694

Open

This was referenced Sep 25, 2018

Split on whitespace for peliasStreet tokenizer #307

Closed

Portland synonyms #291

Closed

orangejulius added a commit to pelias/docker that referenced this pull request Sep 26, 2018

Use test branch of pelias/schema

6e71443

has disable_fielddata and improved synonyms (pelias/schema#310)

orangejulius added a commit to pelias/docker that referenced this pull request Sep 27, 2018

Use test branch of pelias/schema

a259daa

has disable_fielddata and improved synonyms (pelias/schema#310)

orangejulius added a commit to pelias/docker that referenced this pull request Sep 27, 2018

Use test branch of pelias/schema

e04e2fd

has disable_fielddata and improved synonyms (pelias/schema#310)

orangejulius added a commit to pelias/docker that referenced this pull request Oct 1, 2018

Use test branch of pelias/schema

58f50fb

has disable_fielddata and improved synonyms (pelias/schema#310)

orangejulius added a commit to pelias/docker that referenced this pull request Oct 11, 2018

Use test branch of pelias/schema

1a6ce31

has disable_fielddata and improved synonyms (pelias/schema#310)

orangejulius force-pushed the synonyms branch from 176448e to e1f5dde Compare October 13, 2018 01:19

This was referenced Oct 17, 2018

Utilize cutoff_frequency parameter in all queries pelias/api#1213

Merged

Synonyms for french addresses #301

Closed

orangejulius force-pushed the synonyms branch 2 times, most recently from 69bdac7 to 77f678b Compare October 29, 2018 19:14

orangejulius force-pushed the synonyms branch 2 times, most recently from e53e1ed to e22bc76 Compare November 5, 2018 22:44

orangejulius mentioned this pull request Nov 23, 2018

Av abbreviation for avenue is not supported #335

Closed

missinglink approved these changes Jan 14, 2019

View reviewed changes

orangejulius force-pushed the synonyms branch from e22bc76 to d3c82e4 Compare January 17, 2019 23:13

orangejulius added 3 commits January 17, 2019 15:58

Revert "fix(docker): Pin to Node.js 10.13.0"

fe222df

This reverts commit a285cae.

Use warn as correct logger method

addfd07

Split on whitespace for peliasStreet analyzer

3041448

missinglink added 2 commits January 17, 2019 16:00

fix(tests): peliasStreet

76fe8ca

feat(analysis): improvements to peliasStreet and peliasPhrase analysis

b59ae7e

orangejulius force-pushed the synonyms branch from d3c82e4 to b59ae7e Compare January 18, 2019 00:00

orangejulius merged commit 52997ab into master Jan 19, 2019

orangejulius mentioned this pull request Jan 23, 2019

node: bad option: --max-http-header-size=16384 #343

Closed

orangejulius added a commit to pelias/docker that referenced this pull request Jan 23, 2019

Use schema master branch everywhere

ccef491

Now that pelias/schema#310 has been merged, using the `portland-synonyms` branch is no longer required.

orangejulius mentioned this pull request Jan 23, 2019

Use schema master branch everywhere pelias/docker#68

Merged

orangejulius deleted the synonyms branch January 30, 2019 01:19

orangejulius mentioned this pull request Feb 12, 2019

Intersection Format OpenTransitTools-BonePile/trimet-mod-pelias#25

Closed

orangejulius mentioned this pull request Jun 19, 2019

feat(addressSearchUsingIds): Set default street slop to 1 pelias/api#1322

Merged

This was referenced Oct 3, 2019

autocomplete street synonyms pelias/pelias#563

Closed

Address geocoding problems in Florida pelias/pelias#285

Closed

calpb pushed a commit to sorelle/docker that referenced this pull request Mar 29, 2021

Use schema master branch everywhere

d229ee1

Now that pelias/schema#310 has been merged, using the `portland-synonyms` branch is no longer required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved synonyms system#310

Improved synonyms system#310
orangejulius merged 5 commits intomasterfrom
synonyms

orangejulius commented Aug 2, 2018

Uh oh!

missinglink commented Sep 20, 2018

Uh oh!

orangejulius commented Sep 20, 2018

Uh oh!

orangejulius commented Oct 12, 2018 •

edited

Loading

Uh oh!

orangejulius commented Oct 12, 2018 •

edited

Loading

Uh oh!

orangejulius commented Oct 16, 2018 •

edited

Loading

Uh oh!

orangejulius commented Nov 28, 2018

Uh oh!

Joxit commented Jan 12, 2019

Uh oh!

missinglink left a comment

Uh oh!

orangejulius commented Jan 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

orangejulius commented Aug 2, 2018

Uh oh!

missinglink commented Sep 20, 2018

Uh oh!

orangejulius commented Sep 20, 2018

Uh oh!

orangejulius commented Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Query result changes

/v1/search?text=49 kay street

/v1/search?text=450 37th st, new york, ny 11232

Performance

Uh oh!

orangejulius commented Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orangejulius commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orangejulius commented Nov 28, 2018

Uh oh!

Joxit commented Jan 12, 2019

Uh oh!

missinglink left a comment

Choose a reason for hiding this comment

Uh oh!

orangejulius commented Jan 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

orangejulius commented Oct 12, 2018 •

edited

Loading

orangejulius commented Oct 12, 2018 •

edited

Loading

orangejulius commented Oct 16, 2018 •

edited

Loading