Skip to content

Improved synonyms system#310

Merged
orangejulius merged 5 commits intomasterfrom
synonyms
Jan 19, 2019
Merged

Improved synonyms system#310
orangejulius merged 5 commits intomasterfrom
synonyms

Conversation

@orangejulius
Copy link
Copy Markdown
Member

This PR includes all the changes in #291 except for those which are Portland-specific, and not generally applicable.

It makes the following changes:

  • Use whitespace to split tokens in the peliasStreet analyzer
  • Move synonym definitions out of dynamically created filters and into synonym files, greatly reducing the size/complexity of the schema itself

@missinglink
Copy link
Copy Markdown
Member

I came across a relevant example today 17217 SW STARBUCK LN in Portland. In this example SW STARBUCK LN is a valid street but STARBUCK LN is not a valid street anywhere in the world (as far as I could tell).

Maybe not an issue because SW STARBUCK LN is potentially a valid match for STARBUCK LN.

@orangejulius
Copy link
Copy Markdown
Member Author

It's been a while since I've thought about this issue. We really do need to decide whether or not to merge this soon. Can you expand on how your example is relevant? And is it an example of how this PR breaks things, how this PR helps, or something else?

orangejulius added a commit to pelias/docker that referenced this pull request Sep 26, 2018
has disable_fielddata and improved synonyms
(pelias/schema#310)
orangejulius added a commit to pelias/docker that referenced this pull request Sep 27, 2018
has disable_fielddata and improved synonyms
(pelias/schema#310)
orangejulius added a commit to pelias/docker that referenced this pull request Sep 27, 2018
has disable_fielddata and improved synonyms
(pelias/schema#310)
orangejulius added a commit to pelias/docker that referenced this pull request Oct 1, 2018
has disable_fielddata and improved synonyms
(pelias/schema#310)
orangejulius added a commit to pelias/docker that referenced this pull request Oct 11, 2018
has disable_fielddata and improved synonyms
(pelias/schema#310)
@orangejulius
Copy link
Copy Markdown
Member Author

orangejulius commented Oct 12, 2018

I took a look at what sort of impact this branch would have by running a North America build.

TLDR: the change in query results is not too bad, but the impact on performance is noticeably bad.

Query result changes

Because we now tokenize on whitespace on peliasStreets analyzer used to index into and search against the address_parts.street field, we now can return matches that match all of the tokens in the query, as well as extras. Two examples of this, which are now failing tests are:

/v1/search?text=49 kay street

image

This query now returns results from Kay Street as well as De Kay Street. This is listed as an "unexpected" result in our acceptance tests, but it does score lower than the exact matches, so I believe it's not a huge deal

/v1/search?text=450 37th st, new york, ny 11232

image

This query now returns results for both 450 37th st and 450 West 37th St. Again, not a huge deal, and the more exact matches are shown first.

Performance

However, as we feared, there a some obvious performance degradation from searching the index for common query terms like street. This only happens when there aren't enough uncommon other terms to lower the total number of matched documents. A particularly unperformant query from the acceptance tests is actually a venue query:

/v1/search?text=Washington University in St Louis

This query takes over 3 secondsto complete on a c5.xlarge (4 core) instance, with just north america data. (It takes 15-20 seconds with the Elasticsearch profile feature enabled!)

Looking at the Elasticsearch profiler output, it's fairly clear where the problem lies: the total count of matched documents for the query is 42,912,211, or about 1/4 of all the documents in a North America build!

Digging further, most parts of the query are very fast. For example, querying the name.phrase field for washington university in (st street) louis takes only 1.2 milliseconds. However, searching address_parts.street for just st and street (both are searched due to synonyms) took 6767ms on a single shard:

                           "query_type": "BooleanQuery",
                           "lucene": "address_parts.street:st address_parts.street:street",
                           "time": "6767.610199ms",

Here's the full profile output: https://gist.github.com/orangejulius/ecd43607a059fa383c499a7b49c185e6

Curiously, a query that was fairly fast is 5 avenue street. I think this is because there are actually very few records that match both avenue and street

@orangejulius
Copy link
Copy Markdown
Member Author

orangejulius commented Oct 12, 2018

Regarding performance, the "Washington University in St Louis" really is a uniquely bad case.

Exploring other queries, even a query for "main st" is quite fast, matching 1.1M documents and taking around 180ms, or 300-400ms with profiling enabled (profiler output).

Furthermore, its possible we can improve query performance significantly without too much work by detecting common tokens, and writing queries in a way that helps Elasticsearch match fewer documents.

For example, in the university query, we ended up with a should query like the following:

        {   
          "match": {
            "address_parts.street": {
              "analyzer": "peliasStreet",
              "boost": 5,
              "query": "Washington University in St"
            }   
          }   
        },  

With an extra filter clause as follows we can get much better results:

                  {
                    "match": {
                      "address_parts.street": {
                        "analyzer": "peliasStreet",
                        "boost": 5,
                        "query": "Washington University"
                      }
                    }

This basically tells Elasticsearch to only consider documents that match the "uncommon" tokens in the query, excluding a massive number of documents that only match the common tokens "in" and "st".

This reduces the number of matched documents to only 590,000 (down from 42 million!), and then the query takes only about 200ms.
Here's a full query and profile output of that query.
university-street-with-common-token-filter-profile-output.txt

This sort of approach could give us the following benefits:

  • allows us to detect expensive query terms and mitigate their cost, while still using them for matching
  • in the case where the query is entirely common terms (for example, "avenue street"), we can keep the query as-is, instead of returning no results, as would happen if we used Elasticsearch stop words, or removed any expensive terms entirely

@orangejulius
Copy link
Copy Markdown
Member Author

orangejulius commented Oct 16, 2018

I tested setting operator: and with the above, query, and it didn't work! The query returned no results.

I also tried using what sounded like the perfect match query option, cutoff_frequency. However, it also didn't have any effect, even though it should have "automatically" handled high-frequency terms. It's very likely I didn't use it correctly. Here's a subquery I tried (it was in a should block):

        {
          "match": {
            "address_parts.street": {
              "analyzer": "peliasStreet",
              "cutoff_frequency": 0.01,
              "boost": 5,
              "operator": "and",
              "query": "Washington University in St"
            }
          }
        },

Update: It turns out I was using cutoff_frequency correctly, but in fact had to use it in the main name.default query clause as well to actually reduce the number of documents hit.

pelias/api#1213 is a PR to use cutoff_frequency everywhere since it looks very promising.

@orangejulius
Copy link
Copy Markdown
Member Author

I've run a full planet build with these changes, and did not notice any new acceptance test failures. The only thing left before being able to merge this is to do a little bit of performance comparison, with and without pelias/api#1213 involved.

@Joxit
Copy link
Copy Markdown
Member

Joxit commented Jan 12, 2019

Hi there,

We are using this PR (customised) on a full planet build and it seems that we have no performance issues.
Here is the customised branch we use : joxit/pelias-schema/build-2018-11.

I ran a full acceptance tests on both old prod (91.4%) and new one (89.51%), but errors are not relevant in our use-case.

BTW 👍 for ES5, works like a charm 😄

Copy link
Copy Markdown
Member

@missinglink missinglink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was good that we were originally apprehensive to merge this but now is the time to merge this.

@Joxit has also confirmed that it's a positive change, so let's go ahead and pull the trigger on this.

Are there any final actions required in terms of unit test coverage, etc?

👍

@orangejulius
Copy link
Copy Markdown
Member Author

Nothing left to do :)

This PR looks good, and it's time for it to be merged. Looking forward to building on top of it in the future!

orangejulius added a commit to pelias/docker that referenced this pull request Jan 23, 2019
Now that pelias/schema#310 has been merged,
using the `portland-synonyms` branch is no longer required.
@orangejulius orangejulius deleted the synonyms branch January 30, 2019 01:19
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Feb 21, 2019
Since pelias/schema#310, we do not require exact
matches on street names: extra tokens are now allowed.

This means that a search for 'Kay St' will return 'De Kay St', but 'Kay
St' should be returned first.
orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Feb 21, 2019
Since pelias/schema#310 non-exact street
matches, where there are additional tokens in the street field, are now
allowed.

This means its required to expand this test to two tests to check that:
1.) the result with the correct zipcode comes first
2.) the result with an incorrect zipcode comes later
orangejulius added a commit to pelias/api that referenced this pull request Jun 19, 2019
Combined with pelias/query#101, this change
allows for improved matching of streets when using the
addressSearchUsingIDs query (the first query executed by the search
endpoint).

Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of
the Elasticsearch query, its now possible to omit a single word _in the
middle_ of a street name and still match it. Omitting words at the start
or end of the street name has been supported since
pelias/schema#310.

Some examples of where this helps:

| Actual street name | Newly matching input text |
| ---                | ---                       |
| SE Martin Luther King Jr Blvd | Martin Luther King Blvd |
| Carrer de Balmes | Carrer Balmes |
orangejulius added a commit to pelias/api that referenced this pull request Jun 19, 2019
Combined with pelias/query#101, this change
allows for improved matching of streets when using the
addressSearchUsingIDs query (the first query executed by the search
endpoint).

Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of
the Elasticsearch query, its now possible to omit a single word _in the
middle_ of a street name and still match it. Omitting words at the start
or end of the street name has been supported since
pelias/schema#310.

Some examples of where this helps:

| Actual street name | Newly matching input text |
| ---                | ---                       |
| SE Martin Luther King Jr Blvd | Martin Luther King Blvd |
| Carrer de Balmes | Carrer Balmes |
orangejulius added a commit to pelias/api that referenced this pull request Oct 3, 2019
Combined with pelias/query#101, this change
allows for improved matching of streets when using the
addressSearchUsingIDs query (the first query executed by the search
endpoint).

Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of
the Elasticsearch query, its now possible to omit a single word _in the
middle_ of a street name and still match it. Omitting words at the start
or end of the street name has been supported since
pelias/schema#310.

Some examples of where this helps:

| Actual street name | Newly matching input text |
| ---                | ---                       |
| SE Martin Luther King Jr Blvd | Martin Luther King Blvd |
| Carrer de Balmes | Carrer Balmes |
orangejulius added a commit to pelias/api that referenced this pull request Oct 3, 2019
Combined with pelias/query#101, this change
allows for improved matching of streets when using the
addressSearchUsingIDs query (the first query executed by the search
endpoint).

Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of
the Elasticsearch query, its now possible to omit a single word _in the
middle_ of a street name and still match it. Omitting words at the start
or end of the street name has been supported since
pelias/schema#310.

Some examples of where this helps:

| Actual street name | Newly matching input text |
| ---                | ---                       |
| SE Martin Luther King Jr Blvd | Martin Luther King Blvd |
| Carrer de Balmes | Carrer Balmes |

Since pelias/query#113 has been merged, this
commit will result in changes to how queries behave.
orangejulius added a commit to pelias/api that referenced this pull request Oct 3, 2019
Combined with pelias/query#101, this change
allows for improved matching of streets when using the
addressSearchUsingIDs query (the first query executed by the search
endpoint).

Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of
the Elasticsearch query, its now possible to omit a single word _in the
middle_ of a street name and still match it. Omitting words at the start
or end of the street name has been supported since
pelias/schema#310.

Some examples of where this helps:

| Actual street name | Newly matching input text |
| ---                | ---                       |
| SE Martin Luther King Jr Blvd | Martin Luther King Blvd |
| Carrer de Balmes | Carrer Balmes |

Since pelias/query#113 has been merged, this
commit will result in changes to how queries behave.
orangejulius added a commit to pelias/api that referenced this pull request Oct 3, 2019
Combined with pelias/query#101, this change
allows for improved matching of streets when using the
addressSearchUsingIDs query (the first query executed by the search
endpoint).

Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of
the Elasticsearch query, its now possible to omit a single word _in the
middle_ of a street name and still match it. Omitting words at the start
or end of the street name has been supported since
pelias/schema#310.

Some examples of where this helps:

| Actual street name | Newly matching input text |
| ---                | ---                       |
| SE Martin Luther King Jr Blvd | Martin Luther King Blvd |
| Carrer de Balmes | Carrer Balmes |

Since pelias/query#113 has been merged, this
commit will result in changes to how queries behave.
orangejulius added a commit to pelias/api that referenced this pull request Oct 4, 2019
Combined with pelias/query#101, this change
allows for improved matching of streets when using the
addressSearchUsingIDs query (the first query executed by the search
endpoint).

Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of
the Elasticsearch query, its now possible to omit a single word _in the
middle_ of a street name and still match it. Omitting words at the start
or end of the street name has been supported since
pelias/schema#310.

Some examples of where this helps:

| Actual street name | Newly matching input text |
| ---                | ---                       |
| SE Martin Luther King Jr Blvd | Martin Luther King Blvd |
| Carrer de Balmes | Carrer Balmes |

Since pelias/query#113 has been merged, this
commit will result in changes to how queries behave.
calpb pushed a commit to sorelle/docker that referenced this pull request Mar 29, 2021
Now that pelias/schema#310 has been merged,
using the `portland-synonyms` branch is no longer required.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants