Conversation
|
I came across a relevant example today Maybe not an issue because |
|
It's been a while since I've thought about this issue. We really do need to decide whether or not to merge this soon. Can you expand on how your example is relevant? And is it an example of how this PR breaks things, how this PR helps, or something else? |
has disable_fielddata and improved synonyms (pelias/schema#310)
has disable_fielddata and improved synonyms (pelias/schema#310)
has disable_fielddata and improved synonyms (pelias/schema#310)
has disable_fielddata and improved synonyms (pelias/schema#310)
has disable_fielddata and improved synonyms (pelias/schema#310)
|
I took a look at what sort of impact this branch would have by running a North America build. TLDR: the change in query results is not too bad, but the impact on performance is noticeably bad. Query result changesBecause we now tokenize on whitespace on /v1/search?text=49 kay streetThis query now returns results from /v1/search?text=450 37th st, new york, ny 11232This query now returns results for both PerformanceHowever, as we feared, there a some obvious performance degradation from searching the index for common query terms like /v1/search?text=Washington University in St Louis This query takes over 3 secondsto complete on a Looking at the Elasticsearch profiler output, it's fairly clear where the problem lies: the total count of matched documents for the query is 42,912,211, or about 1/4 of all the documents in a North America build! Digging further, most parts of the query are very fast. For example, querying the name.phrase field for Here's the full profile output: https://gist.github.com/orangejulius/ecd43607a059fa383c499a7b49c185e6 Curiously, a query that was fairly fast is 5 avenue street. I think this is because there are actually very few records that match both |
|
Regarding performance, the "Washington University in St Louis" really is a uniquely bad case. Exploring other queries, even a query for "main st" is quite fast, matching 1.1M documents and taking around 180ms, or 300-400ms with profiling enabled (profiler output). Furthermore, its possible we can improve query performance significantly without too much work by detecting common tokens, and writing queries in a way that helps Elasticsearch match fewer documents. For example, in the university query, we ended up with a With an extra filter clause as follows we can get much better results: This basically tells Elasticsearch to only consider documents that match the "uncommon" tokens in the query, excluding a massive number of documents that only match the common tokens "in" and "st". This reduces the number of matched documents to only 590,000 (down from 42 million!), and then the query takes only about 200ms. This sort of approach could give us the following benefits:
|
176448e to
e1f5dde
Compare
|
I tested setting I also tried using what sounded like the perfect Update: It turns out I was using pelias/api#1213 is a PR to use |
69bdac7 to
77f678b
Compare
e53e1ed to
e22bc76
Compare
|
I've run a full planet build with these changes, and did not notice any new acceptance test failures. The only thing left before being able to merge this is to do a little bit of performance comparison, with and without pelias/api#1213 involved. |
|
Hi there, We are using this PR (customised) on a full planet build and it seems that we have no performance issues. I ran a full acceptance tests on both old prod (91.4%) and new one (89.51%), but errors are not relevant in our use-case. BTW 👍 for ES5, works like a charm 😄 |
missinglink
left a comment
There was a problem hiding this comment.
I think it was good that we were originally apprehensive to merge this but now is the time to merge this.
@Joxit has also confirmed that it's a positive change, so let's go ahead and pull the trigger on this.
Are there any final actions required in terms of unit test coverage, etc?
👍
e22bc76 to
d3c82e4
Compare
d3c82e4 to
b59ae7e
Compare
|
Nothing left to do :) This PR looks good, and it's time for it to be merged. Looking forward to building on top of it in the future! |
Now that pelias/schema#310 has been merged, using the `portland-synonyms` branch is no longer required.
Since pelias/schema#310, we do not require exact matches on street names: extra tokens are now allowed. This means that a search for 'Kay St' will return 'De Kay St', but 'Kay St' should be returned first.
Since pelias/schema#310 non-exact street matches, where there are additional tokens in the street field, are now allowed. This means its required to expand this test to two tests to check that: 1.) the result with the correct zipcode comes first 2.) the result with an incorrect zipcode comes later
Combined with pelias/query#101, this change allows for improved matching of streets when using the addressSearchUsingIDs query (the first query executed by the search endpoint). Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of the Elasticsearch query, its now possible to omit a single word _in the middle_ of a street name and still match it. Omitting words at the start or end of the street name has been supported since pelias/schema#310. Some examples of where this helps: | Actual street name | Newly matching input text | | --- | --- | | SE Martin Luther King Jr Blvd | Martin Luther King Blvd | | Carrer de Balmes | Carrer Balmes |
Combined with pelias/query#101, this change allows for improved matching of streets when using the addressSearchUsingIDs query (the first query executed by the search endpoint). Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of the Elasticsearch query, its now possible to omit a single word _in the middle_ of a street name and still match it. Omitting words at the start or end of the street name has been supported since pelias/schema#310. Some examples of where this helps: | Actual street name | Newly matching input text | | --- | --- | | SE Martin Luther King Jr Blvd | Martin Luther King Blvd | | Carrer de Balmes | Carrer Balmes |
Combined with pelias/query#101, this change allows for improved matching of streets when using the addressSearchUsingIDs query (the first query executed by the search endpoint). Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of the Elasticsearch query, its now possible to omit a single word _in the middle_ of a street name and still match it. Omitting words at the start or end of the street name has been supported since pelias/schema#310. Some examples of where this helps: | Actual street name | Newly matching input text | | --- | --- | | SE Martin Luther King Jr Blvd | Martin Luther King Blvd | | Carrer de Balmes | Carrer Balmes |
Combined with pelias/query#101, this change allows for improved matching of streets when using the addressSearchUsingIDs query (the first query executed by the search endpoint). Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of the Elasticsearch query, its now possible to omit a single word _in the middle_ of a street name and still match it. Omitting words at the start or end of the street name has been supported since pelias/schema#310. Some examples of where this helps: | Actual street name | Newly matching input text | | --- | --- | | SE Martin Luther King Jr Blvd | Martin Luther King Blvd | | Carrer de Balmes | Carrer Balmes | Since pelias/query#113 has been merged, this commit will result in changes to how queries behave.
Combined with pelias/query#101, this change allows for improved matching of streets when using the addressSearchUsingIDs query (the first query executed by the search endpoint). Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of the Elasticsearch query, its now possible to omit a single word _in the middle_ of a street name and still match it. Omitting words at the start or end of the street name has been supported since pelias/schema#310. Some examples of where this helps: | Actual street name | Newly matching input text | | --- | --- | | SE Martin Luther King Jr Blvd | Martin Luther King Blvd | | Carrer de Balmes | Carrer Balmes | Since pelias/query#113 has been merged, this commit will result in changes to how queries behave.
Combined with pelias/query#101, this change allows for improved matching of streets when using the addressSearchUsingIDs query (the first query executed by the search endpoint). Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of the Elasticsearch query, its now possible to omit a single word _in the middle_ of a street name and still match it. Omitting words at the start or end of the street name has been supported since pelias/schema#310. Some examples of where this helps: | Actual street name | Newly matching input text | | --- | --- | | SE Martin Luther King Jr Blvd | Martin Luther King Blvd | | Carrer de Balmes | Carrer Balmes | Since pelias/query#113 has been merged, this commit will result in changes to how queries behave.
Combined with pelias/query#101, this change allows for improved matching of streets when using the addressSearchUsingIDs query (the first query executed by the search endpoint). Essentially, by allowing a `slop` of 1 in the `match_phrase` clauses of the Elasticsearch query, its now possible to omit a single word _in the middle_ of a street name and still match it. Omitting words at the start or end of the street name has been supported since pelias/schema#310. Some examples of where this helps: | Actual street name | Newly matching input text | | --- | --- | | SE Martin Luther King Jr Blvd | Martin Luther King Blvd | | Carrer de Balmes | Carrer Balmes | Since pelias/query#113 has been merged, this commit will result in changes to how queries behave.
Now that pelias/schema#310 has been merged, using the `portland-synonyms` branch is no longer required.


This PR includes all the changes in #291 except for those which are Portland-specific, and not generally applicable.
It makes the following changes:
peliasStreetanalyzer