Today I was searching SharePoint using PnP PowerShell (Invoke-PnPSearchQuery) and ran into duplicated results on large resultsets.
I got like a 100.000 files as resultrows, but only a third of those results were unique results (unique based on the URL of the file). Which is an odd thing. It appeared I was finding all files, but what I actually got back was duplicated rows. (Not due to TrimDuplicates:False)
In this scenario I implemented paging myself (for some reasons). I discovered that when I used the -All parameter on the commandlet (instead of paging myself), I would get as many result rows, but now they were all unique files. Very odd discovery.
When looking in the PnP.PowerShell codebase (and searching online) it turned out that using paging with the StartRow property is very untrustworthy. And therefore PnP.PowerShell has implemented a different way, that is also documented by MS. Source: https://learn.microsoft.com/en-us/sharepoint/dev/general-development/pagination-for-large-result-sets
Although this source doesn't mention the duplicate results that I was struggling with, it does show that for paging on large resultsets you should sort by the DocId and page like that. And that's precisely what PnP.PowerShell is doing if you look into their codebase.
I looked into ours and it seems we are paging by StartRow when --allResults is used.
I propose that we update the way we page in this command to the DocId version, to make our command trustworthy on large resultsets as well.
Implementation
In short, this is what we need to do:
- Query everything where
IndexDocId>${lastDocId}
- Sort by DocId:
[DocId]:ascending
I wrote down a slightly more extensive explanation in my blog:
https://www.blimped.nl/correctly-paging-when-searching-sharepoint/
Additional
While we;re at it, lets add extra verbose message while looping all results.
Something like:
Processing search query, retrieved 1000 of 1000000 items
Processing search query, retrieved 1500 of 1000000 items
Processing search query, retrieved 2000 of 1000000 items
Today I was searching SharePoint using PnP PowerShell (
Invoke-PnPSearchQuery) and ran into duplicated results on large resultsets.I got like a 100.000 files as resultrows, but only a third of those results were unique results (unique based on the URL of the file). Which is an odd thing. It appeared I was finding all files, but what I actually got back was duplicated rows. (Not due to TrimDuplicates:False)
In this scenario I implemented paging myself (for some reasons). I discovered that when I used the
-Allparameter on the commandlet (instead of paging myself), I would get as many result rows, but now they were all unique files. Very odd discovery.When looking in the PnP.PowerShell codebase (and searching online) it turned out that using paging with the StartRow property is very untrustworthy. And therefore PnP.PowerShell has implemented a different way, that is also documented by MS. Source: https://learn.microsoft.com/en-us/sharepoint/dev/general-development/pagination-for-large-result-sets
Although this source doesn't mention the duplicate results that I was struggling with, it does show that for paging on large resultsets you should sort by the DocId and page like that. And that's precisely what PnP.PowerShell is doing if you look into their codebase.
I looked into ours and it seems we are paging by StartRow when
--allResultsis used.I propose that we update the way we page in this command to the DocId version, to make our command trustworthy on large resultsets as well.
Implementation
In short, this is what we need to do:
IndexDocId>${lastDocId}[DocId]:ascendingI wrote down a slightly more extensive explanation in my blog:
https://www.blimped.nl/correctly-paging-when-searching-sharepoint/
Additional
While we;re at it, lets add extra verbose message while looping all results.
Something like: