Added -Top/-Bottom params to Sort-Object for Top/Bottom N sort and many Pester tests#2518
Conversation
…d many Pester tests
|
Hi @KirkMunro, I'm your friendly neighborhood Microsoft Pull Request Bot (You can call me MSBOT). Thanks for your contribution! The agreement was validated by Microsoft and real humans are currently evaluating your PR. TTYL, MSBOT; |
|
I'm not entirely sure what to do with the failures in AppVeyor/Travis CI. Saw on another PR thread that @vors indicated there were some problems due to changes in .NET Core that he was looking into. These changes build fine on my machine, so pointers if I need to do anything would be helpful. Thanks. |
|
Hi @KirkMunro, I'm your friendly neighborhood Microsoft Pull Request Bot (You can call me MSBOT). Thanks for your contribution! The agreement was validated by Microsoft and real humans are currently evaluating your PR. TTYL, MSBOT; |
|
Reopened PR to re-trigger the execution on the fixed master |
| } | ||
| } | ||
|
|
||
| } No newline at end of file |
| /// <summary> | ||
| /// | ||
| /// </summary> | ||
| [Cmdlet("Sort", "Object", HelpUri = "https://go.microsoft.com/fwlink/?LinkID=113403", RemotingCapability = RemotingCapability.None)] |
There was a problem hiding this comment.
Avoid unnecessary formatting
There was a problem hiding this comment.
The formatting is not unnecessary. I added DefaultParameterSetName to the Cmdlet attribute. The line becomes far too long and hard to read. Formatting makes the content readable without unnecessary side-scrolling. Unless the PS Team requires this change for some reason, this code stays as is.
| } | ||
| } | ||
|
|
||
| Describe 'Sort-Object Top and Bottom Unit Tests' -Tags 'CI' { |
There was a problem hiding this comment.
Consider grouping similar tests below in Context
There was a problem hiding this comment.
If I need to go into this PR to address issues with the logic I'll consider adding context to the Pester tests. Otherwise they are grouped as other tests in the file were grouped, and I'm comfortable with that.
| Describe 'Sort-Object Top and Bottom Unit Tests' -Tags 'CI' { | ||
|
|
||
| It 'Return the top N integers sorted in ascending order' { | ||
| $unsortedData = 973474993,271612178,-1258909473,659770354,1829227828,-1709391247,-10835210,-1477737798,1125017828,813732193 |
There was a problem hiding this comment.
- Move duplicate initializers to BeforeAll {} (and throughout all code below too)
- Why do you use so complex data? I would prefer 1,2,-3,... a,b,z,... Enough 5 or 6 values.
There was a problem hiding this comment.
BeforeAll isn't documented (I didn't even know it existed), and I simply followed the model in the file already.
The numeric data used was intentionally generated using Get-Random with the range of all int values so that the test would result in a random outcome that covered a broad base of numeric input. With sorting and -Top/-Bottom, 5 or 6 values to start with is not always a sufficient test, so I'll leave it with 10 values.
There was a problem hiding this comment.
Many tests on this repository contain BeforeAll. See these examples.
Simple source data easier to understand and easier to maintain. I would say that it is better to have the test which return results you can evaluate your own eyes.
There was a problem hiding this comment.
As per our test guidelines, please do use BeforeAll.
| for ($i = 0; $i -lt $topNSortResults.Count; $i++) { | ||
| $topNSortResults[$i] | Should Be $fullSortResults[$i] | ||
| } | ||
| } |
There was a problem hiding this comment.
It would be better to move repeated code to functions.
There was a problem hiding this comment.
I had written a function at first, with the tests being simply one or two lines of code; this resulted in a complicated function which would have to be tested by itself. I really didn't like this model, and preferred a short test with a half-dozen lines of code or so, which allows anyone to review a failing test in place and see more easily why it may be failing, plus enable testing outside of Pester by simply copying and pasting the test contents.
|
@KirkMunro Thanks for Sort-Object improvements!
Feel free to open the Issue now! I believe that performance improving is the right direction. |
|
@KirkMunro Could you please rebase your branch (and drop unnecessary commits) ? Your PR contains some merge commits. |
| /// <summary> | ||
| /// This is a single entry in the min-/max-heap used for top N/bottom N sorts | ||
| /// </summary> | ||
| public sealed class IndexedMinMaxHeapEntry |
There was a problem hiding this comment.
This doesn't need to be public, right?
Also consider making this a struct, it seems like that would be a win performance wise.
There was a problem hiding this comment.
Changed locally, will commit with other changes soon. On the struct comment, should this be a struct as well then, for the performance win?
internal sealed class OrderByPropertyEntry
{
internal PSObject inputObject = null;
internal List<ObjectCommandPropertyValue> orderValues = new List<ObjectCommandPropertyValue>();
}| } | ||
| } | ||
|
|
||
| internal OrderByPropertyEntry Entry { get; } = null; |
There was a problem hiding this comment.
Avoid initializing properties with their default values. Old code may be doing this, but there is no reason to do so and we'll fix the old code eventually.
There was a problem hiding this comment.
Fixed locally, will commit soon.
|
|
||
| internal OrderByPropertyEntry Entry { get; } = null; | ||
| internal int Index { get; } = -1; | ||
| internal bool Sortable { get; } = false; |
There was a problem hiding this comment.
Again, avoid initializing properties with their default values.
There was a problem hiding this comment.
Fixed locally, will commit soon.
| else | ||
| { | ||
| // The heap entry comparer will sort the items in the heap | ||
| var comparer = new IndexedMinMaxHeapComparer(orderByProperty.Comparer); |
There was a problem hiding this comment.
The actual sort should be in it's own method.
There was a problem hiding this comment.
Moved entire body of else block except for WriteObject to new Heapify method, will commit soon.
| { | ||
| // There may be a faster algorithm to sort while eliminating duplicates and retaining only | ||
| // the top N records; for now though, this gets the job done | ||
| listToOutput = listToOutput.GetRange(0, Top); |
There was a problem hiding this comment.
This needlessly creates a new list and copies the elements.
There was a problem hiding this comment.
Replaced with RemoveRange, will commit soon.
There was a problem hiding this comment.
RemoveRange might be worse - it copies and clears.
I was thinking of iterating over the range explicitly with integers instead of the enumerator of the list.
| } | ||
|
|
||
| Describe "Sort-Object DRT Unit Tests" -Tags "CI" { | ||
| It "Sort-Object with object array should work"{ |
There was a problem hiding this comment.
As per our guidelines, avoid formatting changes. I realize test files are inconsistent w.r.t. tabs vs. spaces, but we want to keep meaningful additions in different commits from formatting changes.
There was a problem hiding this comment.
@lzybkr Understood. So should I convert that file back to use tabs instead of spaces?
| Describe 'Sort-Object Top and Bottom Unit Tests' -Tags 'CI' { | ||
|
|
||
| It 'Return the top N integers sorted in ascending order' { | ||
| $unsortedData = 973474993,271612178,-1258909473,659770354,1829227828,-1709391247,-10835210,-1477737798,1125017828,813732193 |
There was a problem hiding this comment.
As per our test guidelines, please do use BeforeAll.
|
|
||
| It 'Return the top N integers sorted in ascending order' { | ||
| $unsortedData = 973474993,271612178,-1258909473,659770354,1829227828,-1709391247,-10835210,-1477737798,1125017828,813732193 | ||
| $baseSortParameters = @{} |
There was a problem hiding this comment.
It seems silly to splat an empty hashtable.
|
|
||
| It 'Return the top N integers sorted in descending order' { | ||
| $unsortedData = 973474993,271612178,-1258909473,659770354,1829227828,-1709391247,-10835210,-1477737798,1125017828,813732193 | ||
| $baseSortParameters = @{Descending=$true} |
There was a problem hiding this comment.
It seems a little silly to splat a single parameter here.
| } | ||
|
|
||
| It 'Return the top N objects from a collection of different types of objects, sorted in descending order' { | ||
| $unsortedData = @( |
There was a problem hiding this comment.
Why not create $unsortedData just once in BeforeAll?
There was a problem hiding this comment.
Actually, after a closer look, these tests are a good example of why Pester has a parameter -TestCases to It. Please consider changing the tests to use -TestCases, I think it will simplify these test quite a bit.
|
@lzybkr: All changes requested on this discussion have been addressed and committed to this PR. I'm not sure what I need to do to respond to your requested changes other than send this note. One very noteworthy change came out of this discussion: since you highlighted that List.Remove had a strong negative impact on performance vs in-place sorting of the items in the list, I was able to get a 5x performance improvement out of -Unique -Top/-Bottom sorts, which is great! On a collection of 250K random values between 1 and 175K, Sort -Unique performs 5x faster than before, it performs slightly faster than that with -Bottom N, and it performs faster than that with -Top N. Non-unique -Top/-Bottom N sorts perform more than 2x faster than a base sort (and even more than that with sort | select -First/-Last). As suggested by @iSazonov, I will blog about this change set, probably in an article for PowerShell Magazine. I'm just waiting for results from appveyor/travis CI and will respond to any issues they have if they come up. |
|
@KirkMunro Please rebase your branch. |
|
@iSazonov I did rebase my branch. |
|
Something is clearly wrong because this PR now shows 26 files changed, yet when I compare my branch here (master...KirkMunro:sort-top-or-bottom) it properly shows 4 files have changed. sigh I didn't have this issue before I rebased my branch. |
|
Rebase is also my headache 😊 |
|
I'm glad the improvements were notable. Please do fix the commits in the PR and I'll finish reviewing. For future reference, when we merge, we decide if we'll squash, rebase, or commit with a merge commit, so it's not normally necessary for most folks to rebase themselves. That said, it is useful to be comfortable with rebase. |
7cd9e22 to
5bee3fd
Compare
|
@lzybkr Commit cleanup finished and checks passed, awaiting your review. I should add one note: I didn't benchmark using struct vs class because I moved my additional properties into the existing class and changing that to a struct was non-trivial due to other dependencies (Compare-Object, Group-Object) and outside of the scope of this work. |
|
@KirkMunro Great work!!! |
lzybkr
left a comment
There was a problem hiding this comment.
OK - this is looking pretty good.
I think it's easy to make -Unique much faster with -Top and -Bottom - can you take a look?
Other than that, I just noticed a couple little things that would be nice to clean up.
| /// <summary> | ||
| /// Sort unsorted OrderByPropertyEntry data using an indexed min-/max-heap sort | ||
| /// </summary> | ||
| private List<OrderByPropertyEntry> Heapify(List<OrderByPropertyEntry> dataToSort, OrderByPropertyComparer orderByPropertyComparer, out int heapCount) |
There was a problem hiding this comment.
Returning the heap suggests it's not done in place even though it is.
Maybe return heapCount instead?
There was a problem hiding this comment.
Yeah, I should have made that change when I switched to in-place sorting. It's done locally now.
| { | ||
| RemoveDuplicates(orderByProperty); | ||
| } | ||
| // Note: It may be worth compariing List.Sort with SortedSet.Sort when handling unique records in |
There was a problem hiding this comment.
Fixed local, thanks.
|
|
||
| // Min-heap: if the child item is larger than its parent, break | ||
| // Max-heap: if the child item is smaller than its parent, break | ||
| if (comparer.Compare(dataToSort[childIndex], dataToSort[parentIndex]) == comparator) |
There was a problem hiding this comment.
Can you handle -Unique here too? If they compare equal, you just skip the item, right?
There was a problem hiding this comment.
I don't think it's that simple. When I looked at supporting -Unique in heapify, my initial reaction was that it wouldn't work, at least not the way you might think. Since the data in the heap is not sorted, the challenge when checking for uniqueness of each value is which values you compare against. If the next value could go into the heap (as determined by comparing with the root), then I'd have to look at children of the current item (root) to see if I either find the same item or determine whether or not it would be possible to find the same item in a child of the item (in which case I'd have to recurse into that child item and repeat with its children). Worst case would result in N comparisons, where N is the size of the heap. Best case would be 1 comparison. Anyhow, that's roughly what I thought would be required. I could try it and see how well (or not) that performs vs the current method.
| param($nSortEntry, $fullSortEntry) | ||
| if ($nSortEntry -is [System.Array]) { | ||
| # Arrays must be compared differently than everything else | ||
| $nSortEntry.Equals($fullSortEntry) | Should Be $true |
There was a problem hiding this comment.
Array.Equals is reference equality - so this can't be what you meant.
Array does implement IStructuralEquatable, so if you can create an EqualityComparer, you could use something like:
([System.Collections.IStructuralEquatable]$nSortEntry).Equals($fullSortEntry,
[System.Collections.Generic.EqualityComparer[int]]::Default) | Should Be $trueThere was a problem hiding this comment.
Or if you really meant to use reference equality, use [object]::ReferenceEquality and add a comment why.
There was a problem hiding this comment.
I'll add the comment and update to the other reference equality test you suggested. I use reference equality because I'm verifying arrays are sorted into the proper position.
|
|
||
| // If -Unique was used, if -Top & -Bottom were not used, or if -Top or -Bottom would return all objects, sort | ||
| // all objects, remove duplicates if necessary, and identify the list of items to return | ||
| if (_unique || (Top == 0 && Bottom == 0) || Top >= dataToProcess.Count || Bottom >= dataToProcess.Count) |
There was a problem hiding this comment.
Am I mistaken in assuming you can easily handle -Unique in heapify?
There was a problem hiding this comment.
Same comment about -Unique and heapify applies here. I'm testing something with it though to see what impact it might have.
|
|
||
| It 'Return the <nSortType> N sorted in <orderType> order' -TestCases $topBottomAscendingDescending { | ||
| param([string]$nSortType, [string]$orderType) | ||
| $baseSortParameters = if ($orderType -eq 'Descending') {@{Descending = $true}} else {@{}} |
There was a problem hiding this comment.
As long as you're splatting, you can simplify with something like:
... = @{Descending = $orderType -eq 'descending'}And you decide to change this, you might as well fix the case to be consistent.
There was a problem hiding this comment.
As is, if you're doing an ascending sort, no parameter is passed. If I change this then an ascending sort passes -Descending:$false. I suppose for my tests that doesn't matter much since I'm verifying output order and not explicitly testing the switch parameter, so I changed it to simplify the logic (that's what I had before anyway).
| Context 'Heterogeneous n-sort' { | ||
|
|
||
| $unsortedData = @( | ||
| ($item0 = 'x') |
There was a problem hiding this comment.
You aren't using any of the assignments, so can they be removed?
There was a problem hiding this comment.
Sure, I removed them. I would use the assignments when testing various things outside of Pester, but it's good now so I took them out.
| # Arrays must be compared differently than everything else | ||
| $nSortEntry.Equals($fullSortEntry) | Should Be $true | ||
| # Arrays are compared using reference equality to ensure that the original array was | ||
| # moved to the correct position in both sorts; value equality doesn't verify this |
There was a problem hiding this comment.
My first thought was about Compare-Object, because many tests using it, but Sort-Object is specific. So maybe for the future explicitly add: 'Compare-Object is not applicable here!'
| $topBottom = @( | ||
| @{nSortType='Top' } | ||
| @{nSortType='Bottom'} | ||
| ) |
There was a problem hiding this comment.
Previously I had in mind, that these variables ($topBottom and $topBottomAscendingDescending) are good candidates to move in BeforeAll.
There was a problem hiding this comment.
That doesn't buy anything though -- those variables don't change.
There was a problem hiding this comment.
The specificity is that this is not an ordinary script - Pester makes parsing and execution operators. BeforeAll is the result of the fact that it does not guarantee the order of execution. Thus all the common code (including function) within Describe must be inside BeforeAll.
You can see the most correct sample here
There was a problem hiding this comment.
it does not guarantee the order of execution
Where are you getting that from? Pester is not documented that way, and based on how PowerShell works I don't believe that's the case. And if it is changed to work that way, then plenty of Pester tests could start breaking.
You've shared one example where BeforeAll is being used to define functions, etc. however there are many other examples where BeforeAll is not being used in the test harness for PowerShell Core yet variables are assigned, functions are loaded, etc. There's nothing in the Pester documentation about it running tests asynchronously, and even if it did, it would still have to load those tests, which would execute the code for those tests. Changing that behaviour in Pester would be a breaking change.
Since they are not documented, and since I don't have time to go read the code to figure out what they do, I can only ascertain that BeforeAll/AfterAll are much like begin/end blocks, where you can do resource management, make sure certain things are loaded before tests and unloaded afterwards. Other than that, I don't believe there is any advantage whatsoever to using BeforeAll over defining the functions/variables inside of a Context block instead.
|
I just committed an update that improves -unique -top/-bottom sorts performance substantially. On a collection of 500K integers that were generated randomly with values from 1 to 175K, sort -unique took 18 seconds to complete, and with the previous implementation a -Top/-Bottom N -Unique sort would finish slightly faster than that, in around 17 seconds. With these changes, the -Top/-Bottom N -Unique sort finishes in a little over 6 seconds. I couldn't make this work with the min-/max-heap sort without either walking the heap to see if an item was already added, since an item could be anywhere in the heap. Walking the heap could result in comparing a new item with every item in the heap in the worst case. Instead I decided to add another collection to track the unique items. I use SortedSet to store anything unique that is added to the heap, and then skip items that are already in that set instead of adding them to the heap. This seems pretty efficient, and definitely improves the performance. @lzybkr Let me know what you think of this commit when you have time to review the most recent changes. Thanks! [Update] I committed a follow-up change that removes a few lines that weren't necessary due to the changes I just made, and cleaned up the code in sort-object.cs based on my changes so that it's easier to follow. |
|
Great work @KirkMunro. A Set should be fine for typical uses of Top/Bottom w/ If that assumption is wrong, we could look at insertion sort. |
This changeset addresses issue #2462, along with all feedback received in the comments for that issue.
The original issue was identified as an enhancement to add -Top support to Sort-Object, so that (a) you would get results back faster when you only wanted the top N, and (b) you would get code elegance by not having to pipe to Select-Object. Several commenters wanted -Bottom N support added for completeness, so an additional parameter set was created with a -Bottom <ìnt> parameter. Over 40 tests were written for Pester to compare the sort results using Sort -Top or Sort -Bottom with sort results when -Top and -Bottom are not used to ensure that we always get back the data we want according with how Sort-Object sorts data without -Top/-Bottom. That means that if Sort-Object is broken (if it returns results in a different order), these tests will pass; however, the intent of these tests is to compare the sort results with those of a basic Sort-Object invocation, and other tests exist to verify the order of the items returned from Sort-Object, so this is all good.
Additionally, performance testing was done and -Top/-Bottom sorts on very large data sets (1M integers) returned results about twice as fast as when using sort | select -First/-Last, as long as -Unique is not used. This performance improvement is partly because of the reduced pipeline and partly because of a faster sort algorithm (an indexed min-/max-heap sort is used to process Top N/Bottom N results more quickly).
Worth noting: Using -Top/-Bottom sorts using -Unique are really, really slow with very large data sets, just as sort | select -First/-Last are slow with very large data sets. This code is old, and while looking at it I felt it would be worth looking at SortedSet for -Unique sorts (SortedSet did not exist when sort-object was first written) to see if it would return unique sorted results more quickly. That investigation was outside of the scope of this PR though, so I'm simply mentioning it here so that I don't forget about the item. A separate issue should probably be created to investigate SortedSet with Sort-Object -Unique to see if it improves performance like I hope it will.