Skip to content

Improve descriptive statistics class#3071

Merged
swharden merged 10 commits intoScottPlot:mainfrom
arthurits:DescriptiveStatistics
Dec 17, 2023
Merged

Improve descriptive statistics class#3071
swharden merged 10 commits intoScottPlot:mainfrom
arthurits:DescriptiveStatistics

Conversation

@arthurits
Copy link
Contributor

Purpose:

  • Implements SP5: Improve descriptive statistics class #3055.
  • Input parameters are defined as IEnumerable<T> instead of double.
  • Uses IEnumerable<T>.Average() to compute the data average.
  • Functions StDev are renamed to StdDev.
  • Code for StdDev is modified:
    • Booleand parameter asSample is added to compute both the population and the sample standard deviations.
    • Code checks for and avoids division by 0 runtime exceptions.
    • Function is overloaded with respect to parameter mean.
  • Function StdErr (standard error of the mean) is added.

@swharden
Copy link
Member

swharden commented Dec 17, 2023

Hi @arthurits, thanks so much for this PR! This is fantastic, and I look forward to merging it in a few minutes!

Functions StDev are renamed to StdDev.

Note that I'll keep it StDev to be consistent with naming found in Python's standard library. I try to mimic the API of Python (and the numpy and matplotlib packages) where possible.

@swharden swharden linked an issue Dec 17, 2023 that may be closed by this pull request
@swharden
Copy link
Member

swharden commented Dec 17, 2023

Thinking out loud about performance...

I'm tempted to make this an IReadOnlyList which will allow it to accept both List<double> and double[] and I think offer improved performance because .Count can be used instead of Linq's .Count() ... Probably the decision should be made with proper benchmarks to prove it's meaningfully faster, but I'm not really motivated to micro-optimize these methods at this time.

Similarly, I'll probably make the overloads private that accept the mean "for performance". Maybe it saves a few cycles by not calculating mean twice, but calculating mean isn't very costly considering that function is loaded with Sqrt() calls which are a lot more significant

Also I'm going to remove the parallel code here because parallel processing can be significantly slower for small datasets. If we benchmark it and I'm incorrect, we can add it back

@swharden
Copy link
Member

@swharden
Copy link
Member

swharden commented Dec 17, 2023

[sorry about the message bomb here lol] ... thinking more out loud,

In researching Microsoft's naming of this stuff, it looks like they favor StandardDeviation() and StandardDeviationP() which is a few more keystrokes but may be more readable be a better reference than Python APIs

image

@swharden
Copy link
Member

I look forward to merging it in a few minutes!

A few minutes turned into a few hours - what a rabbit hole! I think this ended in a fantastic place though and I'm merging now. Thanks again for your help getting this started @arthurits!

There's room for benchmark-driven micro-optimization in the future, but I'm happy where this is.

@swharden swharden enabled auto-merge December 17, 2023 01:55
@swharden swharden merged commit 40810ea into ScottPlot:main Dec 17, 2023
@arthurits
Copy link
Contributor Author

Hi @swharden. Although it looked quite straightforward, it turns out there were many points to address.
The final code looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SP5: Improve descriptive statistics class

2 participants