Skip to content

ScottPlot.Statistics.Histogram#2403

Merged
swharden merged 25 commits intoScottPlot:mainfrom
bukkideme:main
Feb 22, 2023
Merged

ScottPlot.Statistics.Histogram#2403
swharden merged 25 commits intoScottPlot:mainfrom
bukkideme:main

Conversation

@bukkideme
Copy link
Contributor

Hi! This was in my mind as a first implementation. Please have a look.
Thanks!

@swharden swharden changed the title First implementation of the ScottPlot.Statistics.Histogram class ScottPlot.Statistics.Histogram Feb 20, 2023
@swharden
Copy link
Member

This new histogram class is so much simpler to work with! I'm going to go all the way and update all Cookbook recipes to use this new class. The old code will still work, but I'd like to highlight this newer and simpler way to create and maintain histograms.

Soon I will put this PR down for tonight, but I'll keep working on it tomorrow and hopefully merge it in when all the Cookbook recipes and demo apps are using it 🚀

@bukkideme
Copy link
Contributor Author

Hi! Good to see the progress! :) I think I will look into my old PhD thesis, and check some logarithmic binning in my old C codes. Logarithmic binning is very useful when for example you need to deal with incoming data following power law distribution (shows as a line in a log-log graph). One of the many examples is earthquake magnitude distribution: lots of small earthquakes, but very few large ones. Simple linear binning is not useful in such cases (the tail will look terrible), so in scientific papers it is common to use logarithmically growing X axis bins. I hope I can put together some example.

@swharden swharden linked an issue Feb 20, 2023 that may be closed by this pull request
@swharden
Copy link
Member

I hope I can put together some example.

Examples would be great! I have to be careful not to put too much stuff in this library though because it's a lot to maintain, and I want to keep ScottPlot focused as much as possible on plotting (there are already good statistics libraries out there). Perhaps copy/pasting useful code in an Issue or Discussion would be a good place to put it so it's searchable and can be found by those with similar needs, but doesn't have to be deployed and maintained as part of this core library.

Good to see the progress! :)

I'm very happy with this new histogram class because

  • It's way simpler to use, and the cookbook examples are now much better
  • It supports live data (I still need to create a demo)

@bukkideme
Copy link
Contributor Author

I totally understand your point. Log binning is not really a usual stuff, so i agree not to put into the main code. But we could show a use case maybe. Anyway, ScottPlot does not really support log scales yet, which is another point why not include this log binning feature :)

@Xerxes004
Copy link
Contributor

Is it possible to address #2299 as part of this PR?

@bukkideme
Copy link
Contributor Author

bukkideme commented Feb 20, 2023

Is it possible to address #2299 as part of this PR?

That is not an issue anymore. The new Histogram method takes inputs min, max and binCount. Binsize is calculated, and not an input parameter.

edit: sorry i think i misunderstood you. Probably you want to fix the bug in the original static method in this PR?

@swharden
Copy link
Member

I may have been too quick to say that 43d0664 fully resolved issue #2299

It's pretty close! The min is inclusive but the max is exclusive (values equal to it roll over into the next bin) so I think the returned data should have one extra bin that currently is not being accounted for... will fix in a follow-up commit

@swharden
Copy link
Member

swharden commented Feb 21, 2023

Mimicking numpy, I'll make all these methods return one more bin than asked for. This is a weird result of what happens when you need a bin to contain the max value, but bins are min inclusive but max exclusive 🙄

https://numpy.org/doc/stable/reference/generated/numpy.histogram.html

image

EDIT: Although numpy does this, I don't think we should. We generate histograms with the intent of plotting them, and to plot them we need the length of Xs (bins) to be equal to the length of Ys (counts). Therefore, I'll leave the existing static methods as they are returning bins with a length of n+1 (but still fix #2299), and the new histogram will return bins equal to the number of counts.

static methods return bins with one more element than the number of counts
@swharden
Copy link
Member

swharden commented Feb 21, 2023

I'm now wondering if the default behavior should be changed. Should the bin size be automatically increased by 1 bin so that the max can get counted? This is the current behavior... which may be best... but it's not clear to me if this is the behavior most people would expect 🤔

var hist1 = ScottPlot.Statistics.Histogram.WithFixedSizeBins(min: 0, max: 10, binSize: 1);

hist1.BinEdges.Should().BeEquivalentTo(new double[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 });
hist1.Counts.Should().BeEquivalentTo(new double[] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 });

hist1.Add(10); // since bins are max-exclusive, this counts as an outlier
hist1.Counts.Should().BeEquivalentTo(new double[] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 });

Assert that the new Histogram class always returns a number of bin edges equal to the number of counts.

Note that this is *NOT* the behavior of other numerical libraries like numpy, but doing it this way makes it easier to create X/Y and bar plots since the number of Xs will match the number of Ys.
constructor defines whether or not an extra bin will be added to catch counts equal to the maximum value
@bukkideme
Copy link
Contributor Author

bukkideme commented Feb 21, 2023

I think either options are ok, if the behaviour is clearly explained to the user.
To my personal taste i would keep the new, non static method in the new class max exclusive. I think asking for N bins, and getting N+1 is a bit strange, no? :)

@swharden
Copy link
Member

I think either options are ok, if the behaviour is clearly explained to the user.

I agree!

I think asking for N bins, and getting N+1 is a bit strange, no? :)

True, but that's been the existing behavior in this library for a long time and changing it now could break others' code running in production 😅

I think a good compromise is to not change the existing static method behavior, but to provide more logical (and better documented) behavior for the new Histogram class.

I'm putting together a live demo now and it's feeling very natural to work with! I think this is landing in a great spot 🚀

@swharden
Copy link
Member

swharden commented Feb 21, 2023

I think this PR may be complete! I'll leave this PR open for another day to give @bukkideme or @Xerxes004 time to look it over in case they have any suggestions. Otherwise I'll merge it tomorrow evening.

Thanks again everyone for your hard work and great input on this!

hist-live

public partial class HistogramDemo : Form
{
readonly Random Rand = new();
readonly ScottPlot.Statistics.Histogram Hist = new(min: 0, max: 1, binCount: 100);
public HistogramDemo()
{
InitializeComponent();
var bar = formsPlot1.Plot.AddBar(values: Hist.Counts, positions: Hist.Bins);
bar.BarWidth = Hist.BinSize;
formsPlot1.Plot.XLabel("Value");
formsPlot1.Plot.YLabel("Count");
formsPlot1.Refresh();
}
private void timer1_Tick(object sender, EventArgs e)
{
if (!cbRun.Checked)
return;
int newValueCount = Rand.Next(123, 1234);
double[] newData = ScottPlot.DataGen.RandomNormal(Rand, newValueCount, stdDev: .15);
Hist.Add(newData);
formsPlot1.Plot.AxisAuto();
formsPlot1.Plot.SetAxisLimits(xMin: -.1, xMax: 1.1, yMin: 0);
formsPlot1.Plot.Title($"Histogram with {Hist.ValuesCounted:N0} Values");
formsPlot1.Refresh();
}
private void btnReset_Click(object sender, EventArgs e)
{
Hist.Clear();
formsPlot1.Plot.Title($"Histogram with 0 Values");
formsPlot1.Refresh();
}
}

makes it easier to center-align bar plots with bins
@bukkideme
Copy link
Contributor Author

bukkideme commented Feb 21, 2023

There is this:
//Upper edge of the last bin (exclusive)
public readonly double Max;

Maybe the help could be changed to:
"Upper edge of the last bin (exclusive if addFinalBin false, otherwise inclusive)" ?

Another thing, why the Add and AddRange methods are same? As i see, the Add method accepts an IEnumerable, and it's summary is the same as of the AddRange methods'.
I see this was an intermediate step version only. Now the Add method has an overload accepting IEnumerable types, and no more AddRange method.

edited: besides these, i think the merge is good to go for me :)

edit2: maybe the Histogram Demo could have a checkbox "Show normalized" too, it would look even better to have option to see normalized values on Y axis?

@Xerxes004
Copy link
Contributor

Xerxes004 commented Feb 21, 2023

True, but that's been the existing behavior in this library for a long time and changing it now could break others' code running in production 😅

Ironically, the current static function caused a bug in our production code. I was requesting buckets of size n, which I assumed (without checking 🙃) would make the edges exactly n apart from each other. We caught the bug when we noticed that the bucket size I displayed on screen was different than our actual exported data's bucket sizes. We were able to rejigger the math to make it work, but it just felt weird that the bucket size I requested was being ignored. We are doing some math so that there are x buckets per standard deviation, which gives us very nice-looking histograms regardless of the input data.

For the record, the way that feels the most correct for bucket size is to completely honor the user's bucket_size as the stride at which data is collected. This means that the number of buckets would be:

int NumBuckets(double min, double max, double bucket_size)
{
    return (int) Math.Ceiling((max - min) / bucket_size);
}

(Edit: Note that my function above doesn't check the sanity of the input params. The real function should.)

@swharden
Copy link
Member

Ironically, the current static function caused a bug in our production code

Thanks for sharing your story @Xerxes004! I'm in a bind because on one hand the "surprising" behavior could cause production bugs, but changing behavior of the existing methods to produce the expected output will change behavior of code which may already be running in production, so the fix risks introducing new bugs 😝

Connecting some dots, this PR updates those static methods to use the ceiling strategy your comment demonstrates (resolving #2299)

I think I'll update this PR to mark those methods obsolete and direct people to the new Histogram class.

public static (double[] hist, double[] binEdges) Histogram(double[] values, double min, double max, double binSize, bool density = false)
{
int binCount = (int)Math.Ceiling((max - min) / binSize);
max = min + binCount * binSize;
return Histogram(values, binCount, density, min, max);
}

public static (double[] hist, double[] binEdges, int minOutliers, int maxOutliers) HistogramWithOutliers(double[] values, double min, double max, double binSize, bool density = false)
{
int binCount = (int)Math.Ceiling((max - min) / binSize);
max = min + binCount * binSize;
return HistogramWithOutliers(values, binCount, density, min, max);
}

Note that my function above doesn't check the sanity of the input params. The real function should.

Good point! I'll add that in before merging.


Add() vs. AddRange()

@bukkideme I'm still deciding which API I prefer... an overload seems simplest, but I recognize IList uses AddRange() so I'll probably put it back to that

Maybe the help could be changed to: "Upper edge of the last bin (exclusive if addFinalBin false, otherwise inclusive)" ?

/// <summary>
/// Upper edge of the last bin (exclusive)
/// </summary>
public double Max { get; }

I think this is confusing too, and for simplicity I think I can improve it by removing thee word "exclusive" from the existing XML comment

maybe the Histogram Demo could have a checkbox "Show normalized" too, it would look even better to have option to see normalized values on Y axis?

I thought about it, but then we might as well add CPH options and outlier options ... then in the effort to demonstrate all the cool things it can do we accidentally make it difficult to learn from 😅

I think I'll favor ultimate simplicity for these demo apps, and users can dig deeper by poking around the classes they instantiate (and also advanced features demonstrated in the cookbook)

I think the merge is good to go for me :)

woohoo! I'll merge it and make a new release very soon. Thanks again for getting this PR started! 🚀

@swharden swharden enabled auto-merge February 22, 2023 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Histogram: class for accumulating binned value counts and outliers Histogram function ignores requested bucket size

3 participants