Skip to content

Conversation

@melotic
Copy link
Member

@melotic melotic commented Jul 6, 2022

To double check:

Epic: dotnet/dnceng#2657

Unfortunately, Github doesnt support inline comments with jupyter notebooks. If you could quote the section of the notebook that you had feedback on for context, that would be much appreciated!

@melotic melotic requested a review from a team July 6, 2022 22:32
@ChadNedzlek
Copy link
Contributor

Uh... this doesn't appear to be a one pager... or anything I can read. :-)

@ChadNedzlek
Copy link
Contributor

Weird. GitHub can render the file more or less fine if you go to "view file". It's just the diff that's basically worthless. :-)

@melotic
Copy link
Member Author

melotic commented Jul 6, 2022

Uh... this doesn't appear to be a one pager... or anything I can read. :-)

Github isn't good with PRs and notebooks. You can click on Files Changed -> three dots on the right -> View File to render it in the browser :/

@riarenas
Copy link
Contributor

riarenas commented Jul 6, 2022

three dots on the right -> View File to render it in the browser :/

That makes it very unfriendly to reviewers when they want to leave a comment. Let's make a proper one pager doc, and you can link your notebook to it.

@ChadNedzlek
Copy link
Contributor

This doc is really, really long... Long enough that I'm not sure it should be checked into arcade. There's like a sentence or two every page that's sometimes saying important stuff... but otherwise I'm not sure what I'm supposed to be getting of this other than "runtime is doing something weird and we should tell them not to do that" and "averages and stddev are stable, so we can just use those". I'm not sure what the "machine learning" part of this is... it's basically just saying that for a given definition, the build times form a more or less predicable distribution, but that that point, it's just basic statistics.

@melotic
Copy link
Member Author

melotic commented Jul 6, 2022

This doc is really, really long... Long enough that I'm not sure it should be checked into arcade. There's like a sentence or two every page that's sometimes saying important stuff... but otherwise I'm not sure what I'm supposed to be getting of this other than "runtime is doing something weird and we should tell them not to do that" and "averages and stddev are stable, so we can just use those". I'm not sure what the "machine learning" part of this is... it's basically just saying that for a given definition, the build times form a more or less predicable distribution, but that that point, it's just basic statistics.

The one-pager outlines the data science process in order to arrive at the conclusion that regressing a distribution over the duration of pipelines is a suitable and effective way to predict the duration of a given pipeline. Without this document, how do we answer the question "How are you giving us prediction?" and "Is your prediction accuarate?" The one-pager (I should rename it something else, and write a seperate markdown one-pager) walks any user through this process and provides suitable statistical analysis and answers to those questions.

While yes, this is basic statistics, the point was that I proved that using basic statistics is effective in solving this problem. The machine learning part comes from the regression on the distrubution.

@ChadNedzlek
Copy link
Contributor

ChadNedzlek commented Jul 6, 2022

True... but I'm not sure we need a 10 page doc to answer the question "yes, we did the science, and the science says ok" saved (and fetched by everyone that clones arcade) into our public repository for all time. A quick summary of the findings in a MD file is likely sufficient. With a quick summary of "we did analysis of all the repositories, and most of them form a stable normal distribution".

I trust that you did all the correct analysis if you say you did. I don't really need proof. :-)

@riarenas
Copy link
Contributor

riarenas commented Jul 7, 2022

Initially Justin came to me with the idea of making his one pager as a jupyter notebook and I thought it was worth the try as long as it was possible to review it.

We now see that the approach is not correct. This file is too big to check in, and this PR difficult to give feedback on.

@ChadNedzlek's suggestion of just summarizing your findings in the one-pager, along with a link to the full notebook for those who want to go through your process sounds right to me.

@markwilkie
Copy link
Member

markwilkie commented Jul 7, 2022

I love the idea of the jupyter notebook being there because I suspect there's a good amount of learning documented. However, I guess a giant doc isn't really a "one pager"....lol So yea, a high level summary of the findings and approach, with the jupyter doc as source material seems great.

Oh, and I'm well aware I was wrong when I thought that using the jupyter file as the one pager was "fine" @melotic ..... :)

@melotic
Copy link
Member Author

melotic commented Jul 8, 2022

Switched it to markdown.

* Is there a goal to have this work completed by, and what is the risk of not hitting that date? (e.g. missed OKRs, increased pain-points for consumers, functionality is required for the next product release, et cetera)
* Aug 12, the end of the internship.
* Does anything the new feature depend on consume a limited/throttled API resource?
* No. Kusto will be queried once a week to retrain the models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many models? How much data is needed per model?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've arbitrarily set the minimum data to be 50 runs for a pipeline to have a model. Each pipeline has its own model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed as well since we're no longer using this approach?

* Are you utilizing any response data that allows intelligent back-off from the service?
* We only query Kusto, so there is no need for back-off.
* What is the plan for getting more capacity if the feature both must exist and needs more capacity than available?
* Azure Functions should auto-scale our function if our ML endpoint is being queried hard. No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies no cache of already-calculated data. Is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If already-calculated you mean the trained models, they are cached in a blob container.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer using Azure Functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it from the doc then

@melotic
Copy link
Member Author

melotic commented Jul 11, 2022

Testing only fitting normal distributions:
a3201354-d9c2-4952-97ab-b0aacbd3ec83

Accuracy results:

Accuracy
count 29
mean 94.9138
std 3.14995
min 84.7935
25% 93.7422
50% 96.0576
75% 96.9298
max 98.3808

Statistics on ranges given over backtesting period, in minutes:

Ranges(min)
count 534
mean 26.7273
std 24.2594
min 4.05823
25% 11.7814
50% 18.9974
75% 26.6553
max 107.596

@ChadNedzlek seems like this is a good solution.

@mathaholic
Copy link

@melotic you mention overall being concerned about the distributions being different and that being a limitation of your pipeline. Consider using Chebychev's Inequality Theorem for a model that is more generalizable. https://statisticsbyjim.com/basics/chebyshevs-theorem-in-statistics/

@melotic
Copy link
Member Author

melotic commented Jul 12, 2022

@melotic you mention overall being concerned about the distributions being different and that being a limitation of your pipeline. Consider using Chebychev's Inequality Theorem for a model that is more generalizable. https://statisticsbyjim.com/basics/chebyshevs-theorem-in-statistics/

This solves all of our problems! Thanks Nikki :)

I've updated the one-pager with the new method of doing this. We'll use Chebychev's Inequality and we can do this all in kusto.

@melotic
Copy link
Member Author

melotic commented Jul 14, 2022

I think this is ready for a second review. Let me know your thoughts.

Justin Perez added 2 commits July 19, 2022 14:13
@melotic
Copy link
Member Author

melotic commented Jul 19, 2022

Addressed feedback. Would love to get thoughts & feedback on the possible remediations I listed for detecting when Helix or AzDo is on the floor so we can hide our predictions:

  1. Use Grafana's Alert API and check a handful of bad alerts.
  2. A test pipeline that periodically runs and sends stuff to Helix that we can monitor.

There's more detail in the one pager (under Caveats -> Possible Solutions)

@ChadNedzlek
Copy link
Contributor

@melotic Why not use the "known issues" stuff that @ulisesh has worked on? Trying to decide which alerts mean "don't show things" is going to be really difficult, and so rare that it doesn't seem worth coding up.

Honestly, I say just show it, even in weird broken times. Odds are users will have some inclination things are horribly broken... either because of a mail, FR contact, or a "known issue" in the build analysis. Trying to make this feature smart enough to know that it's not smart enough sounds like a crazy twisted bit of logic that hurts my brain. :-) Just... let it be wrong sometimes and let other mechanisms handle telling uses that "stuff is weird right now".

@melotic
Copy link
Member Author

melotic commented Jul 26, 2022

Known issues is definitely the way to go. I think our best bet is to simply hide the duration if there is a Critical known issue.


In addition, there is the issue of AzDo, Helix, or builds being on the floor, and we still give customers an estimate, blissfully unaware of any infrastructure errors. In the Juptyer notebook, I dive into an anomaly detection model, based on Helix work item wait times trying to predict this, but the model only improves accuracy by $0.3\%$.

#### Possible Solutions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would just leave this section out, and instead put the solution to each caveat directly underneath.

First, some pipeline's like runtime's....

We will handle this by...


We backtested the model by training on all previous data before a point, and then testing on 1 week ahead, on data the model has not seen before. Here is a graph of the accuracy over time.

<img src="./PipelineMachineLearning/back-tested-accuracy-vs-time.svg" width="600" height="600">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small paragraph that interprets the results in the graph?

@riarenas
Copy link
Contributor

We should move this doc to live under the other queue insights docs now that we decided to integrate it in the same place. This LGTM with only some minor feedback.

@melotic melotic enabled auto-merge (squash) August 2, 2022 17:34
@melotic melotic merged commit f7e668c into dotnet:main Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants