Skip to content

Plugged in Lin. Reg. and PCA from Deedle.Extras#496

Merged
zyzhu merged 4 commits intofslaborg:masterfrom
Ildhesten:master
Apr 18, 2020
Merged

Plugged in Lin. Reg. and PCA from Deedle.Extras#496
zyzhu merged 4 commits intofslaborg:masterfrom
Ildhesten:master

Conversation

@Ildhesten
Copy link
Copy Markdown

Added the linear regression stuff from my own package for PCA and Lin. Reg. Let me know if we need to change/update/fix some stuff, (style, any bugs I have overlooked, etc.) before merging in completely.

@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Apr 5, 2020

That's awesome. Could you add some use cases as unit test? Load some csv files similar to other test cases and cover your functions.

@Ildhesten
Copy link
Copy Markdown
Author

Sure, that makes a lot of sense. Will probably get around to do it sometime this week.

@Ildhesten
Copy link
Copy Markdown
Author

Status update: Just added a few tests for the linear regression logic, that verifies that the coefficients matches with what is done in Math.net, will get around to PCA soon.

… sorted by largest eigenvalues first, and added a unit test for PCA logic.
@Ildhesten
Copy link
Copy Markdown
Author

Added another unit tests for PCA now. Reckon this will suffice?

@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Apr 14, 2020

In principle, this already looks pretty good. Thanks a lot for contributing your extensions directly here!

As for design of the regression function and result, I'd like to follow the experience of lm in R as it's always the first regression function used among statisticians.
https://www.datacamp.com/community/tutorials/linear-regression-R

The thing that bugs me is FitIntercept as option of b. User has to give intercept a name to get the result. But in reality, "Intercept" shall always be the name and user shouldn't worry about it, like lm in R

We could change FitIntercept to bool. But the regression has to work on Frame<'b, string> with string as the only supported column type, because you have to add "Intercept" as the default name when FitIntercept is true. I have no problem with that as regression factors shall be meaningful to have variable names in string. Let me know what you think as you have used this library already in your environment.

We shall also add more extension functions. For instance, a Summary function could be added to extend Fit so that it's easier to be discovered by users, instead of having to pipe it through LinearRegression.Fit.summary. We also need to make it more C# friendly.

F-Stats can also be added later.

Let me know what you think. We can address it in this pull, or I can merge it and then work it out in another pull.

@Ildhesten
Copy link
Copy Markdown
Author

Thank you very much for your comments.

Regarding design being close to lm in R, I agree very much, this was also the experience I tried to recreate, when I originally coded this.

Regarding FitIntercept, I have had similar thoughts as you. My choice to go for the option was:

  1. Not to restrict the column type to string (even though I must admit I cannot personally recall ever having worked with column types that are not string).
  2. Allowing the user to choose an appropriate name for the intercept if he already has a column named so.
    Whether or not there are actual users who will see locking column type to string and assuming no columns are named “Intercept”, as an issue, I really don’t know.

Another design choice could be sub-modules, so rather than doing
LinearRegression.simple, we could have:
LinearRegression.NoIntercept.simple : ‘b -> ‘b -> Frame<‘a,’b> -> Fit.t<‘a,’b>
And for the majority (string columns not named intercept):
LinearRegression.WithIntercept.simple string -> string -> Frame<‘a,string> -> Fit<‘a,string>
And for the minority (non string columns, or columns named intercept).
LinearRegression.WithIntercept.simpleCustom ‘b -> ‘b -> ‘b -> Frame<‘a, ‘b> -> Fit<‘a,’b>

So here we are essentially trading off the option type for overloads. However I don’t know much I like that approach.

A third solution could be that:

  1. We move the current implementations into LinearRegression.Advanced (yes, that probably needs a better name).
  2. In the plain LinearRegression.module we make two overloads on the form
    Let simple xCol yCol fitIntercept df = // string -> string -> bool -> Frame<‘a, string>
    If fitIntercept then
    LinearRegression.Advanced.simple xCol yCol (Some “Intercept”) df
    Else
    LinearRegression.Advanced.simple xCol yCol None df
    And similar with the multiDim. That allows the expected majority access to the simpler version, and people with non-string columns or columns named intercept will have to use the LinearRegression.Advanced functions

Regarding summary, then certainly it needs a bit of improvement and prettifications, including the F-statistic.

I’d say lets agree on the interface for LinearRegression and I can fix it in this pull request, and then leave summary improvements and F-statistics for later ones. Which one of my the design proposals do you think is the most suitable, or do you have another idea?

@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Apr 16, 2020

I've tried a use case of Frame<DateTime, int> using the unit test

let df = [1, stockReturns?MSFT; 2, stockReturns?WMT; 3, stockReturns?AES] |> frame
LinearRegression.multiDim [1; 2] 3 (Some 0) df
|> LinearRegression.Fit.coefficients

Output is like following

0 -> -0.897557711512795 
1 -> 0.0760246168058124 
2 -> 0.0389123651688682 

In the above case, I have to name the intercept a number, zero in this case. Though calculation is correct, the use case looks very bizarre to me. I cannot imagine stats user would ever run a regression without naming their variable for interpretation. As you confirmed, you have never encountered a case like that either.

Your solution 3 can solve it. But I feel leaving another layer of Advanced namespace will only confuse library users.

Another idea for your consideration. LinearRegression.simple is only a special case of LinearRegression.multiDim right now.
The following two are equivalent, as both are OLS regression,

LinearRegression.multiDim ["MSFT"] "AES" (Some "yIntersect") stockReturns
LinearRegression.simple "MSFT" "AES" (Some "yIntersect") stockReturns

Why don't we just name LinearRegression.multiDim as LinearRegression.ols and leave out the simple version? In the extension function, we can create an overload parameter for simple vs multidim using string seq or string. We can add other things like LinearRegression.stepwise, LinearRegression.lasso, LinearRegression.ridge in the future once they are added to Math.Net or when we find other OSS libraries out there. Or we can implement the math here first.

@Ildhesten
Copy link
Copy Markdown
Author

Changed design to this,

  1. Locking type of supported frames to Frame<'a, string>,
  2. fitIntercept is now a bool
  3. simple regression is gone, it is done by passing the single column in as e.g. a list.
  4. Assumption that no columns are named Intercept.
    So a call will now look like:
    LinearRegression.ols ["xCol"] "yCol" false dataFrame

@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Apr 18, 2020

Looks great to me. Feel free to send other pulls to improve the output. I'll try to fix another issue and then bump up the version to release it.

@zyzhu zyzhu merged commit c77b1a8 into fslaborg:master Apr 18, 2020
@zyzhu
Copy link
Copy Markdown
Contributor

zyzhu commented Jun 15, 2020

I've released 2.2.0 on nuget that includes this feature. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants