Skip to content

Conversation

@niallrobinson
Copy link
Contributor

some code to calculate the correlation between cubes over any dimensions (including all of them). Not sure if the tests are up to scratch but they are some anyway. Maybe should make the coord list optional and just calc corr over all dimensions in that case?

@esc24
Copy link
Member

esc24 commented Jun 14, 2013

@niallrobinson - I haven't had time to look closely at what you've done yet, but have you considered taking advantage of the built in stats functionality in scipy/numpy e.g. numpy.correlate(), scipy.stats.pearsonr(), scipy.signal.correlate2d() etc.? These may be of use.

@niallrobinson
Copy link
Contributor Author

Hi Ed - I initially wrote this code iterating over the cube using pearsonr() but it was very slow. I don't know how I could use those functions without iterating, but I might have missed something. (As far as I can see, correlate() and pearsonr() both only accept 2 1-D arrays and correlate2d is for cross correlation, but there might be something smart you can do)

@ajdawson
Copy link
Member

@esc24 I actually suggested @niallrobinson do a manual calculation instead of using scipy.stats.pearsonr because of the vectorization opportunity it presents. I've written similar code in the past and expressing it as a matrix multiplication allows one to compute all the correlations in one go rather than iterating, and is fast since matrix-matrix multiplication is probably one of the most optimized array operations going.

@niallrobinson
Copy link
Contributor Author

I'm wondering now if it might be best to separate this into its own module in analysis. That way we can separate out and generalise the function for managing the reshaping. Then we can pass a function for doing the maths operation to the general function, allowing this approach to be used for other operations than pearsons r - spearmans, covar etc etc

@ghost ghost assigned bblay Jun 24, 2013
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(trivial) Please start with a capital.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you suggest, making the list of coords optional sounds like a good idea.

@bblay
Copy link
Contributor

bblay commented Jun 24, 2013

The comment about separating / generalising could optionally be addressed in another ticket.
However, I must ask (forgive my ignorance), is this type of calculation generally considered to be "calculus"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider making the docstring more clear that the cubes are "assumed to be" comparable in all other dimensions, or adding a check.

@bblay
Copy link
Contributor

bblay commented Jun 24, 2013

Thanks Niall,
I've reviewed this PR.
No major issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently if I provide a coordinate that doesn't exist then it goes unnoticed. It would be better to iterate over the coordinates that are passed in and extract the coordinate from the cube using cube.coord which will raise an exception if the coordinate does not exist. Then you can use cube.coord_dims to find the dimension number(s) it corresponds to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 You might need to sort the resulting list of dims too, if you do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem with that is that I also want access to the coordinates that are not named for slice_ind and slice_shape. I'll just add a check to make sure non-existent coords are caught though.

@niallrobinson
Copy link
Contributor Author

@bblay

is this type of calculation generally considered to be "calculus"?

I wouldn't consider it calculus, no. If I implied this then I didn't mean to/was momentarily confused!

@cpelley
Copy link

cpelley commented Jul 2, 2013

Rebase required to fix issues with latest standard name table, see #586

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention in iris is for multi-line docstrings to start with a blank line. The first line should also have a full stop at the end.

@bblay
Copy link
Contributor

bblay commented Jul 24, 2013

Humble apologies for the slow response time on this, @niallrobinson.
Travis fell asleep during the last test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be views not copies. i.e. data_a = cube_a.data.view().

@rhattersley
Copy link
Member

I'm wondering now if it might be best to separate this into its own module in analysis. That way we can separate out and generalise the function for managing the reshaping. Then we can pass a function for doing the maths operation to the general function, allowing this approach to be used for other operations than pearsons r - spearmans, covar etc etc

This sounds similar to the Aggregator objects used by Cube.collapsed() and friends. Are you suggesting a user API which is something like:

result = iris.analysis.stats.do_stuff(cube_a, cube_b, iris.analysis.stats.PEARSON)
result = iris.analysis.stats.do_stuff(cube_a, cube_b, iris.analysis.stats.SPEARMAN)
result = iris.analysis.stats.do_stuff(cube_a, cube_b, iris.analysis.stats.COVARIANCE)
...

Or are you talking about an internal helper function used by a variety of user API functions, such as:

result = iris.analysis.stats.pearson(cube_a, cube_b)
result = iris.analysis.stats.spearman(cube_a, cube_b)
result = iris.analysis.stats.covariance(cube_a, cube_b)

Or have I got the wrong end of the stick altogether? 😉

@niallrobinson
Copy link
Contributor Author

Thanks guys, I'll get on sorting that stuff out.

I wrote this a long time ago but I guess I was envisaging the latter (internal helper functions), however, the former may be desirable? I think that using the existing aggregators wouldn't be possible as I have done the calculation explicitly here in order to take advantage of the vectorisation.

However, as @bblay said previously

The comment about separating / generalising could optionally be addressed in another ticket.

Maybe we should add that to the to do list for another ticket and get this working first? What do you think

@bblay
Copy link
Contributor

bblay commented Jul 29, 2013

another ticket

Seems reasonable, and more agile, to worry about refactoring for these future functions in their future PRs but if you're going to pave the way, I agree with you on the latter API.

@niallrobinson
Copy link
Contributor Author

Sorry to litter this. I was so busy convincing myself I had mastered github commands that I forgot to run the unit tests!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not dimensions, they are names. Suggest corr_names, corr_coords or alternative.

@niallrobinson
Copy link
Contributor Author

Thanks for the rigorous review of my refactoring. Add look sensible to me so I'll sort them out this morning.

@bblay
Copy link
Contributor

bblay commented Aug 6, 2013

Thanks @niallrobinson, please squash and we'll give the others a chance to comment before merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late to the party here, but I find the docstring quite confusing with respect to the coordinates and what output I might expect. Would it be possible to include a short but verbose example in the docstring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @ajdawson. The world only makes sense if I keep the concepts of coordinates and dimensions clearly separated. The docstring speaks of dimensions, but the signature refers to coords.

@niallrobinson
Copy link
Contributor Author

@esc24 @ajdawson Hopefully that's slightly clearer now. Its a bit tortured to explain but the example should make it clearer

@niallrobinson
Copy link
Contributor Author

commit in response to
niallrobinson@f4af695#commitcomment-3798884
which isn't shown on this feed for whatever reason

@niallrobinson
Copy link
Contributor Author

Calling @bblay - think the Travis build has fallen over erroneously.

@bblay
Copy link
Contributor

bblay commented Aug 14, 2013

I just restarted it (there's an option in the settings drop-down to do that).

@niallrobinson
Copy link
Contributor Author

@bblay That option is greyed out for me, event after logging in with github etc. Also, I think it failed again!

@bblay
Copy link
Contributor

bblay commented Aug 14, 2013

That happens to me sometimes too. Re-restarted.

@bblay
Copy link
Contributor

bblay commented Aug 15, 2013

Ok, it's not timed out this time. Those new test failures are not your fault, @niallrobinson. I believe @esc24 is looking into this problem, which affects all Iris tests.

@esc24
Copy link
Member

esc24 commented Aug 16, 2013

I've implemented a temporary workaround on upstream/master (fixed netCDF4 to version 1.0.2 in .travis.yml) so you could rebase.

@niallrobinson
Copy link
Contributor Author

🎉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a iris.tests.skip_data to indicate that this test needs real data.

@bblay
Copy link
Contributor

bblay commented Aug 22, 2013

This is looking great.
Just added a minor request for a comment.
If you then squash I'll merge it!
Thanks for all your hard work and patience on this.

@niallrobinson
Copy link
Contributor Author

@bblay done...? I believe we have realised we need to comment to get attention after a commit.

@niallrobinson
Copy link
Contributor Author

@bblay done and passed

@bblay
Copy link
Contributor

bblay commented Aug 22, 2013

Lovely.
I'm a little vague on the statistical process itself, but happy that several experts have seen it.

bblay added a commit that referenced this pull request Aug 22, 2013
Calculates correlation between cubes
@bblay bblay merged commit a909184 into SciTools:master Aug 22, 2013
@bblay
Copy link
Contributor

bblay commented Aug 22, 2013

Phew, good stuff @niallrobinson!

@niallrobinson
Copy link
Contributor Author

Hurray! Thanks for sticking with me.

I'm a little vague on the statistical process itself, but happy that several experts have seen it.

The "statistics" themselves are basically three lines, and I'm perfectly happy with them. Its all the housekeeping - the shape manipulation etc - that I need to think carefully about.

@niallrobinson niallrobinson deleted the cube_correlation branch August 23, 2013 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants