Fixes Indexed assignment of extra columns in SolutionArray#838
Fixes Indexed assignment of extra columns in SolutionArray#838speth merged 15 commits intoCantera:mainfrom
Conversation
| "used by SolutionArray objects.".format(name)) | ||
| if not np.shape(v): | ||
| self._extra[name] = [v]*self._shape[0] | ||
| self._extra[name] = np.array([v]*self._shape[0]) |
There was a problem hiding this comment.
I know this is a pre-existing choice, but It may make sense to allow for more dimensions? I.e. this would be a case where we have something like
>>> states = ct.SolutionArray(gas, (6, 10), extra=[‘foo’, ‘bar’])
There was a problem hiding this comment.
I can't clearly understand what you want to say. I guess you want to say foo and bar to be extra arrays of size (6,10)?
like
>>> states = ct.SolutionArray(gas, (6,10), extra={'foo':sd,"bar":sd})
where sd is any (6,10) list/np array
??
There was a problem hiding this comment.
@sin-ha ... what I wrote was deliberate - no typo there. It currently throws a ValueError, but you could argue that it should initialize np.ndarray's with the full _shape, not just the first dimension. From my perspective, the shape of the _extra entries should be consistent with other data.
There was a problem hiding this comment.
@ischoegl I'm also not sure what you're suggesting, but it seems like it would be a new feature? In that case, I think it seems outside the scope of this fix.
There was a problem hiding this comment.
@ischoegl As of now extra expects dicts unless the shape of SolutionArray is trivial
- If the user passes a list we should be able to initiate the extra array with same shape.
- else the user will pass that as a dict , in that case we should validate extra with the complete SolutionArray shape instead of
self._shape[0]
There was a problem hiding this comment.
I also do think this is more of adding another functionality, seems a bit off of the issue so should it be pulled here or open another issue?
There was a problem hiding this comment.
I think the point I raised is a trivial addition, and the application was stated in my first post: it’s a reasonable assumption that the arrays are instantiated with correct dimensions based on the information already provided. My guess is that the former single-dimensioned approach was based on _extra formerly being list-based. Now that this is no longer the case, putting in a more consistent approach is trivial.
What I am talking about affects some of the same lines that are changed already and won’t take more than a couple of additional lines. Hence, I’d like to see it considered as a friendly amendment of the original issue 😉
| if isinstance(extra,str): | ||
| extra = [extra] | ||
|
|
||
| if isinstance(extra, list) and all(isinstance(name, str) for name in extra): |
There was a problem hiding this comment.
That's captures almost everything! However, extra could also be an instance of tuple or ndarray of strings? I believe all of these are supported for when no shape is used for the instantiation of SolutionArray (i.e. I am talking about the former elif extra and self._shape == (0,): case, in which case any of the above can be iterated over).
There was a problem hiding this comment.
that'd be a quick fix but I don't think np.arrays are important?.
Besides I believe the other part should be removed where self._shape is trivial as after this commit it would only be a desperate attempt to map something in extra, probably not useful (is there a particular use case?). it lacks checks and something weird like integers, for instance can pass which can't be used back.e.g.
>>>>states3 = ct.SolutionArray(gas, shape=(0,0,0), extra=[1])
There was a problem hiding this comment.
I don’t think that the alternative case can be executed for useful content either (you already caught all useful cases), I.e. I believe it can be deleted. If I’m not mistaken, the unit tests should still pass.
On a formatting note, there are a couple of spaces missing after commas (e.g. it should be shape=(0, 0, 0) in the previous post; same for some other lines you edited). It’s minor, but the code should remain in compliance with PEP 8.
PS: ad ndarrays ... again it’s minor but it should be included for consistency.
There was a problem hiding this comment.
Yes the unit tests pass.
sorry for the formatting though.
bryanwweber
left a comment
There was a problem hiding this comment.
Hi @sin-ha! Thanks for taking this on. I've got several comments below, I think we need to change how this is implemented. Let me know your thoughts, or if you have any questions/concerns!
|
|
||
| for name, value in self._extra.items(): | ||
| value.append(kwargs.pop(name)) | ||
| np.append(value,kwargs.pop(name)) |
There was a problem hiding this comment.
| np.append(value,kwargs.pop(name)) | |
| np.append(value, kwargs.pop(name)) |
There was a problem hiding this comment.
But this isn't correct, np.append does not work in-place. See the documentation: https://numpy.org/doc/1.18/reference/generated/numpy.append.html This will most likely be very slow, since NumPy allocates a new array every time np.append() is called. I wonder if it would be better to cast the values in _extra to lists and use the list append method, and then cast them back to arrays? Maybe the complexity isn't worth it though.
There was a problem hiding this comment.
I also can't find any other way around see
There was a problem hiding this comment.
What do you mean, any other way around? You'll have to find some way, since this won't work as expected. I think the simplest method is just to reassign value: value = np.append(value, kwargs.pop(name)) That might be slow, since NumPy has to create a new array every time, but hopefully not too slow.
|
I forgot to review the test file as well! Sorry, my main comment there was that I think each time you have a comment that delimits a different set of tests, that should be a new test function. |
|
@sin-ha Are you going to be able to address my comments? Thanks 😄 |
|
Yes @bryanwweber, actually i was a bit busy with my never ending university exams and forgot about it |
|
Thanks @sin-ha I certainly understand having lots of exams 😊 |
edebe9b to
d9ec808
Compare
bryanwweber
left a comment
There was a problem hiding this comment.
Thanks for the changes @sin-ha, just a few things left!
|
done @bryanwweber . |
bryanwweber
left a comment
There was a problem hiding this comment.
Thanks @sin-ha. One change still to do, and please feel free to add yourself to AUTHORS!
|
|
||
| for name, value in self._extra.items(): | ||
| value.append(kwargs.pop(name)) | ||
| np.append(value, kwargs.pop(name)) |
There was a problem hiding this comment.
This line still isn't fixed.
There was a problem hiding this comment.
I didn't think this through before, Anyways
I think the simplest method is just to reassign
value: `value = np.append(value, kwargs.pop(name))
Appending np.ndarrays is not as fast
Turns out converting them to lists and appending (as you said before). does a better job!!,
see https://stackoverflow.com/a/22394181 and https://stackoverflow.com/questions/7133885/fastest-way-to-grow-a-numpy-numeric-array
In [2]: %%timeit
...: a = np.empty((0,3), int)
...: l = list(a)
...: for i in range(1000):
...: l.append([3*i+1,3*i+2,3*i+3])
...: l = np.array(l)
467 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [3]: %%timeit
...: a = np.array([3,4,3])
...: for i in range(1000):
...: l = list(a)
...: l.append([3*i+1,3*i+2,3*i+3])
...: l = np.array(l)
...:
2.27 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: %%timeit
...: a = np.empty((0,3), int)
...: for i in range(1000):
...: a = np.append(a, 3*i+np.array([[1,2,3]]), 0)
...:
3.88 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
|
@sin-ha I pushed a few more commits to this branch to fix a few edge cases I found in some testing. For your reference, the changes are in commits b86786b, 1cc53b4, and 8786b51. The other commits fix a few other things I found while working on this code. The main things to watch out for are:
|
|
The test failures are due to using NumPy 1.11. The docs for The default |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Ubuntu 16.04 support ends April 2021, which is less than a year out. Citing a conservative (albeit current) distro, CentOS8 uses PS: It appears that numpy requirements are actually not strictly enforced in |
|
Thanks for the thoughts on NumPy @ischoegl. It is really good to know that a conservative distribution like CentOS includes 1.14. Until this PR, there has not been a required minimum version of NumPy, we have been able to be flexible since we hadn't used very many of the user-interface features (basically, we just used the C headers to provide arrays in Cython). Hence, it was only necessary to provide a warning for versions that we hadn't tried of NumPy. Going forward, it seems it may be necessary to enforce an actual minimum version. |
The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in Cantera#838.
The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in Cantera#838.
The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in #838.
speth
left a comment
There was a problem hiding this comment.
From reading the discussion, I think all of the previously-mentioned issues have been hashed out at this point, right? I just had one case where I think the behavior should change.
| raise ValueError("Initial values for extra properties must be " | ||
| "supplied in a dict if the SolutionArray is not " | ||
| "initially empty") | ||
| self._extra[name] = np.empty(self._shape) |
There was a problem hiding this comment.
I don't the 'extra' arrays should be allowed to contain uninitialized values. The existing behavior of requiring initial values if the SolutionArray is not empty should be preserved.
There was a problem hiding this comment.
@speth ... I’d ask to consider retaining this behavior (while potentially avoiding np.empty).
Think of a scenario where extra is calculated based on the content of the SolutionArray itself (which typically receives the actual value using some setter only after it is instantiated, so the content of extra is not known a priori). For this case, you’d have to provide dummy values to create extra columns, which would be subsequently overwritten. I believe that automatically initializing values as zeros would be a lot more convenient (and avoid np.empty); columns that contain strings are presumably rare, so having to provide a value there would be ok.
There was a problem hiding this comment.
@ischoegl I originally had the same objection as you, but I considered it for a while and I ended up agreeing with @speth. There are two main reasons for this, as I see it:
-
It is trivially easy to specify a dummy value, since you do not need to specify the shape. For instance:
>>> arr = ct.SolutionArray(gas, (3, 10), extra={"prop": 0, "prop2": 0.0}) >>> arr.prop array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) >>> arr.prop2 array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]) >>> # This is not allowed right now, but would be almost the same, except the dtype >>> arr = ct.SolutionArray(gas, (3, 10), extra=["prop", "prop2"])
Since I've used
np.full()to fill the initial value, there's no need to give an array as the initial value. -
It isn't clear what
dtypeshould be used - it is not only strings, but whether one wants to use floats vs. ints or any other dtype as well. I suppose floats could be used in place of ints, but there are legitimate use cases for the int dtype, for instance, as a categorical or level variable. I think it would also not be clear why one would have to specify an initial value for some dtypes but not for others.
There was a problem hiding this comment.
@bryanwweber ... I didn’t think of the scalar route, so your argument is convincing.
There was a problem hiding this comment.
The availability of the scalar initializer is a nice compromise for the case where the value is going to be calculated anyway.
If I'm reading this block of code correctly now, the only time that this line can be called is if self._shape is (0,), right? It might be more clear if the value was just set to self._extra[name] = np.empty(0) then.
There was a problem hiding this comment.
Yes, that's correct, it only runs if self._shape == (0,). I'm not sure it's clearer to just pass 0 though. Shapes of NumPy arrays are always returned as tuples, and just writing np.empty(0) implies to me that 0 will be the value in the array (notwithstanding the name empty), not that there's a shape of 0. So I think it would be better to use np.empty(shape=(0,)), which is then just self._shape, and putting in (0,) I thought would be more confusing than using self._shape, because the question arises, if these are equal why not just use the existing variable name?
There was a problem hiding this comment.
I think explicitly showing the size as zero (with whatever notation you like) is preferable because it makes it clear that this doesn't result in an array with uninitialized values. Tracing the logic back to see that self._shape is (0,) if we're in this branch is not really that simple.
There was a problem hiding this comment.
OK, this is fixed now with shape=(0,) 👍
Fix the formatting of the two KeyErrors in SolutionArray.append(). The first one was missing the format method on the string. The second was raising a double-KeyError due to the way the except is handled.
When appending to a SolutionArray, any extra values must be specified as kwargs. If they aren't present, the array lengths would become out of sync, or there would be a KeyError on the kwargs dictionary. Also adds a test for the failure modes of SolutionArray.append().
Move append of extra values to the end of SolutionArray.append(). If the append is done at the beginning of the function, the append can happen even if the state is invalid. This would cause the length of the arrays to become out of sync.
If a single value (e.g., integer or float) is passed as the value for an extra column, the array with the single value cannot be reshaped. Use np.full() instead. Add/rename tests for creating extra items by dicts and iterables, along with tests of the failure conditions.
Make sure that ndarrays are one-dimensional. Move check for bare string. Clarify error messages. Reduce indentation. Add tests for creating extra items from bare strings and ndarrays.
This change bumps the builder that uses the system Python 2 to run SCons and the system Python 3 for the Python interface to Ubuntu 18.04 from 16.04. The reason for this change is that Ubuntu 16.04 provides NumPy 1.11, which does not support flexible dtypes when creating arrays using .full(). We have decided to drop support for NumPy older than 1.12 for this reason. Numpy 1.12 was released in January 2017.
If the extra columns are passed to the SolutionArray, they must either have initial values or the SolutionArray must be empty.
Checklist
scons build&scons test) and unit tests address code coverageIf applicable, fill in the issue number this pull request is fixing
Fixes #829
Changes proposed in this pull request
-Earlier the self._extra[name] was initialised as a list but when the set item method was called it returned a
np.array()copy of it-So I propose to initialise the init method with the np.array and amended other methods to abide with it
-I think that this would be a better implementation.