Skip to content

Overload to Phase::speciesIndex() to take an atomic composition map of the species#635

Closed
ThanasisMattas wants to merge 1 commit intoCantera:mainfrom
ThanasisMattas:issue#603
Closed

Overload to Phase::speciesIndex() to take an atomic composition map of the species#635
ThanasisMattas wants to merge 1 commit intoCantera:mainfrom
ThanasisMattas:issue#603

Conversation

@ThanasisMattas
Copy link
Copy Markdown
Contributor

@ThanasisMattas ThanasisMattas commented May 3, 2019

Fixes #603

Changes proposed in this pull request:

As noticed at the issue #603, in some cases, querying a species by name is not a well-defined process, because there is not a requirement that species be named in this way. Thus, an overload of Phase::speciesIndex() is provided, in order to find the index of the species, by passing its atomic composition map (compositionMap Species::composiotion).

@codecov
Copy link
Copy Markdown

codecov bot commented May 3, 2019

Codecov Report

Merging #635 into master will increase coverage by 0.18%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #635      +/-   ##
==========================================
+ Coverage   71.40%   71.58%   +0.18%     
==========================================
  Files         372      372              
  Lines       43605    44520     +915     
==========================================
+ Hits        31134    31870     +736     
- Misses      12471    12650     +179     
Impacted Files Coverage Δ
include/cantera/thermo/Phase.h 86.04% <ø> (+36.04%) ⬆️
src/thermo/Phase.cpp 83.55% <100.00%> (+0.18%) ⬆️
test/thermo/phaseConstructors.cpp 100.00% <100.00%> (ø)
include/cantera/zeroD/IdealGasReactor.h 50.00% <0.00%> (-50.00%) ⬇️
include/cantera/zeroD/ConstPressureReactor.h 50.00% <0.00%> (-50.00%) ⬇️
...clude/cantera/zeroD/IdealGasConstPressureReactor.h 50.00% <0.00%> (-50.00%) ⬇️
include/cantera/kinetics/Falloff.h 52.50% <0.00%> (-47.50%) ⬇️
include/cantera/kinetics/Kinetics.h 56.75% <0.00%> (-33.49%) ⬇️
include/cantera/numerics/ctlapack.h 77.04% <0.00%> (-22.96%) ⬇️
include/cantera/thermo/MixtureFugacityTP.h 20.00% <0.00%> (-13.34%) ⬇️
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b23f184...a106e76. Read the comment docs.

@decaluwe
Copy link
Copy Markdown
Member

decaluwe commented May 3, 2019

Thanks for this contribution! Two questions:

  1. How does this deal with isomers? I'm guessing it just returns the first encountered and that this is probably the best we can do, here. But curious if you have any other thoughts.

  2. Can you add something to the test suite to cover this and demonstrate its operation? If one doesn't already exist, it would be good to add a test which includes both the usual (string) approach and this new approach, and verifies that they return the same result.

Thanks again.

@ThanasisMattas
Copy link
Copy Markdown
Contributor Author

Hello! Thank you for the quick response.

  1. Good observation. Searching through the files, unfortunately I didn't find any attribute to hang on, so that it can be used as a 2nd passing argument at another overload. In order to overcome this, as far as I can tell, one has to carefully pass the name, or add another variable or flag to the header, to determine that a species is an isomer. But, I think the last one is too much.

  2. I tested it locally before the pr, to check if it is working. I am writing the unit test now.

@ischoegl
Copy link
Copy Markdown
Member

ischoegl commented May 3, 2019

I think the issue of isomers is quite important, and a case where multiple species match a composition should not pass quietly. Intuitively a warning should be logged, so a user is aware of what is going on. Based on that, additional arguments can resolve this.

@ThanasisMattas
Copy link
Copy Markdown
Contributor Author

Ok, this is quite reasonable. So, what about:

a search whether there are other species with the input atomic composition map and, if true:

  1. create a vector with the isomers
  2. log a message printing the vector
  3. log a message prompting the user to repeat, in case they want another isomer, passing as second parameter the index of the isomer in the vector (and the method takes ellipsis as second parameter).

@ischoegl
Copy link
Copy Markdown
Member

ischoegl commented May 3, 2019

I did write a function looking for isomers some time ago in Python, and decided to returned a list. The C++ core is very different, and having alternative return types for overloaded functions may not be as straight-forward.

@ThanasisMattas
Copy link
Copy Markdown
Contributor Author

ThanasisMattas commented May 4, 2019

The intented return value of the method is the index of the species at:

std::vector<std::string> Phase::m_speciesNames; // and
std::map<std::string, size_t> Phase::m_speciesIndices; // this is exactly for fast searching

With the current implementation, I think that the only way to be sure that one is accessing the right isomer is by the correct name. A different characteristic attribute to use to access is case specific, I guess. The approach that I can think of is a simple log message that prints the isomers (if the specific species has an isomer at the phase) and the exact names that one can access them by name and a prompt to do so, if the first encountered is not the intented. Any ideas are wellcome! I haven't use cantera that much, so is isomers in a single phase a common situation?

@ischoegl
Copy link
Copy Markdown
Member

ischoegl commented May 4, 2019

From an application point of view, isomers are not uncommon, especially with large mechanisms. I agree that accessing species via composition is not uniquely defined (as in the one-to-one mapping of species names and indices). I am not as familiar with the cantera core, but believe that there may be an argument for introducing a separate function rather than overloading (I'd be interested in @speth's opinion). On _cantera.pxd, species are handled via

        size_t speciesIndex(string) except +translate_exception
        string speciesName(size_t) except +translate_exception

which on Python is handled via ThermoPhase.species_index and ThermoPhase.species_name. From a user perspective, it would be great if there would be a simple solution that is is exposed to the Python interface and does not require looking for warnings and/or exceptions (which will complicate automated processing: I rarely process data interactively). PS: these are my 2 cents, so I'm not necessarily speaking for the community. Also, thanks for your contribution(s)!

@speth
Copy link
Copy Markdown
Member

speth commented May 11, 2019

While isomers are not uncommon, using the current code base as a guide, the most common use case is to find the index of a single species that probably does not have any isomers (at least in the mechanism), so a function that just returns a single index would probably be the most useful. To provide some flexibility, I would suggest adding a second argument to the function which determines whether to just return the first match found, or to ensure uniqueness and raise an exception if multiple matches are found.

I agree that issuing a warning is the least useful option, since it will likely go unnoticed in the case where the code is being run non-interactively.

There is then a second question of whether adding a second function to return a list of all isomers is useful enough to include. This can be done already in the Python interface with a fairly straightforward list comprehension, e.g.

   [S for S in gas.species() if S.composition == {'C': n, 'H', m}]

@ThanasisMattas
Copy link
Copy Markdown
Contributor Author

Thanks for the feadback. I have 2 questions:

  1. What would the 2nd argument be? Is something like "bool checkForIsomer" reasonable? And at the method description stated that if the user wants the 1st encountered, they may pass false, or if they want an exception thrown in case of isomers, they may pass true. (And what could they do then? Pass the name?)

  2. If a user is searching for a non-isomer in the phase, is it wise to still have to use the two-argument method, or should there be a single-argument like that in the 1st commit, too. I think that only the two-argument should exist, in order to avoid a possible bug, but I would like to hear you opinions.

@skyreflectedinmirrors
Copy link
Copy Markdown
Contributor

skyreflectedinmirrors commented May 12, 2019 via email

@ischoegl
Copy link
Copy Markdown
Member

I routinely download large mechanisms (hundreds of species) where the first problem is to identify names used for specific fuel species (various research groups have non-uniform conventions for large hydrocarbons): the simplest way to do that is by their formula, which is where the issue of isomers (or missing species) comes up very frequently.

Following up on @speth's convenient comprehension suggestion, my go-to python solution would be

{i: spc for i, spc in enumerate(gas.species()) if spc.composition == {'C': n, 'H', m}}

which however does lead to more fundamental questions:

  1. should there be a core function where the intended outcome (identification of a unique species) will fail in a non-trivial number of cases? Compared to the simpler python approach, the required exception handling will make the core function less powerful.
  2. where does cantera draw the line between implementations in C++ and Python? (Anything implemented in Python cannot be easily ported to Matlab/C++/Fortran.) My understanding had been that much of the actual work should be done in the C++ core, with Python/Matlab/etc. ideally being light-weight wrappers.

@ThanasisMattas
Copy link
Copy Markdown
Contributor Author

ThanasisMattas commented May 13, 2019

@arghdos I will try that.

@ischoegl
By my understunding, the particular function searches through the species in a user defined phase, where a user includes the species from a eg .cti file. Is a user defined phase composed of a whole mechanism (hundreds of species)? If yes, then there is a problem. This far I can go with my experience. Namely, I think that the particular function does not address the problem of non-uniform conventions in a mechanism, because it does not search the .cti / .xml / YAML / whatever input file. It searches through these vector and map, which refer to a particular phase. In the examples that I saw, phases are composed of not that many species drawn upon a .cti file. Then again, I don't have much hands-on experience to tell.

Sorry if there is a misunderstanding. If indeed isomers in a single phase is a common problem, then a user has to search the correct name in the mechanism and then query the name. Consequently, a more strict approach with names should be evaluated.

As a probably wrong idea, another approach may be to pass, along with the composition map, the name of a reaction that uniquely identifies a species (?).

I agree that it is best for maintaining and performance to keep the core in C++, but this is not me to say!

Again, I will try the default second argument, rasing an exception, if the user chooses so, to see how it works.

@decaluwe
Copy link
Copy Markdown
Member

decaluwe commented May 13, 2019 via email

@ThanasisMattas
Copy link
Copy Markdown
Contributor Author

So, by default, Phase::speciesIndex() searches for isomers in the phase object. If at least one isomer to the query exists, an error is thrown, printing a list of the isomers. There is at least 50% chance that the first encountered is the wrong one, thus an error seems a reasonable responce. Then, the user can use the existing oveload and pass the correct name.

Providing some flexibility, as suggested, a user can pass as a second argument (which defaults to true) a false expression, in order to get the first encountered, ignoring the case of isomers.

@decaluwe
Copy link
Copy Markdown
Member

decaluwe commented Jun 7, 2019 via email

@speth
Copy link
Copy Markdown
Member

speth commented Jun 8, 2019

Providing some flexibility, as suggested, a user can pass as a second argument (which defaults to true) a false expression, in order to get the first encountered, ignoring the case of isomers.

Since the iteration is done over the m_species map, this means that the first isomer encountered is the one whose name comes first alphabetically. I think it would be slightly less surprising to return the species with the lowest index, i.e. the first one appearing first in the list of species in the phase.

@ThanasisMattas
Copy link
Copy Markdown
Contributor Author

ThanasisMattas commented Jun 11, 2019

@speth I edited the description, stating that if "bool checkForIsomers == false", the first "lexicographically" encountered isomer is returned.

@decaluwe

  1. I edited the error message, so that it prints the indices, too.
  2. I added a unit-test so that it checks if:
    • the default 'npos' value is returned, in case of unindexed species
    • the correct index is returned, in case of indexed species
    • the same value is returned with both overloads
    • the first lexicographically encountered species is returned, in case a false expression is passed as a second argument
    • a CanteraError is thrown, in case of isomers exist and checkForIsomers is true (the default behavior)

Locally, I ran the following test:

#include "cantera/thermo.h"
#include "cantera/IdealGasMix.h"
#include <iostream>

using namespace Cantera;

void checkSpeciesIndex()
{
    IdealGasPhase gas("gri30_ion.cti","gas");

    compositionMap HCNOcomp = gas.species("HCNO")->composition;
    std::cout << gas.speciesIndex(HCNOcomp) << std::endl;
}

int main()
{
    try {
        checkSpeciesIndex();
    } catch(std::exception& err) {
        std::cout << err.what() << std::endl;
    }
}

Output:


CanteraError thrown by Phase::speciesIndex():
Isomers detected in phase 'gas':
hcno with index: 44
hnco with index: 46
hocn with index: 45
You should use Phase::speciesIndex(), passing the name of one of the species listed, or use the corresponding index directly.


@ischoegl
Copy link
Copy Markdown
Member

ischoegl commented Aug 10, 2019

@ThanasisMattas ... thanks for your contribution so far. I have two suggestions that would imho improve this PR:

  1. Introduce an intermediate function Phase::findIsomers that generates a map of isomer names and indices based on a compositionMap. I.e. signature and operation of your suggested Phase::speciesIndex overload would remain the same, but the logic would operate on the map that is generated internally by findIsomers. I believe this would both improve code readability and add additional benefits.

  2. Break out both functions to Python via the cython interface, i.e. introduce ThermoPhase.find_isomers, as well as add some logic to ThermoPhase.species_index so it interfaces to the overloaded function in the C++ core.

PS: @speth ... yes, it is possible to filter out isomers using

{i: spc for i, spc in enumerate(gas.species()) if spc.composition == {'C': n, 'H', m}}

but using

gas.find_isomers({'C': n, 'H', m})

to get the same dictionary would be quite a bit more convenient. As it is already necessary to track down isomers in the C++ core, why shouldn't it be broken out?

@ThanasisMattas
Copy link
Copy Markdown
Contributor Author

@ischoegl Thanks for the suggestions. They seem reasonable. The 1st one is quite straightforward. For the 2nd one, I 'll get on it when I find a little time.

@ischoegl ischoegl mentioned this pull request Sep 24, 2019
@ischoegl
Copy link
Copy Markdown
Member

ischoegl commented Nov 2, 2019

@ThanasisMattas ... I ended up implementing the findIsomers routine in #714 (C++ and Python interface; which at this point is merged). I didn't touch the speciesIndex overload though.

@bryanwweber bryanwweber force-pushed the issue#603 branch 2 times, most recently from 277291c to 7eb4cc2 Compare May 31, 2020 03:19
@bryanwweber
Copy link
Copy Markdown
Member

@speth @ischoegl @decaluwe I've updated the implementation of speciesIndex() here a little bit, to use the findIsomers() method and to clarify the error message and documentation. I'm not sure what a good Python implementation would look like. Since this version of speciesIndex() requires another argument if you want the index of the first isomer that's found, I couldn't see a good way to do that in Python. The simplest way:

    cpdef int species_index(self, species, raise_if_isomers=True) except *:
        """
        The index of species *species*, which may be specified as a string or
        an integer. In the latter case, the index is checked for validity and
        returned. If no such species is present, an exception is thrown.
        """
        if isinstance(species, (str, bytes)):
            index = self.thermo.speciesIndex(stringify(species))
        elif isinstance(species, dict):
            index = self.thermo.speciesIndex(
                comp_map(species), bool(raise_if_isomers)
            )
       elif ....  # There's more code here, but not relevant

Doesn't seem good because that optional argument is used in only one of the branches. On top of that, I couldn't actually get it to work, the True was always passed to the underlying function. On the other hand, having a separate function to handle this case also seems less than optimal. Thoughts would be appreciated. Thanks!

@ischoegl
Copy link
Copy Markdown
Member

@bryanwweber ... intuitively, I would suggest to eliminate the raise_if_isomer argument, as there's no way to guarantee a desired outcome (the order of isomers is specific to a mechanism, so it's impossible to preempt a 'best' option), i.e. an error should be thrown regardless.

Any other option would potentially lead to unexpected outcomes where root causes may be extremely hard to track down (e.g. set the override when using mechanism A as this has the desired outcome; when using the same code for mechanism B things get dicey).

@bryanwweber
Copy link
Copy Markdown
Member

@ischoegl I disagree that the option should be removed, assuming we can come up with a decent interface. I think there are many use cases where returning a value would be appropriate. Since the default is to raise, anyone not aware would get the error and would have to opt into the other behavior. Moreover, a user would have the same problem with any kinds of operations involving species indices when moving between mechanisms, so I don't see how this is all that different.

@ischoegl
Copy link
Copy Markdown
Member

@bryanwweber ... no problem agreeing to disagree here. I believe the cleanest solution in case of an error raised by non-unique isomers would be to specify a species by its unique name. Picking a species based on a predefined position in a list appears somewhat arbitrary. I’ll rest my case at this point 😉 ...

@decaluwe
Copy link
Copy Markdown
Member

decaluwe commented Jun 1, 2020

I'm trying to think of the use case (or I guess the user's ultimate intent) where getting just the first index found is acceptable. It essentially says "the species properties don't matter, I just want something with this composition," no? What would be the value to the user, here? Something related to species balance?

Not to stir the pot, but if that were their intent, I would think they'd want all indices returned, in an array. I guess that would be a next step and doesn't impact this change, necessarily.

@bryanwweber in terms of only one branch using the optional argument... is this sort of a "code style" issue, rather than anything related to the functionality? Just trying to be clear on the problem with it. I agree though, that the only way around that would be a completely separate user function for finding id by map vs. via string. If so, I'd agree--sub-optimal, to be kind.

Of the two problems, I'd prefer the former, but I feel pretty sympathetic to @ischoegl's take, here: I'm having a hard time IDing the case where the user doesn't care which isomer gets returned.

@bryanwweber
Copy link
Copy Markdown
Member

@decaluwe @ischoegl I had written a long post justifying my position, and then I had a light bulb and I think I get what you're both saying now. In #714, @ischoegl added the find_isomers function, which returns a list of isomer names for a given composition. I guess you're suggesting that species_index/Phase::speciesIndex returns a list/array/vector of indices for the species names identified by find_isomers/Phase::findIsomers? That makes much more sense to me, and will also satisfy the original issue that prompted this PR (#603) which was to make it easier/more robust for code like the radiation model in the 1-D solver to identify species that are needed (e.g., H2O) which may have different names in different mechanisms.

Let me see whether it makes sense to continue modifying the code here or open a new PR.

@decaluwe
Copy link
Copy Markdown
Member

decaluwe commented Jun 1, 2020

@bryanwweber I think this works. If they think they do want just a single species, returning an array will probably error out in some way that they will catch. I need to think on that, a little more. Definitely don't want a case where someone thinks they're getting a single index, but unknowingly gets an array, undetected. But in general, I think what you suggest above is the best approach.

Phase::speciesIndex() can take an atomic composition map to find the
index of species with that composition. Due to the possibility of
isomers, the overloaded function always returns a vector<size_t>, even
if no isomers are found. Similarly, in the Python interface, passing a
dictionary to species_index returns a list even if only a single
species is found.

Resolves Cantera#603
@bryanwweber
Copy link
Copy Markdown
Member

@ischoegl @speth @decaluwe I made another round of updates here. It feels a little inconsistent that a single Python function can give you either an integer or a list of integers, but I couldn't see a better way to provide this interface in Python. Maybe it makes sense just to drop this from Python, but I think it would be generally useful so 🤷‍♂️

@ischoegl
Copy link
Copy Markdown
Member

ischoegl commented Jun 3, 2020

@bryanwweber ... the only way to avoid a list would be to raise errors for species_index whenever a composition is not unique, while directing users to the alternative find_isomers route. I still find a rigorous treatment somewhat more consistent, which imho would be preferable to dropping this from Python entirely. Overall, it’s not a major issue ...

@bryanwweber
Copy link
Copy Markdown
Member

@ischoegl I considered that, but find_isomers returns species names, not indices, resulting in an additional step for users. I suppose users would have to look up the species name for the indices returned by species_index anyways, since the index by itself is not very meaningful... I think that a function which will raise/throw an error some large fraction of the time during normal use is not all that nice of an interface either.

@speth
Copy link
Copy Markdown
Member

speth commented Jun 30, 2020

Per @bryanwweber's concluding suggestion on #603, I'm going to close this. Thanks to all for the discussion -- at least we avoided adding potentially confusing behavior to a fairly basic function 😕.

@speth speth changed the base branch from master to main June 30, 2020 23:12
@bryanwweber
Copy link
Copy Markdown
Member

Seems like this never got closed, so I'm doing that now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a function to find species by chemical formula

6 participants