MMCif behavior when auth_seq_id is missing
A discussion came up in PR #774 regarding the correct behavior when parsing an mmCif file without auth_seq_id.
BioJava 4.2.11 requires the auth_seq_id column. This is a problem because it is optional according to the spec and omitted by PyMOL.
In b207d34 I added code to use the label_seq_id column for creating the ResidueNumber for each group if auth_seq_id is missing. There was some concern that this could lead to inconsistent residue numbers if some residues used '?' (defaulting to label_) while the rest used the auth_ values specified. This worry is actually not justified due to another bug, which causes a NumberFormatException if '?' is used in that column.
@josemduarte suggested only doing the label_ fallback if ALL groups have null ResidueNumbers. This is probably the right solution, but it seems like such an edge case it might not be worth the hour it will take to fix it.
683132d contains tests showing that the current code (with #774) handles a missing auth_seq_id column and throws an exception for a partially populated column. Good enough?
Thanks @sbliven , very nice tests! There's one detail I was forgetting: HETATMs have a '.' for their label_seq_id in deposited PDB files.
I've added some HETATMs to your tests and they failed. I have made a fix that should handle that: https://github.com/josemduarte/biojava/commit/699af8a8c2ed39ff71f3d2a29036c106f299d33f All tests pass with that so I guess we can keep it. I still worry that the HETATMs won't be handled correctly in the edge case where there's no auth_seq_id, but it is an edge case.
Please feel free to pull request all that together if you are happy with that.
I think this can't be properly fixed in 4.* because we need access to the seq_id. This is stored in AminoAcidImpl.getId() and similar, but not available for Group.
I'm going to rebase these tests onto the master branch and do any fixes there. 4.* will continue to throw errors if auth_seq_id is missing or not numeric.
A related comment is that auth_seq_id is an arbitrary string, according to the spec. I think that might break the current data model.
What is the status of this issue? I still do get an Error when parsing mmCIF files without auth_seq_id produced by PyMOL with BioJava 6.0.5 (Error: org.rcsb.cif.EmptyColumnException: column auth_seq_id is undefined).
Is there a conclusion about the correct expected behaviour of BioJava in such case? Raising an exception or fallback to sequential numbering?
I'd be in favour of falling back to a sequential numbering if auth_seq_id is not provided.
But of course there are some subtleties on how to implement it (see above):
- for polymers: all
auth_seq_idshould be null in order to consider it empty and fall back to sequential (i.e.label_seq_id) - non-polymers are expected to have no
label_seq_id. If they have also noauth_seq_idthen it'd be complicated to assign an id for them. Perhaps files with noauth_seq_idfor non-polymers should be considered invalid?
We have already a very good start with this unit test: https://github.com/sbliven/biojava-sbliven/commit/683132da29516eb2bfac1dfd3eb56a16e37ce527
Note also that we are now (since 6.0.0) relying on a much more flexible parser from the https://github.com/rcsb/ciftools-java library which should make things easier.
@rdk would you be willing to take it from there to provide a PR?
@josemduarte sorry for the delay. Yes I would like to do it. Should I try to do it on the 6.x branch or just for 7.x? Has your thinking about how to do it correctly (especially with regards to the mentioned subtleties) developed in the meantime?
I think we want the fix in 7.x. The one big difference between 6.x and 7.x is the move to java 11. I think at this point we can assume that's a safe choice for everyone so I think there's no need to maintain 6.x.
See my last comment above for what I think the implementation should be. Of course it may be that some details are missing, but I hope that gives the general direction to follow.
Note that PR #1083 has solved this issue: mmCIF files with atom_site section not specifying author fields (chain id, residue number, insertion code) will fall back to use the corresponding label fields.
@rdk if you have a file to test it, that would be a nice confirmation. I've tested with files coming from ESMAtlas.
I've just confirmed that a pymol (2.5) exported cif works with latest master (and didn't work before #1083 ). I'll close.
@josemduarte thank you very much! I'm going to try it right away.