biojava MMCif behavior when auth_seq

A discussion came up in PR #774 regarding the correct behavior when parsing an mmCif file without auth_seq_id.

BioJava 4.2.11 requires the auth_seq_id column. This is a problem because it is optional according to the spec and omitted by PyMOL.

In b207d34 I added code to use the label_seq_id column for creating the ResidueNumber for each group if auth_seq_id is missing. There was some concern that this could lead to inconsistent residue numbers if some residues used '?' (defaulting to label_) while the rest used the auth_ values specified. This worry is actually not justified due to another bug, which causes a NumberFormatException if '?' is used in that column.

@josemduarte suggested only doing the label_ fallback if ALL groups have null ResidueNumbers. This is probably the right solution, but it seems like such an edge case it might not be worth the hour it will take to fix it.

Jun 15 '18 10:06 sbliven

683132d contains tests showing that the current code (with #774) handles a missing auth_seq_id column and throws an exception for a partially populated column. Good enough?

Jun 15 '18 10:06 sbliven

Thanks @sbliven , very nice tests! There's one detail I was forgetting: HETATMs have a '.' for their label_seq_id in deposited PDB files.

I've added some HETATMs to your tests and they failed. I have made a fix that should handle that: https://github.com/josemduarte/biojava/commit/699af8a8c2ed39ff71f3d2a29036c106f299d33f All tests pass with that so I guess we can keep it. I still worry that the HETATMs won't be handled correctly in the edge case where there's no auth_seq_id, but it is an edge case.

Please feel free to pull request all that together if you are happy with that.

Jun 15 '18 18:06 josemduarte

I think this can't be properly fixed in 4.* because we need access to the seq_id. This is stored in AminoAcidImpl.getId() and similar, but not available for Group.

I'm going to rebase these tests onto the master branch and do any fixes there. 4.* will continue to throw errors if auth_seq_id is missing or not numeric.

Jun 18 '18 10:06 sbliven

A related comment is that auth_seq_id is an arbitrary string, according to the spec. I think that might break the current data model.

Jun 18 '18 10:06 sbliven

What is the status of this issue? I still do get an Error when parsing mmCIF files without auth_seq_id produced by PyMOL with BioJava 6.0.5 (Error: org.rcsb.cif.EmptyColumnException: column auth_seq_id is undefined).

Is there a conclusion about the correct expected behaviour of BioJava in such case? Raising an exception or fallback to sequential numbering?

Apr 19 '22 07:04 rdk

I'd be in favour of falling back to a sequential numbering if auth_seq_id is not provided.

But of course there are some subtleties on how to implement it (see above):

for polymers: all auth_seq_id should be null in order to consider it empty and fall back to sequential (i.e. label_seq_id)
non-polymers are expected to have no label_seq_id. If they have also no auth_seq_id then it'd be complicated to assign an id for them. Perhaps files with no auth_seq_id for non-polymers should be considered invalid?

We have already a very good start with this unit test: https://github.com/sbliven/biojava-sbliven/commit/683132da29516eb2bfac1dfd3eb56a16e37ce527

Note also that we are now (since 6.0.0) relying on a much more flexible parser from the https://github.com/rcsb/ciftools-java library which should make things easier.

@rdk would you be willing to take it from there to provide a PR?

Apr 26 '22 15:04 josemduarte

@josemduarte sorry for the delay. Yes I would like to do it. Should I try to do it on the 6.x branch or just for 7.x? Has your thinking about how to do it correctly (especially with regards to the mentioned subtleties) developed in the meantime?

Aug 01 '23 11:08 rdk

I think we want the fix in 7.x. The one big difference between 6.x and 7.x is the move to java 11. I think at this point we can assume that's a safe choice for everyone so I think there's no need to maintain 6.x.

See my last comment above for what I think the implementation should be. Of course it may be that some details are missing, but I hope that gives the general direction to follow.

Aug 02 '23 14:08 josemduarte

Note that PR #1083 has solved this issue: mmCIF files with atom_site section not specifying author fields (chain id, residue number, insertion code) will fall back to use the corresponding label fields.

@rdk if you have a file to test it, that would be a nice confirmation. I've tested with files coming from ESMAtlas.

Mar 05 '24 01:03 josemduarte

I've just confirmed that a pymol (2.5) exported cif works with latest master (and didn't work before #1083 ). I'll close.

Mar 06 '24 00:03 josemduarte

@josemduarte thank you very much! I'm going to try it right away.

Mar 06 '24 09:03 rdk

MMCif behavior when auth_seq_id is missing