Generate reference citation keys for ingested references#114
Conversation
| yarn check | ||
| yarn test | ||
|
|
||
| yarn pylint:check |
There was a problem hiding this comment.
Noticed that someone added this to yarn check (thank you!), so we don't need to call it twice here
| def update(self, target: typing.Reference): | ||
| """ | ||
| Update a Reference in storage with the target reference. | ||
| This is used when the client has updated the reference in the UI. | ||
| """ | ||
| for i, source in enumerate(self.references): | ||
| if source.filename_md5 == target.filename_md5: | ||
| logger.info(f"Updating reference {source.source_filename} ({source.filename_md5})") | ||
| self.references[i] = target | ||
| break | ||
| else: | ||
| logger.info(f"Unable to find reference {target.source_filename} ({target.filename_md5})") | ||
| return | ||
| self.save() |
There was a problem hiding this comment.
alternatively, we could pass in the filename hash (or we should probably use a UUID) and dict, making the function signature something like:
def update(reference_id: UUID, updates: dict[string, Any])where dict[string, Any] => field_name_to_update: value_to_set. would need to think through how to do this for subtypes like Author though. maybe individual methods for update_reference and update_author to make things clear and specific.
There was a problem hiding this comment.
I think it is ok, also bc this is JsonStorage file. Maybe we need to expose to the frontend more granular operations (ex: rename citation). So this is ok for now.
| given_name: str | None = None | ||
| surname: str | None = None | ||
| email: str | None = None |
There was a problem hiding this comment.
Grobid won't always have these fields, but from what I can tell full_name will always be present if it can parse any authors.
We'll be able to get a lot more consistent here once we start work on further augmenting our references with #140
| id: reference.filename_md5, | ||
| title: reference.title ?? reference.source_filename.replace('.pdf', ''), | ||
| authors: (reference.authors ?? []).map((author) => ({ fullName: author.full_name, surname: author.surname })), | ||
| authors: (reference.authors ?? []).map((author) => ({ fullName: author.full_name })), |
| def update(self, target: typing.Reference): | ||
| """ | ||
| Update a Reference in storage with the target reference. | ||
| This is used when the client has updated the reference in the UI. | ||
| """ | ||
| for i, source in enumerate(self.references): | ||
| if source.filename_md5 == target.filename_md5: | ||
| logger.info(f"Updating reference {source.source_filename} ({source.filename_md5})") | ||
| self.references[i] = target | ||
| break | ||
| else: | ||
| logger.info(f"Unable to find reference {target.source_filename} ({target.filename_md5})") | ||
| return | ||
| self.save() |
There was a problem hiding this comment.
I think it is ok, also bc this is JsonStorage file. Maybe we need to expose to the frontend more granular operations (ex: rename citation). So this is ok for now.
|
@gjreda just found an error (date parsing) in the ingestion for the Machine Learning at Scale.pdf.
|
cguedes
left a comment
There was a problem hiding this comment.
There is a pending error when parsing some PDF files (check my latest comment).
Codecov Report
@@ Coverage Diff @@
## main #114 +/- ##
=======================================
Coverage 36.73% 36.73%
=======================================
Files 52 52
Lines 2488 2488
Branches 114 114
=======================================
Hits 914 914
Misses 1551 1551
Partials 23 23
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |

fixes #112
Adds
citation_keyandpublished_dateto the Reference response.Depending on whether the "uniqueness" postfix of
a, b, c ... 1, 2, 3should be handled by the frontend or sidecar, this PR might require more work. See discussion here.