fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF by gjreda · Pull Request #103 · refstudio/refstudio

gjreda · 2023-06-06T19:39:08Z

PDF ingest does not work properly with non-scholarly documents #79

This fixes two underlying bugs:

When Grobid is unable to parse a PDF, it prints an error message to stdout, rather than raising an exception or returning an http error status code. Printing this error message to stdout breaks sidecar communication since we rely on stdout for data passing.
If a PDF could not be parse, it is not included in the References response from the sidecar. This adds Reference objects to the response even if we could not parse the PDF.

…n-scholarly-documents

cguedes · 2023-06-07T08:31:58Z

+        The TXT file is named as {pdf_filename}_{error_code}.txt.
+        """
+        txt_files = list(self.grobid_output_dir.glob("*.txt"))
+        logger.info(f"Found {len(txt_files)} txt files from Grobid parsing errors")


Where does this logger writes to?

This PR has it disabled (I had to do that a while back so the sidecar can work). #104 re-enables it and sets it up as a file logger, rather than to stdout.

cguedes · 2023-06-07T08:34:39Z

-    title: str
-    abstract: str
-    contents: str
+    title: Optional[str] = None


@sehyod we will need to change the TS type after this merge.

The title was already optional (cf https://github.com/refstudio/refstudio/pull/95/files#diff-68572da928b5651e3f8dda9d2b291815dc0cdf0e0ff5dc40ab3dc81215b760feL9) because the sidecar was already returning an null field for some pdfs. @gjreda I don't know if that was an expected behaviour, if it's not I can send you an example pdf file for which the returned title is null

hammer · 2023-06-07T13:56:13Z

I'm still seeing brittle behavior after this fix...

gjreda added 4 commits June 6, 2023 10:56

This should have been included in #83

04a0eda

Merge branch 'main' into 79-pdf-ingest-does-not-work-properly-with-no…

cafd307

…n-scholarly-documents

Move client.process inside of HiddenPrints

84b2aa5

Add unparsed PDFs to final Reference response

0512749

gjreda marked this pull request as ready for review June 6, 2023 19:44

gjreda requested review from cguedes and sehyod June 6, 2023 19:44

gjreda linked an issue Jun 6, 2023 that may be closed by this pull request

PDF ingest does not work properly with non-scholarly documents #79

Closed

cguedes reviewed Jun 7, 2023

View reviewed changes

cguedes approved these changes Jun 7, 2023

View reviewed changes

cguedes reviewed Jun 7, 2023

View reviewed changes

sergioramos changed the title ~~Fix PDF Ingestion bug when Grobid is unable to parse the reference PDF~~ fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF Jun 7, 2023

sergioramos merged commit f909337 into main Jun 7, 2023

sergioramos deleted the 79-pdf-ingest-does-not-work-properly-with-non-scholarly-documents branch June 7, 2023 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF#103

fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF#103
sergioramos merged 4 commits into
mainfrom
79-pdf-ingest-does-not-work-properly-with-non-scholarly-documents

gjreda commented Jun 6, 2023 •

edited

Loading

Uh oh!

cguedes Jun 7, 2023

Uh oh!

gjreda Jun 7, 2023

Uh oh!

cguedes Jun 7, 2023

Uh oh!

sehyod Jun 7, 2023

Uh oh!

hammer commented Jun 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

gjreda commented Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cguedes Jun 7, 2023

Choose a reason for hiding this comment

Uh oh!

gjreda Jun 7, 2023

Choose a reason for hiding this comment

Uh oh!

cguedes Jun 7, 2023

Choose a reason for hiding this comment

Uh oh!

sehyod Jun 7, 2023

Choose a reason for hiding this comment

Uh oh!

hammer commented Jun 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gjreda commented Jun 6, 2023 •

edited

Loading