fix: PDF Ingestion bug when Grobid is unable to parse the reference PDF#103
Conversation
| The TXT file is named as {pdf_filename}_{error_code}.txt. | ||
| """ | ||
| txt_files = list(self.grobid_output_dir.glob("*.txt")) | ||
| logger.info(f"Found {len(txt_files)} txt files from Grobid parsing errors") |
There was a problem hiding this comment.
Where does this logger writes to?
There was a problem hiding this comment.
This PR has it disabled (I had to do that a while back so the sidecar can work). #104 re-enables it and sets it up as a file logger, rather than to stdout.
| title: str | ||
| abstract: str | ||
| contents: str | ||
| title: Optional[str] = None |
There was a problem hiding this comment.
@sehyod we will need to change the TS type after this merge.
There was a problem hiding this comment.
The title was already optional (cf https://github.com/refstudio/refstudio/pull/95/files#diff-68572da928b5651e3f8dda9d2b291815dc0cdf0e0ff5dc40ab3dc81215b760feL9) because the sidecar was already returning an null field for some pdfs. @gjreda I don't know if that was an expected behaviour, if it's not I can send you an example pdf file for which the returned title is null
|
I'm still seeing brittle behavior after this fix... |
This fixes two underlying bugs: