-
Notifications
You must be signed in to change notification settings - Fork 60
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
I am trying to use spacy-layout to pre-process a bunch of PDFs contained in a given directory then save them for later use. Below is my code:
nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
# Get paths for all PDF files in the input directory
paths = [os.path.join(directory, file) for file in os.listdir(directory) if file.lower().endswith('.pdf')]
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
# Read all PDFs into spaCy Doc objects, serialize and save
for doc in tqdm(layout.pipe(paths), desc="Processing documents", unit="doc"):
doc_bin.add(doc)
doc_bin.to_disk(out_file)
And the error I get:
TypeError: can not serialize 'SpanLayout' object -> Cannot close object, library is destroyed. This may cause a memory leak!
Shall I add SpanLayout to the DocBin attributes somehow (it is not part of the list of possible attributes unless I'm mistaken)? Or is there a different way to save the DocBin / pre-processed documents?
I am running this on a Ubuntu 20.04 x86_64 VM, with spacy==3.7.5 and spacy_layout==0.0.7
Thanks in advance!
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working