Skip to content

can not serialize 'SpanLayout' object #11

@MarieCo

Description

@MarieCo

I am trying to use spacy-layout to pre-process a bunch of PDFs contained in a given directory then save them for later use. Below is my code:

nlp = spacy.blank("en")
layout = spaCyLayout(nlp)
    
# Get paths for all PDF files in the input directory
paths = [os.path.join(directory, file) for file in os.listdir(directory) if file.lower().endswith('.pdf')]
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)

# Read all PDFs into spaCy Doc objects, serialize and save
for doc in tqdm(layout.pipe(paths), desc="Processing documents", unit="doc"):
    doc_bin.add(doc)
doc_bin.to_disk(out_file)

And the error I get:
TypeError: can not serialize 'SpanLayout' object -> Cannot close object, library is destroyed. This may cause a memory leak!

Shall I add SpanLayout to the DocBin attributes somehow (it is not part of the list of possible attributes unless I'm mistaken)? Or is there a different way to save the DocBin / pre-processed documents?

I am running this on a Ubuntu 20.04 x86_64 VM, with spacy==3.7.5 and spacy_layout==0.0.7
Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions