Convert spacy binary data to jsonl

tristan19954 · April 8, 2021, 8:48am

Hello,

I just recently got prodigy, and before that I was using another free program to annotate data for my spacy model. From this program, I could download my annotated data in iob format which I then converted to spacy binary format with the spacy convert command.

I would like to use this data that I already annotated in prodigy, partially to try to run the "train-curve" command to see how much improvement I could still get by adding more data.
I can't seem to find an easy way to convert a ".spacy", a spacy json file or ".iob" file to jsonl. I found the depreciated "ner.iob-to-gold", so I guess there might be a new recipe for that purpose but I can't find it. I also found the "data-to-spacy" that can do the reverse.

Is there a way to do this?

I'm using the latest version of spacy (3.0.5 I believe) and prodigy nightly.

ines · April 12, 2021, 12:43am

Hi! Under the hood, the binary .spacy file is a serialized DocBin, which you can always load back as spaCy Doc objects: https://spacy.io/api/docbin This gives you access to the annotations, e.g. the doc.ents.

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
doc_bin = DocBin().from_disk("./data.spacy")  # your file here
examples = []  # examples in Prodigy's format
for doc in doc_bin.get_docs(nlp.vocab):
    spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
    examples.append({"text": doc.text, "spans": spans})

In your case, this is probably the most straightforward solution because you already have the .spacy files and you need them for training anyways.

For completeness (and others who come across this thread later): of course, you'd generally don't need the .spacy conversion to use IOB data with Prodigy. Prodigy's format expects entities and other spans to be defined as character offsets. So if you have token-based annotations like IOB or BILUO, you can convert them to offsets – spaCy provides handy utility functions for that: https://prodi.gy/docs/named-entity-recognition#tip-biluo-offsets

tristan19954 · April 12, 2021, 9:12am

Thank you Ines, this worked like a charm!

As a little note to anyone who plans to reuse this script, just replace ent.label with ent.label_ to get the char instead of the label ID

ines · April 13, 2021, 1:28am

Ah sorry, that was a typo Just edited my previous post!

milos-cuculovic · April 27, 2022, 2:36pm

Thanks for the hint @ines . I have a slightly different problem but very much related to your reply.

I have .spacy train and valid files and am trying to reuse my code that was used to work with JSON files (SpaCy v2). So what I was tying:

    nlp = spacy.load("my-model")

    doc_bin = DocBin().from_disk(path_test_data)
    examples = []
    for doc in doc_bin.get_docs(nlp.vocab):
        spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
        examples.append(Example.from_dict(nlp.make_doc(doc.text), spans))

However, this part is failing due to:

TypeError: Argument 'example_dict' has incorrect type (expected dict, got list)

Initially, my code was working as:

examples = []
    for text, annotations in TEST_DATA:
        examples.append(Example.from_dict(nlp.make_doc(text), annotations))

With the TEST_DATA being a JSON

Thank you in advance.

ljvmiranda921 · April 28, 2022, 6:07am

Hi @milos-cuculovic !

I think you've already posted in the spaCy discussions forum. For posterity, here's the link with the answer.

Topic		Replies	Views
Convert DocBins or .spacy files to .jsonl format usage , ner , spacy	2	806	January 3, 2023
Training prodigy ner data through spacy usage , ner , spacy , solved	3	892	January 8, 2020
Convert spaCy training json file to prodigy jsonl format for db-in command enhancement , ner , spacy	1	593	June 15, 2020
Converting SpaCy training json file to Prodigy jsonl format usage , spacy	9	3009	April 17, 2023
prodigy ner train error iob translated to json annotation data usage , ner , training	3	617	March 28, 2022

Convert spacy binary data to jsonl

Related topics