I just recently got prodigy, and before that I was using another free program to annotate data for my spacy model. From this program, I could download my annotated data in iob format which I then converted to spacy binary format with the spacy convert command.
I would like to use this data that I already annotated in prodigy, partially to try to run the "train-curve" command to see how much improvement I could still get by adding more data.
I can't seem to find an easy way to convert a ".spacy", a spacy json file or ".iob" file to jsonl. I found the depreciated "ner.iob-to-gold", so I guess there might be a new recipe for that purpose but I can't find it. I also found the "data-to-spacy" that can do the reverse.
Is there a way to do this?
I'm using the latest version of spacy (3.0.5 I believe) and prodigy nightly.
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en")
doc_bin = DocBin().from_disk("./data.spacy") # your file here
examples = [] # examples in Prodigy's format
for doc in doc_bin.get_docs(nlp.vocab):
spans = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents]
examples.append({"text": doc.text, "spans": spans})
In your case, this is probably the most straightforward solution because you already have the .spacy files and you need them for training anyways.