-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Description
Feature request
Question answering is an important problem for both text and documents. The question-answering pipeline makes it very easy to work with plain text and includes helpful utilities (like post-processing start/end candidates). It'd be amazing for question answering on documents to be that easy.
The primary goal of this feature request is to extend either the question answering or visual question answering pipeline to be as easy to use as, for example, the distilbert-base-cased-distilled-squad model. LayoutLM is a great model architecture for solving this problem and @NielsRogge's notebook example even shows you how to fine tune the model for this use case. I think it'd be very powerful for a number of use cases if it were as easy to use LayoutLM for document question answering as it is to use BERT-like models for text question answering.
This will require a few additions, all of which I have working code for that I'd be happy to contribute:
- Extend the
QuestionAnsweringPipelineorVisualQuestionAnsweringPipelinepipeline to support document inputs. I think the latter would be the right pipeline, since it already takes an image as input, but ideally could also take a list of words+bounding boxes as input (in case users want to run their own OCR). - Hook up
LayoutLMv2ForQuestionAnsweringandLayoutLMv3ForQuestionAnsweringto the pipeline. Ideally, there would also beLayoutLMForQuestionAnswering, since v2 and v3 are not licensed for commercial use. - Publish pre-trained model weights with an easy-to-follow model card. I found a few examples of fine-tuned layoutlm for QA models (e.g. this), but could not get them to run easily. For example, the "hosted inference API" UI throws an error when you try to run it. I think the visual question answering UI (which lets you load an image) might be a better fit. But I am very open to discussion on what the best experience would be.
Motivation
When we started using transformers, we saw the question-answering pipeline and we're blown away by how easy it was to use for text-based extractive QA. We were hoping it'd be "that easy" for document QA, but couldn't find pre-trained weights or a pipeline implementation. Thanks to this tutorial, however, we were able to fine tune our own model and get it running. That inspired us to wonder -- could we make it that easy for Document QA too?
Your contribution
We have working code for all of the proposed feature requests that we'd be happy to contribute. We also have a pre-trained model that we're happy to upload along with an easy-to-follow model card. Since there are a few changes proposed here, it might be worthwhile to break this into multiple issues/PRs, or we can do it all at once (however works best within your processes).