LayoutLM-based visual question answering model, weights, and pipeline

### Feature request

Question answering is an important problem for both text and documents. The question-answering pipeline makes it very easy to work with plain text and includes helpful utilities (like [post-processing start/end candidates](https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/question_answering.py#L510)). It'd be amazing for question answering on documents to be _that_ easy.

The primary goal of this feature request is to extend either the question answering or visual question answering pipeline to be as easy to use as, for example, the [distilbert-base-cased-distilled-squad](https://huggingface.co/distilbert-base-cased-distilled-squad) model. LayoutLM is a great model architecture for solving this problem and @NielsRogge's [notebook example](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/DocVQA/Fine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb) even shows you how to fine tune the model for this use case. I think it'd be very powerful for a number of use cases if it were _as easy_ to use LayoutLM for document question answering as it is to use BERT-like models for text question answering.

This will require a few additions, all of which I have working code for that I'd be happy to contribute:

1. Extend the `QuestionAnsweringPipeline` or `VisualQuestionAnsweringPipeline` pipeline to support document inputs. I _think_ the latter would be the right pipeline, since it already takes an image as input, but ideally could also take a list of words+bounding boxes as input (in case users want to run their own OCR).
2. Hook up `LayoutLMv2ForQuestionAnswering` and `LayoutLMv3ForQuestionAnswering` to the pipeline. Ideally, there would also be `LayoutLMForQuestionAnswering`, since v2 and v3 are not licensed for commercial use.
2. Publish pre-trained model weights with an easy-to-follow model card. I found a few examples of fine-tuned layoutlm for QA models (e.g. [this](https://huggingface.co/tiennvcs/layoutlmv2-base-uncased-finetuned-docvqa)), but could not get them to run easily. For example, the "hosted inference API" UI throws an error when you try to run it. I think the visual question answering UI (which lets you load an image) might be a better fit. But I am very open to discussion on what the best experience would be.

### Motivation

When we started using transformers, we saw the `question-answering` pipeline and we're blown away by how easy it was to use for text-based extractive QA. We were hoping it'd be "that easy" for document QA, but couldn't find pre-trained weights or a pipeline implementation. Thanks to [this tutorial](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/DocVQA/Fine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb), however, we were able to fine tune our own model and get it running. That inspired us to wonder -- could we make it _that_ easy for Document QA too?

### Your contribution

We have working code for all of the proposed feature requests that we'd be happy to contribute. We also have a pre-trained model that we're happy to upload along with an easy-to-follow model card. Since there are a few changes proposed here, it might be worthwhile to break this into multiple issues/PRs, or we can do it all at once (however works best within your processes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LayoutLM-based visual question answering model, weights, and pipeline #18380

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LayoutLM-based visual question answering model, weights, and pipeline #18380

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions