Skip to content

LayoutLM-based visual question answering model, weights, and pipeline #18380

@ankrgyl

Description

@ankrgyl

Feature request

Question answering is an important problem for both text and documents. The question-answering pipeline makes it very easy to work with plain text and includes helpful utilities (like post-processing start/end candidates). It'd be amazing for question answering on documents to be that easy.

The primary goal of this feature request is to extend either the question answering or visual question answering pipeline to be as easy to use as, for example, the distilbert-base-cased-distilled-squad model. LayoutLM is a great model architecture for solving this problem and @NielsRogge's notebook example even shows you how to fine tune the model for this use case. I think it'd be very powerful for a number of use cases if it were as easy to use LayoutLM for document question answering as it is to use BERT-like models for text question answering.

This will require a few additions, all of which I have working code for that I'd be happy to contribute:

  1. Extend the QuestionAnsweringPipeline or VisualQuestionAnsweringPipeline pipeline to support document inputs. I think the latter would be the right pipeline, since it already takes an image as input, but ideally could also take a list of words+bounding boxes as input (in case users want to run their own OCR).
  2. Hook up LayoutLMv2ForQuestionAnswering and LayoutLMv3ForQuestionAnswering to the pipeline. Ideally, there would also be LayoutLMForQuestionAnswering, since v2 and v3 are not licensed for commercial use.
  3. Publish pre-trained model weights with an easy-to-follow model card. I found a few examples of fine-tuned layoutlm for QA models (e.g. this), but could not get them to run easily. For example, the "hosted inference API" UI throws an error when you try to run it. I think the visual question answering UI (which lets you load an image) might be a better fit. But I am very open to discussion on what the best experience would be.

Motivation

When we started using transformers, we saw the question-answering pipeline and we're blown away by how easy it was to use for text-based extractive QA. We were hoping it'd be "that easy" for document QA, but couldn't find pre-trained weights or a pipeline implementation. Thanks to this tutorial, however, we were able to fine tune our own model and get it running. That inspired us to wonder -- could we make it that easy for Document QA too?

Your contribution

We have working code for all of the proposed feature requests that we'd be happy to contribute. We also have a pre-trained model that we're happy to upload along with an easy-to-follow model card. Since there are a few changes proposed here, it might be worthwhile to break this into multiple issues/PRs, or we can do it all at once (however works best within your processes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions