Skip to content

Elucidator-V/NovaChart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NovaChart: A Large-scale Dataset towards Chart Understandingand Generation of Multimodal Large Language Models

                       

📚Data Overview

We create a large-scale chart dataset for chart understanding and generation of MLLMs, with extensive coverage of chart types, various chart-related tasks, and good scalability. We propose a fully-fledged data generation engine including Raw Data Acquisition, Data Curation, Image Styling and Visualization, and Instruction Formulation, supporting the construction of large-scale chart metadata and chart instruction data from scratch. We also release several tools which can be employed for NovaChart extension, to help the development of customized MLLMs with specific chart understanding and generation capabilities.


⚙️Data Engine

The framework of the data generation engine of NovaChart is illustrated in The following figure. It mainly comprises 4 steps. Detailed introduction can be found in our paper.


📋Chart Metadata

For every instance of chart, we provide 4 kinds of annotations: 1) data points which are statistical units of information represented by numerical values on the chart’s axes; 2) visual elements used in charts to convey information and enhance expressiveness, such as colors; 3) source data, the raw, unprocessed data samples from which the statistical chart is derived; 4) visualization code for chart images rendering with given data points and visual elements.



🧑‍🦽Chart Instruction Data

We design a comprehensive set of 15 unique tasks, covering 4 kinds of tasks:

  1. Chart Data Understanding, which aims to precisely understand the statistical data points within charts;

  2. Chart Visual Understanding, which focuses on identifying particular visual elements in charts.

  3. Chart Summarization and Analysis, which aims to summarize and analyze the phenomena behind the data.

  4. Chart Generation, which focuses on generating executable visualization code (in Python) to help users create charts.

🛠NovaChart Toolkit

Toolkit

In addition to the data resources, we provide three tools for fellow researchers to facilitate utilization and extension of NovaChart.

Data curation tool enables users to reinitiate the process of obtaining chart metadata, enabling the generation of more chart instances of different topics.

Chart visualization tool allows users to freely adjust relevant visualization parameters to generate chart images with more diversified visual styles.

Instruction generation tool helps researchers leverage LLMs to create chart instruction data (based on chart metadata) covering a wider range of tasks, based on their own requirements.

We aim to enable researchers to conveniently utilize NovaChart and assist them in generating high-quality chart data for their customized model training through these tools. We sincerely hope that our efforts can pave the way for the intelligent assistant with powerful capabilities in chart comprehension and generation.

🤖Model Capabilities



👊Using NovaChart

The full NovaChart dataset can be downloaded from the following link:

Chart Instruction Data: Huggingface

🤓TO-DOs

  • Open source the evaluation scripts.
  • Open source checkpoints.
  • Open source the dataset.
  • Create the git repository.

💡Citation

@inproceedings{hu2024novachart,
  title={NovaChart: A Large-scale Dataset towards Chart Understanding and Generation of Multimodal Large Language Models},
  author={Hu, Linmei and Wang, Duokang and Pan, Yiming and Yu, Jifan and Shao, Yingxia and Feng, Chong and Nie, Liqiang},
  booktitle={ACM Multimedia 2024},
  year={2024}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages