Skip to content

dadelani/AfriHG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AfriHG: News Headline Generation for African Languages

paper | dataset

Introduction

We introduce AfriHG, a news headline generation dataset created by combining the XLSum and MasakhaNEWS datasets, focusing on 16 widely spoken African languages. We experimented with two sequence-to-sequence (seq2seq) models, mT5-base and AfriTeVa V2, as well as the Aya-101 large language model (LLM). Our results demonstrate that Africa-centric seq2seq models, such as AfriTeVa V2, outperform the massively multilingual mT5-base model. Additionally, we show that fine-tuning AfriTeVa V2 with 313M parameters yields competitive performance compared to prompting Aya-101 LLM, which has over 13B parameters. Using the Multilingual-Rouge metric provided accurate scores for Non-Latin Script Languages, addressing issues identified in the findings from AfriHG presented at AfricaNLP'24.

Data Gathering

We obtained the data by combinig the XL-Sum corpus with MasakhaNews corpus (of which most data are scraped from BBC news, and a few other websites listed in MasakhaNews. The full corpus can also be downloaded from AfriHG Google Drive Link

Data Splits

For all languages, we maintained the splits used in the XL-Sum and added all data from Masakhanews to the train data, while Languages like Shona (sna), Xhosa (xho), and Zulu (zul) that are not available in XL-Sum where splitted into a Train/Dev/Test split of 80/10/10 ratio used in the XL-Sum.

The original splits were named train, dev and test and they correspond to the train, validation and test splits.

Table below summarizes the Data Split and the Model Performance on the AfriHG with with the XL-Sum Abstractive Summarization Comparison results.

Language Script Train / Dev / Test mT5-base (Summary) mT5-base (Headline) AfriTeVa V2-base (Headline) Aya (Headline)
Amharic (amh) Ge'ez 16,915 / 719 / 719 20.0 / 7.4 / 18.1 17.9 / 6.0 / 16.9 19.4 / 6.8 / 18.6 22.6 / 8.5 / 21.4
Arabic (arb) Arabic 37,519 / 4,689 / 4,689 35.0 / 14.8 / 29.2 25.3 / 9.0 / 23.2 24.1 / 8.2 / 24.1 28.1 / 10.8 / 25.4
English (eng) Latin 311,694 / 11,535 / 11,535 37.6 / 15.2 / 29.9 36.1 / 13.5 / 31.9 37.9 / 14.5 / 33.4 32.3 / 11.7 / 28.2
French (fra) Latin 15,377 / 1,086 / 1,086 35.3 / 16.2 / 28.2 30.6 / 13.7 / 27.2 33.8 / 15.7 / 29.8 35.9 / 17.5 / 31.0
Hausa (hau) Latin 17,174 / 802 / 802 39.4 / 17.7 / 31.7 30.2 / 11.1 / 26.9 35.0 / 14.9 / 31.5 34.5 / 14.8 / 30.1
Igbo (ibo) Latin 8,877 / 522 / 522 31.6 / 10.2 / 24.5 28.5 / 11.2 / 24.6 31.0 / 12.7 / 27.2 33.4 / 14.9 / 28.7
Oromo (orm) Latin 16,417 / 757 / 757 18.7 / 6.2 / 16.2 15.7 / 4.7 / 14.8 18.8 / 6.5 / 17.6 19.6 / 6.7 / 18.3
Pidgin (pcm) Latin 18,214 / 1,151 / 1,151 39.0 / 15.1 / 29.9 31.5 / 10.9 / 27.0 33.8 / 12.2 / 29.1 30.5 / 10.5 / 26.2
Kirundi (run) Latin 9,930 / 718 / 718 32.0 / 14.4 / 25.8 25.2 / 8.9 / 22.5 29.2 / 11.0 / 25.7 27.9 / 10.5 / 25.1
Shona (sna) Latin 9,573 / 1,064 / 1,064 - / - / - 22.6 / 8.1 / 22.1 25.5 / 9.6 / 24.7 23.5 / 8.2 / 22.7
Somali (som) Latin 10,508 / 745 / 745 31.6 / 11.6 / 24.2 24.1 / 7.0 / 21.2 28.2 / 9.6 / 24.7 28.6 / 10.5 / 24.9
Swahili (swa) Latin 18,914 / 987 / 987 37.7 / 17.9 / 30.9 33.0 / 13.4 / 29.1 37.6 / 15.8 / 33.1 38.9 / 16.7 / 33.9
Tigrinya (tir) Ge'ez 12,351 / 681 / 681 25.3 / 8.0 / 21.2 21.9 / 7.0 / 19.8 26.3 / 8.7 / 23.7 25.7 / 8.6 / 22.8
Xhosa (xho) Latin 10,440 / 1,305 / 1,305 - / - / - 13.0 / 4.0 / 12.7 15.2 / 5.5 / 14.9 16.1 / 5.3 / 15.2
Yoruba (yor) Latin 15,172 / 793 / 793 31.7 / 11.7 / 25.1 40.0 / 14.8 / 31.4 42.0 / 16.2 / 33.1 36.0 / 13.5 / 28.3
Zulu (zul) Latin 14,209 / 1,776 / 1,776 - / - / - 16.0 / 5.0 / 15.5 17.8 / 5.5 / 17.4 17.0 / 4.9 / 16.2
Average -- - - / - / - 25.7 / 9.3 / 22.9 28.5 / 10.84 / 25.5 28.2 / 10.85 / 24.9

Table 1 shows the data split for all languages in AfriHG, Rouge score (R1/R2/RL) comparing news summarization from XL-Sum paper to AfriHG Headline Generation. - / - / - indicates the evaluation values that are not available in the XL-Sum.

BibTeX entry and citation info

@misc{ogunremi2024afrihgnewsheadlinegeneration,
      title={AfriHG: News headline generation for African Languages}, 
      author={Toyib Ogunremi and Serah Akojenu and Anthony Soronnadi and Olubayo Adekanmbi and David Ifeoluwa Adelani},
      year={2024},
      eprint={2412.20223},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.20223}, 
}

About

News headline generation for African languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages