Discoverability of francophone scientific content: release of the FrenchScienceCommons corpus

On the Week of the French Language and Francophonie (March 17-20), we are delighted to announce the publication of the FrenchScienceCommons corpus on Hugging Face. This open, structured corpus, entirely dedicated to scientific production in French, has been developed by Pleias in collaboration with OPERAS and the Quebec Research Chair on the Discoverability of Scientific Content in French, thanks to the support of the General Delegation for the French Language and the Languages of France (DGLFLF). This resource brings together 1.25 million scientific documents, including theses and articles, published between 2007 and 2026 under permissive licences, and referenced in OpenAlex, as well as in the French aggregators HAL and theses.fr.
The corpus was built as part of an initiative aimed at fostering the discoverability1 of francophone research in a context of scientific overproduction dominated by the English language. Among other purposes, it will be possible to use it to train RAG2 and agentic search systems, specialised language models, or to conduct thematic search through an interactive semantic map.
The documents were processed and structured to enable a variety of use cases for the benefit of the scientific community and beyond. The balanced multidisciplinarity of the corpus – which includes publications from natural sciences, engineering and technology, medical and health sciences, agricultural and veterinary sciences, social sciences, humanities and arts – makes it relevant across all disciplinary fields.
The corpus is also an initial result of the wider ambition to develop digital commons within the Francophonie. With a view to support linguistic and cultural sovereignty, as well as traceability, transparency and scientific integrity, the objective is to build shared specialist resources, curated by language and disciplinary experts, opening new avenues for the discoverability of francophone scientific content across multiple use cases: indexing and classification, writing and translation, training, public dissemination of research findings, etc.
Find out more about the corpus, its composition and its technical specifications here (website in French).
The corpus is released on Hugging Face at this link.
Contacts
For any technical questions, please send inquiries to [email protected]
For more information about the project, please write to [email protected]
Footnotes
- For a definition of the concept of discoverability, refer to: Grenier, J., Francoeur, J., Paquin, É., Trépanier, S., Larivière, V. (2025). Découvrabilité des contenus en français : de la culture à la science (in French). The authors identify two fundamental dimensions of discoverability: “findability – which refers to the ability of content to be discovered by a user actively searching for it – and serendipity – which refers to the potential for content to be discovered by chance in a digital environment.
↩︎ - Retrieval-Augmented Generation (RAG) is a technique that allows AI systems (like chatbots) to retrieve information from external knowledge sources before answering a question. This approach helps improve accuracy and contextual relevance of outputs, while reducing the risk of incorrect or “hallucinated” information. ↩︎
The text only may be used under licence Creative Commons Attribution Non Commercial 4.0 International. All other elements (illustrations, imported files) are “All rights reserved”, unless otherwise stated.
OpenEdition suggests that you cite this post as follows:
OPERAS Editorial Team (March 19, 2026). Discoverability of francophone scientific content: release of the FrenchScienceCommons corpus. OPERAS. Retrieved April 4, 2026 from https://doi.org/10.58079/15whx
