Corpus CreoleVaL

We are proud to announce the release of CreoleVal – a collection of benchmarks for 28 Creole languages. The collection of datasets span tasks such as relation classification, machine comprehension, machine translation, named entity recognition, and use cases such as language modeling. We cover Haitian Creole, Bislama, Chavacano, Pitkern, Singlish, Tok Pisin, Papiamento, and others.

We hope the NLP community will include this collection of datasets in ongoing & future evaluations of methods directed at low-resource languages. Not only that, we also hypothesise that CreoleVal will open the door for controlled experimentation with transfer learning methodology.

This resource has been long in the making, and was made possible by a long list of collaborators.

For a pre-print, see: https://arxiv.org/abs/2310.19567

For code and data, see: https://github.com/hclent/CreoleVal
(Repository under construction)

From: Johannes Bjerva <jbjerva@cs.aau.dk>

Corpus : CreoleVal

Corpus : CreoleVal

Corpus CreoleVaL

© Benjamin Pavone & SIDF - 2026