Sub-Saharan Africa is home to more than 2,000 languages, yet fewer than 5% currently have sufficient resources for natural language processing. This underrepresentation significantly reduces the effectiveness of existing voice technologies for African users.
In response to this long-standing gap, Google has officially launched WAXAL, an open-source voice database designed to accelerate the development of artificial intelligence systems capable of understanding and reproducing African languages. Developed over three years in partnership with institutions across the continent, the project aims to address the structural shortage of linguistic data that has long hindered the growth of voice-based AI in sub-Saharan Africa.
Now freely available on the Hugging Face platform, WAXAL includes more than 11,000 hours of recorded speech, drawn from nearly two million audio files. The dataset covers 21 African languages, including Hausa, Yoruba, Luganda, Acholi, Swahili, Igbo, and Fulfulde.
Data collection was led by several African partners, notably Makerere University in Uganda and the University of Ghana, which oversaw work on 13 languages. Rwanda’s Digital Umuganda initiative contributed five additional languages. Regional studios also supported the production of high-quality recordings, while the African Institute for Mathematical Sciences (AIMS) helped build multilingual corpora for future expansions.
Designed as foundational infrastructure, WAXAL provides approximately 1,250 hours of transcribed speech for automatic speech recognition, along with more than 20 hours of studio recordings intended for text-to-speech synthesis. The goal is to enable the creation of voice-driven applications such as virtual assistants, dictation tools, and public services accessible to populations with limited literacy, particularly in sectors like healthcare, education, and agriculture.
“This database provides an essential foundation for researchers and entrepreneurs to build technologies adapted to their languages and contexts,” said Aisha Walcott-Bryant, Head of Google Research Africa.
The launch comes amid growing momentum around African language technologies. In 2025, Nigeria introduced N-ATLAS, an open-source language model capable of transcribing Yoruba, Hausa, Igbo, and Nigerian English. Meanwhile, African start-ups are also developing speech recognition and translation solutions tailored to local needs.
The stakes are high. Although sub-Saharan Africa hosts thousands of languages, only a small fraction currently have the resources required for AI-driven language processing. This limits access for millions of people to voice technologies that have become commonplace elsewhere.
Under the partnership model adopted for WAXAL, African institutions involved in data collection retain ownership of their corpora while making them accessible under an open license. As Joyce Nakatumba-Nabende, a researcher at Makerere University, noted: “For artificial intelligence to have real impact in Africa, it must be able to speak our languages and reflect our realities.”