In a context where a growing number of languages are in danger of extinction and linguists in dire need for efficient language documentation tools, Breaking the Unwritten Language Barrier (BULB) aims at supporting the documentation of unwritten languages with the help of modern natural language processing technologies, in particular automatic speech recognition (ASR) and machine translation (MT).
This ANR/DFG project relies on a strong German-French cooperation between linguists and computer scientists from ZAS (F. Hamlaoui), the KIT (S. Stüker) and the University of Stuttgart (S. Zerbian) on the German side, as well as the LPP (M. Adda-Decker, A. Rialland), the LLACAN (M. van de Velde, D. Idiatov), the LIMSI (L. Lamel and F. Yvon), the LIG (L. Besacier) and the IMMI-CNRS (G. Adda) on the French side. These researchers and their local teams are bringing together their expertise to address the documentation of three mostly unwritten and generally under-resourced African languages of the Bantu family: Basaa (Cameroon), Myene (Gabon) and Embosi (Republic of Congo).
The first phase of the project consists in collecting large speech corpora (at least 100 hours/language) using a three step resource economic methodology designed by S. Bird and M. Liberman:
This phase is coordinated by F. Hamlaoui and primarily involves the linguists partners at ZAS (E.-M. Makasso, J. Engelmann, C. Ngo Sohna and H. Salfner), at LLACAN, LPP, LIG and at the University of Stuttgart.
The LIMSI and KIT teams will work on the development of language independent phonetic recognition systems to automatically produce accurate transcriptions in source (Basaa/Embosi/Myene) and target (French) languages. Alignments between source and target languages will subsequently be performed by the IMMI-CNRS and the KIT teams, using and improving existing statistical machine translation techniques. These alignments will be highly valuable to linguists and phoneticians for large scale acoustic-phonetic studies, phonological and prosodic data mining and dialectal variations studies, as well as morphological studies and dictionary elaborations.
Beyond the positive outcomes for the documentary linguistic community, BULB generally aims at participating in the preservation of linguistic and cultural diversity by providing communities with tools (e.g. writing systems, dictionaries, grammars) that will heighten the perceived value of their unwritten languages, facilitate the use of these languages in a wider array of settings, and thus help preventing them from disappearing.
An important part of the BULB project has been the development by our partners at LIG (CNRS) of an application to collect data in the field. Based on the Aikuma app by S. Bird and al., LIG-Aikuma is a tool to record, respeak and translate speech through a clean and easy-to-use interface. Furthermore it enables elicitation of speech from text, images or videos. There are clear file naming conventions, extended metadata information and the whole application is optimized for 10 inch screens for tablet use. More features are being developed while it is already used on fieldtrips to collect data in Africa.
It is available for free here.