Leibniz-Zentrum Allgemeine Sprachwissenschaft Leibniz-Gemeinschaft

Big data on small languages: Release of the DoReCo online database


On July 29, linguists working all over the world will gather in Berlin at Leibniz-Zentrum Allgemeine Sprachwissenschaft to celebrate the online release of the DoReCo data base, which provides access to audio recordings from more than 50 languages, along with their transcriptions, translations, and detailed linguistic analyses. Admission to the hybrid event is free, but registration is required. Journalists are invited to meet the DoReCo Principal Investigators upfront between 2 and 3 pm. More details can be found at the end of this press release.

The DoReCo (Language Documentation Reference Corpus) online data base will offer an unmatched panorama of the diversity of the world's languages through oral narratives from Northern Siberia to South Africa, and from Europe to Australia. DoReCo features a selection of top results of meticulous work by linguists who have spent years analyzing small and endangered languages in collaboration with native speakers. With DoReCo, scientists from all over the world will now have access to these recordings and analyses. Looking through this kaleidoscope of languages will help them unravelling the mysteries of linguistic diversity, but also underlining their common characteristics, beyond the distances that exist between them.

Frank Seifart, senior researcher at ZAS and Principal Investigator in the DoReCo-project comments: "I am really excited about being able to now study the diversity of human languages not just through abstract statements taken from grammar books, but through the rich expressive power of spontaneously produced speech."

DoReCo built a network of nearly a hundred linguists who have collected and analysed primary linguistic data. Processing these data was enabled by a French-German project led by ZAS in Berlin and Dynamics of Language Laboratory in Lyon and jointly funded by the German DFG and the French ANR (Agence Nationale de la Recherche). During three years, the project team has worked on enriching and homogenizing the miscellaneous transcriptions, translations, and linguistic analyses in towards the unified DoReCo data format, facilitating comparative analyses. DoReCo data also comply with the ethical and scientific 'FAIR' principles (Findability, Accessibility, Interoperability, Re-usability), and are accessible under Creative Commons licenses.

The DoReCo collection of 50 languages is the closest linguistics has come so far in terms of a representative sample of the 7,000 languages are still spoken today in terms of audio-recorded, expertly analyzed texts. It represents a substantial breakthrough for scientists, compared to the resources previously available: Most of the DoReCo languages were almost invisible online, and the release of oral recordings associated to rich linguistic analyses opens new avenues for research on the uniquely human ability to develop, maintain, and use astonishingly diverse linguistic systems. However, for DoReCo languages Dalabon (spoken in Australia), Resígaro (spoken in South America), and Kamas (spoken in Siberia) numbers of speakers have already dramatically declined, with barely a handful, only one, or even none left today, leading inexorably to the closing of the windows they opened on human cultural and linguistic diversity, and adding to the importance of the documentary materials provided by DoReCo.

More information about the event:
Friday July 29th, 2022, 3pm - 5:30pm, hybrid event at the Leibniz Zentrum Allgemeine Sprachwissenschaft, Schützenstr. 18, 10117 Berlin. Free admission, please register by July 22nd at https://www.leibniz-zas.de/de/das-zas/veranstaltungen/details/events/new-doreco-...

For Journalists:
Journalists are invited to meet the principal investigators upfront between 2 and 3 pm. For participation, please write an email to Frank Seifart (seifart@leibniz-zas.de) or François Pellegrino (francois.pellegrino@univ-lyon2.fr) until July 27. It is possible to participate in person or via zoom.

About the DoReCo-Project
The DoReCo database was created within the DoReCo project from 2019 to 2022. This project was funded by an ANR-DFG grant (ANR-18-FRAL-0010-01, KR951/17-1) awarded to Frank Seifart, Manfred Krifka (March-July 2019), and François Pellegrino (August 2019-August 2022). The project was housed at the Leibniz-Zentrum Allgemeine Sprachwissenschaft in Berlin and the laboratoire Dynamique Du Langage (UMR5596, CNRS & Université Lyon 2) in Lyon and cooperated with the Bavarian Archive for Speech Signals, Munich. The aim of the DoReCo project was to carry out research on local variations of speech rate based on a broader sample of the world’s languages than has previously been available in a single linguistic database.


Contact for scientific information:

Frank Seifart
Phone: +49 30 20192 407

François Pellegrino

More information:
https://doreco.info Project website