Leibniz-Zentrum Allgemeine Sprachwissenschaft Leibniz-Gemeinschaft

Exploring data from language documentation

Organisator(en) Felix Rau & Kilu von Prince
Veranstaltungsbeginn 10.05.2013, 10.00 Uhr
Veranstaltungsende 11.05.2013, 18.00 Uhr
Ort ZAS

Description: Language documentation has produced a large amount of extensive spoken language corpora. These corpora consist of time-aligned and annotated audio and video recordings of endangered and often lesser known languages. The typological diversity and the variety of these data pose new and interesting technological and methodological challenges. Moreover, in the last ten years, a considerable infrastructure has been developed to create and archive larger corpora of time-aligned and annotated primary data. This infrastructure involves digital archives such as the TLA at the MPI in Nijmegen and tools such as ELAN, Toolbox, FLEX, praat and Transcriber.

But to unlock the full potential of spoken language corpora, researchers often face unique challenges: Depending on the properties of the documented language, the primary research questions, and the nature of the workflow, the tools listed above might not fully correspond to the researchers' needs. Also, in studies working with data from different documentation projects, it may be difficult to integrate a variety of formats and standards. This workshop, which is funded by the CLARIN-D project (F-AG3), invites experts from language documentation and linguistic typology as well as language technology and corpus linguistics to present and discuss problems and solutions posed by the analysis of typologically diverse spoken language corpora as well as relevant practices and technologies of related fields.

Program

Day 1, 2013-05-10

1:45-2:00

Welcome 

2:00-2:45

Nick Thieberger: Pathways to reusability for fieldwork records

2:45-3:00

Short coffee break

3:00-3:30 

Christian Chanard, Amina Mettouchi: Cross-linguistic comparison in lesser-described languages : from homogenized sub-corpora to integrated meta-corpus

3:30-4:00 

Emily M. Bender, Fei Xia, Joshua Crowgey, Michael Wayne Goodman: Towards automatic detection of morphosyntactic systems from IGT

4:00-4:30 

Coffee break

   

4:30-5:00

Alexander König, Menzo Windhouwer, Paul Trilsbeek, Sebastian Drude: Curation of large diverse data collections – the DoBeS annotations

5:00-5:30

Nikolaus Himmelmann: Some small things that would be a big help in processing fieldwork data

Day 2, 2013-05-11

9:00-9:45

Taras Zakharko: ToolboxSearch – an R package for working with Toolbox corpora 

9:45-10:15 

Frank Seifart, Jan Strunk, Florian Schiel: Word- and Phoneme-Level Time Alignment for Language Documentation Corpora

10:15-10:45 

Coffee break

10:45-11:30 

Ciprian Gerstenberger: Why not tba?

11:30-12:00 

Peter Bouda: Annotation graphs and distant reading with GrAF and Poio API

12:00-12:30 

Coffee break

12:30-1:00 

Seunghun J. Lee, Emily Elfner: Building a database for phonology-syntax interface research

1:00-1:30 

Kilu von Prince: Problems with and solutions to consolidating corpus data from fieldwork