Using CLDF and CLLD for language documentation and description

Matter, Florian (25 October 2019). Using CLDF and CLLD for language documentation and description (Unpublished). In: Workshop on language documentation: multilingual settings and technological advances. Uppsala University. 24. - 25.10.2019.

[img] Slideshow (Slides)
uppsala_ogt_slides.pdf - Presentation
Restricted to registered users only
Available under License BORIS Standard License.

Download (61kB)
[img] Text (Abstract)
uppsala_abstract.pdf - Accepted Version
Restricted to registered users only
Available under License BORIS Standard License.

Download (429kB)

Within the humanities, linguistics can rightly be seen as one of the more data-heavy fields. Especially when undertaking a comprehensive description of a language, all sorts of linguistic data and metadata are needed for successful analysis. What used to be a more-or-less well organized compilation of handwritten field work notes is nowadays often a more-or-less well organized collection of digital notes. Digital literacy, as advocated by the digital humanities movement, is something linguistics could benefit from immensely. However, much descriptive linguistic work conducted with digital tools does not take full advantage of their possibilities. There are ways of converting between the formats used by some of these tools, but the potential interoperational richness stemming from language structure or rather, the analysis thereof, is often not explored.
The CLLD (cross-linguistic linked data) project (Forkel, Bank, et al. 2019) is focused on interoperable data and provides an easily extendable web framework for providing access to these data. The CLDF (cross-linguistic data formats) project (Forkel, J.-M. List, et al. 2017), which grew out of CLLD, aims to create a database-independent framework for storing and manipulating linguistic data, based on simple CSV (comma-separated values) files. The advantages of CLDF are the software-independent data format and its encouragement of consistency and completeness. The simplicity and ease of editing can be enormously beneficial for collaborations with native speakers, and consistency and completeness of data leads to more satisfying analyses.
I present an approach that combines more traditional linguis-
tic software solutions and the philosophy championed by these projects. It is based on two staple software solutions of many field linguists, ELAN (The Language Archive 2018) and FLEx (Summer Institute of Linguistics 2019), and the CLLD framework. Language data (in the form of texts) is first annotated in ELAN, then exported to FLEx (see Gaved & Salffner 2014), where morphosyntactic annotation is conducted. From the resulting FLEx export, CLDF files containing the lexicon and interlinear examples are created. These files in turn are then used to create a rich and interactive online grammar / dictionary / text collection built with the CLLD framework.1 The result is accessible for both laypeople and linguists, providing both a thorough description of the language, as well as rich illustrations of language in use, all with an accessible and interactive interface.
Every example sentence is linked to an audio file and one or multiple speakers. Also, morphemes in the object line of sentences are links to dictionary entries (Figure 1a). Dictionary entries for a morpheme list all sentences which exemplify it; again with included audio, and links to other appearing morphemes (Figure 1c). Meanings can be linked to the Concepticon (J. M. List et al. 2019). In the grammatical description part, illustrative examples can be inserted directly from the database. Links to individual entries, sources, and other parts of the grammatical description are also possible (Figure 1b). Planned features include the inclusion of dialectal and other sociolinguistic variation based on information about the included speakers.

Item Type:

Conference or Workshop Item (Speech)


06 Faculty of Humanities > Department of Linguistics and Literary Studies > Institute of Linguistics

UniBE Contributor:

Matter, Florian


400 Language > 410 Linguistics




Florian Emmanuel Matter

Date Deposited:

21 Apr 2020 15:23

Last Modified:

21 Apr 2020 15:23




Actions (login required)

Edit item Edit item
Provide Feedback