Conference Presentation: CARE, FAIR and the DGS Corpus: Implementing Ethical Open Data Practices in a Large Sign Language Dataset

Name: Conference Presentation: CARE, FAIR and the DGS Corpus: Implementing Ethical Open Data Practices in a Large Sign Language Dataset
Start: 2022-10-07T14:30:00+02:00
End: 2022-10-07T15:00:00+02:00
Location: Berlin, Germany

Marc Schulder, Thomas Hanke

Date

7 October 2022 14:30 — 15:00

Event

Language Documentation Archiving Conference

Location

Berlin, Germany

Presentation

International Sign Interpreter: Razaq Fakir

Abstract

The creation and publication of resources for minority languages requires a balance between making data open and accessible and respecting the rights and needs of its language community. The DGS Corpus, a large collection of conversations in German Sign Language (DGS), seeks to strike that balance. Its entire data life cycle follows the CARE Principles for Indigenous Data Governance (Carroll et al., 2020), putting the deaf community of Germany, the primary stakeholders of DGS, front and centre. Deaf people have been involved not only as participants, but also as project members and advisors. Feedback from the deaf community is regularly used to adjust the output and practices of the project. All data was collected based on informed consent and license conditions that empower participants and give them control over their own data.

50 hours of the 560 hour corpus have been released publicly, following particularly stringent quality control, including anonymisation of personal identifiable information. To meet the needs of different audiences and fulfil its goal of being a record of deaf culture, the public corpus is not only available through a research portal (https://ling.meine-dgs.de) that provides full annotations, translations and metadata, but also through the community portal MY DGS (https://meine-dgs.de), which focuses on providing easy access to interesting stories for the deaf community, language teachers and others interested in DGS and deaf culture.

In addition to CARE, the public corpus also follows the FAIR Principles (Wilkinson et al., 2016) of good open data practices. Unique persistent identifiers are provided not only for the dataset as a whole, but also each individual transcript and each distinct sign type, aiding best citation practices. Metadata is human- and machine-readable and there are over 30 reports documenting every aspect of the project. Taken altogether, it makes the DGS Corpus both CAREful and FAIR.

Marc Schulder

Research Associate in Computational Linguistics

My research interests include sign languages, natural language processing, and open science.