The Sign Language Dataset Compendium: Creating an Overview of Digital Linguistic Resources

Maria Kopf, Marc Schulder, Thomas Hanke

June, 2022

Type

Publication

In Proceedings of the 10th Workshop on the Representation and Processing of Sign Languages: Multilingual Sign Language Resources (sign-lang@LREC 2022)

Abstract

One of the challenges that sign language researchers face is the identification of suitable language datasets, particularly for cross-lingual studies. There is no single source of information on what sign language corpora and lexical resources exist or how they compare. Instead, they have to be found through extensive literature review or word-of-mouth. The amount of information available on individual datasets can also vary widely and may be distributed across different publications, data repositories and (potentially defunct) project websites. This article introduces the Sign Language Dataset Compendium, an extensive overview of linguistic resources for sign languages. It covers existing corpora and lexical resources, as well as commonly used data collection tasks. Special attention is paid to covering resources for many different languages from around the globe. All information is provided in a standardised format to make entries comparable, but kept flexible enough to allow for differences in content. The compendium is intended as a growing resource that will be updated regularly.

Marc Schulder

Research Associate in Computational Linguistics

My research interests include sign languages, natural language processing, and open science.