WEBVTT 00:00:00.800 --> 00:00:03.233 Hi, I’m Marc and I’d like to present 00:00:03.233 --> 00:00:07.099 how the DGS Corpus implements ethical open data practices 00:00:07.099 --> 00:00:09.533 such as the CARE and FAIR principles. 00:00:09.766 --> 00:00:12.666 My co-author for this presentation is Thomas Hanke, 00:00:12.666 --> 00:00:17.066 but the work I describe has been contributed to by many people over the years, 00:00:17.066 --> 00:00:21.300 so credit should go to the entire DGS-Korpus project team. 00:00:26.600 --> 00:00:30.233 A quick note for those who are not familiar with signed languages, 00:00:30.233 --> 00:00:33.766 DGS stands for “Deutsche Gebärdensprache”, 00:00:33.766 --> 00:00:35.633 meaning German Sign Language, 00:00:35.633 --> 00:00:39.600 which is a signed language that is primarily used in Germany and Luxembourg. 00:00:45.833 --> 00:00:48.600 As is the case for basically all signed languages, 00:00:48.600 --> 00:00:51.066 DGS is severely under-resourced 00:00:51.066 --> 00:00:54.233 and very little data for corpus-driven research exists. 00:00:55.200 --> 00:00:56.700 To address this lack of data, 00:00:56.700 --> 00:01:00.866 the DGS Corpus was created as part of a 15 year project 00:01:00.866 --> 00:01:02.833 that was started in 2009. 00:01:03.299 --> 00:01:06.700 It is an annotated reference corpus of German Sign Language, 00:01:06.700 --> 00:01:10.266 consisting of 560 hours of conversations. 00:01:10.666 --> 00:01:13.466 It includes both sign by sign annotations 00:01:13.466 --> 00:01:16.233 and translations into German and English. 00:01:17.066 --> 00:01:20.766 50 hours of that reference corpus are also publicly available 00:01:20.766 --> 00:01:22.966 as the Public DGS Corpus. 00:01:26.066 --> 00:01:30.900 While small when compared to many corpora of well-resourced spoken languages, 00:01:30.900 --> 00:01:33.533 the DGS Corpus and its public subset 00:01:33.533 --> 00:01:37.233 are both amongst the largest sign language corpora of their kind. 00:01:37.833 --> 00:01:40.133 One reason for this is that signed languages 00:01:40.133 --> 00:01:43.099 have no commonly used written forms 00:01:43.099 --> 00:01:45.866 and phonetic transcription is very complex. 00:01:46.866 --> 00:01:51.666 Annotating signed utterances properly is a very time-consuming process. 00:01:52.633 --> 00:01:54.466 One hour of recording needs about 00:01:54.466 --> 00:01:58.299 800 to 1000 hours of work before publication 00:01:58.299 --> 00:02:01.833 setting hard constraints for how much of the reference corpus 00:02:01.833 --> 00:02:04.333 could be included in the public corpus. 00:02:05.700 --> 00:02:07.766 Publishing 50 hours of data 00:02:07.766 --> 00:02:11.933 allows us to present a cross-section of the overall corpus, 00:02:11.933 --> 00:02:16.099 giving a good impression of the different kinds of contents that it covers. 00:02:18.166 --> 00:02:21.133 During the creation and publication of the corpus, 00:02:21.133 --> 00:02:23.433 the team followed established best practices 00:02:23.433 --> 00:02:26.599 for ethical research relating to deaf populations. 00:02:27.599 --> 00:02:30.733 These practices are in line with the recently introduced 00:02:30.733 --> 00:02:33.233 CARE principles of open data ethics. 00:02:36.800 --> 00:02:40.300 To see how exactly the DGS Corpus implements CARE, 00:02:40.300 --> 00:02:41.933 let’s start at the beginning. 00:02:45.566 --> 00:02:48.633 The primary stakeholders in the DGS language community 00:02:48.633 --> 00:02:51.233 are members of the deaf community in Germany. 00:02:51.699 --> 00:02:55.400 Following the principle “nothing about us without us”, 00:02:55.400 --> 00:02:58.699 the corpus project has always included deaf team members. 00:02:59.400 --> 00:03:02.633 In addition, a focus group of deaf users was formed 00:03:02.633 --> 00:03:04.599 to guide project decisions 00:03:04.599 --> 00:03:07.566 and assist us in connecting with the deaf community 00:03:07.566 --> 00:03:10.233 as well as keeping it informed about the project. 00:03:12.933 --> 00:03:15.300 To ensure collective benefit, 00:03:15.300 --> 00:03:17.699 the corpus was designed so that it would be 00:03:17.699 --> 00:03:22.566 both a source for linguistic research and a record of deaf culture, 00:03:22.566 --> 00:03:24.966 covering general life experience, 00:03:24.966 --> 00:03:27.066 deaf-specific experiences, 00:03:27.066 --> 00:03:29.266 perception of historical events, 00:03:29.266 --> 00:03:31.933 but also things like telling jokes. 00:03:33.633 --> 00:03:35.866 The goal was to create a resource 00:03:35.866 --> 00:03:38.433 that would be entertaining, informative, 00:03:38.433 --> 00:03:41.633 and that would support the identity of the community. 00:03:48.199 --> 00:03:52.933 The project recorded 330 participants from all across Germany 00:03:52.933 --> 00:03:56.199 whose primary language of daily life was DGS. 00:03:57.300 --> 00:04:00.166 Following the authority to control principle, 00:04:00.166 --> 00:04:03.533 informed consent was requested from all participants. 00:04:04.133 --> 00:04:08.699 This involved providing information in both DGS and German 00:04:08.699 --> 00:04:10.766 regarding the goals of the project, 00:04:10.766 --> 00:04:12.166 uses of the data, 00:04:12.166 --> 00:04:14.466 and what the rights of the participants are. 00:04:15.633 --> 00:04:20.100 These rights include restricting for what purposes the data may be shared 00:04:20.100 --> 00:04:24.566 and also reviewing the recordings to give or withhold their approval 00:04:24.566 --> 00:04:27.800 for either entire recordings or individual moments. 00:04:33.433 --> 00:04:35.466 Let’s fast forward a few years. 00:04:35.899 --> 00:04:39.433 After a lot of work annotating and translating recordings, 00:04:39.433 --> 00:04:44.333 the first full release of the public corpus was published in 2018. 00:04:45.866 --> 00:04:48.899 We also release updated versions on a regular basis 00:04:48.899 --> 00:04:50.966 to add more data, make corrections, 00:04:50.966 --> 00:04:52.633 and to react to feedback. 00:04:55.300 --> 00:04:56.699 As I mentioned before, 00:04:56.699 --> 00:04:59.566 the DGS Corpus is both a linguistic resource 00:04:59.566 --> 00:05:01.399 and a record of deaf culture. 00:05:01.800 --> 00:05:04.033 So to maximise its collective benefit 00:05:04.033 --> 00:05:07.000 we released its data on two separate portals. 00:05:14.733 --> 00:05:17.266 The first one is My DGS, 00:05:17.266 --> 00:05:19.333 a community portal for deaf people 00:05:19.333 --> 00:05:22.500 and others interested in DGS and deaf culture. 00:05:22.933 --> 00:05:26.399 It provides all recordings with optional German subtitles 00:05:26.399 --> 00:05:28.800 and its design focusses on making it easy 00:05:28.800 --> 00:05:31.566 to find and watch interesting content. 00:05:32.033 --> 00:05:33.600 Here is a little example. 00:05:53.666 --> 00:05:56.866 The second portal is My DGS – annotated, 00:05:56.866 --> 00:05:59.633 a research portal that provides the same recordings 00:05:59.633 --> 00:06:03.766 with full sign annotations and translations in German and English. 00:06:04.199 --> 00:06:06.399 All data is available to download, 00:06:06.399 --> 00:06:09.533 but can also be viewed in an online transcript viewer. 00:06:09.833 --> 00:06:11.500 Here is the video from before 00:06:11.500 --> 00:06:13.833 Here is the video from before as seen through the research portal. 00:06:33.566 --> 00:06:36.199 The research portal also provides a type index 00:06:36.199 --> 00:06:39.466 of all unique signs occurring in the corpus. 00:06:47.466 --> 00:06:51.333 For each sign you receive an overview of its corpus occurrences, 00:06:51.333 --> 00:06:53.100 grouped by sign sense, 00:07:02.566 --> 00:07:07.633 and where possible a studio recording and phonetic transcription of its citation form 00:07:07.633 --> 00:07:10.533 as well as links to other lexical resources. 00:07:29.366 --> 00:07:31.666 The publication of the portals is also 00:07:31.666 --> 00:07:34.800 where FAIR joins CARE in our considerations. 00:07:35.399 --> 00:07:37.733 To make them reliably findable, 00:07:37.733 --> 00:07:41.733 each portal is treated as a separate but related dataset 00:07:41.733 --> 00:07:43.800 and given separate DOIs. 00:07:44.399 --> 00:07:47.466 For a simpler dataset, a single DOI would be sufficient, 00:07:47.466 --> 00:07:51.300 but for a complex dataset like the Public DGS Corpus 00:07:51.300 --> 00:07:56.000 we find it advisable to also have identifiers for individual parts. 00:07:59.699 --> 00:08:03.633 So we create DOIs for each individual transcript 00:08:03.633 --> 00:08:06.866 as well as for each type in the type index. 00:08:07.233 --> 00:08:10.266 That way researchers can clearly specify 00:08:10.266 --> 00:08:13.833 which transcripts or signs they refer to in their research. 00:08:15.800 --> 00:08:19.966 On top of that, whenever we release an updated version of the corpus, 00:08:19.966 --> 00:08:23.733 every component that has changed also receives a new DOI, 00:08:23.733 --> 00:08:28.100 so it is always clear which version of the corpus is being referred to. 00:08:31.300 --> 00:08:34.433 All of these DOIs are then given qualified references 00:08:34.433 --> 00:08:37.666 to clarify how they are related to each other. 00:08:45.799 --> 00:08:49.899 Each DOI also comes with a set of machine-readable metadata 00:08:49.899 --> 00:08:52.233 covering general dataset information 00:08:52.233 --> 00:08:54.933 like its name, authors, release date, 00:08:54.933 --> 00:08:57.733 and all those qualified references I just mentioned. 00:09:02.566 --> 00:09:05.633 For information that is more specific to language data 00:09:05.633 --> 00:09:08.633 we also provide a CMDI file for each transcript. 00:09:08.933 --> 00:09:12.366 In there we specify metadata about the participants, 00:09:12.366 --> 00:09:13.966 elicitation tasks, 00:09:13.966 --> 00:09:16.633 the languages of the primary and secondary data, 00:09:16.633 --> 00:09:17.966 and so on. 00:09:23.766 --> 00:09:27.666 Apart from this metadata, the corpus has a wealth of documentation. 00:09:27.866 --> 00:09:29.733 In addition to peer-reviewed articles, 00:09:29.733 --> 00:09:34.399 there are over thirty project notes documenting various aspects of the project. 00:09:42.799 --> 00:09:44.600 This includes a Data Statement, 00:09:44.600 --> 00:09:47.866 a document type specifically designed to help researchers 00:09:47.866 --> 00:09:50.333 understand the background of a dataset 00:09:50.333 --> 00:09:52.666 and anticipate its inherent biases. 00:09:53.233 --> 00:09:57.366 Of course, all project notes have their own version-controlled DOIs. 00:10:03.899 --> 00:10:05.566 Another challenge for the project 00:10:05.566 --> 00:10:08.933 was the long term archival of its original recordings. 00:10:09.899 --> 00:10:13.033 At over one thousand hours of raw recording time, 00:10:13.033 --> 00:10:14.700 captured from several perspectives 00:10:14.700 --> 00:10:18.566 by 7 different high definition or stereoscopic cameras, 00:10:18.566 --> 00:10:24.333 the original data collection resulted in 730,000 gigabytes of video files. 00:10:27.100 --> 00:10:30.100 An additional 250,000 gigabytes of data 00:10:30.100 --> 00:10:33.466 have so far been produced through secondary data collections, 00:10:33.466 --> 00:10:36.700 studio recordings made for the presentation of the corpus, 00:10:36.700 --> 00:10:39.433 and recordings made for a corpus-based dictionary 00:10:39.433 --> 00:10:41.466 that is also part of the project. 00:10:43.933 --> 00:10:46.766 That leaves us with a whole petabyte of data 00:10:46.766 --> 00:10:49.566 that needs to be securely backed up and archived, 00:10:49.566 --> 00:10:51.933 ideally following FAIR principles. 00:10:53.399 --> 00:10:57.066 Luckily, the Research Data Repository of Hamburg University 00:10:57.066 --> 00:11:00.233 provides a decentralised redundant server structure 00:11:00.233 --> 00:11:03.366 capable of handling large data collections like ours. 00:11:04.899 --> 00:11:08.200 The archive is run using Invenio RDM, 00:11:08.200 --> 00:11:10.799 the same repository software as Zenodo, 00:11:10.799 --> 00:11:13.933 which enables persistent storage, fast access, 00:11:13.933 --> 00:11:16.899 associated metadata, and clear versioning. 00:11:18.833 --> 00:11:22.533 While the original recordings themselves are not publicly accessible, 00:11:22.533 --> 00:11:24.600 their identity and metadata are, 00:11:24.600 --> 00:11:27.533 allowing us to also link each of them explicitly 00:11:27.533 --> 00:11:31.399 with the transcripts of the public corpus that originate from them. 00:11:37.533 --> 00:11:40.233 And that brings me to the end of my presentation. 00:11:40.600 --> 00:11:42.600 Of course, I only skimmed the surface 00:11:42.600 --> 00:11:44.033 and even skipped a few aspects, 00:11:44.033 --> 00:11:47.633 such as usage licences and anonymisation, 00:11:47.633 --> 00:11:51.600 so if you have any questions now or later, I’d be happy to answer them. 00:11:51.933 --> 00:11:55.033 And of course please go and check out the corpus itself 00:11:55.033 --> 00:11:57.000 and the many stories it contains. 00:11:57.466 --> 00:11:58.366 Thank you!