WEBVTT 00:00:02.833 --> 00:00:04.700 - Hi, I’m Marc - And I’m Maria 00:00:04.700 --> 00:00:08.533 And we’d like to introduce you to the Sign Language Dataset Compendium. 00:00:08.533 --> 00:00:11.199 A new overview of sign language resources. 00:00:17.933 --> 00:00:21.333 [Maria:] Basically all sign languages are under-resourced languages. 00:00:21.733 --> 00:00:25.100 Both corpora and lexical resources are rare 00:00:25.100 --> 00:00:28.500 and usually quite limited in size, if they exist at all. 00:00:29.666 --> 00:00:32.666 So, when we started working on project EASIER, 00:00:32.666 --> 00:00:38.366 an EU project working on automatic translation between the signed and spoken languages of Europe, 00:00:38.366 --> 00:00:44.399 one of our first tasks was to compile information on what data might be available to us. 00:00:45.333 --> 00:00:50.100 What we realized was that it was quite difficult to identify many datasets 00:00:50.100 --> 00:00:53.633 and even more difficult to compile sufficient information 00:00:53.633 --> 00:00:58.733 about things like their size, participant demographic or licence conditions. 00:01:00.033 --> 00:01:03.433 A few datasets could be found in language repositories 00:01:03.433 --> 00:01:06.566 and some were mentioned in old literature surveys, 00:01:06.566 --> 00:01:09.066 but mostly we had to rely on web searches 00:01:09.066 --> 00:01:12.466 and our own literature review of over 300 papers. 00:01:13.666 --> 00:01:16.400 From this we created a hundred page report 00:01:16.400 --> 00:01:20.033 providing a structured outline of European sign language data. 00:01:20.533 --> 00:01:24.333 The responses to this report were so enthusiastic 00:01:24.333 --> 00:01:27.733 that we decided that we wanted to extend our work even further 00:01:27.733 --> 00:01:32.000 and make it a global overview of datasets for all sign languages. 00:01:32.733 --> 00:01:35.266 This meant not only adding more information, 00:01:35.266 --> 00:01:38.566 but also rethinking how we make this information available. 00:01:39.200 --> 00:01:42.400 The result is the Sign Language Dataset Compendium, 00:01:42.400 --> 00:01:46.400 a new website that already lists over a hundred different resources, 00:01:46.400 --> 00:01:47.766 with more to come. 00:01:51.033 --> 00:01:53.066 Let’s have a look at the compendium! 00:01:53.900 --> 00:01:56.366 The website has four main entry types: 00:01:56.633 --> 00:02:02.033 Corpora, lexical resources, (such as dictionaries and vocabulary indices) 00:02:02.033 --> 00:02:04.966 data collection tasks, and languages. 00:02:06.033 --> 00:02:10.633 Let’s start by looking at one of the 41 corpora that we have so far. 00:02:27.033 --> 00:02:29.766 Each entry has a table of structured information 00:02:29.766 --> 00:02:33.533 and a free-form text section to introduce the dataset 00:02:33.533 --> 00:02:36.966 and include any information that may not have fit in the table. 00:02:37.333 --> 00:02:41.066 The standardised table makes it easy to compare different resources 00:02:41.066 --> 00:02:45.166 as the structure is the same for all corpora listed in the compendium. 00:02:45.933 --> 00:02:51.099 It covers factors like its size, languages and participant demographic, 00:02:51.099 --> 00:02:55.066 but also information like file formats, usage licence 00:02:55.066 --> 00:02:58.233 and where to find both data and relevant literature. 00:02:59.099 --> 00:03:02.400 The table structure for lexical resources is similar, 00:03:02.400 --> 00:03:05.599 although some of the table rows are different, of course. 00:03:07.066 --> 00:03:10.400 Further below you can find another section that comes in handy 00:03:10.400 --> 00:03:12.966 when you are looking for comparable data. 00:03:19.900 --> 00:03:23.500 Here we list all the data collection tasks of the corpus 00:03:23.500 --> 00:03:25.866 which have also been used in other corpora. 00:03:26.766 --> 00:03:31.599 These tasks can range from retellings of specific materials, like the Frog Story, 00:03:31.599 --> 00:03:36.933 to general prompts, like asking participants to relate their deaf life experiences. 00:03:37.666 --> 00:03:41.166 You will find information on which task is used by the corpus, 00:03:41.166 --> 00:03:43.466 which language it was performed in, 00:03:43.466 --> 00:03:48.400 how many recordings you can find in the open or restricted access parts of the corpus 00:03:48.400 --> 00:03:51.599 and also where you can find those specific recordings. 00:03:55.800 --> 00:04:00.333 To get more information on the task itself you can click on the name of the task. 00:04:00.666 --> 00:04:04.466 This will lead you directly to the task entry in the compendium. 00:04:12.766 --> 00:04:17.600 As you can see, entries for data collection tasks follow a very similar structure, 00:04:17.600 --> 00:04:19.933 giving a general description of the task, 00:04:19.933 --> 00:04:22.266 followed by a table providing information 00:04:22.266 --> 00:04:25.666 like a general classification of the stimulus type 00:04:25.666 --> 00:04:28.533 and where the stimulus materials may be found. 00:04:29.666 --> 00:04:33.933 Again, this is followed by a section linking tasks and corpora, 00:04:33.933 --> 00:04:38.466 except this time, it identifies all the corpora that include this task. 00:04:44.266 --> 00:04:46.933 Another way to look for relevant resources 00:04:46.933 --> 00:04:50.199 is to start by looking up the language you are interested in. 00:04:50.566 --> 00:04:55.100 The language index lists them by their most common English name and acronym, 00:04:55.100 --> 00:04:58.866 but you can also look them up using other names via the search filter. 00:05:15.600 --> 00:05:20.666 Apart from linking to all the corpora and lexical resources that cover the language, 00:05:20.666 --> 00:05:25.066 every entry also lists the ISO and glottolog codes of the language 00:05:25.066 --> 00:05:29.366 as well as the established names and acronyms that we could determine. 00:05:34.133 --> 00:05:36.500 Now that we had a glance at the compendium 00:05:36.500 --> 00:05:41.566 we would like to explain what criteria we use to curate our selection of resources. 00:05:42.433 --> 00:05:46.866 First of all, every dataset has to fulfill some qualitative conditions, 00:05:46.866 --> 00:05:50.833 such as representing natural language use by L1 signers, 00:05:50.833 --> 00:05:55.466 rather than interpreted spoken language content or language learner recordings. 00:05:56.600 --> 00:06:01.366 On the quantitative side, each resource must meet certain size requirements, 00:06:01.366 --> 00:06:05.866 although those requirements have changed from those of the original report. 00:06:07.199 --> 00:06:12.699 As mentioned before, we started the Compendium as an overview of European sign language resources 00:06:12.699 --> 00:06:17.333 with a focus on finding datasets that would be useful to machine learning. 00:06:17.699 --> 00:06:20.433 This made us disregard smaller resources 00:06:20.433 --> 00:06:24.933 in favour of larger ones that provide significant amounts of annotations. 00:06:26.166 --> 00:06:28.766 It became apparent that even within Europe 00:06:28.766 --> 00:06:33.199 there were large differences in how much the creation of sign language resources 00:06:33.199 --> 00:06:35.233 was supported in different countries. 00:06:35.566 --> 00:06:38.333 This imbalance became even more pressing 00:06:38.333 --> 00:06:41.166 when we expanded our goals to a global search. 00:06:41.166 --> 00:06:46.300 Sticking with our original criteria would mean not covering many languages at all. 00:06:46.899 --> 00:06:50.333 At the same time, simply lowering our requirements would mean 00:06:50.333 --> 00:06:54.833 that those languages with comparatively strong support and funding structures 00:06:54.833 --> 00:06:56.699 might dominate the compendium 00:06:56.699 --> 00:06:59.633 through the inclusion of many of their smaller datasets. 00:06:59.966 --> 00:07:03.866 So we decided to work with two sets of curation criteria: 00:07:03.866 --> 00:07:06.666 strict criteria and minimal criteria. 00:07:07.366 --> 00:07:10.766 All datasets have to meet the minimum criteria, 00:07:10.766 --> 00:07:15.100 but if some datasets of a given language also met the strict criteria, 00:07:15.100 --> 00:07:19.500 we would prioritise those and then move on to other languages first. 00:07:19.833 --> 00:07:23.399 We may, of course, come back to the smaller resources in future, 00:07:23.399 --> 00:07:25.233 once we have covered all languages 00:07:25.233 --> 00:07:29.666 and have developed some filter functions to avoid mini-dataset overload. 00:07:34.433 --> 00:07:38.166 [Marc:] As you have seen, we try to make the entries of the compendium comparable 00:07:38.166 --> 00:07:41.333 by structuring and standardising the information we provide. 00:07:42.033 --> 00:07:44.533 At the same time, we had to be realistic 00:07:44.533 --> 00:07:48.966 and see what information would actually be available for enough resources. 00:07:49.800 --> 00:07:53.000 We also wanted to stay flexible enough to include resources 00:07:53.000 --> 00:07:56.033 with varying kinds and granularities of information. 00:07:56.399 --> 00:07:58.333 So we defined a general structure, 00:07:58.333 --> 00:08:01.733 but allowed the individual sections to mostly be freeform text. 00:08:02.833 --> 00:08:06.899 Internally this freeform text is enhanced with XML markup 00:08:06.899 --> 00:08:12.300 to allow us to interconnect entries, include citations, or format them otherwise. 00:08:13.633 --> 00:08:16.300 We are currently working on extending this markup 00:08:16.300 --> 00:08:20.333 to help us with extracting machine-readable metadata from our descriptions. 00:08:20.333 --> 00:08:22.766 Such metadata will on the one hand 00:08:22.766 --> 00:08:27.433 help us in introducing new features like search filters and sorting functions, 00:08:27.433 --> 00:08:30.899 but it also makes us more compatible with other services. 00:08:32.466 --> 00:08:35.366 As one example, we are currently working on being registered 00:08:35.366 --> 00:08:40.399 as a data provider for the Open Language Archives Community, OLAC for short. 00:08:41.333 --> 00:08:46.833 This helps both our visibility as well as the visibility of the resources we describe. 00:08:50.833 --> 00:08:52.866 During our search for datasets, 00:08:52.866 --> 00:08:57.000 we did also come across a small number of other resource overviews. 00:08:57.500 --> 00:09:02.333 These were usually one-off reports that had by now become somewhat outdated. 00:09:03.799 --> 00:09:07.166 Our goal is to treat the compendium as a growing resource 00:09:07.166 --> 00:09:09.733 that will be extended and revised over time. 00:09:10.633 --> 00:09:13.933 We plan to maintain this effort for at least the next five years. 00:09:14.200 --> 00:09:18.766 In that time, we also hope to develop a sustainable long-term strategy. 00:09:19.833 --> 00:09:23.233 To make the compendium more independent from factors like web hosting, 00:09:23.233 --> 00:09:26.500 we plan to also generate a PDF version of its contents, 00:09:26.500 --> 00:09:30.100 which will be easier to archive and distribute independently. 00:09:35.066 --> 00:09:38.733 With the Sign Language Dataset Compendium we present a resource 00:09:38.733 --> 00:09:42.266 that we hope will help researchers and other interested parties 00:09:42.266 --> 00:09:44.866 in finding suitable sign language resources. 00:09:46.100 --> 00:09:50.533 We also hope that our efforts in structuring information about sign language datasets 00:09:50.533 --> 00:09:56.399 will lead to realistic standards that creators will find both usable and worth adapting. 00:09:57.133 --> 00:10:01.600 As the compendium is growing over time, we will not only add more resources, 00:10:01.600 --> 00:10:05.066 but also revise our website features and curation criteria. 00:10:06.033 --> 00:10:10.066 If you have any suggestions for how we can make the compendium even better, 00:10:10.066 --> 00:10:12.100 we would be more than happy to discuss them. 00:10:12.633 --> 00:10:13.466 Thank you.