[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

Hi all,

We are planning to do one of the biggest team effort we have ever done next week (Nov 30th to Dec 4th) to reach the v2.0 of the datasets library (Edit: final day extended to next Wednesday Dec 9th!).

The effort will involve more than half of HuggingFace (!) with about 15 people including members who’ve defined the library like @lhoestq @yjernite, @joeddav, @jplu @patrickvonplaten, members of the research team like @teven @VictorSanh and the OSS team like @Narsil, newcomers like @abhishek, awesome part-time members like @aymm and @canwenxu and many others including @madlag or yours truly. (Edit: And now over 200 external participants as well :exploding_head:)

It will be targetted toward adding and tagging a large number of NLP datasets to the :hugs: datasets library with the goal being to reach +500 datasets and covers and organize as much of the NLP dataset eco-system as we find possible.

We are taking the occasion to develop some tools to more easily add and tag datasets in the library as well as create dataset cards for them.

After internal discussion, we have decided to open this time-limited project to external contributors if you want to have a little taste of what it is to participate in an internal HuggingFace team effort.

Basically, you can ping me or anyone of us and I will add you to the slack channel and give you access to the tools we use as well as detailed information on the workflow and a list of datasets that we think are worth adding.

There might be (Edit: “will definitely be”) a small reward as HuggingFace swag and of course sharing your contribution to this project but keep in mind that this is an open-source effort so join if you want to do an open-contribution and enjoy a bit of HuggingFace vibe, this is not an internship or work offer (for this you should check and apply on our profile on AngelList!). We expect most of the work to be done by the full-time members of HuggingFace but we are also always happy to share how we work and collaborate with external contributors which why we are opening this project.

what is it about:

  • we are adding a lot of new datasets to the library (in particular in many NLP tasks and we would like to have more datasets in low ressource languages as well) with the aim to cover as much ground as possible

how you can join:

  • post here to say that you want to participate and I will add you to our slack => That’s it :slight_smile:

what you’ll get

  • enjoy a bit of HuggingFace vibe by joining the team sprint
  • receive a special event gift (actually 2 gifts, see this post further down the thread for details!) because it’s really amazing to see the community so involved here that we wanted to remember this event!

BIG UPDATE
We have just updated the deadline to next Wednesday (Dec 9th) So the late comers can still participate!

SECOND BIG UPDATE
A lot of people are still joining (on the way to be 300 participants :exploding_head:) so we are extending a bit the deadline again — though it will a limited extension because we have to end the project at some point :sweat_smile:

:question: More precisely:
All the participants who will have open at least 1 PR before the end of Wednesday (Dec 9th) can continue adding additional datasets until the end of Sunday (Dec 13th) that will be counted in the sprint.

:shirt::coffee: In other word:
If you have open 1 PR before Wednesday (and thus are eligible for the special event tee-shirt goody :wink:) you will have until the end of Sunday to add 2 others datasets if you want, and join the main-contributors channel of the slack (+ get the special event mug)

Open-sourcely yours,

Thom

47 Likes

Love this! I’ll be too preoccupied the following weeks, but I’ll definitely join in in the future if such an event is done again!

3 Likes

@thomwolf @lhoestq I want to join and work on adding larger datasets used for model pre-training. I’d start with preparing datasets used to train Alexa Bort. We already have Wikipedia, Bookcorpus, OpenWebText. I want to add Wiktionary, UrbanDictionary, One Billion Words, the news subset of Common Crawl (Nagel, 2016).

I’d like to contribute to the creation of datasets tooling so any researchers working on a nextgen LM can quickly and easily make arbitrary “brews” of these large datasets and use them in pre-training.

Cheers,
Vladimir

4 Likes

Count me in please.

1 Like

Hey, How do decide which datasets to add ? Any priorities for low resource languages? I am interested in working with Arabic datasets.

3 Likes

@Zaid yes we are aiming to improve coverage of low resource languages! Would you mind posting some of the Arabic language datasets you’d want to see added? (And if you can add links to the data location and paper if available that would be fantastic!)

3 Likes

Hi, I’m open to contribute. Do you have any backlog?
About low resources languages, I’m interested in working with Portuguese datasets.

1 Like

I’ve added a note to the main post. Basically if you know datasets you would like to see added in the library (e.g. in portuguese) feel free to dump them here :slight_smile: with a link to their location for instance.

3 Likes

I’m interested in contributing to this effort.

1 Like

This is an initial list for Arabic

Dataset Paper Link
SOQAL Neural Arabic Question Answering https://github.com/husseinmozannar/SOQAL
HARD Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications https://github.com/elnagara/HARD-Arabic-Dataset
ArsentD-LEV A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets http://oma-project.com/ArSenL/ArSenTD_Lev_Intro
ANERcorp ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp
LABR A Large-SCale Arabic Book Reviews Dataset https://github.com/mohamedadaly/LABR
AJGT N/A https://github.com/komari6/Arabic-twitter-corpus-AJGT
Multi-datasets Building Large Arabic Multi-domain Resources for Sentiment Analysis https://github.com/hadyelsahar/large-arabic-sentiment-analysis-resouces
TEAD Using Tweets and Emojis to Build TEAD: an Arabic Dataset for Sentiment Analysis https://github.com/HSMAabdellaoui/TEAD
COVID-19 dataset Large Arabic Twitter Dataset on COVID-19 https://github.com/SarahAlqurashi/COVID-19-Arabic-Tweets-Dataset
4 Likes

Nice initiative!

It would be cool to add (more to come as I think of them):

Dataset Paper Link Comments
PAN-X / Wikiann Massively Multilingual Transfer for NER https://github.com/afshinrahimi/mmner Although a subset of this dataset is available in the XTREME dataset, XTREME doesn’t have all the languages and forces you to do a clunky manual download.
NOAH’s Corpus of Swiss German Dialects Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging https://noe-eva.github.io/NOAH-Corpus/ PoS tagged
The ArchiMob corpus ArchiMob - A Corpus of Spoken Swiss German https://drive.switch.ch/index.php/s/vYZv9sNKetuPYTn PoS tagged, download possible via curl
4 Likes

Adding an initial list for datasets in Portuguese:

Dataset Paper Link
b5 corpus Building a Corpus for Personality-dependent Natural Language Understanding and Generation https://drive.google.com/file/d/0B-KyU7T8S8bLTHpaMnh2U2NWZzQ/view
BlogSet-BR BlogSet-BR: A Brazilian Portuguese Blog Corpus https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/blogset-br/
MilkQA Dataset MilkQA: a Dataset of Consumer Questions for the Task of Answer Selection nilc.icmc.usp.br/nilc/index.php/milkqa/
3 Likes

@thomwolf that’s a cool initiative! Would love to be a part of it. Can probably help you with Russian.

1 Like

This is a nice initiative. I would like to contribute to swahili - http://opus.nlpl.eu/download.php?f=GoURMET/v1/xml/sw.zip

2 Likes

@thomwolf I am very excited to participate in this project. And I am thrilled to join and contribute.

1 Like

I would love to contribute in developement.

1 Like

I would like to contribute!

1 Like

Would be fantastic to contribute, +1 on yo be added :smiley:

1 Like

@thomwolf, amazing. Please count me in too :slight_smile: . I would love to do my part for Sanskrit dataset.
Dataset: https://zenodo.org/record/803508#
Paper: https://www.aclweb.org/anthology/W17-2214.pdf

3 Likes

Eager to help out!

1 Like