Skip to content

1.0.0 Release: New name, Speed-ups, Multimodal, Serialization

Compare
Choose a tag to compare
@lhoestq lhoestq released this 11 Sep 10:19

1.0.0 Release: New name, Speed-ups, Multimodal, Serialization

Package Changes

  • Rename: nlp -> datasets

Update now with

pip install datasets

Dataset Features

  • Keep the dataset format after dataset transforms (#607)
  • Pickle support (#536)
  • Save and load datasets to/from disk (#571)
  • Multiprocessing in map and filter (#552)
  • Multi-dimensional arrays support for multi-modal datasets (#533, #363)
  • Speed up Tokenization by optimizing casting to python objects (#523)
  • Speed up shuffle/shard/select methods - use indices mappings (#513)
  • Add input_column parameter in map and filter(#475)
  • Speed up download and processing (#563)
  • Indexed datasets for hybrid models (REALM/RAG/MARGE) (#500)

Dataset Changes

  • New: IWSLT 2017 (#470)
  • New: CommonGen Dataset (#578)
  • New: CLUE Benchmark (11 datasets) (#572)
  • New: the KILT knowledge source and tasks (#559)
  • New: DailyDialog (#556)
  • New: DoQA dataset (ACL 2020) (#473)
  • New: reuters21578 (#570)
  • New: HANS (#551)
  • New: MLSUM (#529)
  • New: Guardian authorship (#452)
  • New: web_questions (#401)
  • New: MS MARCO (#364)
  • Update: Germeval14 - update download url (#594)
  • Update: LinCE - update download url (#550)
  • Update: Hyperpartisan news detection - update download url, manual download no longer required (#504)
  • Update: Rotten Tomatoes - update download url (#484)
  • Update: Wiki DPR - Use HNSW faiss index (#500)
  • Update: Text - Speed up using multi-threaded PyArrow loading (#548)
  • Fix: GLUE, PAWS-X - skip header (#497)

[Breaking] Update Dataset and DatasetDict API (#459)

  • Rename the flatten, drop and dictionary_encode_column methods in flatten_, drop_ and dictionary_encode_column_ to indicate that these methods have in-place effects
  • Remove the dataset.columns property and dataset.nbytes
  • Add a few more properties and methods to DatasetDict

Metric Features

  • Disallow the use of positional arguments to avoid predictions vs references mistakes (#466)
  • Allow to directly feed numpy/pytorch/tensorflow/pandas objects in metrics (#466)

Metric Changes

  • New: METEOR metric (#479)
  • Fix: Sacrebleu - fix inputs format (#520)

Loading script Features

  • Pin the version of the scripts (reproducibility) (#603, #584)
  • Specify default script_version with the env variable HF_SCRIPTS_VERSION (#584)
  • Save scripts in a modules cache directory that can be controlled with HF_MODULES_CACHE (#574)

Caching

  • Better support for tokenizers when caching map results (#601)
  • Faster caching for text dataset (#573, #502)
  • Use dataset fingerprints, updated after each transform (#536)
  • Refactor caching behavior, pickle/cloudpickle metrics and dataset, add tests on metrics (#518)

Documentation

  • Metrics documentation (#579)

Miscellaneous

  • Add centralized logging - Bump-up cache loads to warnings (#538)

Bug fixes

  • Datasets: [Breaking] fixed typo in "formated_as" method: rename formated to formatted (#516)
  • Datasets: fixed the error message when loading text/csv/json without providing data files (#586)
  • Datasets: fixed select method for pyarrow < 1.0.0 (#585)
  • Datasets: fixed elasticsearch result ids returning as strings (#487)
  • Datasets: fixed config used for slow test on real dataset (#527)
  • Datasets: fixed tensorflow-formatted datasets outputs by using ragged tensor by default (#530)
  • Datasets: fixed batched map for formatted dataset (#515)
  • Datasets: fixed encodings issues on Windows - apply utf-8 encoding to all datasets (#481)
  • Datasets: fixed dataset.map for function without outputs (#506)
  • Datasets: fixed bad type in overflow check (#496)
  • Datasets: fixed dataset info save - dont use beam fs to save info for local cache dir (#498)
  • Datasets: fixed arrays outputs - stack vectors in numpy, pytorch and tensorflow (#495, #494)
  • Metrics: fixed locking in distributed settings if one process finished before the other started writing (#564, #547)