Transformers v4.0.0 announcement

We are working on a new major release that should come out at the end of next week, with cool new features that will unfortunately result in some breaking changes. There will be one last release for v3 before we start introducing those breaking changes on master, so if you’re using a source installation, be prepared or revert to v3.5.0 for a bit :hugs:

  • AutoTokenizers and pipeline will switch to Fast tokenizers by default
    => Resulting breaking change: the slow and fast tokenizers have roughly the same API but they have a different handling of the overflowing tokens.
    => Why are we doing this: This will greatly increase the performance of the tokenization aspect in pipelines, and enable clearer, simpler example scripts leveraging the fast tokenizers. The overflowing of Fast tokenizers is also a lot more powerful than it’s counterpart in slow tokenizers.

  • sentencepiece will be removed as a required dependency. (It will still be required to be installed for slow SP based tokenizers)
    => Resulting breaking change: some people will have to install sentencepiece explicitly while they didn’t have to before with the command pip install transformers[sentencepiece].
    => Why are we doing this? This, in turn, will allow us to create and maintain a conda channel offering the full Hugging Face suite on conda.

  • Reorganizing the internal organization of the library with subfolders (either one per model or one for all models, all tokenizers, one subfolder for pipelines, trainer etc). With the number of models growing, the source folder is a bit too hard to navigate right now.
    => Resulting breaking change: some people directly accessing the internals will have to update the path they use. If you only use imports from transformers directly, nothing will break.
    => Why are we doing this? The library will be more robust to scaling for more models.

  • Switching the return_dict argument to True. This argument that makes the ouputs of the models self-documented was introduced a few months ago with a default to False for backward compatibility.
    => Resulting breaking change: unpacking the output of a model with commands like loss, logits = model(**inputs) won’t work anymore. The command to_tuple can convert a model output to a tuple.
    => Why are we doing this? Outputs of the model are easier to understand when they are ModelOutput. You can index as a dict or use auto-complete in an IDE to find all fields. This will also allow us to optimize the TensorFlow models more (the tuples of various size being incompatible with graph mode).

  • Deprecated arguments or functions will be removed on a case-by-case basis.

15 Likes

Reorganizing the whole repo was bound to be necessary at one point. If anything, it is a testament of how a great library it is with ever-increasing features and models.

Good luck with the release!

7 Likes

Edit up there: since there is nothing really new and we know 3.5.0 is stable (no bugs report) we decided not to do a v3.6.0. Breaking changes on master will start tomorrow.

4 Likes