v4.6.0: ViT, DeiT, CLIP, LUKE, BigBirdPegasus, MegatronBERT

@NielsRogge

v4.6.0: ViT, DeiT, CLIP, LUKE, BigBirdPegasus, MegatronBERT

Transformers aren't just for text - they can handle a huge range of input types, and there's been a flurry of papers and new models in the last few months applying them to vision tasks that had traditionally been dominated by convolutional networks. With this release, we're delighted to announce that several state-of-the-art pretrained vision and multimodal text+vision transformer models are now accessible in the huggingface/transformers repo. Give them a try!

ViT (@NielsRogge)

Two new models are released as part of the ViT implementation: ViTModel and ViTForImageClassification, in PyTorch.

ViT is an image transformer-based model obtaining state-of-the-art results on image classification tasks. It was the first paper that successfully trained a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures.

The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=vit

DeiT (@NielsRogge)

Three new models are released as part of the DeiT implementation: DeiTModel, DeiTForImageClassification and DeiTForImageClassificationWithTeacher, in PyTorch.

DeiT is an image transformer model similar to the ViT model. DeiT (data-efficient image transformers) models are more efficiently trained transformers for image classification, requiring far less data and far less computing resources compared to the original ViT models.

The DeiT model was proposed in Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=deit

Add DeiT (PyTorch) #11056 (@NielsRogge)

CLIP (@patil-suraj)

Three new models are released as part of the CLIP implementation: CLIPModel, CLIPVisionModel and CLIPTextModel, in PyTorch.

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.

The CLIP model was proposed in Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=clip

CLIP #11445 (@patil-suraj)

BigBirdPegasus (@vasudevgupta7)

BigBird is a sparse-attention-based transformer that extends Transformer based models, such as BERT to much longer sequences. In addition to sparse attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it has been shown that applying sparse, global, and random attention approximates full attention while being computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context, BigBird has shown improved performance on various long document NLP tasks, such as question answering and summarization, compared to BERT or RoBERTa.

The BigBird model was proposed in Big Bird: Transformers for Longer Sequences by Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others.

Add BigBirdPegasus #10991 (@vasudevgupta7)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=bigbird_pegasus

LUKE (@NielsRogge, @ikuyamada)

LUKE is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps improve performance on various downstream tasks involving reasoning about entities such as named entity recognition, extractive and cloze-style question answering, entity typing, and relation classification.

The LUKE model was proposed in LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto.

Add LUKE #11223 (@NielsRogge, @ikuyamada)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=luke

Megatron (@jdemouth)

The MegatronBERT model is added to the library, giving access to the 345m variants.

It is implemented comes with nine different models: MegatronBertModel, MegatronBertForMaskedLM, MegatronBertForCausalLM, MegatronBertForNextSentencePrediction, MegatronBertForPreTraining, MegatronBertForSequenceClassification, MegatronBertForMultipleChoice, MegatronBertForTokenClassification, MegatronBertForQuestionAnswering, in PyTorch.

The MegatronBERT model was proposed in Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.

Add nvidia megatron models #10911 (@jdemouth)

Hub integration in Transformers

The Hugging Face Hub integrates better within transformers, through two new added features:

Models, configurations and tokenizers now have a push_to_hub method to automatically push their state to the hub.
The Trainer can now automatically push its underlying model, configuration and tokenizer in a similar fashion. Additionally, it is able to create a draft of model card on the fly with the training hyperparameters and evaluation results.
Auto modelcard #11599 (@sgugger)
Trainer push to hub #11328 (@sgugger)

DeepSpeed ZeRO Stage 3 & ZeRO-Infinity

The Trainer now integrates two additional stages of ZeRO: ZeRO stage 3 for parameter partitioning, and ZeRO Infinity which extends CPU Offload with NVMe Offload.

[DeepSpeed] ZeRO Stage 3 #10753 (@stas00) release notes
[Deepspeed] ZeRO-Infinity integration plus config revamp #11418 (@stas00) release notes
PLease read both release notes for configuration file changes

Flax

Flax support is getting more robust, with model code stabilizing and new models being added to the library.

[FlaxRoberta] Add FlaxRobertaModels & adapt run_mlm_flax.py #11470 (@patrickvonplaten)
[Flax] Add Electra models #11426 (@CoderPat)
Adds Flax BERT finetuning example on GLUE #11564 (@marcvanzee)

TensorFlow

We welcome @Rocketknight1 as a TensorFlow contributor. This version includes a brand new TensorFlow example based on Keras, which will be followed by examples covering most tasks.
Additionally, more TensorFlow setups are covered by adding support for AMD-based GPUs and M1 Macs.

Merge new TF example script #11360 (@Rocketknight1)
Update TF text classification example #11496 (@Rocketknight1)
run_text_classification.py fix #11660 (@Rocketknight1)
Accept tensorflow-rocm package when checking TF availability #11595 (@mvsjober)
Add MacOS TF version #11674 (@jplu)

Pipelines

Two new pipelines are added:

Adding AutomaticSpeechRecognitionPipeline. #11337 (@Narsil)
Add the ImageClassificationPipeline #11598 (@LysandreJik)

Notebooks

[Community notebooks] Add Wav2Vec notebook for creating captions for YT Clips #11142 (@Muennighoff)
add bigbird-pegasus evaluation notebook #11654 (@vasudevgupta7)
Vit notebooks + vit/deit fixes #11309 (@NielsRogge)

General improvements and bugfixes

[doc] gpt-neo #11098 (@stas00)
Auto feature extractor #11097 (@sgugger)
accelerate question answering examples with no trainer #11091 (@theainerd)
dead link fixed #11103 (@cronoik)
GPTNeo: handle padded wte (#11078) #11079 (@leogao2)
fix: The 'warn' method is deprecated #11105 (@stas00)
[examples] fix white space #11099 (@stas00)
Dummies multi backend #11100 (@sgugger)
Some styling of the training table in Notebooks #11118 (@sgugger)
Adds a note to resize the token embedding matrix when adding special … #11120 (@LysandreJik)
[BigBird] fix bigbird slow tests #11109 (@vasudevgupta7)
[versions] handle version requirement ranges #11110 (@stas00)
Adds use_auth_token with pipelines #11123 (@philschmid)
Fix and refactor check_repo #11127 (@sgugger)
Fix typing error in Trainer class (prediction_step) #11138 (@jannisborn)
Typo fix of the name of BertLMHeadModel in BERT doc #11133 (@forest1988)
[run_clm] clarify why we get the tokenizer warning on long input #11145 (@stas00)
[trainer] solve "scheduler before optimizer step" warning #11144 (@stas00)
Add fairscale and deepspeed back to the CI #11147 (@LysandreJik)
Updates SageMaker docs for updating DLCs #11140 (@philschmid)
Don't duplicate logs in TensorBoard and handle --use_env #11141 (@sgugger)
Run mlm pad to multiple for fp16 #11128 (@ak314)
[tests] relocate core integration tests #11146 (@stas00)
[setup] extras[docs] must include 'all' #11148 (@stas00)
Add support for multiple models for one config in auto classes #11150 (@sgugger)
[setup] make fairscale and deepspeed setup extras #11151 (@stas00)
typo #11152 (@stas00)
Fix LogitsProcessor documentation #11130 (@k-tahiro)
Correct typographical error in README.md #11161 (@Seyviour)
Make get_special_tokens_mask consider all tokens #11163 (@sgugger)
Add a special tokenizer for CPM model #11068 (@JetRunner)
[examples/translation] support mBART-50 and M2M100 fine-tuning #11170 (@patil-suraj)
[examples run_clm] fix _LazyModule hasher error #11168 (@stas00)
added json dump and extraction of train run time #11167 (@philschmid)
Minor typos fixed #11182 (@cronoik)
model_path should be ignored as the checkpoint path #11157 (@tsuchm)
Added documentation for data collator. #10941 (@fghuman)
Fix typo #11188 (@tma15)
Replaced which with who #11183 (@cronoik)
Import torch.utils.checkpoint in ProphetNet #11214 (@LysandreJik)
Sagemaker test docs update for framework upgrade #11206 (@philschmid)
Use MSELoss with single class label in (M)BartForSequenceClassification #11178 (@calpt)
wav2vec2 converter: create the proper vocab.json while converting fairseq wav2vec2 finetuned model #11041 (@cceyda)
Add Matt as the TensorFlow reference #11212 (@LysandreJik)
Fix GPT-2 warnings #11213 (@LysandreJik)
fix docs for decoder_input_ids #11221 (@patil-suraj)
Add documentation for BertJapanese #11219 (@forest1988)
Replace error by warning when loading an architecture in another #11207 (@sgugger)
Refactor GPT2 #11225 (@patil-suraj)
Doc check: a bit of clean up #11224 (@sgugger)
added cache_dir=model_args.cache_dir to all example with cache_dir arg #11220 (@philschmid)
Avoid using no_sync on SageMaker DP #11229 (@sgugger)
Indent code block in the documentation #11233 (@sgugger)
Run CI on deepspeed and fairscale #11172 (@LysandreJik)
[Deepspeed] zero3 tests band aid #11235 (@stas00)
Wav2Vec2 CommonVoice training - Save the processor before training starts #10910 (@Nithin-Holla)
Make "embeddings" plural in warning message within tokenization_utils_base #11228 (@jstremme)
Stale bot updated #10562 (@LysandreJik)
Close open files to suppress ResourceWarning #11240 (@parakalan)
Fix dimention misspellings. #11238 (@odellus)
Add prefix to examples in model_doc rst #11226 (@forest1988)
[troubleshooting] add 2 points of reference to the offline mode #11236 (@stas00)
Fix #10128 #11248 (@sgugger)
[deepspeed] test on one node 2 gpus max #11237 (@stas00)
Trainer iterable dataset #11254 (@sgugger)
Adding pipeline task aliases. #11247 (@Narsil)
Support for set_epoch in IterableDataset #11258 (@sgugger)
Tokenizer fast save #11234 (@sgugger)
update dependency_versions_table #11273 (@stas00)
Workflow fixes #11270 (@LysandreJik)
Enabling multilingual models for translation pipelines. #10536 (@Narsil)
Trainer support for IterableDataset for evaluation and predict #11286 (@sgugger)
move device statements outside if statements #11292 (@e-yi)
modify double considering special tokens in language_modeling.py #11275 (@taepd)
[Trainer] fix the placement on device with fp16_full_eval #11322 (@stas00)
[Trainer] Add a progress bar for batches skipped #11324 (@sgugger)
Load checkpoint without re-creating the model #11318 (@sgugger)
Added translation example script #11196 (@rajvi-k)
[Generate] Remove outdated code #11331 (@patrickvonplaten)
[GPTNeo] create local attention mask ones #11335 (@patil-suraj)
Update to use datasets remove_cloumns method #11343 (@sgugger)
Add an error message for Reformer w/ .backward() #11117 (@forest1988)
Removed max_length from being mandatory within generate. #11314 (@Narsil)
Honor contributors to models #11329 (@sgugger)
[deepspeed] fix resume from checkpoint #11352 (@stas00)
Examples reorg #11350 (@sgugger)
Extract metric_key_prefix during NotebookProgressCallback.on_evaluate #11347 (@lewtun)
[testing doc] bring doc up to date #11359 (@stas00)
Remove boiler plate code #11340 (@patrickvonplaten)
Move old TF text classification script to legacy #11361 (@Rocketknight1)
[contributing doc] explain/link to good first issue #11346 (@stas00)
Fix token_type_ids error for big_bird model. #11355 (@wlhgtc)
[Wav2Vec2] Fix special tokens for Wav2Vec2 tokenizer #11349 (@patrickvonplaten)
[Flax] Correct typo #11374 (@patrickvonplaten)
[run_translation.py] fix typo #11372 (@johnson7788)
Add space #11373 (@tma15)
Correctly cast num_train_epochs to int #11379 (@Rocketknight1)
Fix typo #11369 (@penut85420)
Fix Trainer with remove_unused_columns=False #11382 (@sgugger)
[Flax] Big FlaxBert Refactor #11364 (@patrickvonplaten)
[Flax] Typo #11393 (@patrickvonplaten)
[Flax] Correct Flax <=> PyTorch conversion #11394 (@patrickvonplaten)
Fix small typo in text #11396 (@maksym-del)
Fix typos in README for text-classification #11391 (@yoshitomo-matsubara)
[Blenderbot] Integration Test should be slow #11395 (@patrickvonplaten)
Fixed trainer total_flos relaoding in distributed mode #11383 (@TevenLeScao)
[Wav2Vec2] Correct conversion script #11400 (@patrickvonplaten)
added support for exporting of T5 models to onnx with past_key_values. #10651 (@Ki6an)
Fixing bug in generation #11297 (@nicola-decao)
Fix cross-attention head mask for Torch encoder-decoder models #10605 (@stancld)
Default to accuracy metric in run_glue_no_trainer #11405 (@sgugger)
Enable option for subword regularization in XLMRobertaTokenizer #11149 (@PhilipMay)
wrong parentclass in documentation #11410 (@cronoik)
EncoderDecoderConfigs should not create new objects #11300 (@cronoik)
Updating checkpoint for GPT2ForSequenceClassification #11334 #11434 (@abiolaTresor)
[BigBird] enable BigBirdForQuestionAnswering to return pooler output #11439 (@vasudevgupta7)
Upgrade Black to version 21.4b0 #11442 (@patrickvonplaten)
TF BART models - Add cross_attentions to model output and fix cross-attention head masking #10699 (@stancld)
Add basic support for FP16 in SageMaker model parallelism #11407 (@sgugger)
Fix link to the TPU launcher script in the pytorch examples #11427 (@amineabdaoui)
Typo fixes #11432 (@LSinev)
Pass along seed to DistributedSampler #11406 (@sgugger)
Clarify description of the is_split_into_words argument #11449 (@kstathou)
[docs] fix invalid class name #11438 (@stas00)
[Makefile] make sure to test against the local checkout #11437 (@stas00)
Give each hub test a different repo name #11453 (@sgugger)
[Examples] Fixes inconsistency around eval vs val and predict vs test #11380 (@bhadreshpsavani)
Variable Correction for Consistency in Distillation Example #11444 (@jaimeenahn)
Remove max length beam scorer #11378 (@GeetDsa)
update QuickTour docs to reflect model output object #11462 (@hamelsmu)
Finish Making Quick Tour respect the model object #11467 (@hamelsmu)
fix docs for decoder_input_ids #11466 (@patil-suraj)
Update min versions in README and add Flax #11472 (@sgugger)
Update PreTrainedTokenizerBase to check/handle batch length for text_pair parameter #11486 (@hamelsmu)
[Docs] remove paragraph on CI from installation instructions #11493 (@hamelsmu)
[Flax] Add docstrings & model outputs #11498 (@patrickvonplaten)
Reformat to make code clearer in tokenizer call #11497 (@sgugger)
solved coefficient issue for the TF version of gelu_fast #11514 (@michaelbenayoun)
Split checkpoint from model_name_or_path in examples #11492 (@sgugger)
Pin HuggingFace Hub dependency #11502 (@LysandreJik)
correct incorrect dimension comment in Longformer model #11494 (@fredo838)
Fix sp_model_kwargs param missing at unpickle in XLMRobertaTokenizer #11430 (@PhilipMay)
[Master] Make style #11520 (@patrickvonplaten)
Update README.md #11489 (@mrm8488)
T5 Gradient Checkpointing #11353 (@ceshine)
Implement Fast Tokenization for Deberta #11387 (@ShubhamSanghvi)
Accepts BatchEncoding in LengthGroupedSampler #11431 (@tma15)
Fix do_eval default value in training_args.py #11511 (@bonniehyeon)
[examples, translation/summerization] resize token embeds #11524 (@patil-suraj)
Run model templates on master #11527 (@LysandreJik)
[Examples] Added support for test-file in QA examples with no trainer #11510 (@bhadreshpsavani)
Add Stas and Suraj as authors #11526 (@sgugger)
Improve task summary docs #11513 (@hamelsmu)
[debug utils] activation/weights underflow/overflow detector #11274 (@stas00)
[DeepSpeed] fp32 support #11499 (@stas00)
Fix examples in M2M100 docstrings #11540 (@lewtun)
[Flax BERT/Roberta] few small fixes #11558 (@patil-suraj)
[Wav2Vec2] Fix convert #11562 (@patrickvonplaten)
Remove datasets submodule. #11563 (@LysandreJik)
fix the mlm longformer example by changing [MASK] to #11559 (@fredo838)
[Wav2vec2] Fixed tokenization mistakes while adding single-char tokens to tokenizer #11538 (@Muktan)
Fix metric computation in run_glue_no_trainer #11569 (@sgugger)
Fixes a useless warning in generate. #11566 (@Narsil)
Fix checkpointing in SageMaker MP #11481 (@sgugger)
Update training tutorial #11533 (@sgugger)
[Deepspeed] fix resize_token_embeddings #11572 (@stas00)
Add multi-class, multi-label and regression to transformers #11012 (@abhi1thakur)
add importlib_metadata as dependency as it is required for py<3.8 #11490 (@cdeepali)
Enable added tokens #11325 (@LysandreJik)
Make quality scripts work when one backend is missing. #11573 (@sgugger)
Removes SageMakerTrainer code but keeps class as wrapper #11587 (@philschmid)
Reproducible checkpoint #11582 (@sgugger)
[trainer] document resume randomness #11588 (@stas00)
[template runner CI] copies need to be fixed too #11585 (@stas00)
add importlib_metadata and huggingface_hub as dependency in the conda recipe #11591 (@cdeepali)
Pytorch - Lazy initialization of models #11471 (@patrickvonplaten)
fix head_mask for albert encoder part(AlbertTransformer) #11596 (@baeseongsu)
Fix Python version #11607 (@LysandreJik)
fix typo in command #11605 (@vipulraheja)
Fix typo in docstring #11611 (@eldarkurtic)
Re-styling in seq2seq attention #11613 (@sgugger)
[Lazy init] Fix edge cases #11615 (@patrickvonplaten)
[cuda ext tests] fixing tests #11619 (@stas00)
Fix RNG saves in distributed mode. #11620 (@sgugger)
Fix comment in run_clm_no_trainer.py #11624 (@cccntu)
make fix copy #11627 (@patrickvonplaten)
Reduce to 1 worker and set timeout for GPU TF tests #11633 (@LysandreJik)
[self-push CI] sync with self-scheduled #11637 (@stas00)
[examples] fix sys.path in conftest.py #11636 (@stas00)
[Examples] Check key exists in datasets first #11503 (@oToToT)
[Examples] Fix invalid links after reorg #11650 (@oToToT)
Update code example #11631 (@NielsRogge)
Add missing git dependency for RAG example #11634 (@lhoestq)
updated user permissions based on umask #11119 (@bhavitvyamalik)
Big Bird Fast Tokenizer implementation #11075 (@tanmaylaud)
Save scaler state dict when checkpointing #11663 (@sgugger)
[BigBird Pegasus] Add config to auto tokenizer #11667 (@patrickvonplaten)
Fixes NoneType exception when topk is larger than one coupled with a small context in the Question-Answering pipeline #11628 (@psorianom)
Add --text_column to run_summarization_no_trainer #11673 (@cccntu)
Fix docstring of description about input_ids #11672 (@nxznm)
Grammar and style edits for the frontpage README #11679 (@Rocketknight1)
Fix TF Roberta for mixed precision training #11675 (@jplu)
Test checkpointing #11682 (@sgugger)
Fix clip docs #11694 (@patil-suraj)
[Flax] Updates README and fixes bug #11701 (@marcvanzee)
remove defaults to None if optional #11703 (@PhilipMay)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4.6.0: ViT, DeiT, CLIP, LUKE, BigBirdPegasus, MegatronBERT

v4.6.0: ViT, DeiT, CLIP, LUKE, BigBirdPegasus, MegatronBERT

ViT (@NielsRogge)

DeiT (@NielsRogge)

CLIP (@patil-suraj)

BigBirdPegasus (@vasudevgupta7)

LUKE (@NielsRogge, @ikuyamada)

Megatron (@jdemouth)

Hub integration in Transformers

DeepSpeed ZeRO Stage 3 & ZeRO-Infinity

Flax

TensorFlow

Pipelines

Notebooks

General improvements and bugfixes