Skip to content

v4.6.0: ViT, DeiT, CLIP, LUKE, BigBirdPegasus, MegatronBERT

Compare
Choose a tag to compare
@LysandreJik LysandreJik released this 12 May 15:07

v4.6.0: ViT, DeiT, CLIP, LUKE, BigBirdPegasus, MegatronBERT

Transformers aren't just for text - they can handle a huge range of input types, and there's been a flurry of papers and new models in the last few months applying them to vision tasks that had traditionally been dominated by convolutional networks. With this release, we're delighted to announce that several state-of-the-art pretrained vision and multimodal text+vision transformer models are now accessible in the huggingface/transformers repo. Give them a try!

ViT (@NielsRogge)

Two new models are released as part of the ViT implementation: ViTModel and ViTForImageClassification, in PyTorch.

ViT is an image transformer-based model obtaining state-of-the-art results on image classification tasks. It was the first paper that successfully trained a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures.

The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=vit

DeiT (@NielsRogge)

Three new models are released as part of the DeiT implementation: DeiTModel, DeiTForImageClassification and DeiTForImageClassificationWithTeacher, in PyTorch.

DeiT is an image transformer model similar to the ViT model. DeiT (data-efficient image transformers) models are more efficiently trained transformers for image classification, requiring far less data and far less computing resources compared to the original ViT models.

The DeiT model was proposed in Training data-efficient image transformers & distillation through attention by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=deit

CLIP (@patil-suraj)

Three new models are released as part of the CLIP implementation: CLIPModel, CLIPVisionModel and CLIPTextModel, in PyTorch.

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3.

The CLIP model was proposed in Learning Transferable Visual Models From Natural Language Supervision by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=clip

BigBirdPegasus (@vasudevgupta7)

BigBird is a sparse-attention-based transformer that extends Transformer based models, such as BERT to much longer sequences. In addition to sparse attention, BigBird also applies global attention as well as random attention to the input sequence. Theoretically, it has been shown that applying sparse, global, and random attention approximates full attention while being computationally much more efficient for longer sequences. As a consequence of the capability to handle longer context, BigBird has shown improved performance on various long document NLP tasks, such as question answering and summarization, compared to BERT or RoBERTa.

The BigBird model was proposed in Big Bird: Transformers for Longer Sequences by Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others.

  • Add BigBirdPegasus #10991 (@vasudevgupta7)

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=bigbird_pegasus

LUKE (@NielsRogge, @ikuyamada)

LUKE is based on RoBERTa and adds entity embeddings as well as an entity-aware self-attention mechanism, which helps improve performance on various downstream tasks involving reasoning about entities such as named entity recognition, extractive and cloze-style question answering, entity typing, and relation classification.

The LUKE model was proposed in LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto.

Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=luke

Megatron (@jdemouth)

The MegatronBERT model is added to the library, giving access to the 345m variants.

It is implemented comes with nine different models: MegatronBertModel, MegatronBertForMaskedLM, MegatronBertForCausalLM, MegatronBertForNextSentencePrediction, MegatronBertForPreTraining, MegatronBertForSequenceClassification, MegatronBertForMultipleChoice, MegatronBertForTokenClassification, MegatronBertForQuestionAnswering, in PyTorch.

The MegatronBERT model was proposed in Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro.

Hub integration in Transformers

The Hugging Face Hub integrates better within transformers, through two new added features:

  • Models, configurations and tokenizers now have a push_to_hub method to automatically push their state to the hub.

  • The Trainer can now automatically push its underlying model, configuration and tokenizer in a similar fashion. Additionally, it is able to create a draft of model card on the fly with the training hyperparameters and evaluation results.

  • Auto modelcard #11599 (@sgugger)

  • Trainer push to hub #11328 (@sgugger)

DeepSpeed ZeRO Stage 3 & ZeRO-Infinity

The Trainer now integrates two additional stages of ZeRO: ZeRO stage 3 for parameter partitioning, and ZeRO Infinity which extends CPU Offload with NVMe Offload.

Flax

Flax support is getting more robust, with model code stabilizing and new models being added to the library.

TensorFlow

We welcome @Rocketknight1 as a TensorFlow contributor. This version includes a brand new TensorFlow example based on Keras, which will be followed by examples covering most tasks.
Additionally, more TensorFlow setups are covered by adding support for AMD-based GPUs and M1 Macs.

Pipelines

Two new pipelines are added:

Notebooks

  • [Community notebooks] Add Wav2Vec notebook for creating captions for YT Clips #11142 (@Muennighoff)
  • add bigbird-pegasus evaluation notebook #11654 (@vasudevgupta7)
  • Vit notebooks + vit/deit fixes #11309 (@NielsRogge)

General improvements and bugfixes