Attention Is All You Need

  • Google Brain - Ashish Vaswani, Noam Shazeer
  • Google Research - Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez (et al.)

Summary

  • Existing models for sequence transduction rely on recurrent or convolutional neural networks.
  • Transformer is a new network architecture based solely on attention mechanisms.
  • Advantages of the Transformer model:
    • Superior quality
    • Highly parallelizable
    • Reduced training time
  • Achievements:
    • 28.4 BLEU on WMT 2014 English-to-German translation task
    • Established a new single-model BLEU score of 41.8 on WMT 2014 English-to-French translation
  • Generalizes well to other tasks, such as English constituency parsing.

Transformer Model Proposal

  • Recurrent neural networks (RNNs) are commonly used in sequence modeling and transduction.
  • Challenges of RNNs include their sequential nature, limited parallelization, and management of long-range dependencies.
  • While attention mechanisms have shown promise, they are predominantly used in conjunction with RNNs.
  • The Transformer model is proposed to be based entirely on attention, eliminating the need for recurrence or convolution.

Overview

  • Extended Neural GPU, ByteNet, and ConvS2S reduce sequential computation.
  • Self-attention is used in various tasks to compute a representation of the sequence.
  • The Transformer is the first transduction model relying entirely on self-attention.
  • It has been compared with recurrent and convolutional layers on computational complexity, parallelization, and path lengths between dependencies.

Model Architecture

  • Encoder-decoder structure with stacked self-attention and point-wise fully connected layers
  • Encoder: stack of N=6 identical layers with multi-head self-attention and feed-forward sub-layers
  • Decoder: stack of N=6 identical layers with self-attention, encoder-decoder attention, and feed-forward sub-layers
  • Position-wise feed-forward networks applied to each position separately

Attention Mechanism Overview

  • Attention function maps query and key-value pairs to an output
  • Scaled Dot-Product Attention:
    • Computes dot products
    • Divides by the square root of dimensionality
    • Applies the softmax function
  • Multi-Head Attention:
    • Projects queries, keys, and values to different dimensions
    • Computes attentions in parallel
    • Concatenates results
    • Projects the concatenated results again

Positional Encoding

  • Injects information about the position of tokens in the sequence
  • Sinusoidal positional encoding based on sine and cosine functions
  • Hypothesis: allows the model to easily learn to attend by relative positions
  • Comparisons with learned positional embeddings

Training Details

  • Trained on WMT 2014 English-German dataset consisting of 4.5 million sentence pairs
  • Used Adam optimizer with varying learning rates over time
  • Applied dropout and label smoothing as regularization techniques

Results

  • Transformer achieves state-of-the-art BLEU scores on English-to-German and English-to-French translation tasks, surpassing previous models.
  • It has faster training time and lower training cost than recurrent or convolutional models.

Model Variations

  • Experiments on variations of base model to measure change in performance
  • Varying:
    • Attention heads
    • Key dimensions
    • Dropout rate
    • Model size

English Constituency Parsing

  • Transformer performs surprisingly well on the English constituency parsing task, outperforming previous models.
  • The model has demonstrated the ability to generalize to other tasks, even without task-specific tuning.

Conclusion

  • Transformer model based solely on attention mechanisms
  • Superior in quality, more parallelizable, and require less time to train
  • Achieves state-of-the-art results on translation tasks and performs well on other tasks
  • Extensive opportunities for future research and application of attention-based models