Attention Is All You Need
-
Google Brain - Ashish Vaswani, Noam Shazeer
-
Google Research - Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez (et al.)
Summary
-
Existing models for sequence transduction rely on recurrent or convolutional neural networks.
-
Transformer is a new network architecture based solely on attention mechanisms.
-
Advantages of the Transformer model:
-
Superior quality
-
Highly parallelizable
-
Reduced training time
-
Achievements:
-
28.4 BLEU on WMT 2014 English-to-German translation task
-
Established a new single-model BLEU score of 41.8 on WMT 2014 English-to-French translation
-
Generalizes well to other tasks, such as English constituency parsing.
Transformer Model Proposal
-
Recurrent neural networks (RNNs) are commonly used in sequence modeling and transduction.
-
Challenges of RNNs include their sequential nature, limited parallelization, and management of long-range dependencies.
-
While attention mechanisms have shown promise, they are predominantly used in conjunction with RNNs.
-
The Transformer model is proposed to be based entirely on attention, eliminating the need for recurrence or convolution.
Overview
-
Extended Neural GPU, ByteNet, and ConvS2S reduce sequential computation.
-
Self-attention is used in various tasks to compute a representation of the sequence.
-
The Transformer is the first transduction model relying entirely on self-attention.
-
It has been compared with recurrent and convolutional layers on computational complexity, parallelization, and path lengths between dependencies.
Model Architecture
-
Encoder-decoder structure with stacked self-attention and point-wise fully connected layers
-
Encoder: stack of N=6 identical layers with multi-head self-attention and feed-forward sub-layers
-
Decoder: stack of N=6 identical layers with self-attention, encoder-decoder attention, and feed-forward sub-layers
-
Position-wise feed-forward networks applied to each position separately
Attention Mechanism Overview
-
Attention function maps query and key-value pairs to an output
-
Scaled Dot-Product Attention:
-
Computes dot products
-
Divides by the square root of dimensionality
-
Applies the softmax function
-
Multi-Head Attention:
-
Projects queries, keys, and values to different dimensions
-
Computes attentions in parallel
-
Concatenates results
-
Projects the concatenated results again
Positional Encoding
-
Injects information about the position of tokens in the sequence
-
Sinusoidal positional encoding based on sine and cosine functions
-
Hypothesis: allows the model to easily learn to attend by relative positions
-
Comparisons with learned positional embeddings
Training Details
-
Trained on WMT 2014 English-German dataset consisting of 4.5 million sentence pairs
-
Used Adam optimizer with varying learning rates over time
-
Applied dropout and label smoothing as regularization techniques
Results
-
Transformer achieves state-of-the-art BLEU scores on English-to-German and English-to-French translation tasks, surpassing previous models.
-
It has faster training time and lower training cost than recurrent or convolutional models.
Model Variations
-
Experiments on variations of base model to measure change in performance
-
Varying:
-
Attention heads
-
Key dimensions
-
Dropout rate
-
Model size
English Constituency Parsing
-
Transformer performs surprisingly well on the English constituency parsing task, outperforming previous models.
-
The model has demonstrated the ability to generalize to other tasks, even without task-specific tuning.
Conclusion
-
Transformer model based solely on attention mechanisms
-
Superior in quality, more parallelizable, and require less time to train
-
Achieves state-of-the-art results on translation tasks and performs well on other tasks
-
Extensive opportunities for future research and application of attention-based models