2/15 - Transformers

Tranformers

Feed forward, whole input processed in parallel
map time into space using shared weights
use parallel computation at every level of encoder, fast to train
every location in the input is processed by the same network, communicating via attentino
replicate the network for every network
achieve high performance w/o recurrence
Faster time complexity than RNN

Purpose of Self-attention mechanisms

allow model to attend to different parts of the input sequence when making predictions

Encoder

Utilizes self-attention to allow the model to focus on different parts of the input sequence.
Generates a representation of the input sequence

Attention Based Methods