Tranformers
- Feed forward, whole input processed in parallel
- map time into space using shared weights
- use parallel computation at every level of encoder, fast to train
- every location in the input is processed by the same network, communicating via attentino
- replicate the network for every network
- achieve high performance w/o recurrence
- Faster time complexity than RNN
Purpose of Self-attention mechanisms
Encoder
- Utilizes self-attention to allow the model to focus on different parts of the input sequence.
- Generates a representation of the input sequence
Attention Based Methods
-
point with NNs
-
use softmax + multiplicative connections (addressing in NTMs)

-
during decoding, decoder can pay attention to any states of the encoder

