RNN vs Transformers - who knows...?

JulianSMoore · 25 September 2021 13:58

Transformers are really popular at the moment - with good reason, it must be said - but their memory depth is not arbitrarily long

A limitation of existing Transformer models and their derivatives, however, is that the full self-attention mechanism has computational and memory requirements that are quadratic with the input sequence length. With commonly available current hardware and model sizes, this typically limits the input sequence to roughly 512 tokens, and prevents Transformers from being directly applicable to tasks that require larger context, like question answering, document summarization or genome fragment classification.

Citation: Google AI Blog. ‘Constructing Transformers For Longer Sequences with Sparse Attention Methods’. Accessed 15 August 2021. http://ai.googleblog.com/2021/03/constructing-transformers-for-longer.html.

NB Long documents can now be summarised by “recursive summarization” (summarise by parts, then summarise the summaries, repeat); I suggested this in another context, nice to see others had the same (fairly obvious) idea and run with it!

The memory length of e.g. an LSTM is however arbitrary. The limitation for LSTMs (IF I have understood their modus operandi correctly) is that only a finite number of things can be remembered over that arbitrary period - but the number of things therefore increases ~exponentially with the number of indexes used (cell state = 2^bits)

The key advantage of transformers, as far as I can tell, is the fact that they can be trained with parallel processing, whereas RNNs are limited to sequential training.

Maybe the next thing will be to use LSTM’s on the queries of the transformer (to give arbitrary depth on relevance lookups), however, the general question of this post is…

Apart from the training advantages/disadvantages of Transformers vs RNNs, what are the pro’s and con’s of each?

misakss · 25 September 2021 18:57

Just to add to your post @JulianSMoore - I also think it´s interesting to look into inference performance and speed, especially in the vision domain.

Here´s a paper touching on this area: