Abstract: The computational complexity of the Transformer model grows quadratically with input sequence length. This causes a sharp increase in computational cost and memory consumption for ...