We also got Yoshua Bengio's story: "My own insight really became strong in the context of the machine translation task. Prior to our introduction of attention, we were using a recurrent network that read the whole input source language sequence and then generated the translated target language sequence. However, this is not at all how humans translate. Humans pay very particular attention to just one word or a few input words at a time, in their context, to decide on the next word (or few words) to generate to form the sequence of words in the translation. "
This is the paper he mentions as the one that influenced them: https://www.cs.toronto.edu/~hinton/absps/nips_eyebm.pdf
You can find his full reply here: https://www.turingpost.com/p/attention