Next, we'll talk about a variant or extension of original memory bank called end-to-end memory network. The motivation here is original memory network cannot be trained end-to-end, as we have illustrated in the previous module. The reason they cannot train end-to-end is because this argmax operation, trying to find the most similar memory slot compared to the query. Here the main idea is, maybe we can replace that argmax operation with something else, something you can compute gradient easily. In this paper, they use softmax operation with attention as this replacement. No longer you are finding a single memory slot, you'll find all of those memory slot, but you'll put different attentions, or different weights on all those memory slots. The one with the largest weight could potentially be most relevant, but you still will have some weight on every memory slot. This is the main idea, we'll talk about that. The general idea is, again, store the input into memories, or maybe subset of the input. Then compute the attention weight between a query when a query comes in. Then against all these things that you store in there. But they want to keep this in a way that you can compute this end-to-end easily. Let's look at the architecture. You have a input, or a sequence of input x_i. Then they will do a set of, in this particular there's three different embedding for input x. One is just embedding x using embedding matrix A, and to produce a matrix, or in this case we call them keys m_i. Each of those input vector will map to a vector here using this embedding transformation. When I say embedding, I mean really A times original x_i. That'll give you this embedding m_i. Then you would do another embedding C. C times x_i give you another set of embedding, we call them values, or c_i. Then the third one is a query. For a query, keep in mind, query is just one of the input, and so this query vector will have another embedding, matrix B, so B times q will give you this embedding u, so that's a query. Once you have this key value and query, what you'll do is compute attention between query and the key. For example, taking one of the memory slot m_i, you can compute inner product of this query against the embedding m_i. You can do this for every m_i, for all these keys. Then it goes through this softmax operation to normalize them into some score between zero and one. So p_i will be the attention weight for memory slot i. Then we want to produce the output embedding. You just do a weighted sum of all the values. Here, you loop over all this memory slot in values, but weight them by the corresponding attention p_i. That give you the output embedding. Once you have the output embedding, for the final prediction, you would take the output embedding, and sum that with this query embedding u, and goes through another transformation W, and followed by softmax operation to do the final prediction, as you meet your task as classification, of course, in this case. The model parameters here are all of those embedding matrix A, B, and C, and W. That's the end-to-end memory network. The good thing about this, is you can train this end-to-end, because you can compute the gradient, and there's no dependency, so it can be done very efficiently. Maybe you want to do this in multiple layers. That's what you're going to do as well. Instead of just projecting or embedding x into one key and values, you can imagine do this multiple times, so that this give you a multi-layer end-to-end memory network. Then you may wonder, "Okay, maybe that's too many parameters," so it's all these, A_1, A_2, A_3, and the C_1, C_2, C_3, all these things can lead to too many parameters. They also introduce some parameter sharing strategy. One is, you could imagine, there's tying these values of previous layer as the key in the next layer, so C_k plus 1 equal to C_k. Here, C_1 equals to A_2, C_2 equal to A_3, and so on. Or you can do some layer-wise tying. You say, "Okay, maybe all the value should be the same," all the C_1s equal to C_2, and C_3, and so on, and all the keys should be the same. So A_1 equal to A_2 equal to A_3, all this embedding should be the same. That's multi-layer end-to-end memory network. Here's a summary of end-to-end memory network. It's another variant of memory network. It can be trained end-to-end, and the key ideas here is to use softmax with attention to replace the original argmax operation, so that you can still compute gradient and you do this by propagation end-to-end. Their application is again, question and answer application, and it has this question and answer template. You could query, and you use query against the keys and then you retrieve values. This type of framework has been used again and again here, and allows end-to-end training as we said. That's the end-to-end memory network.