Today we'll talk about memory network. Memory network is another powerful extension of attention model. And it has been initially applied in natural language processing, such as question answering applications. And in healthcare, memory network can be valuable due to its capability of memorizing medical knowledge and patient history. So here's outline of this lecture. We'll talk about memory network, original version of that, then this end to end version of memory network. And then we'll talk about some powerful application of memory network, self-attention. One is this model called transformer, and then an application of transformer for processing text called BERT. And after that, we will discuss some use cases in healthcare. Namely, Doctor2Vec, that's for clinical trial recruitment, and then medication recommendation. There we'll talk about two method. One is called GAMENET, which uses memory network and graph attention, graph neural network. And then we'll talk about another work using BERT like transformers for medication recommendation. So the original memory network uses this external memory component to assist deep neural network. The idea is to try to remembering and storing information that can be useful. This analogous to the human brain, you have this memory component as well, as well as some capability to compute. That's the deep neural network part. So different memory network's model have been proposed. We talk about two of them in this lecture. So let's start with this original memory network paper. So here's overview of this memory network architecture. It has a few components here. And it has this input component or input feature map that map this input raw feature vectors into some internal feature vector representation, I(x). Then it has a generalization component, G. It's the main kind of memory component. It tried to remembering some important information, in this case, this feature maps that has been computed over a big data set. A subset of them will be recorded here. And so when a new feature map comes in, and this G component will try to update existing memory. And once you have this memory components and this current input feature map, then the next component is this output feature map. So to produce output feature map, instead of just using the input feature map and passing it through some kind of feedforward neural network, here they use this query and retrieval mechanism. So they use the input feature map as a query and try to find some most relevant memory slot that's in this external memory. Take that as part of the output to produce an output feature map using the input feature map, I(x), and some most relevant memory slot, mi. Then we have the final response component. That's just a feedforward neural network trying to produce the final prediction with the output feature map as input. Okay, next let's look at each of those component in more details. The first component is this input feature map component, I. It converts the input feature vectors, x, into some internal feature representation, I(x). And here the representation of input, and it can be quite general. It can be the simple one hot encoding, or it can be some kind of a neural network, like recurrent neural network if the input is a sequel. Of words, for example. So the next component is this a G Component Generalization. This is the kind of the core innovation in memory network, by introduced this memory component. And the idea here is we can, when we have this knew feature, input feature map IX. You want to store that into the memory, or updating existing memory by some kind of generalization of that, or some compression that, right? So the paper describe all this strategy in the very high level, and essentially, mathematically, we want to learn this function G over this input feature my I(x). And this entire memory bank m, and we want to do this for each memory slot, for all slots, maybe I've from 1 to 1,000,000. And here, we essentially taking this input, feature map, and we will try to update each one of those memory slot. So that's kind of the general or abstract formulation of this generalization component. The simplest version is, it could imagine, just store this knew input feature map I(x) into a memory slot, in this kind of formulation. So the idea here is, you can first use finding out the location to store the information, using a hash function H(x). Then for that memory slot, then you just store this knew input feature map. Of course, here, the original information stored in that location will be replaced. And in this simple version, a more general version, you considering as some kind of a more advanced operation, how to find this, define this hash function. Instead of just randomly mapping to some location in memory. Maybe you could considering a cluster, based on clusters of the input, right? So similar input feature map will be mapped same or similar locations. Or you can imagine in some kind of a strategy to generaliz,e or update existing memory, not just replace them with a new input, so you don't forget the previous memory. So that's a generalization component, the next component is this output feature map, it's output feature map. It's another component that takes input feature map, I(x), and trying to use that as a query against the entire memory bank, trying to find the most relevant memory slot, mi? Then combine them together. The input, I(x), and the most relevant memory. Trying to compute another output feature map. So here, mathematically, we want to compute this functions. The output of the function is this output feature map, O, and this function takes the input feature map I(x), and the memory bank, together as input. And the ideas, maybe, you want to find a single most relevant, memory slot, mi. Here, you just return one memory slots, okay, equal to 1, in this case, and then, here, this function becomes, essentially, just arc max over similarity as zero, computed over I(x) against memory slot mi for all memory slot i. So you just essentially going through all the memory slot, compute the similarity score, then this argmax would output the particular i that has a largest similarity score, and that will be the most relevant memory bank. When K equal to two, you can do this operation multiple times, two times to find two memory slots. But in this example, we're just looking for one, so this one which is this index with the largest similarity score. So the final output o will be, maybe, just contamination of this input feature map I(x), and the most relevant memory slot, mo1. The next component is this response component. Now, we have this output feature map o, we're just taking that, and map that output feature map. So another set of neural network in this case, take the feedforward neural network and to produce the final output. So software here this neural network can be quite general, right? If the output prediction is some kind of a multi class classification, and you can have some kind of softmax have the final layers or if you want to generate some sequence may be here this neural network become RNN. So to summarize we talked about this original memory network. The key innovation here is to bring in a memory component into the deep neural network. And it has as input feature map generalization component that's the core component by introducing this external memory bank. Then we have the output feature map trying to use the input as query, finding some most relevant memory slots in the memory bank and just combine them together as your output feature map. Then the response component will take them up a feature map and produce a final prediction. So limitation of this particular architecture is they cannot be trained into end because this arc Max operation in this output feature map steps. And this step turns out to be you cannot compute their reading and so you cannot back propagation this whole thing end to end. So that's one of the major limitation of this or original memory bank paper.