In this week, you'll hear about sequence to sequence models, which are useful for everything from machine translation to speech recognition. Let's start with the basic models, and then later this week, you'll hear about beam surge, the attention model, and we will wrap up the discussion of models for audio data like speech. Let's get started. Let's say you want to input a French sentence like Jane visite I'Afrique Septembre, and you want to translate it to the English sentence, Jane is visiting Africa in September. As usual, let's use x1 through x, in this case 5 to represent the words and the input sequence, and we'll use y1 through y6 to represent the words in the output sequence. How can you train a neural network to input the sequence x and output the sequence y? Well, here's something you could do. The ideas I'm about to present are mainly from these two papers due to [inaudible] and that one by [inaudible]. First, let's have a network which we're going to call the encoder network be built as a RNN and this could be a [inaudible] feeding the input French words one word at a time. After ingesting the input sequence the RNN then outputs a vector that represents the input sentence. After that, you can build a decoded network, which you might draw here. Which takes as input the encoding output by the encoding network shown in black on the left, and then can be trained to output the translation one word at a time. Eventually, it helps us say the end of sequence and the sentence taken upon which the decoder stops, and as usual, we could take the generator tokens and feed them to the Knicks so if they just stay in the sequence they were doing before when synthesizing techs using the language model. One of the most remarkable recent results in deep learning is that this model works. Given enough pairs of French and English sentences, if you train a model to input a French sentence and output the corresponding English translation, this will actually work decently well. This model simply uses an encoding network whose job it is to find an encoding of the input French sentence, and then use a decoding network to then generate the corresponding English translation. An architecture very similar to this also works for image captioning. Given an image like the one shown here, maybe you wanted to be captions automatically as a cat sitting on a chair. How do you train in your network to input an image and output a caption like that face up there? Here's what you can do. From the earlier calls on the confidence, you've seen and how you can input an image into a convolutional network, may maybe a pre-trained AlexNet, and have that learn and encoding a learner to the features of the input image. This is actually the AlexNet architecture, and if we get rid of this final softmax unit, the free trade AlexNet can give you a 4,096-dimensional feature vector of which to represent this picture of a cat. This pre-trained network can be the encoded network for the image and you now have a 4,096-dimensional vector that represents the image. You can then take this and feed it to an RNN whose job it is to generate the caption one word at a time. Similar to what we saw with machine translation, translating from French the English, you can now input a feature vector describing the inputs and then have it generate an output set of words, one word at a time. This actually works pretty well for image captioning, especially if the caption you want to generate is not too long. As far as I know, this type of model was first proposed by [inaudible] although it turns out there are multiple groups coming up with very similar models independently and at about the same time. Two of the groups that had done very similar work at about the same time and I think independently of [inaudible] as well as Adrej Karpathy and Fei-Fei Li. You've now seen how a basic sequence to sequence model works. How basic image to sequence, or image capturing model works. But there are some differences between how you'll run a model like this, the generally the sequence compared to how you were synthesizing novel text using a language model. One of the key differences is you don't want to randomly choose in translation. You may be want the most likely translation or you don't want to randomly choose in caption, maybe not, but you might want the best caption and most likely caption. Let's see in the next video how you go about generating that.