In this lecture, we will be talking about HDFS, learning about how to interact with HDFS and also the basics of Map Reduce Paradigm. Okay, what does HDFS? Hadoop Distributed File System, that's what it stands for. It is basically a mechanism for storing data that is accessed within Hadoop. It is designed to scale, meaning to be able to use large number of computers easily and effectively and it is designed to be deployed on commodity hardware or low-cost hardware. Anything which you might get off the shelf, you can install Hadoop Distributed File System on. What is the advantage and what is the things with it? You have the Hadoop Distributed File System, which means the data that you want to store here is not stored in any single node. Part of the data might be stored on one node, part of the data might be stored on another node, and there is usually a replica of this. That's what are the components you can think about in the Hadoop Distributed File System. You have one thing, you have something called the NameNode or the master, which basically manages this distributed file system and the namespace associated with it. You could have multiple copies of the master, which means a half of that NameNode so that if there is a failure or mul-NameNode, a secondary NameNode can take over. That's how they will build redundancy and not having a single point of failure. You could have one or more NameNodes. NameNodes, you can think about it as lookup places. It knows where the data is. It doesn't actually store the data itself, but it is an index to where the data is. Then you have the DataNode itself where the individual items of the data are stored. You have DataNode 1, 2, 3, 4, in this picture there. Whenever a user wants to access the data and you type at HDFS command, it actually goes to the NameNode. The NameNode knows what kind of thing you are looking for, it knows how to contact the appropriate DataNodes. Then the DataNodes directly send that information back to the client that is running on your machine. That's how it operates. This is the architecture of HDFS. The same thing we saw in the last page, a little bit more expanded. You have the NameNode which stores the Metadata, the name of the data, the replicas and values within the file system, etc, which NameNode has its Rack ID, etc. Then you have the DataNodes which actually do the read and write. The Client node can actually directly read and write data into the DataNodes. But the NameNode will basically keep track of where the data is sent, take care of the metadata associated with it. The index is the Namenode and the DataNodes is the actual data. Both of them together makes this Hadoop Distributed File System, and both of them are important for this. Without the NameNode, your data is distributed everywhere, if you had only a single NameNode and you lose it, even though your data is still there on the data nodes, there is no way to recover the data without much effort because you don't really know where the data is and you have lost the index of where things are. Again, if you lose the DataNode, of course, you lose the data itself. That's why you need replicas of the data itself. Usually, you have multiple replicas of the NameNode and multiple replicas of the DataNode. Let's see a couple of simple examples or commands that you can run on HDFS or the Hadoop Distributed File System. If you know something about how to interact with the Linux environment, you know Linux commands like LS or things like that, you will see many of them similarly repeated in HDFS commands. It borrows a lot from Linux operating systems. So first thing if you want to check a particular folder, you have to. Everything is prefixed by hdfs dfs. Which basically means hdfs dfs is just telling the client that this is HDFS common which is to be run by that distributed file system -ls, meaning list and then the path to the folder or the file that you want to look at. It will give you all the properties associated with a file just like a normal ls command, making a directory, hdfs dfs-make the folder. There is an alternative also instead of hdfs dfs, you can also say Hadoop fs. Usually it works. Both of them are similar. But for this class, we'll stick with hdfs dfs format. Copy data into hdfs source. Let's say you have data in your local file system which is not in hdfs, but you want to put it in a distributed file system, you can want to send it. You can just say hdfs dfs and copy from or put and path to the local file and path that you want to store it on hdfs. Once you store it off as dfs, it might not be on the same computer, it might be on a different computer. Especially if it's a huge data, the data might itself get chunked. Part of the data might be on one computer, one note, part of the data might be on another note, which is on another disk. That's how it does the distributed part of it. Copy data from hdfs to local. You can do a copy from which copies a particular file or you can do a git merge and file in hdfs. This will basically merge all the maybe sub-files within that files and give you a single file. We'll see a couple of examples of using all these commands when we do some examples. Delete, same thing as dfs dfs-rm, delete and then -F and the file which you want to delete. Simple commands to check the directory or file, create a directory, copy data into HDFS and out of HDFS and delete the data. Some very simple command, HDFS-commands. Again, some terminologies reminding you namenode, it manages the Hadoop file system data or HDFS metadata. Then the DataNode is where the actually the data gets stored. You also have something called JobTracker and TaskTracker. This is especially for running a Hadoop or MapReduce job. JobTracker is basically for scheduling a job that each job is converted to multiple tasks and the tasks are executed on individual DataNodes. The NameNode and JobTracker are usually masters and are run on mastersnode and TaskTracker as well are usually run on the DataNodes. Here's an example. They can be on a single box or can be on separate box. Your command comes here. That's a job. What you want to do with the data. You don't really know what the data is, so this might get split into multiple tasks which goes to one or many of the N-workernode value data, it executes here and then the results are returned here and back to you. In some cases, if the TaskTracker or for node fails or a node fails, then the master here, JobTracker is required to execute it on a different node and give you the results. That's how it operates. Now, we will look at the MapReduce computing paradigm. How does that work? We have now set up all the basics. We know what HDFS file system is, we know NameNode, DataNode, JobTracker, TaskTracker, etc. What exactly is this MapReduce computing paradigm? You have your data distributed throughout, meaning all these data nodes. How does this MapReduce concept works? Here's a simple illustration of this paradigm. For example, you have these different datasets, so all these cylinders you can think about as different disks which are on different nodes so they are different computers. Now when you do a Map task, each of them is run on this. What happens is, each of them in this case, let's say they're computing, which is blue, which is green, and which is yellow. A very simple task, so each of them will say, "Hey, I have so many yellows, so many greens, so many blues." Then now they shuffle after that, so all the blues are put together and they might run on a single node, all the greens and all the yellows similarly. Then maybe you are doing a sum, you're counting them altogether, and that's what your result might be. That's the MapReduce Computing Paradigm. The first step, Map is to run on every individual node, then you have shuffle to move the intermediate data to a different node, where the computation can be run, this might be one or two of the selected nodes and then that you'd run a reduce and you get the result. Essentially these can be run in parallel. Map step, you can think about running on all, reduce might be running on few. Then reduce, you can have multiple step of reduce in order to coalesce the data so that you get a single result. Let's see a very, very simple and a very common example of MapReduce and what it was designed for, which is basically to do word count. As I introduce MapReduce, I said it was designed initially by Google. They do a lot of word processing in order to respond to your search results and things like that. It was designed to do word and word processing on a large scale data. Here's an example, let's say you have a corpus of words of a text, and then you're basically doing a simple word count. Let's say this was a very simple text and you had these as input, your output should look something like this. How many times the word "research" appeared four times, the word "student" two times, the word "GIS" three times, "geospatial" two times, "CGU" twice, and so on. That's how it should work. Let's see how this actually works. You have mapper, let's say you have two nodes where this data is. This is fictitious, of course. Let's say on node one you had these datasets, on node two you add these datasets. You are just doing accounting. The first step is just doing accounting. You say, "hey, on this node, how?" Whenever it sees a word, it attaches the word "one". Research 1, student 1, GS 1 and so on. Any word it sees, it says, "that occurs one time, that's a simple mapping task." Keys are the words I am seeing, the value whenever I see them, I say "one". That's what I output. That's a very simple map task and we'll write a simple program to do that. Then there will be a shuffle stage, this is done internally by Hadoop and the MapReduce framework. You don't need to worry about it. It goes and shuffles the data and says, "Put all of them with the same key on a single node or on few nodes if there are large amounts of data." Then that's the shuffle stage. Then you have the reduce stage where you will just sum them together. That's simple. You just put them together. You had 2, 1, CIGI 1, CIGI 1, so 1 plus 1, you say 2. Your reduce operation is a simple operation, If you have the same key, it's just a summation operation and that's it. A very simple example of MapReduce. You can do a combiner plus reducer. The first stage, you are running a mapper on a single node, it just said "one" Every time it looked at the data, you can add another step, it says, "Hey, we have too many keys." If they have a common key just add them together in the node itself. Those gets added and then you go to the reduce step. But in the end you get the same result because it's a similar operation, you're doing addition operation, it doesn't matter if you do 1 plus 1 plus 1 or 1 plus, and then you do 1 plus 1. Essentially you are getting the same result. Now what have we seen in this lecture? We saw what is Hadoop distributed file system, what are the components of a Hadoop distributed file system. We saw basic MapReduce paradigm and a simple example of applying this MapReduce paradigm to a vote counting. Thank you.