[MUSIC] Today we will be discussing how to do big data, geospatial data processing with Hadoop and Spark. Kind of the outline of this module will have four lectures, and we'll start first with a brief introduction to Hadoop and why we need Hadoop. So today we have a large amount of geospatial data which are coming in constantly. So you have things with GPS, even your cell phones, all the devices that you have, each of them is generating geospatial data. And with this kind of a huge volume of data being generated, it is a real challenge in order to deal with this big data. So the Internet of things, your cell phones, your iPads and your tablets, all these kind of generate geospatial data whenever you use them. So why is geospatial or spatial data important? So, for example, when you tweet, you have a location. So using the large volume of people who tweet, we can get a good idea about the movement and flows of people in aggregate. Similarly, in New York City released their taxi cab data and you could see the moments of taxis. And we'll see some examples of how these datas could be processed and results could be viewed. Some of the challenges because of such a large volume, velocity and variety of data is messy. It is noisy, and traditional GIS software has not really been able to handle this kind of data sets. Some of them are going in that space, but large amount of them are not able to handle the data sets if you have any large volume of data. So there is a need for cyberGIS in order to efficiently manage such a data and process such a data. We will see a couple examples. So I mentioned Twitter data. So one of the things we did was we took Twitter data, and we basically created an aggregate movement of people throughout the United States. So you could do even do keyword based scenarios kinds of things. But you could also look at a long range thing, and we can use, like, Hadoop, Apache Hadoop for processing such volumes of data. So here's an example of what an output might look like of this move pattern. So you could see the large volumes of people moving between the big cities when you are at the US scale. As you zoom in, you will see different movement patterns and you might see some regional as well as local variations. And within a city, you might be able to see how the different areas within the cities are interconnected, where people work and where people stay based on when they are tweeting. So we could do kind of analysis on the data. So this was one of the analysis we did, this used Apache Hadoop. So what is Apache Hadoop? Apache Hadoop is essentially as scalable can say fault- tolerance, distributed data processing framework. So the data is distributed among many notes and you have processing being done on this distributed datasets. The data in Hadoop is typically stored in what is called as HDFS or Hadoop distributed file system. There is a software stack which helps handle failure, which makes it a little bit more fault-tolerant. So if one of the data node maybe goes offline, usually there is a replica of the data, and the Hadoop system knows where the other copy of the data is, and it's able to use that. So it just has fault-tolerant built in. And it has been designed to be used on commodity clusters. So you have commodity clusters which are being installed or even old unused desktops are being used in order to stand up a Hadoop cluster. But the big advantage of Hadoop is that it is a distributed kind of a file system. And so you are not generally when you do computation and data processing, you have a machine, a powerful machine. You bring all the data to that machine, you do a computation of the machine, and then you take the results back. This might be a laptop desktop, or it might be like a supercomputer or anything like that. But with Hadoop, this kind of paradigm is somewhat reversed. You can think about the data is distributed. The data is there at various places. Since the volume of data is huge, it's kind of impractical to bring the data to a central location, do the computation on it and then get the results out. That's kind of impractical and maybe even impossible in some cases, if you have very large amounts of data. So what he does is bring computation to data, so you have small sets of maybe if you can think about it as function small processes which are run locally on the data without needing to know the global picture. And then they are like put together and you get the results. So we will see an examples, so we will see examples of how this works and that probably will make it a little bit more clearer. So when do you need to use a Hadoop? Typically, when data is too big either to fit on a memory or hard disk of a particular machine, it's probably time to kind of use are distributed file system or a distributed model and Hadoop is one of them. And tasks you can, you can think about it as task being decomposed into batches so the same computation can be run on part of the data and that gives you a partial results which are valid. And you surround the same computation again on a different computer on a different chunk of data, and that result is also valid. And then you have some ways of combining these partial results together in order to come up with the final result. So generally that kind of paradigm is referred to as a MapReduce computing paradigm and a problem. If the problem can be modeled in that kind of a paradigm, it's generally suitable for Hadoop. And other thing is, scalability is your major concern, not interactivity. So if you have people who are sitting on the keyboard trying to do some processing and you want to give them immediate results, maybe you have to think about a different subsystem of Hadoop, which might be an in memory Spark, or things like that, which we will look at later. But a traditional batch processing based Hadoop is not designed for interactivity, but rather it's designed for scalability. So just looking ahead at the origins of MapReduce and Hadoop, so MapReduce was developed at Google. There is this kind of a famous paper which introduced the concept of MapReduce. Hadoop is open distributed framework, which kind of is part of the Apache Foundation, you can think about it has an implementation of MapReduce concept, which is open source and is heavily used. So what is this distributed framework distributed, things which I'm talking about? So you have a large big data source. So this data sources too big to be fit on a single hard disk or a single machine, let's say. So what you would do is you distribute the data between like say n nodes. You have a number of nodes on which this data gets distributed. Sometimes this data might be replicated, so that if for example, one of the node crashes, you still have the other nodes on which the data is stored and available. So what are some of the challenges? So you have communication and coordination between the nodes, since you have a distributed infrastructure. So you have to make sure that you are able to communicate between the nodes and you are able to coordinate between the nodes are, for example, you know where the replicas are. You know how many replicas are stored, and you have a good understanding of where the data is when you are looking for certain data, so you have to be ready for failure, recovery, for example. None of these nodes are designed to be like resistant or things to failure, or they are commodity hardwares, so they could fail. They have some mean time between failures, so some disks might fail and go away. So your system has to be designed to recover from these kinds of failure. Some challenges of working with this environment is, of course, debugging. When you write programs, it's really difficult for you to know where this is getting run. How do you validate your results? How do you know that it was run on the right data sets and things like that? So debugging is a little bit more challenging than your traditional programming. And monitoring and finding out what is happening to a particular task is also difficult. For example, if you schedule a task on a node, apparently that node goes down. Hadoop has built in mechanisms in order to move it to a different node to do the computation. But as a user, you might be thinking, hey, what happened? My computation is taking too long, what's going on? So these are kind of the challenges associated with the system. Some of the characteristics, we have already seen some of this. It has a distributed file system, Hadoop Distributed File System or HDFS, and it has replicated data in multiple nodes so the notes themselves are independent and the data can be replicated and typically maybe three times, two or three times. But that's a setting, people can change, an administrator can change. It has kind of easy to use MapReduce interfaces such as Pig. You have Hadoop streaming API. Those kinds of things can use, directly use MapReduce paradigm within the Hadoop installation, it's kind of scalable. So you have large amount of data, you are able to easily use it, and it allows in some ways an in built parallel executions of mappers and reducers. But by that, what do I mean? Since your data itself is distributed, your computation goes to the data, and the computation on each of these datasets or nodes are done independent of each other or in parallel with each other. So you get some inherent parallelism by using mappers and reducers, so they can be spread around multiple nodes, and you can get some fault-tolerant characteristics as well. And the good thing is, as a programmer, you don't need to worry about all the details of how do you take care of faults. How do you take care of scalability. How do you take care of replication? That's all taken care of by the Hadoop infrastructure. So what are the Some of the main components? You have Common libraries, Hadoop common or common utilities that support the other Hadoop module. So this is a set of common things which Hadoop is built on. Then you have, of course, the HDFS or the Hadoop distributed file system. It's a distributed system which allows high throughput access to data. So your data is spread around multiple sources and or multiple locations, and you are able to access it in a high-throuput profession or in a parallel kind of a fashion, in an inherently parallel fashion or an Ambani parallel fashion. It has something called YARN, which is basically, you can think about it as a resource manager. So which does the job scheduling and the cluster resource management. So you can think of Hadoop has managing this huge cluster, which has a lot of these notes and data is distributed across these nodes. Now, how do you know when you run a computation, where to run your jobs, where the data is? And so young kind of takes care of scheduling these data, these computational tasks on the right node and managing the resources of the cluster. So when you have multiple people submitting Hadoop, so YARNis the one which is kind of directing these to the right sources or through the right noDes, making sure that the notes are not overwhelmed by tasks and so on. And Hadoop MapReduce essentially is kind of the YARN based parallel processing, you can think about it as parallel processing of large datasets. So kind of this is you can think about it as a architecture of Hadoop, you have HDFS, the file system, which is redundant, reliable storage. So it has multiple copies, which is why it is redundant. Then you have the cluster management system, which is YARN, which schedules through. And on top of it you have various things like MapReduce, which is the kind of the badge or in memory Spark and other kinds of tools which are built on the system. Hadoop ecosystem is really huge, and you have a large number of components in there. Here's a small set of components from this kind of a framework. So you have this something called Zookeeper, which keeps track of configuration and management of the system. And then you have HDFS of course, you have the log management. You have the MapReduce kind of frameworks you Pig, and you have an Ambari system on top, which you can think about a cluster management and providing admins kind of access to easily building the cluster. So you can think about as Hadoop as not just one piece of software, but this kind of an ecosystem which has various components. So in this kind of class, what we will look at the HDFS, the Hadoop distributed file system, mainly, we will be looking at MapReduce, how to write a map produced kind of task. We will be looking at Pig, in order to write streaming jobs in order for which run on other Hadoop cluster and we will also be looking at the in memory Spark. So that's the things we will be covering in this class. So to wrap up this first lecture, what did we look at? We looked at Hadoop. We had an introduction of Hadoop, what Hadoop was, what are the characteristics of Hadoop? The distributed framework which it builds, which is the Hadoop distributed file system or HDFS. We looked at a simple kind of MapReduce paradigm. And we looked at some of the components which formed the ecosystem of Hadoop. Thank you.