Welcome to “Introduction to Hadoop.” After watching this video, you will be able to: Define Hadoop, explain the history of Hadoop, list reasons why Hadoop was the answer to Big Data processing, outline some of the challenges of Hadoop. Hadoop is an open-source framework used to process enormous datasets. Imagine this scenario: You have one gigabyte of data that you need to process, perhaps customer sales records or survey responses. The data is stored in a relational database on your desktop computer or laptop. Neither device has a problem handling this data set and you can query the data with ease. You have a great year, and you pick up a few more clients. Your company starts growing very quickly, and that data grows to ten gigabytes. then 100 gigabytes. You pick up a big client, a retailer perhaps, with multiple databases. Maybe you are a mine manager and a new executive decision means that you must now start processing data from equipment, front end loaders, excavators, and haul trucks. You also have data about the environment, data about soil and weather, and data about safety and rock samples. Or perhaps your company is a delivery service, and you attach sensors to your entire fleet. Overnight your clients, drivers, and their vehicles all start gathering data that you now need to process! When your data grows from 1 terabyte to 10 terabytes, and then 100 terabytes, you are again quickly approaching the limits of your local computers. And maybe you need to process unstructured data coming from sources like Facebook, Twitter, sensors, and so on. The senior leadership team at your company wants to derive insights from both the relational data and the unstructured data in order to make informed decisions. And they want this information as soon as possible. How can you accomplish this task? Hadoop might be the answer. Now let’s rewind! Before we use Hadoop to solve the problems let’s understand what Hadoop is. Hadoop is a set of open-source programs and procedures which can be used as the framework for Big Data operations. It is used for processing massive data in distributed file systems that are linked together. It allows for running applications on clusters. A cluster is a collection of computers working together at the same to time to perform tasks. It should be noted that Hadoop is not a database but an ecosystem that can handle processes and jobs in parallel or concurrently. Hadoop is optimized to handle massive quantities of data which could be: Structured, tabular data, Unstructured data, such as images and videos, or Semi-structured data, using relatively inexpensive computers. In the 1990s coming into the new Millennium, the web grew significantly to millions of pages and different structures of data. Automation was needed to help handle simultaneously the differences in data structure and the web searches. In 1999, the Apache Software foundation was established as a non-profit. In 2002, the Nutch web search engine was created by Doug Cutting and Mike Cafarella. Nutch was created on the basis that it could handle multiple tasks across different computers at the same time, while storing and processing the data in a distributed way, so that the most relevant search would be returned faster. In 2006, Cutting joined Yahoo with the Nutch project, and the project was divided into the web crawler and distributed processing. The distributed processing segment was called Hadoop, and in 2008, Yahoo released Hadoop to the Apache Software Foundation. Data is now in petabytes and exabytes and Big Data is the term used to explain the complexity of the data. Hadoop has individual components for storing and processing data. The term Hadoop is often used to refer to both the core components of Hadoop as well as the ecosystem of related projects. The core components of Hadoop include: Hadoop Common, which is an essential part of the Apache Hadoop Framework that refers to the collection of common utilities and libraries that support other Hadoop modules. There is a storage component called Hadoop Distributed File System, or HDFS. It handles large data sets running on commodity hardware. A commodity hardware is low-specifications industry-grade hardware and scales a single Hadoop cluster to hundreds and even thousands. The next component is MapReduce which is a processing unit of Hadoop and an important core component to the Hadoop framework. MapReduce processes data by splitting large amounts of data into smaller units and processes them simultaneously. For a while, MapReduce was the only way to access the data stored in the HDFS. There are now other systems like Hive and Pig. And the last component is YARN, which is short for “Yet Another Resource Negotiator.” YARN is a very important component because it prepares the RAM and CPU for Hadoop to run data in batch processing, stream processing, interactive processing, and graph processing, with are stored in HDFS. The drawbacks of Hadoop could not be left unnoticed by developers. Hadoop contained many smaller components. Although efficient at first glance, Hadoop failed at simple tasks. Hadoop is not good for processing transactions due to its lack of random access. Hadoop is not good when the work cannot be done in parallel or when there are dependencies within the data. Dependencies arise when record one must be processed before record two. Hadoop is also not good for low latency data access. “Low latency” allows small delays, unnoticeable to humans, between an input being processed and the corresponding output providing real time characteristics. This can be especially important for Internet connections utilizing services such as trading, online gaming, and Voice Over IP. Hadoop is also not good for processing lots of small files, although there is work being done in this area such as IBM’s Adaptive MapReduce. Lastly, Hadoop is not good for intensive calculations with little data. To deal with the shortcomings of Hadoop, new tools like Hive were built on top of Hadoop. Hive provided SQL-like query and provided users with strong statistical functions. Pig was popular for its multi query approach to cut down the number of times that the data is scanned. In this video you learned that: Hadoop is an open-source frame framework for Big Data The core components of Hadoop are HDFS, MapReduce, YARN, and Hadoop Common The drawbacks of Hadoop outweighed the benefits.