Welcome to “Open Source and Big Data” After watching this video, you will be able to: Explain the role of open source in Big Data. Describe platforms for coordinating open source. Describe the most popular open-source frameworks. What is an open source software? A simple definition of an open source software is that it is free. Not only is the runnable version of the code free, but the source code is also completely open, meaning that every line of code is available for people to view, use, and reuse as needed. However, A project becomes truly open source once it embraces the open-governance model. This action ensures that any contributor from any organization can propose changes to the code and the overall project can be steered to serve the community’s needs. It is important to note that not all open source software is the same. The type of license associated with the software prescribes how you can use it. Before using any open source software, it is important to check for the type of license it is available under, and to understand the permissions available under the specific license type. Accordingly, we need to ask “why is the open source model used for Big Data?” Open source projects are massive efforts. While many projects are started within one organization, the truly open source projects persist beyond the efforts of any single organization. These projects form the foundation of all of the modern Big Data infrastructure. Consider the Linux Kernel. When Linus Torvalds first began developing this kernel as an open source, there were many players in the operating system space, and many were offering only proprietary versions of their code. It was not clear at the time who would ultimately emerge as the OS standard. Nearly 30 years later, Linux is the standard for all servers. This is true for almost every data server in every part of the world, regardless of hardware, software, or service provider. This happened because the project took on a life of its own, beyond any single company’s interest, and persists beyond any one organization’s cycles of interest. The open-source development model is to software development what democracy is to the government. It is the most transparent way to conduct a project, and ultimately serves the will of the people participating in it. In some senses it is a surprising result that a completely open model will ultimately be a more profitable solution. But this has proven to be the case time and time again, and we now embrace the model without controversy. Most open source projects have formal processes for contributing to code, and also include various levels of influence and obligation to the project. Committer, contributor, user, user group. Typically, committers have the ability to modify the code directly, while contributors submit their code for review (by a committer) before the code is modified. Many more people are simply users of the code. Most, but not all, major open source projects belong to one of the major open source foundations, and follow a similar governance procedure. Open source means that the code is freely visible, which is an important distinction from proprietary software. Open governance, however, is what makes the project truly open and democratic. Open source foundations prescribe best practices for open source development as well as open governance. The biggest component of big data is, by far, the Hadoop project and its three main components: MapReduce, The Hadoop File System (HDFS), and The resource manager (YARN). MapReduce is a framework that allows code to be written to run at scale on a Hadoop cluster. It is still used, but not as much as more modern Big Data computation frameworks like Apache Spark. HDFS is the file system that stores and manages Big Data files. It manages all of the issues around large and distributed datasets, including resilience and partitioning. It is still a mainstay of the industry. 70% of the world’s Big Data resides on HDFS. More modern approaches to distributed storage, such as S3 and object storage, are coming into use, but they are based on the design principles of HDFS. YARN is the resource manager that comes with Hadoop, and it is the default resource manager for many Big Data applications, including HIVE and Spark. It is one of the most robust resource managers in use today, but more modern container-based resource managers (like Kubernetes) are slowly becoming the new de facto standards. Concluding, these are the main components of the Hadoop ecosystem, and most big data applications are built on top of them. The array of big data applications available to the user is dizzying. They all build upon the basic Hadoop framework or interact with it in some way, however. Frameworks like Hive and Spark support lots of ETL (Extract, Transform, Load) and computation tasks on Hadoop systems. Some systems that integrate tightly with the Hadoop ecosystem are: Apache Hbase, which is a large NoSQL datastore. It manages storage and computation resources outside of the Hadoop ecosystem but often resides on the same cluster. Open-source packages like the Hortonworks Data Platform (HDP) provide a set of big data tools that are already configured to work together, and include most of the important open source packages (Hadoop, Spark, Hive, Hbase, and others). In this video, you learned that: Open source runs the world of Big Data. Open source projects are free and completely transparent. The biggest component of big data is, by far, the Hadoop project including MapReduce, HDFS, and YARN. Open source has big data tools like Apache Hive and Apache Spark.