Welcome to “What is Big Data?” After watching this video, you will be able to: Explain Big Data, identify the characteristics of Big Data, and explain the five V’s of Big Data. Bernard Marr, the Analytics, KPI, AI, and Big Data Guru, defines Big Data as the digital trace (or data) that we are generating in this digital era. To understand Big Data, we need to compare it with Small Data. Small Data is available in limited quantities that can be easily interpreted by humans with little or no digital processing. Sports scores and employee shift schedules are some examples of Small Data. It is accumulated slowly and may or may not get updated continuously. Storing Small Data is relatively easy as it is mostly available in a structured format such as First and Last Name, Address, Gender, and so on, and It can be maintained within an enterprise’s own infrastructure or in a data center. Big Data, on the other hand, is generated in massive volumes and has little or no structure in which it is received. Examples of semi-structured data include social media posts that could be images accompanied by hashtags, while unstructured data could include medical records from millions of patients. It is complex and requires specialized programs to interpret and make it available for human consumption. Not only is Big Data massive in size, it is collected continuously and grows exponentially in a short amount of time. Big Data could be in any form including but not limited to text, images, audio, and videos. Finally, Big Data is so voluminous that it is stored in the Cloud or on server farms set up specifically for this purpose. It is a common misconception that Big Data refers to just large volumes of data. In reality, Big Data is the entire life cycle of working with large volumes of data. Let’s take a look at each phase in the Big Data life cycle. Big Data collection is initiated as a result of a business problem or requirement. As data is collected, it gets stored using a framework for distributed storage such as Hadoop HDFS. To make sense of all the data collected, Map and Reduce tasks and scripts create a data model to store it in a database. This data model includes the various data entities (or objects), and the relationship and rules between these entities. After modeling, data is ready to be processed. Tools such as Apache Spark are used to produce meaningful information from the modeled data. Finally, the processed data is visualized and presented in a graphical format such as charts and graphs. This visualized data is then used for making meaningful business decisions and lead to new business cases, thereby creating a continuous life cycle. The research firm Gartner defines Big Data as a high-volume, high-velocity, and high-variety information asset that demands cost-effective and innovative tools for processing. Let’s take a moment to appreciate just how huge Big Data volumes really are. You have probably heard of megabytes, gigabytes, terabytes, and possibly even petabytes. But big data can be even bigger. Do you know you can store over 11 million movies in 4k resolution in just one exabyte of space? Now visualize this… modern computers generally ship with 1- to 5-terabyte hard drives. One zettabyte contains a billion terabytes, while a yottabyte fits in one trillion terabytes? Doesn’t that boggle the mind? When we talk about Big Data, we traditionally talk about the four V’s of Big Data. These are: Velocity, Volume, Variety, and Veracity. Velocity is the speed at which data arrives. Volume is the increase in the amount of data stored over time. Variety is the diversity of data. Many forms of data exist and need to be stored. Veracity is the certainty of data. With a large amount of data available, how will we know if the data collected is accurate or inaccurate? These four main components are used to describe the dimensions of Big Data. Velocity signifies that data is being generated extremely fast, and the process never stops. Data must be processed quickly so that decisions can be made at the speed with which the data arrives. Velocity’s main attributes are: batch close to real-time, and streaming. What are the drivers? Definitely improving connectivity and hardware. Just think about all the devices that are connected through the Internet today and all the super-fast response times. Big Data also supports upscaling of pre-calculated analysis. Volume is the increase in the amount of data stored. The amount of Big Data generated is vast compared to traditional data sources. Volume attributes of Big Data are: Petabytes, Exa, and Zetta, to name just a few. Typical drivers of volume in Big Data are: the increase in data sources, higher resolution sensors, and scalable hardware infrastructure. Variety is the diversity of the data. Data is generated by people and processes through the use of machines, from both inside and outside an organization. Some of the data is structured and semi-structured, but most is unstructured. The main attributes are structure, complexity, and origin. Drivers of Variety in Big Data can be: mobile technologies, scalable hardware infrastructure, resilience, fault recovery, and efficient storage and retrieval. Veracity is the quality, origin, and conformity to facts and accuracy of the data. This is because data comes from both within and outside an organization. Attributes include consistency and completeness, integrity, and ambiguity. Drivers of Veracity in Big Data are: cost and the need for traceability, robust ingestion, and extract, transform, load (ETL) mechanisms. Big Data has another V that must be considered. The fifth V of Big Data is Value. It is the outcome of making intelligent business decisions from leveraging the previous four V’s. The ultimate goal of an organization is to: produce value in the form of faster and smarter business decisions, increase efficient use of resources, and discover of new market opportunities. Big Data supports innovation and thus creates value. In this video, you learned that: Big Data is the digital trace that gets generated in this digital era. Big Data is a high-volume, high-velocity, and/or high-variety information asset that demands cost-effective and innovative tools for processing. The core features of Big Data are the 4 V’s: Velocity, Volume, Variety, and Veracity. Big Data creates a fifth V, Value, when collected, stored, and processed correctly.