Hello, and welcome! In this video, we’re going to show you how to create a box plot in the R programming language. A box plot summarizes the distribution of sorted numerical data. The first quartile is the point 25% of the way through the sorted data. In other words, a quarter of the data points are less than this value. Similarly, 75% of the points are less than the third quartile value. The interquartile range is simply the difference between the first and third quartile. The median is effectively the second quartile. So, 50% of the data falls below the median. The lower and upper whiskers indicate values outside the interquartile range. And then we have the mean, or the average, of all the data points. Let’s look at an example to see how this works. The dataset has 8 values, so to find the median, we’ll add the fourth and fifth elements, and divide by 2. Notice how half the elements are less than the median, and half are greater. We can perform a similar kind of calculation to get the first and third quartiles. Notice that 2 out of the 8 elements are less than the first quartile, and 6 out of the 8 elements are less than the third quartile. And the lower and upper whiskers simply extend to the minimum and maximum values. For our box plot, we’re going to create a set of pseudo-random data points from the normal distribution. In order to reproduce the results, we’re going to fix the seed value for the random number generator. So, the data will appear random, but it will be the same every time the code is run. We’ll then create two sets of data, A and B, each with 200 samples. Set A is sampled from the normal distribution with mean 1, standard deviation 2. Set B has mean 0, standard deviation 1. We’ll place these sets into a data frame, making sure to separate them by label. You can take a look at the data frame here. The head function shows the first six elements, and the tail function shows the last six. Notice how the numbers are grouped together by their original set. We’re going to be using the ggplot2 and Plotly packages. ggplot2 allows you to create highly customizable, aesthetically pleasing data visualizations. Plotly provides online graphing, analytics, and statistics tools for individuals and collaboration, as well as a scientific graphing library. Since these are external, you must install them if you haven’t done so, and then run the “library” command to use them in your code. In order to create a box plot with the data sets, we’re going to use the code here. We’ll run the ggplot method with our data frame as input, setting the x axis to display the labels, and the y axis to display the range of numbers. To get a box plot as output, we need to add the “geom_boxplot” method at the end. And finally, the ggplotly method will display the resulting graph. In the whiskers, values beyond 1.5 times the interquartile range are considered outliers. These are represented as dots. Let’s practice with the real-world mtcars dataset, which holds data about automobiles from 1973 to 1974. Since the dataset is built into R, you can start referencing it in your code without any import statements. We’re going to work with the first two variables in the top row: miles per gallon, and number of cylinders. Let’s first create a box plot using qplot. qplot is a basic function in the ggplot2 package that’s simple to use, but still capable of creating expressive plots. The x axis will hold the number of cylinders. Since the number of cylinders is more of a category than a numerical attribute, we’ll apply the “factor” function. The cars are either 4-cylinder, 6 cylinder, or 8 cylinder. For the y axis, we’ll use the miles per gallon data. The dataset we’re using is mtcars, and for the geometry, we need to specify that we’re creating a box plot. And you can see we get a box plot for each cylinder category. You can even start to see the relationship between cylinder count, and miles per gallon. We can also use the ggplot function, which is more customizable. The second parameter, aes, denotes a list of aesthetic mappings between variables in the data and the visual properties. Once again, we’re going to see how “cylinder count” relates to the “miles per gallon” value. After adding the “geom_boxplot” method, we can take a look at the box plot in the output. By now, you should understand the structure of a box plot, as well as how to create one in the R programming language. Thank you for watching this video.