The queries in real life are little more complex than what we have seen before. So let's consider a more complex query. Let’s find bars where the price of Miller is the same as or less than what the great American bar called TGAB here charges for Bud. You may say, but we do not really know what TGAB charges for Bud. That's correct, so we can break up the query into two parts. First, we will find this unknown price, and then we'll use that price to find the bars that would sell Miller for the same price or better price. Now this is a classic situation where the result from the first part of the query should be fed as a parameter to the second query. Now this situation is called a subquery. We write this in SQL as shown in the slide here. What makes this query different is that the part where price is less than equal to, instead of specifying a constant like $8, we actually place another query which computes the price of a Bud at TGAB. The shaded part is called the inner query, or the subquery. In this case, both the outer query and the inner query use the same relation, which is Sells. Now in terms of evaluation, the inner query is evaluated first, and the outer query uses its output. Now while it may not be obvious at this time, notice that the inner query is independent of the outer query. In other words, even if we did not have the outer query, we can still evaluate the inner query. We say in this case that the subquery is uncorrelated. Let's look at another example of a subquery. In this example, we want to find the name and manufacturer of each beer that Fred didn't like. So how do we know what Fred didn't like? We do however know that the set of beers that Fred likes because they are listed in the Likes relation. So we need to subtract this set from the total set of beers that the company has recorded. This subtraction of sets can be performed in several ways. One of them is to use the NOT IN construct. So the query class's job is to take every name from the Beers table and output it only if it does not appear in the set produced by the inner query. Similar to the previous query, the subquery here is also uncorrelated. Now this is a more sophisticated query. The intention is to find beers that are more expensive than the average price of beer. But since beers have different prices in different bars, we have to find the average for every bar. Therefore the idea is to find the average price of beer for every bar and then compare the price of each beer with respect to this average. Now look at the query and the table. Let's assume we are processing the first table. The beer is Bud, and the price is $5. Now we need to know if $5 is greater than the average price of beer sold at HGAT. To do this, we need to compute the inner query, okay? So now let's look at the fourth row. The price of Guinness needs to be compared to that average again for HGAT. In fact for every table processed by the outer query, one needs to compute the inner query for that bar. This makes the inner subquery correlated with the outer query. Now a smart query processor will store the average once it's computed and then reuse the stored value instead of computing over and over again. What's an aggregate query? Let's use a simple example of finding the average price of Bud. This is like a simple select project query with the additional aspect that it takes a list of price values of Bud from different bars and then computes an average. In the example shown, the average of the five prices is 4.2. In other words the average function, the AVG function, takes a list of values and produces a single value. Now there are many functions that have this behavior. The SUM function takes a list of values and adds them up to produce a single value. The COUNT function takes a list of list of values and counts the number of items in that list and so on. These are called aggregate functions. Now if we wanted to count only the price values that are different, that is 3, 4, and 5 just once, we can write the SELECT clause a little differently. We would say that the average is over distinct values of price which in this case will result in 4. You should recognize that most analytical operations need to use statistical functions which are aggregates. So another important analytical requirement is computing the statistical aggregate by groups. For example, we often compute the average salaries of employees per department. Now back to our example here, we want to find the average price paid for Bud per drinker, where we know that a drinker visits many bars. So the grouping variable here is drinker. So we have three attributes at play, price which we need to aggregate, drinker which we need to group by, and bar which is a join attribute. The fourth attribute, namely beer, is used for selection and does not participate in grouping. So after the selection we will get an intermediate relation containing drinker, bar, and price. With this, the GROUP BY operation will create one result row for each drinker and place the average price over all such rows. Now how does GROUP BY and aggregate computation work when the data is partitioned? Let's take the same query. We are looking for the average price of Bud grouped by drinker. But this time the result of the selection are in two different machines. Imagine that this time they are range partitioned by row numbers, which we have not shown to maintain clarity. Now with the GROUP BY operation the data will get repartitioned by the grouping attribute, that's drinker. And then the aggregate function is computed locally. To accomplish this repartitioning task, each machine groups its own data locally, determines which portions of data should be transmitted to a different machine, and accordingly ships it to that machine. Now there are several variants of this general scheme which are even more efficient. Now if this reminds you of the map operation you saw in your previous course, you are exactly right. This fundamental process of grouping, partitioning, and redistribution of data is inherent in data-parallel computing and implemented inside database systems.