Next we'll consider queries where two tables are used. Let's consider the query, find the beers liked by drinkers who frequent The Great American Bar. For this query, we need the relation's Frequents and Likes. Now look at the scheme of these relations in the light blue box. They have a common attribute called drinker. So if we use the attribute drinker, we need to tell the system which one we are referring to. Now look at the SQL query, the FROM clause in the query has these two relations. To handle a common attribute name issue, we need to give nicknames, aliases to these relations. Therefore in the FROM clause we say, Likes has the alias L and Frequents has the alias F. Since we want to find beers like before, we use a SELECT DISTINCT clause for beer. As we saw before, using SELECT DISTINCT avoids duplicates in the result. The WHERE clause has two kinds of conditions, the first kind is a single table condition. In this case, bar = The Great American Bar on the Frequents relation. The second kind is a joined condition which says that the drinker's attribute in the frequency relation is the same as the drinker's attribute of the Likes relation. We encode this in SQL in the last line of the query using aliases. Why did we not say L.beer in the SELECT clause or F.bar in the first condition? We could have, the query would have been equally right. But we are using a shortcut because we know that these attributes are unique already in the query. Now let's look at the query again, this time from the viewpoint of evaluating the query. There are many ways to evaluate the query, but the way it's most likely to be evaluated is this. The query will first look at the tables that have single table conditions. So it would perform a select operation on the Frequents table to match the records of the condition that The Great American Bar equal to The Great American Bar. Why is this strategy good? It's because the selection operative reduces the number of triples to consider. Thus, if there are thousand triples in the relation frequents, maybe 60 of them matches the desired bar. So in the next step, we have to deal with a fewer number of records than thousand. All right, the next step will be a Join with a Likes relation. A Join requires two relations in a Join condition, the Join condition comes from the query. The first relation shown with an underscore symbol here is a result of the previous operation. Another way of saying this is that the result of the selection is piped into the Join operation. That means we do not create an intermediate table from the result of the selection. The results are directly supplied to the next operator, which in this case is Join. Now the result of the Join operator is an intermediate structure with columns beer from Likes relation and the drinker from the Frequents relation that we've processed. This intermediate set of triples is piped to the Project operation that picks up the beer column. Now we need to process the DISTINCT clause for Deduplicate elimination, which then goes to the Output. We have already seen how the select project queries on single tables are evaluated when the tables are partitioned across several machines. We'll now see how we process Join queries in the same setting. For our case, consider that the Likes and Frequents tables are on two different machines. In the first part of the query, the selection happens on the machine with the Frequents table. The output of the query is a smaller table with the same scheme as Frequents, that is with drinkers and bars. Now we define an operation called Semijoin, in which we need to move data from one machine to another. The goal of the Semijoin operation is to reduce the cost of data movement. That is to move data from the machine which has the Frequents data to the machine with the Likes data. The cost is reduced if we ship less data. The way to it is to first find which data the join operation actually needs. Clearly, it needs only the drinkers column and not the bars column. So the drinkers column is projected out, then just this column is transmitted to the second machine. Finally, the join is performed by looking at the values in the Likes column that only matches the values in the shipped data. That means only the data from Likes that matches the drinkers that are chosen. These are then the join results which would go to the output of the operation. Now here you can see the Semijoin operation graphically. The red table on the left is the output of the selection operations on the left. The white table on the right is the table to be joined to. Since we need only the Drinkers column, it is projected to create a one-column relation. Notice that the red table has two entries for Pete, who frequented two bars. But the output of the project is condensed in the yellow table to just show the Drinkers, where Pete appears only once. For those of you with a background in computer science, this can be done using a hash map like data structure. This one-column table is now shipped to Site2, which has the Likes relation. Now at Site2, the Shipped relation is used to find matches from the Drinkers column and it finds only one match called Sally. So the corresponding result triples, in this case, only one triple is produced at the end of this operation. Now, the original table and the matching table are shipped to the last of the operation to finish the Join operation. And more efficient version of this is shown in the next slide. In this version, the first two steps this and that are the same. Then the result of the reduce is also shipped to Site1 to find the matches from the red relation. Another reduce operation is performed on Site1 now to get the matching records on the red relation. Finally, these two reduced relations are shipped to the site where the final join happens. And all of this may seem like a lot of detail. Let me repeat something I've said before. If we have a system like DB2 or Spark SQL that implements multi-site joins, it will perform this kind of operation under the hood, you don't have to know them. However, if we were to implement a similar operation and all that you have is Hadoop, you may end up implementing this kind of algorithm yourself.