Now if the table of beers was large and had millions of entries, the table would possibly need to be split over many machines. Another way of saying that is that the table will be partitioned across a number of machines. Since a query simply performs a selection and projection here, it can be evaluated in parallel. Remember that name is the primary key of the table. One standard way of partitioning the data is called a range partitioning by the primary key. This simply means that the rows of the table are put in groups depending on the alphabetical order of the name value. So beers with names starting with E and B here are placed in Machine 1. Those starting with C and D are in Machine 2. And if there are too many rows for entries where the name starts with H, maybe H is split into Machines 5 and 6. This is shown in the sketch here. Next, we will show how queries are performed over partition tables. But before we do that, you should know that all database management companies, like IBM, Chair Data, Microsoft, and others, have a solution like this for large volumes of data, where data partitioning is used. Newer systems, like Spark and SQL, are naturally distributed, and therefore, offer data partitioning. So, we show the same partition tables as we saw before. Now we'll ask two queries. The first query asks for all tuples as records from the beers table where the name of the beer starts with Am. And the second query is exactly what we asked before. The first query in the SQL looks like this. We said SELECT* FROM Beers to mean all attributes from table beers. The WHERE clause shows the syntax for a partial match query. In this query, there are two new syntax elements. The first is a predicate called like. When we use like, we're telling the query engine that we only have partial information about the string we want to match. This partly specified string is called a string pattern. That means, there is this part of the string we know and a part that we do not know. In this case, we know that our design string starts with Am, so we'd write Am, and then we put % to refer to the part of the string that we do not know. Putting them together, we get Am%. If we wanted to find, say, Am somewhere in the middle of the string, we would write the pattern as %Am%. The second query is not new. We saw it in the last slide. However, as we'll see next, evaluating the second query will be a little more tricky in a partition database than that we usually see for big data. Let's talk about the first query in this data partition setting. The question to ask is, do we need to touch all partitions to answer the query? Of course not, we know that the name is a primary key for the table of beers. We also know that the system did arrange partitioning on the name attribute. This means that the evaluation process should only access Machine 1 because no other machine will have records for names starting with A. Now this is exactly what we, as humans, do when we look up an entry in a multivolume encyclopedia. We look for the starting words, then figure out which specific volume would have that entry, then pick up just that volume. Thus, so long as the system knows the partitioning strategy, it can make its job much more efficient. When a system processes thousands of queries per second, this kind of efficiency actually matters. Now raised partitioning is only one of many partitioning schemes used in a database system, okay. Let's try to answer the second query in the same partition setting. Now the query condition is on the second attribute, manf. Now in one sense, it's a simpler query. There is no light pattern here, and we know exactly the string that we are looking for, namely the string Heineken. However, this time, we really cannot get away by using the partitioning information because the partitioning activity is different from the attribute on which the query condition is applied. So this query will need to go to all partitions. Technically speaking, the query needs to be broadcast from the primary machine to all machines, as shown here. Next, this broadcast query will be independently, and in parallel, execute the query on the local machine. Then, these results need to be brought back into the primary machine. And then, they need to be unioned together. And only then, the results can be formed and returned to the client. Now, this might seem like a lot of extra work. However, remember, the shaded part of the query is executed in parallel, which is the essence of dealing with large data. Now, at this point, you might be thinking, wait a minute, what if I had 100 machines, and the desired data is only in 20 of them? Should we needlessly go through all 100 machines, find nothing in 80 of them, and return 0 results from those machines? Then why do the extra work? Can it not be avoided? Well, to do this, it would need one more piece in the solution, it's called an index structure. Very simply, an index can be thought of as a reverse table, where given the value in a column, you would get back the records where the value appears as shown in the figure here. Using an index speeds up query processing significantly. With indexes, we can solve this problem in many different ways. The top table shows the case where each machine has its own index for the manf column. This is called a local index because the index is in every machine that holds the data for that table on that machine. In this case, looking up Heineken in the index, we would know which records would have the data. Since the index is local, the main query will indeed go to all machines, but the lookup will be really instant, and the empty results would return very quickly. In the second case, we adopted a different solution. Here, there is an index on the main machine, all on a separate index server. Now when we place a data record in a machine, this index keeps an account of the machine that contains the record with that value. Look at the second table to the right. Given the value of Heineken, we know that it is only in three machines, and therefore, we can avoid going to the other machines. Clearly, we can always use both indexing schemes. This will use more space, but queries will be faster. Now this gives you some of the choices you may need to make with big data, whether you use a parallel DBMS or a distributed data solution.