So in terms of network debugging, first we need to understand some of the metrics that are used for analyzing networks or quantifying networks. Latency and throughput are terminologies I'm sure that all of you have heard again and again in several different courses. Especially in my own course, for instance. If there is a point A and point B, then the latency is the time it takes to go from here to here, right? So that's your latency, and throughput on the other hand is events per unit time, that's your throughput. At first glance it might seem like throughput is the inverse of latency, but unfortunately, it's not. And the reason why it's not is because you may have a single path like this, if you think about a highway with a single lane. There could also be a highway with multiple lanes. And we have multiple lanes, then the throughput is really how many vehicles are passing per unit time in the highway as a whole, right? So in that sense throughput is not necessarily exactly the inverse of latency. But it is really a measure of parallelism that is available between two endpoints. And that comes through networks as well, and the networks also, if you have a source here and a destination point here. And there may be multiple paths that are available. Which means that multiple sources can be sending to the destination that are in different places. That's where the throughput is going to be more, but latency is a speed of light consideration. It's a distance, the time it takes to send the bits from A to B, all of that will constitute latency. Utilization has to do with the fact that how well the networking infrastructure is used. So for instance, if you have a data center with lots of different paths in it, and you have a servers at the bottom. Maybe this portion of the network is very busy. And maybe some of the portion of the network is not at all busy. And in that case utilization is not very high. So utilization has to do with how well the resources in the system as a whole are being utilized. In order to cater to the demands of the applications that may be running on it. The question is how scalable a particular system is. One possible way of thinking about this is that if I have an application that is running on these two nodes and if I want to run some other application on some of the nodes as well. Is the network still scalable in terms of delivering the same performance when each of these applications run by themselves? That's one way of thinking about in the context of this in data center networks. So it's really, as you increase the offered load into the system, is the performance that is experienced by the applications remaining unchanged? If my application is running by itself, I have a certain performance. If multiple applications are running concurrently with me, am I experiencing the same performance, or is my performance getting affected? And this all comes down to the key to the networks. Because you have lots of servers and it is a network where you might have constriction. And so that's where the scalability becomes very important in the context of data center networks. And when you want to evaluate the goodness of any system. Whether it is computer system, operating system or network or whatever. There are, generally speaking, three different ways you can think about it. One is modeling, if you think about a building architecture, you can build a clay model, right, to just sort of visualize. But in a computer system, if we want to model the computer system, if there is some mathematical way of representing what's going on in the in the system, then I can analyze that mathematical model to derive some performance metrics, right? The metrics that we talked about, latency or scalability or utilization and so on. We can do that if I have a good mathematical model for it. It may not always be possible to do that, because in computer systems, it's mostly heuristics. You don't have Maxwell's equations driving the design of computer systems, right? So it is heuristics and therefore modeling may not be the easiest way to do it. So another approach may be simulation, and we are all familiar with that. So now what we are doing is we are taking the real system and saying, can I write a computer program that simulates what will happen in the real system? And then with that simulator, then we can analyze the same metrics, right? Latency, bandwidth, throughput, utilization and so on. Just because of the fact that there is no easy way to have a mathematical model. Usually what is done in modeling is you approximate. For instance, we talked about workload, right, going into the system. How is the workload going to be represented in a mathematical model? You have to have distribution, and the mathematical distribution may not be exactly what is going on in the real world, right? So in that sense modeling may have limitations in terms of the kind of results that you may get out of it and how believable those results may be. Simulation on the other hand is a representation, a programmatic representation, of the real system. And you can drive the simulator with actual traces from a real execution. And in that sense, it need not be an approximation as to what may be the workload, but it can be real workload. But there are pros and cons in both of these. Modeling, on the one hand, may not be very robust in terms of the results that it can produce, but it can be a tractable way of doing it. Meaning that you can model fairly large systems with mathematical models, even though they may be approximate. Simulation, on the other hand, is a computer program. So how good is this computer program going to scale? You have to model the data center with thousands of nodes, for instance. So that becomes an issue in terms of time that it takes to get results from a simulator as opposed to a mathematical model. So there's always this sort of pros and cons. With something that is simple you get results that may be more approximate. With something that is more intricate and well-defined, it may take more time, but you may get more robust results. And finally, implementation and measurements. As the name suggests, here we are really saying, okay, let's take this implemented and then study the performance of it. Obviously it's going to be the best, because here you're really getting the true system being after implementing you're measuring it. And know that the performance results that you're going to get out of this is completely believable. This is always the path in design goes like this as much as possible. And the reason is because implementation is expensive. So you are committing something to actual design, the design to actual hardware and software. And what if the design decisions are not good? Then you have to go back to the drawing board, right? So in that sense you do want to have these ways of designing the system so that you can get partway results. And which gives you confidence that you're going the right direction before you implement and measure it. And you will see that data center networks have done all of these three different things. In terms of modeling it, simulating it and implementing and measuring it as well. Now that we have real data centers and you can actually see what kind of workloads are being used there. And once you have real workloads running on real data center resources, then that study will tell you whether some of the design decisions that they've taken are the right ones. And if I have to revise or redo some of the design decisions, you know which way to go, right? So those are some of the things that is happening as we speak in most of the data centers. So in this lecture, which I am going to not completely cover today, the things that we're going to look at are things relating to data center networks. And particularly what can go wrong in data center networks. And the second thing is, what are the tools and techniques that are available for testing and debugging data center networks? And also tools for measuring the performance of data center networks. And we'll also look at case studies of measurements that folks have done on data center networks. Sort of an academic study as well as studies that are being done on the Yahoo data center and Google data centers.