One of the things that we often need to do in cryptography is find large prime numbers. How large? Currently, RSA encryption commonly uses 1024 bit public keys. But it is increasingly recommended to use 2048 bit keys and some applications use 4096 bit keys. These are numbers that have about 300, 600 and 1200 decimal digits, respectively. Since these keys are semi-prime numbers, which are numbers that are the products of two primes, and since the two prime numbers used are roughly the same size. This means that we are talking about prime numbers that have approximately 150, 300 and 600 digits, respectively. So the good guys, such as Alice and Bob, have to be able to find large prime numbers to generate their public keys, while the bad guys, Eve and Mal, want to be able to figure out which prime numbers Alice and Bob chose based on that public key. Although both the good guys and the bad guys are interested in finding prime numbers of about the same size, their problems are actually very different. In some respects, the differences are roughly analogous to those between a second pre-image attack against the hash function and a hash collision attack. Except this time, the tables are turned in favor of the good guys. Eve and Mal have to find a specific prime number that factors a particular semi-prime, while Alice and Bob only care about finding a random prime number of a certain size. Alice and Bob can either use an algorithm that generates prime numbers or take the approach of just guessing random numbers and checking to see if they are prime until they find two that are. In this case, what they need is an algorithm that tests if a number is prime, without actually factoring it. It turns out there are a number of efficient ways to do this and those will be the focus of the next several lessons. Eve and Mal, on the other hand, need to solve what is known as the Integer Factorization Problem. Like the discrete log problem, this is a problem for which it is suspected that there may not be any efficient algorithm to solve it. As before, the term efficient is shorthand for a polynomial time algorithm, in terms of the number of bits in the number. But again, like the discrete log problem, it has never been proven that an efficient solution does not exist. So there is the potential that any cryptosystem that relies on this being a hard problem may be relying on a house of cards. In this lesson, we want to get a feel for the scope of the Integer Factorization Problem by looking at how the simplest factoring algorithm, known as trial division, scales, and briefly talking about the performance of the general Number Field Sieve, which is presently the fastest known algorithm for factoring multi-hundred digit semi-primes. Let's start small and first narrow the scope of the problem to something that we can wrap our minds around. Let's say that Alice wants to generate a public key that is just 32 bits. This means that she wants prime numbers that are about 16 bits in length. Since she doesn't want the two primes numbers to be too close to each other, she decides to make one 14 bits and other 18 bits. This means that she wants a prime number close to 16000 and another one close to 260000. It turns out that 16001 and 260003 are prime. So Alice's public key becomes 4160308003 which is indeed a 32 bit number. How did I find these numbers? Simple. I cheated. I looked up a list of prime numbers and scanned down it. These happened to be the 1863rd and the 22838th primes on the list. This is not how Alice would do it when she wants to find even a 150-digit prime number, let alone a 600-digit prime. Such a table has to include all of the prime numbers up to a certain limit. While the largest prime number we know to date has over 22 million digits, it is estimated that the limit below which all of the prime numbers have been found is somewhere between 10 to the 18th and 10 to the 19th, or basically an 18-digit number. But no actual table exists because even that table would require about two exabytes of storage, and there is presently somewhere in the rough neighborhood of a thousand exabytes of storage in the entire world. So how does Alice do it? The most common way is to pick random numbers and test if they are prime. It turns out that the probability the randomly chosen prime number is roughly the reciprocal of the natural log of the number. So in the case of our toy problem, a randomly chosen number around 16000 has about a 10% chance of being prime while a number around 260000 has about an 8% chance. But what about a 150-digit prime or a 600-digit prime? Certainly, the probabilities go down but we are still talking about roughly one in 350 for 150-digit prime, and about 1 in 1400 for a 600-digit prime. So yes, she may have to make several hundred or even a few thousand guesses before she finds two primes, but she only has to do that once. And as long as the Primality test is sufficiently efficient, this is an acceptable burden. That's all I will say about Alice's task here since we'll look at efficient Primality tests in the next few lessons. Turning our attention to Eve, what does she do when she gets handed in this 32-bit toy semi-prime and is told to factor it? Her first approach might be what I call True Brute Force Trial Division. Let's see if two will divide it, if not she tries three if not tries four, tries five, all the way up to N minus one. This is absolutely guaranteed to find any divisor of the number and the trial divisions that she has to perform is basically N. So we have an order N algorithm, but since N is exponential in the number of bits, so is this algorithm. But we can do a lot better. Brute force merely gives us a benchmark for comparison. The first thing to note is that if N has any factors, they obviously have to be less than N over two which cuts our effort in half. But this is still an order N algorithm. But we really only need to try potential divisors until we get up to the largest value that the smallest factor could possibly be. Let's say that the smallest factor of N is p. We know that p is prime since if it wasn't, there would be a factor smaller than p. The other factor, q, which may or may not be prime, is no smaller than p. Again, if it was, p wouldn't be the smallest factor. So we have N equals p times q which is at least as large as p_squared. Which means that p, the smallest factor of N, is no larger than the square root of N. This might not seem like that big of an improvement but even considering just our 32-bit factor puts the lie to this impression. If we tried every number less than half of N, we would be making up to 2 billion trial divisions. But knowing that we can stop at the square root of N means that we have to do at most about 16 or 65000, which is a factor of 65000 reduction in the time required. So if it took us a day to perform 2 billion trial divisions, which admittedly is an over-exaggeration given today's computers, but it's just for illustration, stopping at the square root of N, would only take about a second. Now, I realize that you might already be screaming at me that Eve doesn't even need to check every number less than the square root of N. Once she knows it isn't divisible by 2, she doesn't need to check any other even number. That alone cuts her work in half. In fact, she only needs to try the prime numbers less than square root of N. In the case of our toy N, the square root of N is just over 64500 and there are 6493 prime numbers less than this. How do I know this? I cheated again. I used the same table I did before. But even with this list available, Eve's burden is only reduced by an order of magnitude, and that isn't going to save her. For starters, the list she uses would have to contain all of the prime numbers below the square root of N and we've already seen that the current limit on such a list is less than 10 to the 19th, which means she won't be able to use this method to factor any number more than about 38 digits. Plus, this table would require a non-trivial fraction of the entire world's storage capacity. Now, consider how many prime numbers would be in this table. There is actually a function called the Prime Counting Function, or pi of N, that is defined as the number of prime numbers less than N. Unfortunately, unlike the Totient function, we have yet to figure out the actual function that will give us the actual value of pi of N. That's another of mass great unsolved problems. But we do have some limits on it and one commonly used limit is, that is often good enough, is that it is about N divided by the natural log of N. So for 10 to the 19th, that's about 2.3 times 10 to the 17th. So even if Eve could perform a billion trial divisions a second, it would still take her about seven years to factor a 38-digit number. Clearly, this approach has no chance of scaling 200, let alone, 1000 digit numbers. Our old friend astronomical pays a visit long before we get there. But trial division is not the only way to factor large numbers. Most fast methods, and here fast does not mean polynomial time, it just means faster than anyone else has come up with yet, they use some form of Sieve approach. You might want to look up the Sieve of Eratosthenes which gives you an idea of the concept and which, like the Euclidean Algorithm, is one of the oldest algorithms known. For numbers with more than about 100 digits, the current record holder is known as the general Number Field Sieve. The RSA factoring challenge was created by RSA laboratories in 1991. And yes, that's the same RSA as the encryption algorithm we use for so much of our e-commerce and other public key cryptoprotocols. It contains a list of semi-prime numbers ranging from 100 decimal digits up to 617 digits which is a 4096-bit number. As of 2017, the largest RSA number that has been factored is RSA 768, a 768-bit number, which has 232 decimal digits. Factored back in 2009, it took a couple of years and involved hundreds of processors. It is estimated that it would take about 2000 years on a single core 2.2 Gigahertz processor. A couple of smaller RSA numbers still haven't been factored. To give you some idea of how processing capabilities have progressed, the original RSA key length for RSA in the late 1970s was 512 bits. As is often the case in cryptography, the desired key length was longer than this but the choice of 512 gets reflected practical contemporary implementation constraints. The first 512-bit key factor, RSA 155, involved a dedicated research effort, a close to state-of-the-art Cray supercomputer, and 8000 MIPS years of processing effort. A decade later, the 512-bit RSA key used by Texas Instruments designed firmware for the TI-83+ graphing calculator was broken by a single person, Benjamin Moody, in 73 days using a single 1.9 Gigahertz dual core processor. Some experts suspect that current 1024-bit keys can already be factored in meaningful timeframes by well-funded state actors, while others dispute this claim. But few doubt that it will be possible in the not too distant future. For this reason, the current recommendation is to use those 2048-bit keys. It is believed that short of someone developing an efficient algorithm or a workable quantum computer, that 4096-bit keys move the scale of the factoring problem well into the astronomical realm the cryptographers like to play in. Assuming that we keep the key size ahead of advances in computing technology and the development of factoring algorithms, what are Eve's other options? It's pretty safe to assume that just because we may have found a way to keep Eve and Mal from cracking our key mathematically, it would be pretty unreasonable to think that they're just going to give up and go away. Nope, they're going to explore every facet of our cryptosystems trying to find other ways to break into them. In the case of RSA and Diffie-Hellman, they might be able to make headway against the discrete log problem. More likely, they might discover weaknesses not in the basic math but in exploitable weaknesses due to subtleties in how we use that math, such as how we generate our keys, how we pad our messages, or how we use our keys. In fact, all of these have been the source of exploitable vulnerabilities that have been exposed and patched over the years. Then, of course, there are all the side channel attacks ranging from monitoring how our physical cryptosystems leak information in a variety of ways, or going after human weaknesses such as storing keys improperly or being susceptible to bribery or coercion. We are not going to explore any of these any further. Our focus is on the central part of the underlying math. But engineers are notorious for putting on blinders and thinking that if they solve the narrow problem we're focused on, that the bigger problem is somehow solved. So I want to keep us grounded in that bigger picture by at least pointing it out from time to time.