Chapter 22 How to Generate a Truly Random Number
What is a Random Number?
The traditional mathematical definition of randomness is based on Kolmogorov-Chaitin complexity, which defines a random string as one, which has no shorter description than the string itself. Now, is “7” random? Well, can you find a shorter description? Is Archimedes’ constant (our old friend π from Chapter 19) a random number? The description in Chapter 19 is certainly shorter than the infinitude of seemingly random digits of π. Many statistical tests exist for the purpose of seeing if a pattern exists in the purported randomness. “7” may fail some of these tests, while π seems to pass. I guess that part of the answer depends on for what purpose you require randomness. We’ll offer the following definition of a random number: it is a number that I can compute but that you cannot. Such a number has surprisingly practical uses.
Use of Randomness
Sequences that seem random (and generally in reality are pseudo random) are needed for a wide variety of purposes. They are used for unbiased sampling in the Monte Carlo method, and to imitate stochastic natural processes. They are used in implementing randomized algorithms that require arbitrary choices. And their perceived unpredictability is used in games of chance, and in data encryption and secure communications.
To generate a random sequence on a digital computer, one starts with a certain seed, then iteratively applies some transformation to it, progressively extracting as long as possible a random sequence. In general, one considers a sequence “random” if no patterns can be recognized in it, no predictions can be made about it, and no simple description of it can be found by an adversary. But, purely by nature of the fact that the sequence can be generated by iteration of a definite transformation, then a simple description of it certainly does exist.
Most current practical random sequence generation computer programs are based on linear congruence relations (like the one we used in Chapter 10), or linear feedback shift registers (analogous to linear cellular automata). The linearity and simplicity of these systems lead to efficient algebraic algorithms for predicting the sequences (or deducing their seeds), and limits their degree of randomness unless they are re-seeded often. Their usefulness comes down to the generation of good random seeds.
Bad Seeds
As the World Wide Web was gaining broad public appeal, the need for secure transmittal of payment information (such as credit-card numbers) became evident. Netscape’s browser began to use the Secure Sockets Layer (SSL) for such transactions. Basically, SSL protects communications by encrypting messages with a secret key - a large, random number. The security of SSL depends crucially on the unpredictability of this number. In 1995, Goldberg and Wagner showed that the algorithm that Netscape used to compute this “unknowable” seed was deeply flawed.
Because Netscape would not release detailed information about how the seed was derived (an important point that we shall return to), Goldberg and Wagner resorted to the (honorable) task of reverse-engineering Netscape’s algorithm by manually decompiling the executable program. Here is what they found:
global variable seed;
RNG_CreateContext() {
(seconds, microseconds) = time of day’ /* time elapsed since 1970 */
pid = process ID;
ppid = parent process ID;
a = mklcpr (microseconds);
b = mklcpr (pid + seconds + (ppid < 12));
seed = MD5 (a, b);
}
mklcpr (x) { return ((0XDEECE66D * x + 0x2BBB62DC) > 1); }
The mklcpr function just scrambles the input a bit, and MD5 is the well-known hashing function. The seed generated depends only on the values of a and b, which in turn depend on just the time of day, the process ID, and the parent process ID. If an adversary can predict these three values, the seed can be computed and the whole scheme is compromised. The time of day can be known to within a certain precision, often better than a second, which means that there are only one million possible choices for the microsecond part (and often a lot less because the clocks on most computer systems do not have true microsecond resolution). If you are running on the same machine that is generating the seed, the process IDs will be known, and, if you are not, it is usually possible to guess their values. Firstly, ppid is often 1, or, if not, then it is usually just a bit smaller than the pid. The pid s are normally not considered secret and are often present in message packets from the system. Goldberg and Wagner showed that one could deduce the seed in less than a minute.
Shortly afterwards, it was discovered that Kerberos V4 (another secure protocol) suffered from the same problem, maybe even worse than Netscape since it used the standard Unix random() function instead of the much better MD5. At about the same time, it was announced that although the MIT-MAGIC-COOKIE-1 key had 56 bits, only 256 seed values were possible because of poor use of the rand() function:
key = rand() % 256; /* take the remainder after dividing by 256 */
Similar examples of flawed thinking and sloppy implementation abound, e.g. Sun derived NFS file handles (which serve as tokens to control access to a file and therefore need to be unpredictable) from the traditional pid and time-of-day, but forgot to load the time-of-day variable, and on and on.
The conclusion is that generating good random numbers is hard and that in spite of that (or perhaps, because of that!) it is often done poorly. As Donald Knuth once quipped: “A random number generator should not be chosen at random”. A good random number process is not just a complicated or (as many would think sufficient) an obscure (“secret”) procedure. Netscape, IBM and many, many others have fallen into this trap that the process must be kept secret. But we must heed Kerckhoffs’ second principle: “compromise of the system should not inconvenience the correspondents”. That is: the system should remain secure even if all the details about its inner workings, down to the actual source code, are known to all, and even if the adversaries have access to the very system producing the randomness.
The Entropy Pool
One solution to the randomness problem is to maintain a pool of information pertaining to physical parameters, properties, and activity of the system. Anything that is determined by external factors can be – and should be - used as input to the pool, such as the time between keystrokes, the timing of disk interrupts, number of network packages arrived, and the like. On a multi-user system, such as the AS/400, the number of page faults, the number of disk read/writes, the time to wait until eligible to get the next time slice, and other such information depends on the overall activity in a manner that is hard to predict or precisely control. These things can go into the pool too.
The word entropy is often used as synonymous with randomness and the pool of randomness information is often called the entropy pool. One could have several entropy pools: a fast pool that contains “distilled” values from the other, slower, pools, and a number of ever slower pools containing further data about usage statistics, event occurrences, etc, being continuously collected and percolating up to the faster pools. When you need a random number or a seed for a random number generator, you “consult” the fast pool. Typically a cryptographically sound “message digest”, such as MD5 or SHA (the Secure Hash Algorithm) is computed over the pool and used as your next random number or seed. In this chapter we shall describe the construction of an entropy pool for the AS/400.
Sources of Entropy
We shall utilize the following sources of randomness (or entropy):
● User input (e.g. pass phrases)
● Timing information, such as time of day and processor time used
● Resource Management Data collected by the system
● Variation of when the pool collector runs
Some of this information is also available to an adversary, but not at the same time, so the data will be somewhat different, that is: an adversary will have to try a range of values instead of knowing exactly what the value is. By collecting a wide variety of data, all of which can only be known to the adversary as a range rather than as definite values, we enlarge the search space from which our values can be found. If this process goes on for long enough time, it begins to become unfeasible to guess which values within the compounded sets of ranges we are actually using. This “shotgun” approach tends to make life miserable for a would-be adversary.
Timing Information
The Materialize Process Attributes instruction, MATPRATR, has an option to return the amount of processor time (“CPU-time”) used until now by the current thread. An adversary could be monitoring my invocation stack and ask for the same information at the time the entropy collector runs. Because of the dynamic nature of the situation he is likely to get only an approximation to the result I would get. This is all we need: to force him to consider a range of values. All those ranges multiply together to form a large space to search. Here is how to retrieve the processor time:
DCL SPCPTR .P24-ATTR INIT(P24-ATTR);
DCL DD P24-ATTR CHAR(128) BDRY(16);
DCL DD P24-MAX-SIZE BIN(4) DEF(P24-ATTR) POS( 1);
DCL DD P24-ACT-SIZE BIN(4) DEF(P24-ATTR) POS( 5);
DCL DD P24-CPU-TIME-USED CHAR(8) DEF(P24-ATTR) POS(19);
ENTRY GET-THREAD-ACTIVITY INT;
CPYNV P24-MAX-SIZE, 128;
MATPRATR .P24-ATTR, *, X'24'; /* THREAD ACTIVITY */
CPYBLA TIME-VALUE, P24-CPU-TIME-USED;
CALLI ADD-TIME-TO-ENTROPY, *, .ADD-TIME-TO-ENTROPY;
The TIME-VALUE is in the standard 64-bit timestamp format that we explored in Chapter 4:
DCL DD TIME-VALUE CHAR(8);
DCL DD TIME-VALUE-HI BIN(4) UNSGND DEF(TIME-VALUE) POS(1);
DCL DD TIME-VALUE-LO BIN(4) UNSGND DEF(TIME-VALUE) POS(5);
We convert the timestamp to a number as we did in Chapter 4:
DCL DD TIMESTAMP PKD(21,0); /* CAN HOLD UNSIGNED 64-BIT INTEGER */
DCL DD TWO**32 PKD(11,0) INIT(P'4294967296');
DCL DD TWO**15 PKD(11,0) INIT(P'32768');
DCL INSPTR .ADD-TIME-TO-ENTROPY;
ENTRY ADD-TIME-TO-ENTROPY INT;
MULT TIMESTAMP, TIME-VALUE-HI, TWO**32;
ADDN(S) TIMESTAMP, TIME-VALUE-LO;
DIV ENTROPY-VALUE, TIMESTAMP, TWO**15;
We divide by 215 because the lower 15 bits are all zeroes anyway. The result is what I call the ENTROPY-VALUE for that piece of information. We finally divide by 231 and use the remainder (stored as an unsigned 4-byte value) as the bits to add to the entropy pool:
DCL DD ENTROPY-BITS BIN(4) UNSGND;
DCL DD ENTROPY-VALUE PKD(31,0);
DCL INSPTR .ADD-TO-ENTROPY-POOL;
ENTRY ADD-TO-ENTROPY-POOL INT;
REM ENTROPY-BITS, ENTROPY-VALUE, TWO**31;
CPYBLA ENTROPY-POOL(POOL-POSITION:4), ENTROPY-BITS;
Entropy Pool Format
The entropy pool holds a number (currently eight) of entropy pool entries. Each entry consists of seven pairs of 4-byte values. The first value of the pair is a particular measurement and the second value is the real-time duration between measurements.
Here is a (pseudo) declaration of the structure:
DCL DD ENTROPY-POOL;
DCL DD ENTROPY-POOL-ENTRY (8);
DCL DD ENTROPY-POOL-ENTRY-PAIR (7);
DCL DD ENTROPY-POOL-ENTRY-PAIR-MEASUREMENT BIN(4) UNSGND;
DCL DD ENTROPY-POOL-ENTRY-PAIR-DURATION BIN(4) UNSGND;
To make it harder for the adversary to replicate the entropy pool, the seven measurements in an entry are not made at the same time. They are staggered in time by a variable amount (the yield time) as explained below. Each time the pool is consulted, a new entry is added to the pool with a new set of measurements. When the pool overflows (it only holds eight entries, so that happens quickly), the oldest entry is discarded to make room for the new entry. In this way the pool is constantly renewed. We include an option to flush the pool and start with a clean slate.
The Yield Time
The YIELD MI-instruction allows us to introduce considerable variability in when we execute a particular piece of code. It works like this: upon execution of YIELD, the dispatching algorithm is run. If another thread of equal or higher priority is eligible to run, then one of these threads is chosen and dispatched to run, and our thread will wait until it’s its turn again. Otherwise, the current thread resumes execution.
We could execute YIELD several times in a row, hoping that some other thread will run so that it will take some time before we are given control again. We could even compute a hash of the current contents of the entropy pool, form an integer from some of the middle bits of the hash-value, divide that integer by a parameter MAX-YIELDS and use the remainder to determine how many times to execute the YIELD instruction. We measure the total amount of real time that this whole process takes (the yield time) and use that as the duration part of the entropy pair. The yield time will depend on other activities going on at the same time, and is hard to predict on a busy system. In all fairness, we should note that the yield time could be manipulated somewhat by an adversary because he could cause some of that “other activity”.
The code reads:
REM TIMES-TO-YIELD, SHA-HASH-MIDDLE, CUR-MAX-YIELDS;
MATMATR .MACHINE-ATTR, X'0100';
CPYBLA TIME-BEFORE, MAT-TIMESTAMP; /* clock value before the loop */
CPYNV LAST-YIELDS, TIMES-TO-YIELD;
: YIELD; SUBN(SB) TIMES-TO-YIELD, 1/HI(=-1);
MATMATR .MACHINE-ATTR, X'0100';
CPYBLA TIME-AFTER, MAT-TIMESTAMP; /* clock value after the loop */