Consider testing a computer's CPU by running a long computation and verifying that the result is correct. We restrict to interesting computations only, a purely subjective quality, because if one is going to use a lot of CPU time, one might as well compute something interesting.
Our interesting computation will be to compute 2^n mod p the Slow Way by repeating (x=2*x mod p) n times.
Because of the nature of the computation, an arithmetic error at any point during computation will likely become magnified and visible as an obviously incorrect final result. It is also extremely unlikely for multiple arithmetic errors to cancel each other out and yield a correct final result.
The kind of errors we might see through this kind of testing are sporadic arithmetic errors caused by the heat of pushing a CPU to the max for an extended period of time. Or maybe a cosmic ray flipping a bit.
The result can be verified the Fast Way: modular exponentiation by squaring, the standard method, usually implemented with recursion, that implicitly uses the binary representation of the exponent.
p should be a prime number which has 2 as a primitive root (generator). We'll use the 128-bit unsigned integer type provided by GCC on 64-bit platforms, because the wider the integer type, the more interesting the computation: it has a longer period. If p is less than 2^127, for example 2^127-309, then it is straightforward to implement the Slow Way without needing clever tricks or an arbitrary precision arithmetic library (e.g., gmp). Is it quicker to always compute mod p by the % operator, or is it quicker to subtract p if the result after doubling is greater than or equal to p? The former requires division, but the latter has a branch that will stymie any branch predictor because whether the branch is taken is essentially random. A small amount of experimentation found that predicated subtraction was faster.
We are only testing left shift (multiplication by 2) and subtraction in the ALU. It might be fun to test multiplication, division, and remainder circuitry, but that is a project for another day.
(Doing all this modulo a prime slightly less than 2^128 instead of 2^127, e.g., 2^128-275 seems not too difficult, but we did not pursue this. It might affect ILP discussed below.)
Verification the Fast Way can easily be done using software like Pari/GP or using a library like GNU MP that provides modular exponentiation. Incidentally, the security of time-lock puzzles like LCS35 rests on the assumption that there are no ways to compute modular exponentiation significantly faster than the Fast Way.
We can make the computer do more work in the same amount time by taking advantage of instruction level parallelism (ILP). Do several independent computations (with different initial values) in the same thread, interleaved. ILP will, for example, cause 2 interleaved computations to take less than twice as long as just 1. This source code demonstrates doing multiple parallel computations in the same thread. We found that 5-way parallelization had the most throughput. Less than that probably stalled due to dependency on results from the previous iteration. More than that probably ran out of registers. It's interesting how one can learn rather arcane details of the underlying hardware through this kind of timing.
Optimization trying to take advantage of ILP was delicate. We found that seemingly harmless changes to code caused large decreases in performance. It was probably affecting loop unrolling. In particular, our first attempt at adding checkpoint output every 2^n iterations, by checking that the lower n bits of the iteration counter were zero, hugely deceased performance.
We use a different initial value for the different independent computations within a thread; that is, we computed INITIAL*2^n mod p for different values of INITIAL but the same value of p. In contrast to all computations starting with the same initial value, this provides variability between the computations, preventing a compiler from (hypothetically) optimizing away redundant computation if it reasons that the parallel computations are all exactly the same. Such a complaint about hypothetical optimization could be made of most benchmark and system verification programs which repeat an identical computation many times: a compiler could optimize away all but the first iteration.
When using different initial values, we used odd numbers starting at 1, namely 1 3 5 7 9 as the initial values for 5-way ILP. If we had had (way) more parallelism, we should avoid 309. Starting at 309 is the same as starting at 1, delayed by 127 iterations: 2^x mod (2^127-309) = 309*(2^(x-127)) mod (2^127-309). Substitute x=127 for a simpler equation: 2^127 mod (2^127-309) = 309*(2^0) mod (2^127-309) = 309.
We should also avoid odd multiples of 309. Starting at 3*309 = 927 is the same as starting at 3, shifted by 127: 3*2^x mod (2^127-309) = 927*(2^(x-127)) mod (2^127-309) . We should omit these initial values, multiples of 309, because doing two instances of the same computation, shifted very slightly in time, is not interesting.
Incidentally, Pari/GP can compute a discrete logarithm by this modulus in about 2 minutes, requiring a stack size of 64 megabytes. For example, we can find that
3 = 2^121183092480708561560055562363468055686 (mod 2^127-309)
(or 2^2^126.51045603311334562760769429562182365479 if you want to know the magnitude of the exponent in bits.)
For computations across multiple cores, you don't conserve register file space by using the same modulus as we did above with different initial values. Therefore, use a different modulus per core in order to be interesting. Here are more moduli and source code for finding them.
Future: try this on a GPU over thousands of cores. We might have to drop down to (less interesting) 64-bit or even 32-bit because GPUs are natively 32-bit only (or less). But rolling one's own 128-bit arithmetic library, needing only subtraction, left shift, and comparison does not seem too difficult, and likely already exists.
The largest computations I've ever done which I'm confident have had no errors are probably this verification of the compositeness of the 22nd Fermat number, and a similar length partial computation of LCS35, abandoned after someone else found the solution first. For LCS35, we multiplied the modulus by a medium-sized prime to have high confidence in intermediate results, as described in the original paper. Both of these computations also happened to be modular exponentiation, albeit the Fast Way, and not with a prime modulus. These computations did have several restarts from checkpoints after the computer went down (e.g., OS updates requiring reboots, perhaps some system crashes). Hypothetically, if a system makes no arithmetic errors but catastrophically crashes, has it passed a test of correctness of its arithmetic calculations?
EmoticonEmoticon