In the overwhelming majority of my posts about random number generation, we mainly considered the properties of various generation schemes. This may be unexpected, but the performance of the randomization algorithm may depend not on the chosen generation scheme, but on other factors. In this post (which was inspired by the excellent Daniel Lemire article ), we explore the main reasons for the decline in performance of random numbers, which often outweigh the performance of the engine PRNG.

Imagine this situation:

As a homework assignment, Juan and Sasha implement the same randomized algorithm in C ++, which will be executed on the same university computer and with one data set. Their code is almost identical and differs only in generating random numbers. Juan is in a hurry to do his music lessons, so he simply chose the whirlwind of Mersenne. Sasha, on the other hand, spent several extra hours on research. Sasha conducted benchmarks of several of the fastest GPSNGs, which he recently learned from social networks, and chose the fastest. At the meeting, Sasha was impatient to boast, and he asked Juan: “What GPSNU did you use?”

"Personally, I just took the whirlwind of Mersenne - it is built into the language and seems to work quite well."

“Ha!”, Answered Sasha. “I used

` jsf32 `

. It is much faster than the old and slow whirlwind of Mersenne! My program runs in 3 minutes and 15 seconds! ”. “Hmm, not bad, but mine does it in less than a minute,” says Juan and shrugs. “Well, okay, I have to go to the concert. Come with me? "

“No,” Sasha replies. “I ... uh ... need to look at my code again.”

This awkward fictional situation is

Most modern qualitative random number generators create machine words filled with random bits, that is, they usually generate numbers in the interval [0..2

We will look at many different methods. To simplify the discussion, instead of generating numbers in the interval [

Many languages have built-in tools for getting a random number in a specified interval. For example, to remove a card from a deck with 52 cards in scripting languages like Perl and Python, we can write

` int (rand (52)) `

and ` random.randint (0.52) respectively. `

. In C ++, we can similarly use ` uniform_int_distribution `

.C ++ code to implement this approach is simple:

` ````
uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
std :: uniform_int_distribution & lt; uint32_t & gt; dist (0, range-1);
return dist (rng);
}
```

Usually, the built-in tools use one of the techniques described below, but most users simply use these tools without thinking about what is happening “under the hood”, considering that these tools are properly designed and quite effective. In C ++, built-in tools are more complicated, because they must be able to work with rather arbitrary generation engines - the generator producing values in the interval from -3 to 17 can be quite acceptable and be used with

` std :: uniform_int_distribution `

to create numbers in any interval, for example [0..1000). That is, built-in C ++ tools are too complicated for most of the cases in which they are applied. Let's move from a pereuslozhennogo approach to too simplified.

When I studied programming, we generated numbers in the interval (for example, to select a card in a 52 card deck) using the division remainder operator. To get a number in the interval [0..52), we wrote

` rand ()% 52 `

. In C ++, this approach can be implemented as follows:

` ````
uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
return rng ()% range;
}
```

Despite the simplicity of this approach, it demonstrates the reason that getting numbers in the right interval is usually a slow task — it requires division (to calculate the remainder obtained by the

`% `

operator). The division is usually at least an order of magnitude slower than other arithmetic operations, so a single arithmetic operation takes longer than all the work performed by fast PRNG. But besides low speed, he is also

` rand ()% 52 `

returns skewed numbers, suppose that ` rand () `

creates numbers in the interval [0..2 ` rand ()% 52 `

, then There will be 82,595,525 ways to select the first 48 cards from the deck and 82 8295954 total ways to choose the last four cards. In other words, there is a bias of 0.00000121% against these last four cards (perhaps they are kings!). When I was a student and wrote homework about throwing dice or drawing cards, nobody was usually worried about such tiny distortions, but as the interval increases, the distortion increases linearly. For a 32-bit PRNG, a limited interval less than 2 In this article, we will mainly consider techniques that use strategies to eliminate systematic errors, but probably should be said that for 64-bit PRNGs, the amount of skew in most typical applications will most likely be negligible.

Another problem may be that some generators have weak low bits. For example, the PRGS families Xoroshiro + and Xoshiro + have low-order bits that do not pass statistical tests. When we execute

`% 52 `

(because 52 is even), we pass the low bit straight to the output. Another common technique is the use of GPSNG, which generates floating-point numbers in the interval [0..1) with the subsequent conversion of these numbers to the desired interval. This approach is used in Perl, it it is recommended to use

` int (rand (10)) `

to generate an integer in the interval [0..10) by generating a floating-point number followed by rounding down. < br/>
In C ++, this approach is written as:

` ````
static uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
double zeroone = 0x1.0p-32 * rng ();
return range * zeroone;
}
```

(Note that

` 0x1.0p-32 `

is a floating-point binary constant for 2 ` ldexp (rng (), -32) `

, but when I conducted a benchmark for this approach, it was much slower.) This approach is as skewed as the classic remainder of the division, but the bias is different. For example, if we chose numbers in the interval [0..52), then the numbers 0, 13, 26, and 39 would occur at one time less than others.

This version when generalizing to 64 bits is even more unpleasant, because it requires a floating point type, whose mantissa is at least 64 bits. On x86 machines with Linux and macOS, we can use

` long double `

to use x86 floating-point numbers that have a 64-bit mantissa, but ` long double `

is not ported universally to all systems — on some systems, ` long double `

is equivalent to ` double `

. There is a good side - this approach works faster than residual solutions for PRNG with weaker low bits.

The multiplication method can be adapted to arithmetic with a fixed rather than a floating point. In fact, we just constantly multiply by 2

` ````
uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
uint32_t x = rng ();
uint64_t m = uint64_t (x) * uint64_t (range);
return m & gt; & gt; 32;
}
```

It may seem that this version requires 64-bit arithmetic, on x86 processors a good compiler will compile this code into a 32-bit

` mult `

instruction (which gives us two 32-bit output values, one of which is returned value). It can be expected that this version will be fast, but it is skewed just like the floating-point multiplication method. We can modify the floating point multiplication scheme into a scheme based on the division. Instead of multiplying

` x * range/2 ** 32 `

, we calculate ` x/(2 ** 32/range) `

. Since we work with integer arithmetic, rounding in this version will be done differently, and sometimes generate values outside the desired interval. If we discard these values (for example, get rid of them and generate new values), the result will be a method without distortions. For example, in the case of drawing a card using a 32-bit PRNG, we can generate a 32-bit number and divide it by 2

In our code for this version, a trick with dividing 2

` range `

without using 64-bit mathematics is used. To directly calculate ` 2 ** 32/range `

, we need to present the number 2 ` range `

calculates a positive value of 2 ` range `

; dividing this value by ` range `

, we will get a response less than ` 2 ** 32/range `

. Consequently, the C ++ code for generating a number using a drop with drop is as follows:

` ````
uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
//calculates divisor = 2 ** 32/range
uint32_t divisor = ((-range)/range) + 1;
if (divisor == 0)//overflow, it's really 2 ** 32
return 0;
for (;;) {
uint32_t val = rng ()/divisor;
if (val & lt; range)
return val;
}
}
```

Of course, such an approach requires two slow operations based on division, which are usually slower than other arithmetic operations, so you should not expect it to be fast.

We can also use the drop approach to eliminate skewing in the classical division residue method. In the playing cards example, we again need to drop 48 values. In this version, instead of dropping the

Here is the implementation of this approach in C ++:

` ````
uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
//calculates 2 ** 32% range
uint32_t t = (-range)% range;
for (;;) {
uint32_t r = rng ();
if (r & gt; = t)
return r% range;
}
}
```

This technique eliminates skewing, but it requires two time-consuming division operations with a remainder for each output value (and an internal generator may be required to create several numbers). Consequently, it is worth expecting that the method will be approximately two times slower than the classical approach with a bias.

In

` arc4random_uniform code `

OpenBSD (also used in OS X and iOS) applies this strategy. In Java, a different approach is used to generate a number in an interval in which only one division operation with the remainder is used, with the exception of quite rare cases of discarding the result. Code:

` ````
static uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
uint32_t x, r;
do {
x = rng ();
r = x% range;
} while (x - r & gt; (-range));
return r;
}
```

To understand why this option works, you need to think a little. Unlike the previous version based on residuals, which eliminates skewing by removing some of the lowest values from the internal generation engine, this version filters out values from the top of the engine interval.

In almost the same way as we have eliminated the bias from the division remainder technique, we can eliminate the bias from the integer multiplication technique. This technique was invented by Lemir.

` ````
uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
uint32_t t = (-range)% range;
do {
uint32_t x = rng ();
uint64_t m = uint64_t (x) * uint64_t (range);
uint32_t l = uint32_t (m);
} while (l & lt; t);
return m & gt; & gt; 32;
}
```

In our last approach, division and remainder operations are completely excluded. Instead, it uses a simple masking operation to obtain a random number in the interval [0..2

` ````
uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
uint32_t mask = ~ uint32_t (0);
--range;
mask & gt; & gt; = __builtin_clz (range | 1);
uint32_t x;
do {
x = rng () & amp; mask;
} while (x & gt; range);
return x;
}
```

This approach was adopted by Apple when (in the macOS Sierra release) it performed own revision of code

` arc4random_uniform `

. Now we have several approaches that can be assessed. Unfortunately, when we are concerned about the costs of a single division, benchmarking becomes a non-trivial matter. No benchmark can take into account all the factors affecting the application, and there are no guarantees that the best option for your application will definitely be the best for mine.

We use three benchmarks and test techniques with many different PRNGs.

Probably the most obvious benchmark is mixing. In this benchmark, we imitate the implementation of large-scale mixing. To sort an array of

` uint32_t `

, this is 2 ` ````
for (uint32_t i = 0xffffffff; i & gt; 0; --i) {
uint32_t bval = bounded_rand (rng, i);
assert (bval & lt; i);
sum + = bval;
}
```

Notice that we “use” each number by adding it to

` sum `

(so that optimization does not discard it), but do not do any mixing to focus on generating numbers. To test 64-bit generation, we have a similar test, but it would be impractical to perform a test corresponding to mixing an array of 2

` ````
for (uint32_t i = 0xffffffff; i & gt; 0; --i) {
uint64_t bound = (uint64_t (i) & lt; & lt; 32) | i;
uint64_t bval = bounded_rand (rng, bound);
assert (bval & lt; bound);
sum + = bval;
}
```

The results shown below demonstrate the performance of this benchmark for each of the methods we considered using the Mersenn vortex and testing on the 32-bit discussed in the article (using

` std :: mt19937 `

from ` libstdc ++ `

) and similar 64-bit code (using ` std: mt19937_64 `

from ` libstdc ++ `

). The results are a geometric average of 15 runs with different seed values, which is then normalized so that the classical method of remainder has a unit runtime. It may seem that we have clear answers about performance - it looks like you can build techniques for their perfection, and ask yourself what the

` libstdc ++ `

developers thought about when they wrote such a terrible implementation for 32-bit numbers. But, as is often the case with benchmarking, the situation is more complicated than it seems from these results. Firstly, there is a risk that the results may be specific for the Mersenn vortex, so we will expand the set of PRNGs tested. Secondly, there may be a subtle problem with the benchmark itself. Let's first deal with the first question.We will test 32-bit PRNG using

` arc4_rand32 `

, ` chacha8r `

, ` gjrand32 `

, ` jsf32 `

, ` mt19937 `

, ` pcg32 `

, ` pcg32_fast `

, ` sfc32 `

, ` splitmix32 `

, ` xoroshiro64 + `

, ` xorshift * 64/32 `

, ` xoshiro128 + `

and ` xoshiro128 ** `

, and 64-bit PRNGs using ` gjrand64 `

, ` jsf64 `

, ` mcg128 `

, ` mcg128_fast `

, ` mt19937_64 `

, ` pcg64 `

, ` pcg64_fast `

, ` sfc64 `

, ` splitmix64 `

, ` xoroshiro128 + `

, ` xorshift * 128/64 `

, ` xoshiro256 + `

and ` xoshiro256 * `

. These kits will give us a few slow PRNGs and a large number of very fast ones. Here are the results:

` libstc ++ `

ceases to seem so terrible. In this benchmark with a significant margin wins in speed approach based on multiply with warp. There are many situations in which the boundaries will be small relative to the size of the PRNG, and performance is absolutely critical. In such situations, a slight bias is not likely to have a noticeable effect, but the PRNG speed will have. One such example is Quicksort with a random reference point. Of the methods without distortions, bit-mask technique looks promising.

But before making serious conclusions, we need to point out the huge problem of this benchmark - most of the time is spent on very high boundaries, which most likely gives excessive importance to large intervals. Therefore, we need to go to the second benchmark.

This benchmark is similar to the previous one, but performs a much smaller “array shuffle” (multiple). Code:

` ````
for (uint32_t j = 0; j & lt; 0xffff; ++ j) {
for (uint32_t i = 0xffff; i & gt; 0; --i) {
uint32_t bval = bounded_rand (rng, i);
assert (bval & lt; i);
sum + = bval;
}
}
```

This benchmark avoids too much emphasis on large boundaries and more accurately reflects real use cases, but now completely discards large boundaries.

This benchmark aims to avoid the drawbacks of the previous two; he performs testing at every grade size of two, so that every size is present, but his influence is not overestimated.

` ````
for (uint32_t bit = 1; bit! = 0; bit & lt; & lt; = 1) {
for (uint32_t i = 0; i & lt; 0x1000000; ++ i) {
uint32_t bound = bit | (i & amp; (bit - 1));
uint32_t bval = bounded_rand (rng, bound);
assert (bval & lt; bound);
sum + = bval;
}
}
```

Many of our findings remain unchanged. The skewed multiplication method is fast if we can come to terms with the error, and the bitmask pattern seems to be a good average choice.

We could complete this if we didn’t want to go back, take a critical look at our code and make changes to it.

Up to this point, all methods of eliminating skews required the use of an additional remainder division operation, which is why they are performed much slower than skewed methods. It would be helpful if we could reduce this advantage.

Some of our algorithms have code using a threshold value, for example:

` ````
uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
//calculates 2 ** 32% range
uint32_t t = (-range)% range;
for (;;) {
uint32_t r = rng ();
if (r & gt; = t)
return r% range;
}
}
```

When

` range `

is small compared to the PRNG output interval, most often the number will be This task is handled by the following code:

` ````
uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
uint32_t r = rng ();
if (r & lt; range) {
uint32_t t = (-range)% range;
while (r & lt; t)
r = rng ();
}
return r% range;
}
```

This change can be applied to the "dual Mod without distortions" (see above), and to the "integer multiplication without distortions". The idea came up with Lemir, who applied it to the second method (but not to the first).

This optimization leads to a significant improvement in the results of the 64-bit benchmark (in which mod is even slower), but in fact slightly degrades performance in the 32-bit benchmark. Despite the improvements, the bitmask method still wins.

On the other hand, this change significantly speeds up the small-shuffle benchmark for the multiplication method of integers, and for the double remainder of the division. In both cases, of the performance shifts closer to the results of the options without distortions. The performance of the double residue method (OpenBSD) is now almost equal to that of the single residue method (Java).

We see a similar improvement in the benchmark for all intervals.

It seems that we can declare a new universal winner: an optimized method of multiplying Lemir integers without distortions.

Usually, division is required to calculate

` a% b `

, but in situations where ` a & lt; b `

The result will be just ` a `

, and division is not required. And when ` a/2 & lt; b `

, the result is simply ` a - b `

. Consequently, instead of computing ` `` a% = b; `

we can perform

` ````
if (a & gt; = b) {
a - = b;
if (a & gt; = b)
a% = b;
}
```

The cost of division is so significant that increasing the cost of this more complex code can justify itself by saving time due to the absence of division.

Adding this optimization significantly improves the large-shuffle benchmark results. This is again more noticeable in the 64-bit code, where the operation of taking the remainder is more expensive. The double residue method (in OpenBSD style) shows optimized versions for only one residue operation and for both.

In this benchmark, the bitmask is still the winner, but the boundary between it and the approach of Lemir has narrowed considerably.

Adding this optimization does not improve the performance of the small-shuffle benchmark, so the question remains whether it adds significant costs. In some cases, no, in others the costs increase slightly.

In the benchmark for all intervals, the changes are also small.

The main reason for using a number of PRNG for testing the number schemes in the intervals was to avoid unintentional distortion of the results due to the peculiarities of the operation of certain PRNN schemes. But we can use the same internal test results to compare the generation schemes themselves.

The graph below shows the performance of different 32-bit generation schemes, averaged for all methods and fifteen runs, normalized by the performance of the 32-bit Mersenne vortex:

On the one hand, I am glad to see that

` pcg32_fast `

is really fast - only a small version of Xoroshiro (which does not pass statistical tests) has won it. But it also shows why I rarely get upset about the performance of modern high-performance general-purpose PRNGs — the difference between the different methods is very small. In particular, the fastest four schemes differ in performance by less than 5%, and I believe that this is simply caused by “noise.” The graph shows the performance of various 64-bit generation schemes, averaged among all the techniques and fifteen runs, normalized by the performance of the 32-bit Mersenn vortex. It may seem strange that normalization is performed on the 32-bit Mersenn vortex, but this allows us to see the additional costs of using 64-bit generation in cases when 32-bit generation is enough.

These results confirm that

` mcg128_fast `

is incredibly fast, but the last four technicians again differ by only about 5%, therefore it is difficult to choose from the fastest methods. ` pcg64 `

and ` pcg64_fast `

must be slower than ` mcg128_fast `

because their basic generators are 128-bit linear congruential generators (LKG, LCG) and 128-bit multiplicative congruential generators (ICG, MCG). Despite the fact that they are not the fastest technicians in this kit, ` pcg64 `

is still 20% faster than Mersenn’s 64-bit vortex.But perhaps more importantly, these results also show that if you do not need 64-bit output, then 64-bit PRNG is usually slower than 32-bit.

From our benchmarks, we can see that the transition from the standardly used PRNGs (for example, the 32-bit Mersenn vortex) to faster PRNGs reduced the execution time of benchmarks by 45%. But the transition from the standard method of finding numbers in the interval to our fastest method allowed us to reduce the benchmark time by about 66%; in other words, up to one third of the original time.

The fastest method (without distortions) is the Lemire method (with my additional optimization). Here it is:

` ````
uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
uint32_t x = rng ();
uint64_t m = uint64_t (x) * uint64_t (range);
uint32_t l = uint32_t (m);
if (l & lt; range) {
uint32_t t = -range;
if (t & gt; = range) {
t - = range;
if (t & gt; = range)
t% = range;
}
while (l & lt; t) {
x = rng ();
m = uint64_t (x) * uint64_t (range);
l = uint32_t (m);
}
}
return m & gt; & gt; 32;
}
```

Using the Lemira method will improve the performance of most randomized algorithms more than moving from a fast generation engine to running a little faster.

The code for all tests is posted on GitHub . Overall, I tested 23 methods for

` bounded_rand `

using 26 different PRNGs (13 32-bit PRNGs and 13 64-bit), in two compilers (GCC 8 and LLVM 6), which gave me 26 * 23 * 2 = 1196 executable files, each of which was performed with the same 15 seed, which gives 1196 * 15 = 17,940 unique test runs, in each of which three benchmarks are combined. Basically, I ran tests on a 48-core machine with four Xeon E7-4830v3 processors with a frequency of 2.1 GHz. Performing a full set of tests required a little less than a month of CPU time. In the end we will return to the situation from the introduction of the article. Imagine that Sasha used

` jsf32.STD-libc ++ `

, and Juan used ` mt19937.BIASED_FP_MULT_SCALE `

. In benchmark 3, the latter takes 69.6% less time. That is, the time from this fictional situation is based on data from reality. Source text: [Translation] Effective number generation in a given interval