[Translation] Effective number generation in a given interval

[Translation] Effective number generation in a given interval


image

In the overwhelming majority of my posts about random number generation, we mainly considered the properties of various generation schemes. This may be unexpected, but the performance of the randomization algorithm may depend not on the chosen generation scheme, but on other factors. In this post (which was inspired by the excellent Daniel Lemire article ), we explore the main reasons for the decline in performance of random numbers, which often outweigh the performance of the engine PRNG.

Imagine this situation:

As a homework assignment, Juan and Sasha implement the same randomized algorithm in C ++, which will be executed on the same university computer and with one data set. Their code is almost identical and differs only in generating random numbers. Juan is in a hurry to do his music lessons, so he simply chose the whirlwind of Mersenne. Sasha, on the other hand, spent several extra hours on research. Sasha conducted benchmarks of several of the fastest GPSNGs, which he recently learned from social networks, and chose the fastest. At the meeting, Sasha was impatient to boast, and he asked Juan: “What GPSNU did you use?”

"Personally, I just took the whirlwind of Mersenne - it is built into the language and seems to work quite well."
“Ha!”, Answered Sasha. “I used jsf32 . It is much faster than the old and slow whirlwind of Mersenne! My program runs in 3 minutes and 15 seconds! ”.

“Hmm, not bad, but mine does it in less than a minute,” says Juan and shrugs. “Well, okay, I have to go to the concert. Come with me? "

“No,” Sasha replies. “I ... uh ... need to look at my code again.”

This awkward fictional situation is not special and fictional; It is based on real results. If your randomized algorithm is not as fast as you would like, and the bottleneck seems to be the generation of random numbers, then, oddly enough, the problem may not be in the random number generator!

Introduction: Random numbers in practice


Most modern qualitative random number generators create machine words filled with random bits, that is, they usually generate numbers in the interval [0..2 32 ) or [0..2 64 ) . But in many cases of use, users need numbers at a certain interval — for example, to roll a die or select a random playing card, numbers are needed at small regular intervals. However, many algorithms from shuffle and reservoir sampling before randomized binary search trees require numbers taken from other intervals.

Methods


We will look at many different methods. To simplify the discussion, instead of generating numbers in the interval [ i .. j ) or [ i .. j ] we will generate numbers in the interval [0 .. k ). Having such a scheme, we can, for example, generate numbers in the interval [ i .. j ) by specifying k = j - i by generating a number in the interval [0 .. k ), and then adding i to it.

C ++ Embedded Tools


Many languages ​​have built-in tools for getting a random number in a specified interval. For example, to remove a card from a deck with 52 cards in scripting languages ​​like Perl and Python, we can write int (rand (52)) and random.randint (0.52) respectively. . In C ++, we can similarly use uniform_int_distribution .

C ++ code to implement this approach is simple:

  uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
  std :: uniform_int_distribution & lt; uint32_t & gt;  dist (0, range-1);

  return dist (rng);
 }  

Usually, the built-in tools use one of the techniques described below, but most users simply use these tools without thinking about what is happening “under the hood”, considering that these tools are properly designed and quite effective. In C ++, built-in tools are more complicated, because they must be able to work with rather arbitrary generation engines - the generator producing values ​​in the interval from -3 to 17 can be quite acceptable and be used with std :: uniform_int_distribution to create numbers in any interval, for example [0..1000). That is, built-in C ++ tools are too complicated for most of the cases in which they are applied.

Classical remainder from division (skewed)


Let's move from a pereuslozhennogo approach to too simplified.

When I studied programming, we generated numbers in the interval (for example, to select a card in a 52 card deck) using the division remainder operator. To get a number in the interval [0..52), we wrote rand ()% 52 .

In C ++, this approach can be implemented as follows:

  uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
  return rng ()% range;
 }  

Despite the simplicity of this approach, it demonstrates the reason that getting numbers in the right interval is usually a slow task — it requires division (to calculate the remainder obtained by the % operator). The division is usually at least an order of magnitude slower than other arithmetic operations, so a single arithmetic operation takes longer than all the work performed by fast PRNG.

But besides low speed, he is also skewed . To understand why rand ()% 52 returns skewed numbers, suppose that rand () creates numbers in the interval [0..2 32 ), and note that 52 does not divide 2 32 completely, it divides it 82 595 524 times with the remainder 48. That is, if we use rand ()% 52 , then There will be 82,595,525 ways to select the first 48 cards from the deck and 82 8295954 total ways to choose the last four cards. In other words, there is a bias of 0.00000121% against these last four cards (perhaps they are kings!). When I was a student and wrote homework about throwing dice or drawing cards, nobody was usually worried about such tiny distortions, but as the interval increases, the distortion increases linearly. For a 32-bit PRNG, a limited interval less than 2 24 has a skew less than 0.5%, but greater than 2 31 the skew is 50% - some numbers will return twice as often as others .

In this article, we will mainly consider techniques that use strategies to eliminate systematic errors, but probably should be said that for 64-bit PRNGs, the amount of skew in most typical applications will most likely be negligible.

Another problem may be that some generators have weak low bits. For example, the PRGS families Xoroshiro + and Xoshiro + have low-order bits that do not pass statistical tests. When we execute % 52 (because 52 is even), we pass the low bit straight to the output.

Multiplication of floating-point numbers (skewed)


Another common technique is the use of GPSNG, which generates floating-point numbers in the interval [0..1) with the subsequent conversion of these numbers to the desired interval. This approach is used in Perl, it it is recommended to use int (rand (10)) to generate an integer in the interval [0..10) by generating a floating-point number followed by rounding down. < br/>
In C ++, this approach is written as:

  static uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
  double zeroone = 0x1.0p-32 * rng ();
  return range * zeroone;
 }  

(Note that 0x1.0p-32 is a floating-point binary constant for 2 -32 , which we use to convert a random integer in the interval [0..2 32 ) in double in the unit interval; instead, we can do this conversion with ldexp (rng (), -32) , but when I conducted a benchmark for this approach, it was much slower.)

This approach is as skewed as the classic remainder of the division, but the bias is different. For example, if we chose numbers in the interval [0..52), then the numbers 0, 13, 26, and 39 would occur at one time less than others.

This version when generalizing to 64 bits is even more unpleasant, because it requires a floating point type, whose mantissa is at least 64 bits. On x86 machines with Linux and macOS, we can use long double to use x86 floating-point numbers that have a 64-bit mantissa, but long double is not ported universally to all systems — on some systems, long double is equivalent to double .

There is a good side - this approach works faster than residual solutions for PRNG with weaker low bits.

Integer multiplication (with skew)


The multiplication method can be adapted to arithmetic with a fixed rather than a floating point. In fact, we just constantly multiply by 2 32 ,

  uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
  uint32_t x = rng ();
  uint64_t m = uint64_t (x) * uint64_t (range);
  return m & gt; & gt;  32;
 }  

It may seem that this version requires 64-bit arithmetic, on x86 processors a good compiler will compile this code into a 32-bit mult instruction (which gives us two 32-bit output values, one of which is returned value). It can be expected that this version will be fast, but it is skewed just like the floating-point multiplication method.

Dividing with Drop (without skewing)


We can modify the floating point multiplication scheme into a scheme based on the division. Instead of multiplying x * range/2 ** 32 , we calculate x/(2 ** 32/range) . Since we work with integer arithmetic, rounding in this version will be done differently, and sometimes generate values ​​outside the desired interval. If we discard these values ​​(for example, get rid of them and generate new values), the result will be a method without distortions.

For example, in the case of drawing a card using a 32-bit PRNG, we can generate a 32-bit number and divide it by 2 32 /52 = 82 595 524 to select a card. This technique works if the random value from the 32-bit PRNG is less than 52 × 82595524 = 2 32 /32 - 48. If the random PRNG value is one of the last 48 values ​​of the upper part of the generator interval, then it is necessary discard and look for another.

In our code for this version, a trick with dividing 2 32 into range without using 64-bit mathematics is used. To directly calculate 2 ** 32/range , we need to present the number 2 32 , which is too large (by one!) To be represented as a 32-bit integer.Instead, we take into account that for unsigned integers, the unary negation operation range calculates a positive value of 2 32 - range ; dividing this value by range , we will get a response less than 2 ** 32/range .

Consequently, the C ++ code for generating a number using a drop with drop is as follows:

  uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
//calculates divisor = 2 ** 32/range
  uint32_t divisor = ((-range)/range) + 1;
  if (divisor == 0)//overflow, it's really 2 ** 32
  return 0;
  for (;;) {
  uint32_t val = rng ()/divisor;
  if (val & lt; range)
  return val;
  }
 }  

Of course, such an approach requires two slow operations based on division, which are usually slower than other arithmetic operations, so you should not expect it to be fast.

The remainder of the division (double) without warps - method OpenBSD


We can also use the drop approach to eliminate skewing in the classical division residue method. In the playing cards example, we again need to drop 48 values. In this version, instead of dropping the last 48 values, we (equivalently) drop the first 48 values.

Here is the implementation of this approach in C ++:

  uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
//calculates 2 ** 32% range
  uint32_t t = (-range)% range;
  for (;;) {
  uint32_t r = rng ();
  if (r & gt; = t)
  return r% range;
  }
 }  

This technique eliminates skewing, but it requires two time-consuming division operations with a remainder for each output value (and an internal generator may be required to create several numbers). Consequently, it is worth expecting that the method will be approximately two times slower than the classical approach with a bias.

In arc4random_uniform code OpenBSD (also used in OS X and iOS) applies this strategy.

Remainder of division (single) without skew - Java technique


In Java, a different approach is used to generate a number in an interval in which only one division operation with the remainder is used, with the exception of quite rare cases of discarding the result. Code:

  static uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
  uint32_t x, r;
  do {
  x = rng ();
  r = x% range;
  } while (x - r & gt; (-range));
  return r;
 }  

To understand why this option works, you need to think a little. Unlike the previous version based on residuals, which eliminates skewing by removing some of the lowest values ​​from the internal generation engine, this version filters out values ​​from the top of the engine interval.

Unbalanced integer multiplication - Lemir technique


In almost the same way as we have eliminated the bias from the division remainder technique, we can eliminate the bias from the integer multiplication technique. This technique was invented by Lemir.

  uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
  uint32_t t = (-range)% range;
  do {
  uint32_t x = rng ();
  uint64_t m = uint64_t (x) * uint64_t (range);
  uint32_t l = uint32_t (m);
  } while (l & lt; t);
  return m & gt; & gt;  32;
 }  

Bitmask with drop (without skewing) - Apple technique


In our last approach, division and remainder operations are completely excluded. Instead, it uses a simple masking operation to obtain a random number in the interval [0..2 k ), where k is the smallest value, such that 2 k is longer than the interval. If the value is too large for our interval, we discard it and try to get another one.The code is shown below:

  uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
  uint32_t mask = ~ uint32_t (0);
  --range;
  mask & gt; & gt; = __builtin_clz (range | 1);
  uint32_t x;
  do {
  x = rng () & amp;  mask;
  } while (x & gt; range);
  return x;
 }  

This approach was adopted by Apple when (in the macOS Sierra release) it performed own revision of code arc4random_uniform .

Benchmarking Basic Techniques


Now we have several approaches that can be assessed. Unfortunately, when we are concerned about the costs of a single division, benchmarking becomes a non-trivial matter. No benchmark can take into account all the factors affecting the application, and there are no guarantees that the best option for your application will definitely be the best for mine.

We use three benchmarks and test techniques with many different PRNGs.

Large-Shuffle Benchmark


Probably the most obvious benchmark is mixing. In this benchmark, we imitate the implementation of large-scale mixing. To sort an array of N size, we must generate numbers in the intervals [0 .. N ), [0 .. ( N -1)), ..., [0..1). In this benchmark, we will assume that N is the maximum possible number (for uint32_t , this is 2 32 -1). Code:

  for (uint32_t i = 0xffffffff; i & gt; 0; --i) {
  uint32_t bval = bounded_rand (rng, i);
  assert (bval & lt; i);
  sum + = bval;
 }  

Notice that we “use” each number by adding it to sum (so that optimization does not discard it), but do not do any mixing to focus on generating numbers.

To test 64-bit generation, we have a similar test, but it would be impractical to perform a test corresponding to mixing an array of 2 64 - 1 (because it will take many thousands of years to complete this larger benchmark). Instead, we cross the entire 64-bit interval, but generate the same number of output values ​​as in the 32-bit test. Code:

  for (uint32_t i = 0xffffffff; i & gt; 0; --i) {
  uint64_t bound = (uint64_t (i) & lt; & lt; 32) |  i;
  uint64_t bval = bounded_rand (rng, bound);
  assert (bval & lt; bound);
  sum + = bval;
 }  

Mersenne Vortex Results

The results shown below demonstrate the performance of this benchmark for each of the methods we considered using the Mersenn vortex and testing on the 32-bit discussed in the article (using std :: mt19937 from libstdc ++ ) and similar 64-bit code (using std: mt19937_64 from libstdc ++ ). The results are a geometric average of 15 runs with different seed values, which is then normalized so that the classical method of remainder has a unit runtime.


It may seem that we have clear answers about performance - it looks like you can build techniques for their perfection, and ask yourself what the libstdc ++ developers thought about when they wrote such a terrible implementation for 32-bit numbers. But, as is often the case with benchmarking, the situation is more complicated than it seems from these results. Firstly, there is a risk that the results may be specific for the Mersenn vortex, so we will expand the set of PRNGs tested. Secondly, there may be a subtle problem with the benchmark itself. Let's first deal with the first question.

Results of different PRNG

We will test 32-bit PRNG using arc4_rand32 , chacha8r , gjrand32 , jsf32 , mt19937 , pcg32 , pcg32_fast , sfc32 , splitmix32 , xoroshiro64 + , xorshift * 64/32 , xoshiro128 + and xoshiro128 ** , and 64-bit PRNGs using gjrand64 , jsf64 , mcg128 , mcg128_fast , mt19937_64 , pcg64 , pcg64_fast , sfc64 , splitmix64 , xoroshiro128 + , xorshift * 128/64 , xoshiro256 + and xoshiro256 * . These kits will give us a few slow PRNGs and a large number of very fast ones.

Here are the results:

br/> We can see key differences from the results with the whirlwind of Mersenne. Faster PRNG shifts the equilibrium towards the limiting code, and therefore the difference between different approaches becomes more pronounced, especially in the case of 64-bit PRNG. With a wider PRNG set, the implementation of libstc ++ ceases to seem so terrible.

Conclusions

In this benchmark with a significant margin wins in speed approach based on multiply with warp. There are many situations in which the boundaries will be small relative to the size of the PRNG, and performance is absolutely critical. In such situations, a slight bias is not likely to have a noticeable effect, but the PRNG speed will have. One such example is Quicksort with a random reference point. Of the methods without distortions, bit-mask technique looks promising.

But before making serious conclusions, we need to point out the huge problem of this benchmark - most of the time is spent on very high boundaries, which most likely gives excessive importance to large intervals. Therefore, we need to go to the second benchmark.

Small-Shuffle Benchmark


This benchmark is similar to the previous one, but performs a much smaller “array shuffle” (multiple). Code:

  for (uint32_t j = 0; j & lt; 0xffff; ++ j) {
  for (uint32_t i = 0xffff; i & gt; 0; --i) {
  uint32_t bval = bounded_rand (rng, i);
  assert (bval & lt; i);
  sum + = bval;
  }
 }  

Mersenne Vortex Results


Results of different PRNG


Conclusions

This benchmark avoids too much emphasis on large boundaries and more accurately reflects real use cases, but now completely discards large boundaries.

Benchmark for all intervals


This benchmark aims to avoid the drawbacks of the previous two; he performs testing at every grade size of two, so that every size is present, but his influence is not overestimated.

  for (uint32_t bit = 1; bit! = 0; bit & lt; & lt; = 1) {
  for (uint32_t i = 0; i & lt; 0x1000000; ++ i) {
  uint32_t bound = bit |  (i & amp; (bit - 1));
  uint32_t bval = bounded_rand (rng, bound);
  assert (bval & lt; bound);
  sum + = bval;
  }
 }  

Mersenne Vortex Results


Results of different PRNG


Conclusions

Many of our findings remain unchanged. The skewed multiplication method is fast if we can come to terms with the error, and the bitmask pattern seems to be a good average choice.

We could complete this if we didn’t want to go back, take a critical look at our code and make changes to it.

make improvements


Up to this point, all methods of eliminating skews required the use of an additional remainder division operation, which is why they are performed much slower than skewed methods. It would be helpful if we could reduce this advantage.

Faster threshold-based drop


Some of our algorithms have code using a threshold value, for example:

  uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
//calculates 2 ** 32% range
  uint32_t t = (-range)% range;
  for (;;) {
  uint32_t r = rng ();
  if (r & gt; = t)
  return r% range;
  }
 }  

When range is small compared to the PRNG output interval, most often the number will be much more than the threshold. That is, if we can add a preliminary estimate of the threshold, which may be a little more, then we will save on the costly operation of taking the remainder of the division.

This task is handled by the following code:

  uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
  uint32_t r = rng ();
  if (r & lt; range) {
  uint32_t t = (-range)% range;
  while (r & lt; t)
  r = rng ();
  }
  return r% range;
 }  

This change can be applied to the "dual Mod without distortions" (see above), and to the "integer multiplication without distortions". The idea came up with Lemir, who applied it to the second method (but not to the first).

Large-Shuffle Benchmark Results

This optimization leads to a significant improvement in the results of the 64-bit benchmark (in which mod is even slower), but in fact slightly degrades performance in the 32-bit benchmark. Despite the improvements, the bitmask method still wins.


Small-Shuffle Benchmark Results

On the other hand, this change significantly speeds up the small-shuffle benchmark for the multiplication method of integers, and for the double remainder of the division. In both cases, of the performance shifts closer to the results of the options without distortions. The performance of the double residue method (OpenBSD) is now almost equal to that of the single residue method (Java).


Benchmark results for all intervals

We see a similar improvement in the benchmark for all intervals.


It seems that we can declare a new universal winner: an optimized method of multiplying Lemir integers without distortions.

Optimization of the remainder of the division


Usually, division is required to calculate a% b , but in situations where a & lt; b The result will be just a , and division is not required. And when a/2 & lt; b , the result is simply a - b . Consequently, instead of computing

  a% = b;  

we can perform

  if (a & gt; = b) {
  a - = b;
  if (a & gt; = b)
  a% = b;
 }  

The cost of division is so significant that increasing the cost of this more complex code can justify itself by saving time due to the absence of division.

Large-Shuffle Benchmark Results

Adding this optimization significantly improves the large-shuffle benchmark results. This is again more noticeable in the 64-bit code, where the operation of taking the remainder is more expensive. The double residue method (in OpenBSD style) shows optimized versions for only one residue operation and for both.


In this benchmark, the bitmask is still the winner, but the boundary between it and the approach of Lemir has narrowed considerably.

Small-Shuffle Benchmark Results

Adding this optimization does not improve the performance of the small-shuffle benchmark, so the question remains whether it adds significant costs. In some cases, no, in others the costs increase slightly.


Benchmark results for all intervals

In the benchmark for all intervals, the changes are also small.


Bonus: PRNG comparison results


The main reason for using a number of PRNG for testing the number schemes in the intervals was to avoid unintentional distortion of the results due to the peculiarities of the operation of certain PRNN schemes. But we can use the same internal test results to compare the generation schemes themselves.

PRNG with 32-bit numbers output

The graph below shows the performance of different 32-bit generation schemes, averaged for all methods and fifteen runs, normalized by the performance of the 32-bit Mersenne vortex:


On the one hand, I am glad to see that pcg32_fast is really fast - only a small version of Xoroshiro (which does not pass statistical tests) has won it. But it also shows why I rarely get upset about the performance of modern high-performance general-purpose PRNGs — the difference between the different methods is very small. In particular, the fastest four schemes differ in performance by less than 5%, and I believe that this is simply caused by “noise.”

GPSNC with output of 64-bit numbers

The graph shows the performance of various 64-bit generation schemes, averaged among all the techniques and fifteen runs, normalized by the performance of the 32-bit Mersenn vortex. It may seem strange that normalization is performed on the 32-bit Mersenn vortex, but this allows us to see the additional costs of using 64-bit generation in cases when 32-bit generation is enough.


These results confirm that mcg128_fast is incredibly fast, but the last four technicians again differ by only about 5%, therefore it is difficult to choose from the fastest methods. pcg64 and pcg64_fast must be slower than mcg128_fast because their basic generators are 128-bit linear congruential generators (LKG, LCG) and 128-bit multiplicative congruential generators (ICG, MCG). Despite the fact that they are not the fastest technicians in this kit, pcg64 is still 20% faster than Mersenn’s 64-bit vortex.

But perhaps more importantly, these results also show that if you do not need 64-bit output, then 64-bit PRNG is usually slower than 32-bit.

Conclusions


From our benchmarks, we can see that the transition from the standardly used PRNGs (for example, the 32-bit Mersenn vortex) to faster PRNGs reduced the execution time of benchmarks by 45%. But the transition from the standard method of finding numbers in the interval to our fastest method allowed us to reduce the benchmark time by about 66%; in other words, up to one third of the original time.

The fastest method (without distortions) is the Lemire method (with my additional optimization). Here it is:

  uint32_t bounded_rand (rng_t & amp; rng, uint32_t range) {
  uint32_t x = rng ();
  uint64_t m = uint64_t (x) * uint64_t (range);
  uint32_t l = uint32_t (m);
  if (l & lt; range) {
  uint32_t t = -range;
  if (t & gt; = range) {
  t - = range;
  if (t & gt; = range)
  t% = range;
  }
  while (l & lt; t) {
  x = rng ();
  m = uint64_t (x) * uint64_t (range);
  l = uint32_t (m);
  }
  }
  return m & gt; & gt;  32;
 }  

Using the Lemira method will improve the performance of most randomized algorithms more than moving from a fast generation engine to running a little faster.

Applications: Testing Notes


The code for all tests is posted on GitHub . Overall, I tested 23 methods for bounded_rand using 26 different PRNGs (13 32-bit PRNGs and 13 64-bit), in two compilers (GCC 8 and LLVM 6), which gave me 26 * 23 * 2 = 1196 executable files, each of which was performed with the same 15 seed, which gives 1196 * 15 = 17,940 unique test runs, in each of which three benchmarks are combined. Basically, I ran tests on a 48-core machine with four Xeon E7-4830v3 processors with a frequency of 2.1 GHz. Performing a full set of tests required a little less than a month of CPU time.

In the end we will return to the situation from the introduction of the article. Imagine that Sasha used jsf32.STD-libc ++ , and Juan used mt19937.BIASED_FP_MULT_SCALE . In benchmark 3, the latter takes 69.6% less time. That is, the time from this fictional situation is based on data from reality.

Source text: [Translation] Effective number generation in a given interval