About bit counting, unsigned types in Kotlin and about situations when saving on matches is justified

About bit counting, unsigned types in Kotlin and about situations when saving on matches is justified



This article was pushed to this comment . Or rather, one phrase from it.
... it is not good to spend memory or processor cycles on items in billions of volumes ...
It so happened that lately I had to do just that. And, although the case, which I will consider in the article, is quite private - the conclusions and the applied solutions can be useful to someone.

A bit of context


The iFunny application deals with a huge amount of graphic and video content, and a fuzzy search for duplicates is one of the very important tasks. This in itself is a big topic that deserves a separate article, but today I’ll just talk a little about some of the approaches to calculating very large arrays of numbers in relation to this search. Of course, everyone has a different understanding of “very large arrays,” and it would be foolish to be with the Hadron Collider, but still.:)

If the algorithm is very brief, then for each image, its digital signature (signature) is created from 968 integers, and the comparison is made by finding the “distance” between two signatures. Considering that the volume of content in the last two months alone amounted to about 10 million images, then, as an attentive reader will easily estimate in his mind, these are the very “elements in billions of volumes”. Who cares - welcome under the cat.

In the beginning there will be a boring story about saving time and memory, and at the end there is a small instructive story, about the fact that sometimes it is very useful to look at the sources. The picture to attract attention is directly related to this instructive story.

I have to admit that I was a little wicked. With a preliminary analysis of the algorithm, it was possible to reduce the number of values ​​in each signature from 968 to 420. Already twice as good, but the order of the values ​​remained the same.

Although, if you think about it, I’m not so deceitful, and this will be the first conclusion: should you start thinking about the task, you should think - can it be somehow simplified in advance?

The comparison algorithm is quite simple: the root is calculated from the sum of the squares of the differences of the two signatures, divided by the sum of the previously calculated values ​​(i.e., in each iteration, the summation will still be completely removed from the constant) and compared with the threshold value. It should be noted that the elements of the signature are limited to values ​​from -2 to +2, and, accordingly, the absolute value of the difference - values ​​from 0 to 4.


Nothing complicated, but the number decides.

Approach the first, naive


 //Threshold
 const val d = 0.3
//10.000.000 objects.//Let's keep this figure in mind//so as not to mention in every second sentence
 val collection: MutableList & lt; Signature & gt;  = mutableListOf ()
//signature - an array of 420 values ​​of type Byte
 class Signature (val signature: Array & lt; Byte & gt ;, val norma: Double)

 fun getSimilar (signature: Signature) = collection
  .filter {calculateDistance (it, signature) & lt;  d}

 fun calculateDistance (first: Signature, second: Signature): Double =
 Math.sqrt (first.signature.mapIndexed {index, value - & gt;
  Math.pow ((value - second.signature [index]). ToDouble (), 2.0)
  } .sum ())/(first.norma + second.norma)
  

Let's calculate what we have here with the number of operations:

10M square roots of Math.sqrt
10M additions and divisions /(first.norma + second.norma)
4.200M subtractions and squaring Math.pow ((value - second.signature [index]). toDouble (), 2.0)
4.200M additions .sum ()

What options we have:

  1. With such volumes, spending a whole Byte (or, God forbid, someone would guess to use Int ) to store three significant bits is an unforgivable waste.
  2. Maybe somehow reduce the amount of mathematics?
  3. Is it possible to do any pre-filtering that is not so costly to calculate?

Second approach, packaged


By the way, if someone suggests how to simplify calculations with such packaging, he will receive many thanks and plus in karma. Though one, but from the bottom of my heart:)

One Long is 64 bits, which, in our case, will allow you to store 21 values ​​in it (and 1 bit will remain restless).

 //array of 20 Long values
 class Signature (val signature: Array & lt; Long & gt ;, val norma: Double)

 fun calculateDistance (first: Signature, second: Signature): Double =
  Math.sqrt (first.signature.mapIndexed {index, value - & gt;
  calculatePartial (value, second.signature [index])
  } .sum ())/(first.norma + second.norma)

 fun calculatePartial (first: Long, second: Long): Double {
  var sum = 0L
  (0..60 step 3) .onach {
  val current = (first.ushr (it) and 0x7) - (second.ushr (it) and 0x7)
  sum + = current * current
  }
  return sum.toDouble ()
 }
  

From memory it became better ( 4.200M against 1.600M bytes, if roughly), but what about calculations?

I think it is obvious that the number of operations remained the same and 8.400M shifts and 8.400M logical I were added. Maybe it can and can be optimized, but the trend is still not encouraging - this is not what we wanted.

Approach third, re-pack with sub-subvert


I sense the gut, you can use bit magic here!

Let's store the values ​​not in three, but in four bits. So, here, in the way:
-2 0b1100
-1 0b0100
0 0b0000
1 0b0010
2 0b0011

Yes, we will lose in packing density in comparison with the previous option, but we will be able to get Long with differences (not quite) 16 elements at once by one XOR . In addition, there will be a total of 11 options for the distribution of bits in each final nibble, which will allow the use of pre-calculated squares of differences.

 //array of 27 Long values
 class Signature (val signature: Array & lt; Long & gt ;, val norma: Double)
//-1 for impossible options
 val precomputed = arrayOf (0, 1, 1, 4, 1, -1, 4, 9, 1, -1, -1, -1, 4, -1, 9, 16)

 fun calculatePartial (first: Long, second: Long): Double {
  var sum = 0L
  val difference = first xor second
  (0..60 step 4) .onach {
  sum + = precomputed [(difference.ushr (it) and 0xF) .toInt ()]
  }
  return sum.toDouble ()
 }
  

From memory, it became 2.160M against 1.600M - unpleasant, but still better than the initial 4.200M .

Calculate operations:

10M square roots, additions and divisions (have not gone anywhere)
270M XOR 'ov
4.320 additions, shifts, logical AND and extraction from an array.

It looks more interesting, but, all the same, too many calculations. Unfortunately, it seems that those 20% of the efforts that give 80% of the result, we have already spent here and it's time to think about where else you can gain. The first thing that comes to mind is not to bring it to the calculation at all, filtering out obviously inappropriate signatures with some kind of lightweight filter.

Approach the fourth, large sieve


If you slightly transform the calculation formula, then you can get such an inequality (the smaller the calculated distance, the better):


Those. now we need to figure out how, based on the information we have on the number of bits set in Long 'ah, calculate the minimum possible value of the left side of the inequality. Then just discard all the signatures that do not satisfy him.
Let me remind you that x i can take values ​​from 0 to 4 (the sign is not important, I think, it is clear why). Given that each addend is squared, it is easy to derive a general pattern:


The final formula looks like this (we will not need it, but I took it for quite a long time and it would be insulting to just forget and not show it to anyone):


Where B is the number of bits set.

In fact, in one Long 's only 64 bits, read 64 possible results. And they are well calculated in advance and added to the array, by analogy with the previous section.

In addition, it is not necessary to calculate all 27 Long 's - it is enough to exceed the threshold at the next one and you can interrupt the check and return false. The same approach, by the way, can be used in the main calculation.

  fun getSimilar (signature: Signature) = collection
  .asSequence ()//Why do we need an extra intermediate collection !?
  .filter {estimate (it, signature)}
  .filter {calculateDistance (it, signature) & lt;  d}

 val estimateValues ​​= arrayOf (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 19, 22, 25, 28, 31, 34  , 37, 40, 43, 46, 49, 52, 55, 58, 61, 64, 69, 74, 79, 84, 89, 94, 99, 104, 109, 114, 119, 124, 129, 134, 139  , 144, 151, 158, 165, 172, 179, 186, 193, 200, 207, 214, 221, 228, 235, 242, 249, 256)

 fun estimate (first: Signature, second: Signature): Boolean {
  var bitThreshold = Math.pow (d * (first.norma + second.norma), 2.0) .toLong ()
  first.signature.forEachIndexed {index, value - & gt;
  bitThreshold - = estimateValues ​​[java.lang.Long.bitCount (value xor second.signature [index])]
  if (bitThreshold & lt; = 0) return false
  }

  return true
 }
  

Here we must understand that the effectiveness of this filter (up to negative) strongly depends on the selected threshold and, a little less strongly, on the input data. Fortunately, for the threshold we need d = 0.3 , it is possible to pass a filter to a relatively small number of objects and the contribution of their calculation to the total response time is so scanty that they can be neglected. And if so, then you can still save a little.

Fifth approach, getting rid of the sequence


When working with large collections, the sequence is a good defense against the extremely unpleasant Out of memory . However, if we know that on the first filter the collection is reduced to sane sizes, then a much more sensible choice would be to use the usual iteration in the loop with the creation of an intermediate collection, because sequence is not only fashionable and youthful but also an iterator with hasNext associates who are not at all free.

  fun getSimilar (signature: Signature) = collection
  .filter {estimate (it, signature)}
  .filter {calculateDistance (it, signature) & lt;  d}
  

It would seem that here it is happiness, but I wanted to "make it beautiful." Here we come to the promised instructive story.

Approach of the sixth, like the best


We write on Kotlin, and here some foreign java.lang.Long.bitCount ! And recently, unsigned types were brought to the language. Attack!

No sooner said than done. All Long are replaced with ULong , from Java sources with a root torn out java.lang.Long.bitCount and rewritten as an extension function for ULong

  fun ULong.bitCount (): Int {
  var i = this
  i = i - (i.shr (1) and 0x5555555555555555uL)
  i = (i and 0x3333333333333333uL) + (i.shr (2) and 0x3333333333333333uL)
  i = i + i.shr (4) and 0x0f0f0f0f0f0f0f0fuL
  i = i + i.shr (8)
  i = i + i.shr (16)
  i = i + i.shr (32)
  return i.toInt () and 0x7f
 }
  

Run and ... Something is wrong. The code began to work much slower. We start the profiler and see the strange (see the headline illustration of the article): slightly less than a million bitCount () calls almost 16 million Kotlin.ULong.constructor-impl calls. WAT !?

The riddle is simply explained - just look at the code of the ULong class

  public inline class ULong @PublishedApi internal constructor (@PublishedApi internal val data: Long): Comparable & lt; ULong & gt;  {
  @ kotlin.internal.InlineOnly
  public inline operator fun plus (other: ULong): ULong = ULong (this.data.plus (other.data))
  @ kotlin.internal.InlineOnly
  public inline operator fun minus (other: ULong): ULong = ULong (this.data.minus (other.data))
  @ kotlin.internal.InlineOnly
  public inline infix fun shl (bitCount: Int): ULong = ULong (data shl bitCount)
  @ kotlin.internal.InlineOnly
  public inline operator fun inc (): ULong = ULong (data.inc ())
  etc.
  }
  

No, I understand everything, ULong is now experimental , but how is that !?
In general, we recognize the approach failed, which is a pity.

Well, still, can something else be improved?

In fact, you can. The original code for java.lang.Long.bitCount is not the most optimal. It gives a good result in the general case, but if we know in advance on which processors our application will work, then we can choose a more optimal way - this is a very good article about this on Habré Carefully counting single bits , I highly recommend reading.

I used the "Combined Method"

  fun Long.bitCount (): Int {
  var n = this
  n - = (n.shr (1)) and 0x5555555555555555L
  n = ((n.shr (2)) and 0x3333333333333333L) + (n and 0x3333333333333333L)
  n = ((((n.shr (4)) + n) and 0x0F0F0F0F0F0F0F0FL) * 0x010101010101010101) .shr (56)
  return n.toInt () and 0x7F
 }
  

Find the parrots


All measurements were made haphazardly, in the course of development on a local machine and reproduced from memory, so it’s difficult to speak about any accuracy, but it’s possible to estimate the approximate contribution of each approach.

What did you do parrots (seconds)
Approach the first, naive 25 ±
Second approach, pack -
Approach three, re-pack with a sub-subvert 11-14
Fourth, large sieve approach 2-3
Fifth approach, get rid of sequence 1.8-2.2
Approach the sixth, like the best 3-4
"Combined Method" for calculating the set bits 1.5-1.7

Conclusions


  • When dealing with processing a large amount of data, it is worth spending time on a preliminary analysis. Perhaps not all of this data needs to be processed.
  • If you have the option of using coarse but cheap pre-filtering, this can help a lot.
  • Bit magic is our everything. Well, where it applies, of course.
  • Seeing the source code for standard classes and functions is sometimes very useful.

Thanks for attention!:)

And yes, to be continued.

Source text: About bit counting, unsigned types in Kotlin and about situations when saving on matches is justified