I recently had to solve the problem of optimizing the search for duplicate images.
The existing solution works on a fairly well-known library written in Python, Image Match
, based on the work "AN IMAGE SIGNATURE FOR ANY KIND OF IMAGE ”by H. Chi Wong, Marshall Bern and David Goldberg.
For several reasons, it was decided to rewrite everything to Kotlin, at the same time refusing to store and search in ElasticSearch, which requires much more resources, both iron and human, for support and administration, in favor of searching the local in-memory cache.
To understand how it works, I had to plunge into the “reference” code in Python, as the original work is sometimes not completely obvious, and in a couple of places it makes me remember the meme “how to draw an owl”. Actually, I want to share the results of this study, at the same time talking about some optimizations, both in terms of data volume and search speed. Maybe someone will come in handy.
Disclaimer: There is a chance that I screwed up somewhere, did something wrong or not optimally. Well, that, I will be glad to any criticism and comments.:)
How it works:
- The image is converted to an 8-bit black-and-white format (one dot is one byte in the value 0-255)
- The image is cropped so as to discard 5% of the accumulated difference (in more detail below) from each of the four sides. For example, a black frame around the image.
- Selects the working grid (default 9x9), the nodes of which will serve as the reference points of the image signature.
- For each control point, its brightness is calculated on the basis of a certain area around it.
- For each pivot point, how much brighter/darker it is than its eight neighbors. Five gradations are used, from -2 (much darker) to 2 (much brighter).
- All this "joy" turns into a one-dimensional array (vector) and is called a signature by an image.
The similarity of the two images is verified as follows:
The smaller d is, the better. In fact, d & lt; 0.3 is almost a guarantee of identity.
Now in more detail.
I think that there is no special point in telling about the transformation in grayscale.
Let's say we have this picture:
It can be seen that there is a black frame 3 pixels right and left and 2 pixels above and below, which we don’t need in comparison
The circumcision boundary is determined by the following algorithm:
- For each column, we calculate the sum of the absolute differences of neighboring elements.
- Crop to the left and to the right such a number of pixels, whose contribution to the total accumulated difference is less than 5%.
We do the same for columns.
An important clarification: if the size of the resulting image is insufficient for the implementation of step 4, then we do not cut off!
Well, everything is quite simple here, we divide the image into equal parts and select the nodal points.
The brightness of the reference point is calculated as the average brightness of all points in a square area around it.The original paper uses the following formula to calculate the side of this square:
For each reference point, the difference between its brightness and the brightness of its eight neighbors is calculated. For those points that do not have enough neighbors (top and bottom rows, left and right columns), the difference is taken as 0.
Further, these differences lead to five gradations:
It turns out about the following matrix:
I think the explanation does not need.
And now about the optimization
The original work rightly states that from the resulting signature, you can safely throw out zero values along the edges of the matrix, since they will be identical for all images.
However, if you look closely, it is clear that for any two neighbors their mutual gradations will be equal in magnitude. Therefore, in fact, you can safely throw four duplicate values for each control point, reducing the size of the signature twice (without taking into account the side zeros).
Obviously, when calculating the sum of squares, for each x
there will necessarily be an equal in absolute value x '
and the formula for calculating the norm can be written like this (without going into re-indexing):
This is true for both the numerator and both terms of the denominator.
Further, in the original work, it is noted that three bits are enough to store each gradation. That is, for example, 21 gradations will fit into Unsigned Long. This is quite a big saving on size, but, on the other hand, adds complexity when calculating the sum of squares of the difference of two signatures. I must say that this idea really hooked me with something and did not let go of a couple of days, until one evening it dawned on how you could eat a
, save a place and simplify the calculation. We follow the hands.
We use for storage such a scheme, 4 bits per gradation:
Yes, only 16 gradations will fit into one Unsigned Long, against 21, but the array of differences of two signatures will be calculated by one xor and 15 shifts to the right with the calculation of the number of bits set at each iteration of the shift. Ie
The sign is indifferent to us, since all values will be squared.
In addition, if in advance we determine the threshold value of the distance, the values of which are not interesting to us, then, by dancing a little around the calculation formula, we can derive a rather lightweight filtering condition for a simple number of bits set.
For more information about optimizing this algorithm, with code examples, you can read in previous article
. Separately, I recommend reading the comments to it - habrovchen masyaman
suggested some pretty interesting ways to calculate, including and with three-bit gradation packaging using bit magic.
Actually, everything. Thanks for attention.:)