Bitmap indices in Go: search for wild speed

Bitmap indices in Go: search for wild speed


recording speeches at the conference in English and text decoding in Russian.
We will look at how a bitmap index works, when it is better, when it is worse than other indexes, and in which cases it is much faster than them; see in which popular DBMS already have bitmap-indices; try to write your own on Go. And “for dessert” we will use ready-made libraries to create our own super-fast specialized database.

I really hope that my works will be useful and interesting for you. Let's go!

Introduction



http://bit.ly/bitmapindexes
https://github.com/mkevac/gopherconrussia2019

Hello to all! It's six in the evening, we are all superb. Wonderful time to talk about the boring theory of database indexes, right? Do not worry, I will have a couple of lines of source code here and there.:-)

If without jokes, the report is full of information, but we do not have much time. So let's get started.

Today I will talk about the following:

  • what are indexes;
  • What is a bitmap index;
  • where it is used and where it is NOT used and why;
  • simple implementation on Go and a bit of a fight with the compiler;
  • a little less simple, but much more productive implementation on a Go-assembler;
  • “problems” of bitmap indexes;
  • existing implementations.


So what are indexes?




The index is a separate data structure that we keep and update in addition to the core data. It is used to speed up the search. Without indexes, a search would require a full pass through the data (a process called full scan), and this process has a linear algorithmic complexity. But databases usually contain a huge amount of data and linear complexity - this is too slow. Ideally, we would get logarithmic or constant.

This is a huge complex topic, full of subtleties and compromises, but, looking at decades of development and research of various databases, I can say that there are only a few widely used approaches to creating database indices.



The first approach is to hierarchically reduce the search area, dividing the search area into smaller parts.

We usually do this using different kinds of trees. An example would be a large box with the materials in your closet, in which there are smaller boxes with materials divided by various topics. If you need materials, you will surely look for them in a box labeled “Materials”, and not in the one that says “Cookies”, right?



The second approach is to immediately select the desired element or group of elements. We do this in hash maps or in reverse indexes. Using hash maps is very similar to the previous example, only instead of a box with boxes in your closet a bunch of small boxes with final items.



The third approach is to get rid of the need to search. We do this using Bloom filters or cuckoo filters. The first give the answer instantly, saving you from having to search.



The latter approach is to fully utilize all the power that modern iron gives us. That is what we do in bitmap indexes. Yes, when using them, we sometimes need to go through the whole index, but we do it super-efficiently.

As I said, the topic of database indexes is extensive and full of compromises. This means that sometimes we can use several approaches at the same time: if we need to further accelerate the search or if it is necessary to cover all possible types of search.

Today I will talk about the least well-known approach mentioned above - about bitmap indices.

Who am I to talk about this topic?




I work as a team leader in Badoo (perhaps you know our other product better - Bumble). We already have more than 400 million users worldwide and many features that are committed to selecting the best pair for them. We do this with the help of custom services, including bitmap indexes.

So what is a bitmap index?



Bitmap indices, as the name suggests, use bitmaps or bitsets to embed a search index. From a bird's eye view, this index consists of one or more such bitmaps, representing any entities (such as people) and their properties or parameters (age, eye color, etc.), and from an algorithm using bit operations (AND, OR, NOT) to respond to a search query.

We are told that bitmap-indexes are best suited and very productive for cases where there is a search that combines queries on many columns that have little cardinality (imagine “eye color” or “marital status” against something like “distance from city center” ). But later I will show that they work fine in the case of columns with high cardinality.

Consider the simplest example of a bitmap index.

Imagine that we have a list of Moscow restaurants with binary properties like these:

  • near the metro (near metro);
  • private parking (has private parking);
  • there is a veranda (has terrace);
  • you can book a table (accepts reservations);
  • suitable for vegetarians (vegan friendly);
  • expensive.


Let's give each restaurant a sequence number starting from 0 and allocate memory for 6 bitmaps (one for each characteristic). Then we fill in these bitmaps, depending on whether the restaurant has this property or not. If restaurant 4 has a veranda, then bit 4 in the bitmap “there is a veranda” will be set to 1 (if there is no veranda, then to 0).
Now we have the simplest bitmap index possible, and we can use it to answer queries like:

  • "Show me restaurants suitable for vegetarians";
  • "Show me cheap restaurants with a veranda where you can book a table."




How? Let's get a look. The first request is very simple. All we need is to take a bitmap “suitable for vegetarians” and turn it into a list of restaurants whose bits are on display.


The second query is a bit more complicated. We need to use the NOT bit operation on the expensive bitmap in order to get a list of inexpensive restaurants, then for AND with it, you can book a table and for the result with a veranda bitmap. The resulting bitmap will contain a list of establishments that meet all our criteria. In this example, this is only the restaurant "Youth".


There's a lot of theory here, but don't worry, we'll see the code very soon.

Where are bitmap indexes used?



If you google the bitmap indexes, 90% of the answers will be somehow connected to the Oracle DB. But the rest of the DBMS also supports such a cool thing too, right? Not really.

Go through the list of key suspects.

MySQL does not yet support bitmap indexes, but there is a Proposal with the suggestion to add this option ( https://dev.mysql. com/worklog/task/? id = 1524 ).

PostgreSQL does not support bitmap indexes, but uses simple bitmaps and bit operations to combine search results for several other indexes.

Tarantool has bitset indexes, it supports a simple search for them.

Redis has simple bit fields (https://redis.io/commands/bitfield ) without searching for them.

MongoDB does not yet support indexes, but Proposal also has a suggestion to add this option https://jira.mongodb.org/browse/SERVER-1723

Elasticsearch uses bitmaps inside the (https://www.elastic.co/blog/frame-of -reference-and-roaring-bitmaps ).




  • But a new neighbor has appeared in our house: Pilosa. This is a new non-relational database written in Go. It contains only bitmap indexes and bases everything on them. We'll talk about her a little later.


Go implementation


But why are bitmap indexes so rarely used? Before answering this question, I would like to demonstrate to you the implementation of a very simple bitmap index on Go.

Bitmaps are essentially just pieces of data. In Go, let's use byte slices.

We have one bitmap per restaurant characteristic, and every bit in bitmap tells us whether a particular restaurant has this property or not.

We need two auxiliary functions. One will be used to fill our bitmaps with random data. Randomized, but with a certain probability that the restaurant has every property. For example, I believe that in Moscow there are very few restaurants where a table cannot be booked, and it seems to me that about 20% of the establishments are suitable for vegetarians.
The second function will convert bitmap into a list of restaurants.


To answer the query “Show me cheap restaurants that have a veranda and where you can book a table”, we will need two bit operations: NOT and AND.

We can simplify our code a bit by using the more complex AND NOT operation.

We have functions for each of these operations. Both of them go in slices, take the corresponding elements from each, combine them with a bit operation and put the result in the resulting slice.

And now we can use our bitmaps and functions to respond to a search query.

The performance is not so high, even though the functions are very simple and we saved a lot by not returning a new result slice with each function call.

After profiling a bit with pprof, I noticed that the Go compiler missed one very simple, but very important optimization: function inlining.

The fact is that the Go compiler is terribly afraid of the cycles that go in slices, and categorically refuses to inline functions that contain such cycles.

But I'm not afraid, and I can fool the compiler by using goto instead of a loop, like in the good old days.




And, as you can see, now the compiler is happy to inline our function! As a result, we manage to save about 2 microseconds. Not bad!



The second bottleneck is easy to see if you look closely at the assembler output. The compiler added a check on the slice boundaries right inside our hottest loop. The fact is that Go is a safe language, the compiler fears that my three arguments (three slices) are of different sizes. After all, then there will be a theoretical possibility of the occurrence of the so-called buffer overflow (buffer overflow).

Let's calm the compiler by showing him that all the slices have the same size. We can do this by adding a simple check at the beginning of our function.

Seeing this, the compiler happily skips the check, and we end up saving another 500 nanoseconds.

Big Batchy


OK, we managed to squeeze some performance from our simple implementation, but this result is, in fact, much worse than possible with the current hardware.

All that we do is basic bit operations, and our processors perform them very efficiently. But, unfortunately, we “feed” our processor with very small pieces of work. Our functions are performed by-byte operations. We can very easily tweak our code so that it works with 8-byte chunks using UInt64 slices.



As you can see, this small change accelerated our program eight times by increasing the batch eight times. The win can be said linear.


Implementation in assembly language



But this is not the end. Our processors can work with chunks of 16, 32 or even 64 bytes. Such “broad” operations are called single instruction multiple data (SIMD; one instruction, lots of data), and the process of converting the code so that it uses such operations is called vectorization.

Unfortunately, the Go compiler is far from excellent in vectorization. Currently, the only way to vectorize code on Go is to take and put these operations manually using Go assembler.



Assembler Go - a strange beast. You probably know that an assembler is something that is strongly tied to the architecture of the computer for which you are writing, but in Go it is not. Go assembler is more like IRL (intermediate representation language) or intermediate language: it is practically platform independent. Rob Pike made a great talk about this topic a few years ago at GopherCon in Denver.

In addition, Go uses an unusual Plan 9 format, different from the generally accepted formats of AT & amp; T and Intel.

It’s safe to say that manually writing the assembler Go is not the most fun.

But, fortunately, there are already two high-level tools that help us in writing the Go assembler: PeachPy and avo. Both utilities generate Go-assembler from higher-level code written in Python and Go, respectively.

These utilities simplify things like register allocation (choosing a processor register), writing cycles, and generally simplify the process of entering the world of assembly programming in Go.

We will use avo, so our programs will be almost ordinary Go programs.

Here is the simplest example of the avo program. We have a main () function, which defines the Add () function inside itself, the meaning of which is the addition of two numbers. There are auxiliary functions for getting parameters by name and getting one of the free and suitable processor registers. Each processor operation has a corresponding function on avo, as seen by ADDQ. Finally, we see an auxiliary function for storing the resulting value.

By calling go generate, we run the program on avo and two files will be generated as a result:

  • add.s with the result code in a go-assembler;
  • stub.go with function headers for connecting two worlds: Go and assembler.


Now that we have seen how and what makes avo, let's look at our functions. I implemented both scalar and vector (SIMD) versions of functions.

First look at the scalar versions.

As in the previous example, we ask you to provide us with a free and correct general-purpose register; we do not need to calculate the offsets and sizes for the arguments. All this avo does for us.

Earlier we used labels and goto (or jumps) to improve performance and to trick the Go compiler, but now we do it from the very beginning. The fact is that cycles are a higher level concept. In assembly language, we only have labels and jumps.

The remaining code should be familiar and understandable. We emulate a cycle with labels and jumps, take a small part of the data from our two slices, combine them with a bit operation (AND NOT in this case) and then put the result into the resulting slice. Everything.

This is what the final assembly code looks like. We did not need to calculate the offsets and sizes (highlighted in green) or keep track of the registers used (highlighted in red).

If we compare the performance of the implementation on the assembler with the performance of the best implementation on Go, then we will see that it is the same. And this is expected. After all, we didn’t do anything special - we only reproduced what the Go compiler would do.

Unfortunately, we cannot force the compiler to inline our functions written in assembler. The Go compiler doesn’t have such an opportunity today, although the request to add it has been around for quite a long time.

That is why it is impossible to get any benefits from small functions in assembler. We need to either write large functions, or use the new math/bits package, or bypass the assembler side.

Let's now look at the vector versions of our functions.

For this example, I decided to use AVX2, so we will use operations that work with 32-byte chunks. The structure of the code is very similar to the scalar version: loading parameters, please provide us with a free general register, etc.

One of the innovations is due to the fact that wider vector operations use special wide registers. In the case of 32-byte chunks, these are registers with the Y prefix. That is why you see the YMM () function in the code. If I used the AVX-512 with 64-bit chunks, the prefix would be Z.

The second innovation is related to the fact that I decided to use an optimization called loop unrolling, that is, to do eight loop operations manually before jumping to the beginning of the loop. This optimization reduces the number of brunches (branches) in the code, and it is limited by the number of free registers available.

Well, what about performance? She is beautiful! We got about seven times faster than the best Go solution. Impressive, yes?

But even this implementation could potentially be accelerated using AVX-512, prefetching or JIT (just-in-time compiler) for the query planner. But this is certainly a topic for a separate report.

Bitmap Index Issues


Now, when we have already considered a simple implementation of a bitmap index on Go and a much more productive assembly language, let's finally talk about why bitmap indexes are so rarely used.

Older scientific papers mention three problems of bitmap-indices, but newer scientific papers and I state that they are already irrelevant. We will not dive deep into each of these problems, but we shall examine them superficially.

The problem of great cardinality


So, we are told that bitmap-indices are suitable only for fields with low cardinality, that is, those that have few values ​​(for example, gender or eye color), and the reason is that the usual representation of such fields (one bit per value) in the case of large cardinality will take up too much space and, moreover, these bitmap-indices will be weakly (rarely) filled.


Sometimes we can use another representation, such as the standard one, which we use to represent numbers. But it was the appearance of compression algorithms that changed everything. Over the past decades, scientists and researchers have come up with a large number of compression algorithms for bitmaps. Their main advantage is that you don’t need to compress bitmaps for performing bit operations - we can perform bit operations directly on compressed bitmaps.

Recently, hybrid approaches have begun to appear, such as, for example, roaring bitmaps. They use three different representations for bitmaps at the same time — the bitmaps themselves, arrays, and so-called bit runs — and balance between them to maximize performance and minimize memory consumption.
You can find roaring bitmaps in the most popular applications. There are already a huge number of implementations for various programming languages, including more than three implementations for Go.

Another approach that can help us cope with great cardinality is binning. Imagine that you have a field that represents the height of a person. Growth is a floating point number, but we humans do not think about it in that way. For us there is no difference between the growth of 185.2 cm and 185.3 cm.

It turns out, we can group similar values ​​into groups within 1 cm.

And if we also know that very few people have a height of less than 50 cm and more than 250 cm, then we can, in fact, turn a field with infinite cardinality into a field with cardinality of about 200 values.

Of course, if necessary, we can do additional filtering after.

High bandwidth problem


The next problem with bitmap indexes is that updating them can be very expensive.

Databases should allow data to be updated at the moment when hundreds of other queries potentially search for this data. We need locks to avoid simultaneous data access problems or other sharing problems. And where there is one big lock, there is a problem - lock contention, when this lock becomes a bottleneck.

This problem can be solved or circumvented using sharding or using versioned indexes.

Sharding is a simple and well-known thing. You can shard the bitmap index as you would shard any other data. Instead of one big lock, you will get a bunch of small locks and thus get rid of lock contention.

The second way to solve the problem is to use versioned indexes. You can have one copy of the index that you use to search or read, and one to write or update. And once in a certain period of time (for example, once in 100 ms or 500 ms), you duplicate them and swap them. Of course, this approach is applicable only in cases where your application can work with a slightly lagging search index.

These two approaches can be used at the same time: you can have a shard versioned index.

More complex queries



The last problem with bitmap indexes is that, as we are told, they are poorly suited for more complex types of queries, such as “by interval” queries.

And really, if you think that bit operations like AND, OR, etc., are not very suitable for requests a la “Show me hotels with room rates from $ 200 to $ 300 per night.”

A naive and very unwise decision would be to take the results for each dollar value and combine them with the OR bit operation.

A slightly better solution would be to use grouping. For example, in groups of $ 50. This would speed up our process 50 times.

But the problem is also easily solved using a view created specifically for this type of query. In scientific papers, it is called range-encoded bitmaps.

In this view, we do not just set one bit for any value (for example, 200), but set this value and everything above. 200 and higher. The same for 300: 300 and above. And so on.

Using this view, we can respond to this kind of search query, passing on the index only two times. First we get a list of hotels where the room costs less or $ 300, and then we throw out of it those where the cost of the room is less or $ 199. Done.

You'd be surprised, but even geo-queries are possible using bitmap-indexes. The trick is to use a geo-view that surrounds your coordinate with a geometric shape. For example, S2 from Google. The shape should be possible to represent in the form of three or more intersecting straight lines that can be numbered. In this way, we will be able to turn our geo-inquiry into several requests “by interval” (along these numbered lines).

Ready Solutions


I hope I got you a little interested and you have another useful tool in your arsenal. If you ever need to do something like this, you will know which way to look.

However, not everyone has the time, patience, and resources to create bitmap indices from scratch. Especially more advanced using SIMD, for example.

Fortunately, there are several ready-made solutions that will help you.

roaring bitmaps


First, there is the same roaring bitmaps library, which I have already mentioned. It contains all the necessary containers and bit operations that you need to make a full bitmap index.

Unfortunately, at the moment none of the Go-implementations use SIMD, which means that Go-implementations are less productive than implementations in C, for example.

Pilosa


Another product that can help you is Pilosa, which, in essence, only has bitmap indexes. This is a relatively new solution, but it wins hearts with great speed.

Pilosa uses roaring bitmaps within itself and gives you the opportunity to use them, simplifies and explains all the things I mentioned above: grouping, range-encoded bitmaps, the concept of a field, etc.

Let's take a quick look at an example of using Pilosa to answer a question already familiar to you.

The example is very similar to what you saw before.We create a client to the Pilosa server, create an index and the required fields, then fill our fields with random data with probabilities and, finally, perform a familiar query.

After that we use NOT on the “expensive” field, then cross the result (or AND-them) with the “terrace” field and with the “reservations” field. Finally, we get the final result.

I very much hope that in the foreseeable future, in a DBMS like MySQL and PostgreSQL, this new type of indices will also appear - bitmap indices.

Conclusion



If you are still awake, thank you. I had to casually touch on many topics due to the limited amount of time, but I hope that the report was useful and maybe even motivating.

It’s good to know about bitmap indexes, even if you don’t need them right now. Let them be another tool in your drawer.

We have examined various performance tricks for Go and those things that the Go compiler is not doing very well for now. And this is absolutely definitely useful for every Go programmer to know.

That's all I wanted to tell. Thank you!

Source text: Bitmap indices in Go: search for wild speed