Indexers in C # under the hood: we index better than Dow Jones

Indexers in C # under the hood: we index better than Dow Jones


Good day. In this article, I propose to get acquainted with the indexers in various types. Let's look at the assembly language code for these indexers and the characteristics of each instruction on its speed. I will also offer some obvious conclusions. But what exactly to use in your particular situation is up to you to decide whether it is worth sacrificing convenience for speed or vice versa.
image

Metrics


The assembly language code is given for 64 bit systems. For each instruction, the number of merged micro-operations, the total number of micro-operations, delay, throughput and, of course, the number of instructions were chosen as metrics. I did not give any figures as a whole for the indexer, since the situation may vary depending on the way the indexed type is handled and affect the cache differently.

Below is a brief summary of terminology, without digging deep, just conceptual concepts. I set out to describe everything as simple as possible, for a common understanding.

Micro operation (uop) is the operation of which each instruction consists. The concept of micro-operations is used for optimizations such as merging, caching and reordering. So, for example, the MOV instruction consists of 1 micro-operation, while the XCHG instruction on two registers consists of 3 micro-operations (using the “temporary variable” approach, that is, the internal register, thanks leotsarev for update), the XCHG instruction on the register and memory consists of 8 micro-operations.

Merged micro operations (fused uops) - as already mentioned above, the merging of micro operations is one of the optimizations. It consists in replacing two micro-operations with one more complex one.

Latency - the number of ticks, after which the data used in this manual will be available for use by another instruction.

Reciprocal throughput is the number of ticks required to execute a single instruction, provided that a sequence of identical instructions is executed and they operate with independent data.

Based on these indicators, you can estimate the performance of a particular set of instructions. Please note that we can only “evaluate”, actual performance depends on many factors, such as cache hit or miss, data dependency, etc.

These figures are for Intel's Skylake-X processor. This fits my Intel Core i7-6700 processor.

It is also worth remembering that fastcall for 64 bit systems provides transfer not of 2, but 4 parameters in registers (rcx, rdx, r8, r9).

Indexers in numbers


1. Array Indexer


We will consider the example of the following methods:

  public int IndexerArray (int index, int [] array)
 {
  return array [index];
 }
  

Consider the assembly language code for this fragment.

  1.  cmp edx, dword ptr [r8 + 8]
 2. jae 00007ff9`07288c78
 3. movsxd rax, edx
 4. mov eax, dword ptr [r8 + rax * 4 + 10h]
  

The first line checks if the index is outside the bounds of the array. In the second line throws an exception, if out. Next, we calculate the position of the element in the array. The first fields in the array are service information, so we need to skip them (additional 10h = 16 bytes).

Information on instructions:
No. Fused uops Total uops Latency Reciprocal throughput
1 1 2 1 0.5
2 1 1 1-2
3 1 1 1 0.25
4 1 1 2 0.5



2. Favorite by all List & lt; & gt;


Code blanks:
  public int IndexerList (int index, List & lt; int & gt; list)
 {
  return list [index];
 }
  


Assembly Language Code:

  1.  cmp edx, dword ptr [r8 + 10h]
 2. jae M00_L00
 3. mov rax, qword ptr [r8 + 8]
 4. cmp edx, dword ptr [rax + 8]
 5. jae 00007ff9`07268f56
 6. movsxd rdx, edx
 7. mov eax, dword ptr [rax + rdx * 4 + 10h]
 ret
 M00_L00 call System.ThrowHelper.ThrowArgumentOutOfRange_IndexException ()
  

There are clearly more instructions here. It is clearly visible that the sheet indexer wraps the array indexer. An interesting point - check for going beyond the array here is carried out twice. So, the first instruction checks whether the index is out of bounds. If it turns out, then we jump (instruction 2) to a quite obvious call, throwing an exception in case of going beyond the array boundaries. This boundary check uses the internal field of the sheet, which is second in order (offset 10h (16) bytes from the beginning of the type, 8 to the pointer to the method table, and 8 to the reference to the internal array — the first field). In the third line we place the address of the internal array in the rax register — the first field (by analogy, an offset of 8 bytes is a pointer to the method table). This is followed by the already familiar section - the index call for the array (lines 4 - 7). Here, the internal field of the array is used to check the boundaries.
I tried to remove things that are not directly related to indexing, but here it is worth leaving ret so that it does not seem that at the end of each reference to the element of the sheet there will be an exception: D

By the way, in order not to torment you with speculation, please familiarize yourself with the implementation of the sheet by reference . Type conversion to unsigned int is used to reduce the number of comparisons.

As a result, we get 7 instructions for successful indexing, which is 3 more than in the array.

Information on instructions:
No. Fused uops Total uops Latency Reciprocal throughput
1 1 2 1 0.5
2 1 1 1-2
3 1 1 2 0.5
4 1 2 1 0.5
5 1 1 1-2
6 1 1 1 0.25
7 1 1 2 0.5


New - Span & lt; & gt;


Check:

  public int IndexerSpan (int index, Span & lt; int & gt; span)
 {
  return span [index];
 }
  

And in assembly language:

  1.  cmp edx, dword ptr [r8 + 8]
 2. jae 00007ff9`07278f69
 3. mov rax, qword ptr [r8]
 4. movsxd rdx, edx
 5. mov eax, dword ptr [rax + rdx * 4]
  

At the announcement of the packs, they promised us that they were made according to the mind, with the support of runtime. And they did not lie what to say. In fact, it differs from the classical array only by one instruction, an additional step of contacting the address. Judging by this code, the address of the section of memory where the elements are located is located inside the span, which we get in line 3. It can be an address to a specific place in an array, a string, or a fragment of memory on the stack.
By reference you can familiarize yourself with the implementation of Span for interest. You can see that there is 2 different implementations depending on the environment variable. PROJECTN - the codename of the first version of .NET Native for UWP. Therefore, we are more interested in the second version of the indexer. It is labeled [Intrinsic] . Moreover, if you look at the static class .://github.com/dotnet/corefx/blob/b86783ef9525f22b0576278939264897464f9bbd/src/Common/src/CoreLib/Internal/Runtime/CompilerServices/Unsafe.cs">Unsafe used in the implementation of this indexer, you can find information that implementations of most of the methods in this file are presented as Intrinsic .

Calls to methods or references to fields marked with the [Intrinsic] attribute have runtime support.

In CoreCLR , the bodies of such methods are replaced by the EE (Execution Engine) with unsafe code. If you need more details, you can start digging with the getILIntrinsicImplementationForUnsafe method. < br/>
Information on how this works in CoreRT (which interests me a little),
you can start looking at Information on instructions:
No. Fused uops Total uops Latency Reciprocal throughput
1 1 2 1 0.5
2 1 1 1-2
3 1 1 2 0.5
4 1 1 1 0.25
5 1 1 2 0.5


All indexers have a strong dependence on the data: instructions use the results of previous ones. No unusual results here, and should not be. But now the overhead projector that appears in one way or another is clear. Few obvious conclusions. If the algorithm involves very frequent indexing, it makes sense to think about replacing the sheet with an array. If the call is not very often, then it may be more convenient to use a sheet that provides a very convenient api and does not have such a large overhead projector (I remind you that you should follow the extensions of the internal array).
Now let's try to look at the different ways we can use to define a two-dimensional array: an array of arrays (jagged array) and a multidimensional array (multidimensional array).

4. Multidimensional array


Sharpe code:

  public int IndexerDimensionalArray (int index1, int index2, int [] demensionalArray)
 {
  return demensionalArray [index1, index2];
 }
  

Assembly Language:

  1.  mov eax, edx
 2. sub eax, dword ptr [r9 + 18h]
 3. cmp eax, dword ptr [r9 + 10h]
 4. jae 00007ff9`00098fe6
 5. mov edx, r8d
 6. sub edx, dword ptr [r9 + 1Ch]
 7. cmp edx, dword ptr [r9 + 14h]
 8. jae 00007ff9`00098fe6
 9. mov ecx, dword ptr [r9 + 14h]
 10. imul rcx, rax
 11. mov rax, rdx
 12. add rax, rcx
 13.mov eax, dword ptr [r9 + rax * 4 + 20h]
  

Everything is clear in principle - 2 checks on the array boundaries, further calculation of the index and conversion. This array is stored in memory in one fragment.

Information on instructions:
No. Fused uops Total uops Latency Reciprocal throughput
1 1 1 0-1 0.25
2 1 2 0.5
3 1 2 1 0.5
4 1 1 1-2
5 1 1 0-1 0.25
6 1 2 0.5
7 1 2 1 0.5
8 1 1 1-2
9 1 1 2 0.5
10 1 1 3 1
11 1 1 0-1 0.25
12 1 1 1 0.25
13 1 1 2 0.5



5. Array of arrays (jagged array)


  public int IndexerJaggedArray (int index, int index2, int [] [] jaggedArray)
 {
  return jaggedArray [index] [index2];
 }
  

Assembler:

  1.  cmp edx, dword ptr [r9 + 8]
 2. jae 00007ff9`00098f95
 3. movsxd rax, edx
 4. mov rax, qword ptr [r9 + rax * 8 + 10h]
 5. cmp r8d, dword ptr [rax + 8]
 6. jae 00007ff9`00098f95
 7. movsxd rdx, r8d
 8. mov eax, dword ptr [rax + rdx * 4 + 10h]
  

And the most interesting is that we have fewer instructions than with a specially made type for multidimensionality.

Information on instructions:
No. Fused uops Total uops Latency Reciprocal throughput
1 1 2 1 0.5
2 1 1 1-2
3 1 1 1 0.25
4 1 1 2 0.5
5 1 2 1 0.5
6 1 1 1-2
7 1 1 1 0.25
8 1 1 2 0.5


But about the last 2 examples I advise you not to rush to conclusions. Due to the fact that a two-dimensional array is a solid type, which is initialized at once, memory for the entire array will be allocated one large fragment. This will ensure the best cache performance, which can radically change the situation. In the array of arrays, the memory for each array will be allocated separately, so it is likely that the arrays will be spaced apart from memory and entered into the most suitable places for them.

However, it may be more acceptable to someone. It is possible that in some situations it is known that the lifetime of this specimen will be short. And in order not to fall into a bunch of large objects (which is a kind of second generation for the garbage collector), where there is a chance to stay for a long time, much more than we would like. Or after some time we will want to work only with certain lines, and everything else can be cleared. Plus, it is planned to work with the type referring to random non-sequential elements, when the cache can not work normally.

Also, when using an array of arrays, it is more likely not to provoke the garbage collector to compacting, but to do with sweep. Reminder: when memory is fragmented, the total amount of free space may be sufficient for a new object, but there is no continuous free area of ​​the required volume. In this case, compacting is performed - moving objects with the purpose of defragmentation. If we are able to pick up a continuous section of free memory for a new object, we can simply enter the object in this free space. This is called a sweep.

I hope this information will help you draw the right conclusions and substantiate your opinion in the discussion about what to use.

Source text: Indexers in C # under the hood: we index better than Dow Jones