Deep Learning in Optical Flow Calculation

Deep Learning in Optical Flow Calculation

With the advent of many different neural network architectures, many of the classic Computer Vision methods are a thing of the past. Less and less people use SIFT and HOG for object detection, and MBH for action recognition, and if they do, they are more like handcrafted-signs for the corresponding grids. Today we look at one of the classic CV-problems, in which the primacy is still behind the classical methods, and DL-architecture languidly breathe them in the back of the head.

Optical flow estimation

The task of calculating the optical flow between two images (usually between adjacent video frames) is to build a vector field $ O $ of the same size, with  $ O (i, j) $ will match the vector of the visible pixel displacement  $  i, j) $ from the first frame to the second. By constructing such a vector field between all adjacent video frames, we will get a complete picture of how certain objects moved on it. In other words, this is the task of tracking all the pixels in a video. An optical stream is used extremely widely — in action recognition tasks, for example, such a vector field allows you to concentrate on the movements that occur in a video and escape its context [7]. Even more common applications are visual odometry, video compression, post-processing (for example, adding a slow motion effect) and much more.

There is a place for some ambiguities - what exactly is considered a visible bias from the point of view of mathematics? Usually, they assume that the pixel values ​​go from one frame to the next without changes, in other words:

$ I (i, j, t) = I (i +  u_i, j + u_j, t + 1), $

where $ i (i, j, t) $ - pixel intensity by coordinates  $ (i, j) $ , then the optical flow  $ (u_i, u_j) $ by It shows where this pixel moved to the next moment in time (i.e., in the next frame).

In the picture it looks like this:

Visualizing the vector field directly with vectors is obvious, but not always convenient, so the second common way is to render with color:

Each color in this picture encodes a particular vector. For simplicity, vectors longer than 20 are cropped, and the vector itself can be restored in color from the following image:

More cat flow!

Classical methods have achieved quite good accuracy, for which sometimes you have to pay the speed of work. We will look at the progress that neural networks have achieved in solving this problem over the past 4 years.

Data and Metrics

Two words about what datasets were available and popular at the time of the beginning of our story (ie, 2015), and how to measure the quality of the resulting algorithm.


A tiny dataset of 8 pairs of images with small offsets, which, however, is sometimes used in the validation of optical flow calculation algorithms and now.


It is dated in, marked up for self-driving car applications and compiled using LIDAR technology. It is widely used to validate optical flux calculation algorithms and contains many rather complex cases with abrupt transitions between frames.


Another very common benchmark, created on the basis of the Sintel animated open and drawn in Blender in two versions, which are designated as clean and final. The second is much more difficult, because contains many atmospheric effects, noise, blues and other troubles for optical flow calculation algorithms.


The standard error function for the optical flow calculation problem is End Point Error or EPE. It's just the Euclidean distance between the calculated algorithm and the true optical flux, averaged over all pixels.

Flownet (2015)

Having taken up the construction of the neural network architecture for the task of calculating the optical flow in the distant 2015, the authors (from the Munich and Freiburg universities) faced two problems: there was not a large marked dataset for this task, and it would manually define some markup (try to mark where it went each pixel of the image on the next frame, first. This task was quite different from all the tasks that were solved with the help of CNN-architectures before, and secondly.In essence, this is a problem of pixel-by-regression, which makes it similar to the segmentation task (pixel-by-pixel classification), but instead of one image, we have two inputs, and intuitively, the signs should somehow show the difference between these two images. As the first iteration, it was decided to simply stack two RGB frames (having received, in essence, a 6-channel image), between which we want to calculate the optical flow, and take U-net with a number of changes as architecture. Such a network was called FlowNetS (S means Simple):

As can be seen from the scheme, the encoder is unremarkable, the decoder differs from the classical versions in several things:

  1. Optical stream prediction comes not only from the last level, but also from all others. To get the Ground Truth for the i-th level of the decoder, the source target (i.e., the optical stream) is simply reduced (almost the same as the image) to the desired resolution, and the prediction itself, resulting in the i-th level, goes further, t . is concatenated with an attribute map exiting from this level. The general function of training losses will be a weighted sum of losses from all levels of the decoder; the weight itself will be the greater, the closer the level is to the network output. The authors do not explain why this is done, but most likely the reason is the fact that it is better to detect sharp movements at early levels, then the optical flux of lower vector resolution will not be so large.
  2. The diagram shows that the input image resolution is 384x512, and the output is four times smaller. The authors noticed that if you increase this output to 384x512 by simple bilinear interpolation, this will give the same quality as if you attach two more levels of decoder. You can also use the variational approach [2], which gives you more quality (+ v in the table with quality).
  3. As in U-net, the feature maps from the encoder are prokirayvayutsya in the decoder and concatenated as shown in the diagram.

To understand how the authors tried to improve their baseline, you need to know what the correlation between images is and why it can be useful when calculating the optical flow. So, having two images and knowing that the second is the next frame in the video relative to the first, we can try to compare the area around the point on the first frame (for which we want to find the shift to the second frame) with areas of the same size on the second image. At the same time, assuming that the shift could not be too large per unit of time, the comparison can be considered only in a certain neighborhood of the initial point. For this, cross-correlation is used. Let's explain with an example.

Take two adjacent frames of video, we want to determine where a certain point has shifted from the first frame to the second. Suppose that some area around this point has shifted in the same way. Indeed, the neighboring pixels in the video are usually shifted together, because most likely, visually, are part of a single object. This assumption is actively used, for example, in differential approaches, which can be found in more detail in [5], [6].

  fig, ax = plt.subplots (1, 2, figsize = (20, 10))
 ax [0] .imshow (frame1)
 ax [1] .imshow (frame2);  

Let's try to take a point in the center of the kitten's paw and find it in the second frame. Take some area around it.

  patch1 = frame1 [90: 190, 140: 250]
 plt.imshow (patch1);  

Let us calculate the correlation between this area (in English-language literature, they often write a template or patch from the first image) and the second image. The template will simply “walk” in the second image and calculate the next value between itself and pieces of the same size in the second image:

The greater the value of this value, the more the pattern looks like the corresponding piece in the second image. With OpenCV you can do it like this:

  corr = cv2.matchTemplate (frame2, patch1, cv2.TM_CCORR_NORMED)
 plt.imshow (corr, cmap = 'gray');

Read more details in [7].

The result is as follows:

We see a clear peak, marked in white. Find it on the second frame:

  min_val, max_val, min_loc, max_loc = cv2.minMaxLoc (corr)
 h, w, _ = patch1.shape
 top_left = max_loc
 bottom_right = (top_left [0] + w, top_left [1] + h)
 frame2_copy = frame2.copy ()
 cv2.rectangle (frame2_copy, top_left, bottom_right, 255, 2)
 plt.imshow (frame2_copy);  

We see that the foot was found correctly, from this data we can understand in which direction it moved from the first frame to the second frame and calculate the corresponding optical flow. In addition, it turns out that such an operation is quite resistant to photometric distortion, i.e. if the brightness on the second frame rises sharply, the peak of the cross-correlation between the images will remain in place.

Considering all the above, the authors decided to introduce into their architecture the so-called correlation layer, but it was decided not to consider the correlation using input signs, but after several layers of the encoder, using the feature maps. Such a layer, for obvious reasons, has no learning parameters, although it is similar in its essence to convolution, but instead of filters, it uses not weights, but some area of ​​the second image:

Strangely enough, such a trick did not give a significant improvement in the quality of the authors of this article, however, it was more successfully applied in further works, and in [9] the authors were able to show that by slightly changing the learning parameters you can make FlowNetC work much better.

The authors decided to solve the lack of dataset in a rather elegant way: 964 images from Flickr were scribed on the themes: “city”, “landscape”, “mountain” at a resolution of 1024 × 768 and used their crotches 512 × 384 as a background, which was then thrown into several chairs from an open set of rendered 3D models. Then, various affine transformations were used on the chairs and the background independently, which were used to generate the second image in the pair and the optical flow between them. The result is as follows:

An interesting result was that using such a synthetic dataset allowed us to achieve relatively good quality for data from another domain. Fayn-tyun on the relevant data, of course, added more qualities (+ ft in the table below):

The result on real videos can be viewed here:

SpyNet (2016)

In many subsequent articles, the authors tried to improve the quality by solving the problem of poor recognition of sharp movements. Intuitively, a motion will not be caught by the network if its vector is significantly beyond the receptive field of activation. Solving this problem is proposed by counting three things: a larger bundle, pyramids, and “warping” one image from a pair into an optical stream. Everything in order.

So, if we have a couple of images in which the object has shifted dramatically (10+ pixels), then we can simply reduce the image (6 or more times). The absolute value of the offset decreases significantly, and the network is more likely to catch it, especially if its convolutions are larger than the offset itself (in this case, 7x7 convolutions are used).

However, while reducing the image, we lost a lot of important details, so you should go to the next level of the pyramid, in which the image size is already larger, while somehow taking into account the information that we received before, when we calculated the optical flow at a smaller size. This is done with the help of the warping operator, which recalculates the first image according to the existing optical flux approximation (obtained at the previous level). The improvement in this case is that the first image that was “pushed” according to the approximation of the optical flow will be closer to the second than the original image, i.e., we will again reduce the absolute value of the optical flow that we need to predict (remember, small motions are detected much better, since they are completely included in one convolution). From the point of view of mathematics, having a raster image I and an approximation of the optical flow V, a warping operator can be described as:

$ w (I, V) = I_w, \  ; \; I_w (x) = I (x + V (x)), $

where  $ x = (i, j) $ , i.e. a certain point on the image, $ I $ - the image itself,  $ V $ - optical flow,  $ I_w $ - the resulting image wrapped in an optical stream.

How to apply this all in CNN-architecture? Let's fix the number of levels of the pyramid $ k $ and the multiplier by which each subsequent image is reduced at the level starting with the last $ n $ . Denote by $ d (*) $ and  $ u (*) $ Downsampling (downsampling) and increasing (upsampling) image or optical flow to this factor.

We will also get a set of CNN-ok { $ G_0 ...G_k $ }, one for each level pyramids. Then $ G_i $ the network will take as input a couple of images with $ i $ of the pyramid level and optical flow, calculated on $ G_ {i-1} $ level (  $ G_0 $ will simply take a tensor of zeros instead.) In this case, one of the images will be sent to the warping layer to reduce the difference between them, and we will predict not the optical flow at this level, but the value that needs to be added to the upsampled optical stream from the previous level to get the optical stream at this level. In a formula, it looks like this:

$ v_k = G_k (I_k ^ 1, w (I_k ^ 2, u (V_ {k-1})))  , V_ {k-1}) $

To get the optical flow itself, we simply add the network prediction and the increased flow from the previous level:

$ V_k = u (V_ {k-1}) + v_k  . $

For Ground Truth for the network at this level we need to do the reverse operation - subtract from the target (reduced to the desired level) the prediction from the previous level of the pyramid. Schematically, it looks like this:

The advantage of this approach is that we can teach each level independently. The authors started learning from the 0th level, each subsequent network was initialized with the parameters of the previous one. Since each network is $ G_i $ solves the problem much simpler than the full calculation of the optical flux on a large image, then the parameters can be made much smaller. So much less, that now the whole ensemble can fit on mobile devices:

The ensemble itself is as follows (an example of a pyramid of 3 levels):

It remains to talk directly about the architecture $ G_i $ th network and take stock. Each network $ G_i $ consists of 5 convolutional layers, each of which ends with a ReLU activation, except for the last (which predicts the optical flow). The number of filters on each layer is equal respectively { $ 32, 64, 32, 16, 2 $ }. The inputs of the neural network (the image, the second image is wrapped in the optical stream and the optical stream itself) are simply concatenated by the dimension of the channels, so the input tensor has 8. The results are impressive:

PWC-Net (2018)

Inspired by the success of their German colleagues, the guys from NVIDIA decided to apply their experience (and video cards) in order to further improve the result. The basis of their work in many ways formed the ideas from the previous model (SpyNet), so PWC-Net will also deal with pyramids, but with pyramids of convolutions, and not original images, however, again - everything is in order.

Using raw pixel intensity values ​​to calculate the optical flux is not always reasonable, since a sharp change in brightness/contrast will break our assumption that the pixels move from one frame to the next without any changes and the algorithm will not be resistant to such changes. In classical optical flux counting algorithms, various transformations are used that alleviate this situation, but in this case, the authors decided to give the model the opportunity to learn such transformations. Therefore, instead of the image pyramid in PWC-Net, the bundle pyramid is used (hence the first letter in Pwc-Net), i.e. just feature maps from different layers of CNN, here called the feature pyramid extractor.

Then everything is almost like in SpyNet, just before submitting to CNN, which is called optical flow estimator, everything you need, namely:

  • image (in this case, the feature map from feature pyramid extractor),
  • anfixed optical flux calculated at the previous level,
  • the second image, "wrapped" (remember the warping layer, hence the second letter in pWc-Net) to this optical stream,

between the “wrapped” second frame and the usual first (again, I remind you that instead of the raw images, feature maps with feature pyramid extractor are used here) they consider what is called cost volume (hence the third letter in pwC-Net) and what is essentially the previously considered correlation between the two images.

The final touch is the context network, which is added immediately after the optical flow estimator and plays the role of a trained post-processing for the calculated optical flow. Details of the architecture can be viewed under the spoiler or in the original article.

Intimate details
So, the feature pyramid extractor has the same weights for both images, as nonlinearity for each convolution uses leaky reLU. To reduce the resolution of feature maps, at each subsequent level, convolutions with stride 2 are used, and $ c_t ^ l $ means the image feature map  $ t $ at the level of  $ l $ .

Optical flow estimator at the 2nd level of the pyramid (for example). Nothing unusual here, each convolution still ends with a leaky ReLU, except the last one, which predicts the optical flow.

The context network is still at the same level 2 pyramid, this network uses dilated convolutions with the same leaky ReLU activations, except for the last layer. It takes as input the optical flow calculated using an optical flow estimator and signs from the second from the end of the layer from the same optical flow estimator. The last digit in each block is the dilation constant.

The results are even more impressive:

Compared to other CNN methods for calculating optical flow, PWC-Net achieves a balance between quality and number of parameters:

There is also an excellent presentation by the authors themselves, in which they talk about the model itself and their experiments:


The evolution of architectures that solve the problem of optical flux counting is a wonderful example of how progress in CNN architectures and combining them with classical methods gives all the best and best results. And although the classic CV-methods still win in quality, recent results give hope that this is fixable ...

Sources and References

1. FlowNet: Learning Optical Flow with Convolutional Networks: article , article , article">code .
2. Large displacement optical flow: article .
3. Optical Flow Estimation using a Spatial Pyramid Network: article , code .
4. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume: article , code .
5. What you wanted to know about the optical stream, but hesitated to ask: article .
6. The calculation of the optical flux by the method of Lucas-Canada. Theory: article .
7. Template matching with OpenCVP: dock .
8. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset: article .
9. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks: article , code .

Source text: Deep Learning in Optical Flow Calculation