How and why we did recognition of sights in Cloud

How and why we did recognition of sights in Cloud

With the advent of high-quality cameras in mobile phones, we increasingly photograph, shoot videos of bright and important moments of our life. Many of us have photo archives for decades and thousands of photographs, which are becoming increasingly difficult to navigate. Remember how long it took to search for the right photo a few years ago.

One of the purposes of Cloud is to provide the most convenient access and search by its photo and video archive. To do this, we, the team of machine vision, created and implemented systems of “smart” photo processing: search by objects, scenes, people, etc. Another such bright technology is sight recognition. And today I will talk about how with the help of Deep Learning we solved this problem.

Imagine the situation: you went on vacation and brought a bunch of photos. And in a conversation with friends you were asked to show how you visited a palace, castle, pyramid, temple, lake, waterfall, mountain, etc. You start frantically scrolling a folder with photos, trying to find the right one. Most likely, you will not find it among the hundreds of images, and say that you will show it later.

We solve this problem by grouping custom photos into albums. This makes it easy to find the right pictures in just a few clicks. Now we have albums by persons, by objects and scenes, as well as by sights.

Photos with sights are important because they often reflect significant moments of our life (for example, travel). These can be photographs on the background of an architectural structure or a corner of nature untouched by man. Therefore, we need to find these photos and give users easy and quick access to them.

Features of sight recognition

But there is a nuance: you can not just take and train some model so that it recognizes the sights, there are many difficulties.

  • Firstly, we cannot clearly describe what “sight” is. We cannot say why one building is a landmark, and standing next to it is not. This is not a formalized concept, which complicates the formulation of the problem of recognition.
  • Secondly, the sights are extremely diverse. It can be historical or cultural buildings - temples, palaces, castles. It can be a variety of monuments. It can be natural objects - lakes, canyons, waterfalls. And one model should be able to find all these sights.
  • Thirdly, there are very few images of sights, according to our calculations, they are found only on 1-3% of user photos. Therefore, we cannot allow ourselves errors in recognition, because if we show a person a photo without a sight, it will be immediately noticeable and cause confusion and a negative reaction. Or, on the contrary, we showed a man a photo with a landmark in New York, and he never was in America. So the model for recognition should have a low FPR (false positive rate).
  • Fourthly, about 50% of users, or even more, disable saving geo-information when taking photos. We need to take this into account and determine the place exclusively in the image. Most of the services that today somehow know how to work with sights, do it thanks to geodata. Our initial requirements were tougher.

I will show you now with examples.

Here are similar objects, three French Gothic cathedrals. The left is Amiens Cathedral, in the middle of Reims Cathedral, on the right - Notre-Dame de Paris.

Even a person needs some time to look at them and understand that these are different cathedrals, and the car should also be able to cope with this, and faster than a man.

And here is an example of another difficulty: these three photos on the slide - Notre Dame de Paris, taken from different angles. The photos turned out very different, but they all need to be recognized and found.

Natural objects are completely different from architectural. On the left is Caesarea in Israel, on the right is the English Park in Munich.

In these photos there are very few characteristic details for which the model can "catch".

Our method

Our method is completely based on deep convolutional neural networks. As an approach to learning, we chose the so-called curriculum learning - learning in several stages. In order to work more efficiently both in the presence and absence of geodata, we have made a special inference. I'll tell you about each of the stages in more detail.


The fuel for machine learning is data. And first of all we needed to collect datasets for training the model.

We divided the world into 4 regions, each of which is used at different stages of learning. Then in each region countries were taken, for each country they made a list of cities and collected a base of photos of their sights. Examples of data are presented below.

At first we tried to train our model on the basis. The results were bad. Began to analyze, and it turned out that the data is very "dirty." Each landmark had a large amount of garbage. What to do? Manually revising the entire huge amount of data is expensive, dreary and not very clever. Therefore, we made an automatic base cleaning, during which only one step uses manual processing: for each sight, we manually selected 3-5 reference photographs that accurately contain the desired sight from a more or less correct angle. It turns out pretty quickly, because the volume of such reference data is small relative to the entire database. Then, automatic cleaning is performed based on deep convolutional neural networks.

Further I will use the term “embedding”, by which I will understand the following. We have a convolutional neural network, we trained it for classification, cut off the last classifying layer, took some images, drove it through the network and got a numerical vector at the output. I will call him embedding.

As I said, the training was conducted in several stages corresponding to parts of our database. Therefore, we first take either the neural network from the previous stage or the initializing network.

Reference photographs of the attractions are driven through the network and we get several embeddingings. Now you can clean the base. We take all the pictures from the dataset for this attraction and we also run each picture through the network. We get a bunch of embeddings and for each of them we count the distances to embeddings of the standards. Then we calculate the average distance, and if it is greater than a certain threshold, which is the parameter of the algorithm, then we consider that this is not a landmark. If the average distance is less than the threshold, then we leave this photo.

As a result, we received a base that contains more than 11 thousand attractions from more than 500 cities in 70 countries of the world - more than 2.3 million photographs.Now it's time to remember that most of the photos contain no attractions at all. This information must somehow be told to our models. Therefore, we added 900 thousand photos without attractions to our base, and we trained our model on the resulting dataset.

To measure the quality of education, we introduced an offline test. Based on the fact that the sights are found only on about 1-3% of the photos, we manually compiled a set of 290 photographs in which there are sights. These are different, rather complex photos with a large number of objects taken from different angles, so that the test was as difficult as possible for the model. By the same principle, we selected 11 thousand photographs without attractions, which are also quite complex, and we tried to find objects that are very similar to the sights that are in our database.

To assess the quality of education, we measure the accuracy of our model from photographs with and without landmarks. These are our two main metrics.

Existing approaches

In the scientific literature there is relatively little information on the recognition of sights. Most decisions are based on local attributes. The idea is that we have a certain request picture and a picture from the database. In these pictures we find local signs - key points, and compare them. If the number of matches is large enough, we think that we have found a landmark.

To date, the best is the method proposed by Google - DELF (deep local features), in which the comparison of local features combined with deep learning. By pushing the input image through the convolutional network, we get some DELF-signs.

How is the recognition of attractions? We have a base of photos and an input image, and we want to understand whether there is a landmark on it or not. All shots are run through DELF, we get the corresponding attributes for the base and for the input image. Then we perform a search according to the method of our nearest neighbors and at the output we get candidate images with signs. We compare these signs with the help of geometric verification: if they pass it successfully, then we believe that there is an attraction in the picture.

Convolutional Neural Network

For Deep Learning, prior learning is crucial. Therefore, we took the base of the scenes and pre-educated our neural network on it. Why exactly? A scene is a complex object that includes a large number of other objects. A landmark is a special case of the scene. Pre-training model on this basis, we can give the model an idea of ​​some low-level signs, which can then be generalized for the successful recognition of attractions.

We used the neural network from the Residual network family as a model. Their main feature is that they use a residual block, which includes a skip connection, which allows the signal to pass freely without falling into the weights. With such an architecture, it is possible to train deep nets with high quality and deal with the blur effect of the gradient, which is very important when training.

Our model is Wide ResNet 50-2, a modification of ResNet 50, in which the number of convolutions in the internal bottleneck-block is doubled.

The network works very efficiently. We conducted tests on our scene database and this is what we got:

Top 1 err
Top 5 err
WRN-50-2 (fast!)

Wide ResNet was almost twice as fast as the rather large ResNet 200 network. And the speed of operation is very important for operation. In the aggregate of these circumstances, we took Wide ResNet 50-2 as our main neural network.


For network training, we need a loss (loss function). For its selection, we decided to use the metric learning approach: the neural network is trained in such a way that the representatives of the same class are squeezed into one cluster. At the same time, clusters for different classes should be as far from each other as possible. For sights, we used Center loss, which pushes points of the same class to a certain center. An important feature of this approach is that it does not require negative sampling, which is quite a difficult procedure in the later stages of training.

Let me remind you that we have n classes of sights and one more class “not sights”, for it Center loss is not used. We mean that a landmark is one and the same object, and there is a structure in it, so it is advisable to consider the center for it. And not anything else can be a landmark, and it’s unreasonable to count the center for it.

Then we put it all together and got a model for learning. It consists of three main parts:

  • Wide ResNet 50-2 convolutional neural network, pre-trained based on scenes;
  • Embedding parts consisting of a fully connected layer and a Batch norm layer;
  • A classifier, which is a fully connected layer, followed by a pair of Softmax loss and Center loss.

As you remember, our base is divided into 4 parts by region of the world. These 4 parts we use as part of the curriculum learning paradigm. At each stage we have a current dataset, we add another part of the world to it, and a new training dataset is obtained.

The model consists of three parts, and for each of them we use our own learning rate during training. This is necessary so that the network can not only learn the sights from the new part of the dataset, which we added, but also so that it does not forget the already learned data. After many experiments, this approach proved to be the most effective.

So we trained the model. Need to understand how it works. Let's use the class activation map to see which part of the image our neural network responds to the most. In the picture below, in the first row, the input images, and in the second, the class activation map from the grid, which we trained in the previous step, is superimposed on them.

The heat map shows which parts of the image network pays more attention. From the class activation map, it is clear that our neural network has successfully learned the concept of sights.


Now you need to somehow use this knowledge to get the result. Since we used Center loss for training, it seems logical to calculate cetoids for sights at inference as well.

To do this, we take some of the images from the training set for some attractions, for example, for the Bronze Horseman. We run them through the network, get the embeddings, average and get the centroid.

But the question arises: how many centroids per point of interest does it make sense to calculate? At first, the answer seems clear and logical: one centroid. But it turned out to be not so. At first, we also decided to do one centroid each and got a pretty good result. So why take a few centroids?

First, the data we have is not entirely clean. Although we cleaned the dataset, we removed only the frank trash. And we could have left images that seem not to be considered rubbish, but which worsen the result.

For example, I have a class sightseeing Winter Palace. I want to calculate a centroid for it. But in the set turned out to be a certain number of photos with the Palace Square and the arch of the General Staff. If you count the centroid in all the images, then it will not be very stable. It is necessary in some way to cluster their embeddings, which are obtained from a conventional grid, to take only the centroid that is responsible for the Winter Palace, and from these data already calculate the average.

Secondly, photos can be taken from different angles.

Let me cite as an illustration of this behavior the Belfort bell tower in Bruges. For her counted two centroids. In the top row of the image are those photos that are closer to the first centroid, and in the second row - those that are closer to the second centroid:

The first centroid is responsible for more “parade” photos from close range that were taken from the Bruges market square. And the second centroid is responsible for photographs taken from afar from adjacent streets.

It turns out that, by calculating a few centroids on a point of interest, we can map to the inference different angles of this point of interest.

So, how do we find these sets to calculate the centroids? We apply a complete link to the data sets for each point of interest. With its help, we find valid clusters by which we will calculate the centroids. By valid clusters, we mean those that contain at least 50 photos as a result of clustering. We discard the remaining clusters. As a result, it turned out that about 20% of attractions have more than one centroid.

Now inference. We calculate it in two stages: first, we run the input image through our convolutional neural network and get the embedding, and then with the help of the scalar product we compare the embedding with the centroids. If the images contain geodata, then we limit the search to centroids, which are related to landmarks located in a square of 1 to 1 km from the place of shooting. This allows you to search more precisely, to pick a smaller threshold for later comparison. If the resulting distance is greater than the threshold, which is a parameter of the algorithm, then we say that there is an attraction in the photo with the maximum value of the dot product. If less, then this is not a landmark.

Suppose the photo contains a landmark. If we have geodata, then we use them and display the answer. If there is no geodata, then we do an additional check. When we were cleaning datasets, we made a set of reference images for each class of sights. For them, we can calculate embeddings, and then count the average distance from them to the embedding image of the query. If it is greater than a certain threshold, then verification is completed, we connect metadata and display the result.It is important to note that we can do this procedure for several places of interest that were found in the image.

Test Results

We compared our model with the DELF, for which we took the parameters for which it showed the best results on our test. The results were almost identical.

Not sights

Then we divided the sights into two types: frequent (over 100 photos), which make up 87% of all attractions in the test, and rare. At frequent our model works well: accuracy is 85.3%. With rare landmarks, we got 46%, which is also very good - even with a small amount of data, our approach shows decent results.

Proportion of total

After we conducted A/B testing on user photos. As a result, the conversion of buying a place in the cloud increased by 10%, the conversion of deleting a mobile application fell by 3%, and the number of views of albums increased by 13%.

Compare the speed of our approach and DELF. On a GPU, a DELF requires 7 grid runs, because it uses 7 image scales, and our approach uses only 1 pass. On the CPU, the DELF uses a longer search by the method of nearest neighbors and a very long geometric verification. As a result, our method on the CPU was 15 times faster. In both cases, our approach benefits in speed, which is extremely important during operation.

Results: vacation impressions

At the beginning of the article I mentioned the solution to the problem of scrolling and finding the right pictures with landmarks. Here it is.

This is my cloud, and in it all the photos are divided into albums. There is an album "People", "Objects", "Attractions". Inside it attractions are divided into albums, which are grouped by city. If you click on the Zwinger in Dresden, then an album with photos of this attraction will open.

Very useful feature: I went on vacation, pofotkal, folded into the cloud. And when you wanted to upload them to Instagram or share with friends and family, you don’t have to search and select for a long time, just a few clicks and you will find the photos you need.


Let me remind you of the main points of our decision.

  1. Semi-automatic base cleaning. It is a little manual work for an initial marking, and further the neural network copes itself. This allows you to quickly clean up the new data and to train the model on them.
  2. We use deep convolutional neural networks and deep metric learning, which allows us to effectively learn the structure in classes.
  3. As a learning paradigm, we used curriculum learning - learning by parts. This approach helped us a lot. In inference, we use several centroids that allow us to use cleaner data and find different angles of attractions.

It would seem that object recognition is a well-studied area.But exploring the needs of real users, we find new interesting tasks, such as the recognition of attractions. It allows using machine learning to tell people something new about the world. It is very inspiring and motivating!

Source text: How and why we did recognition of sights in Cloud