Big data - big responsibility, big stress and big money

Big data - big responsibility, big stress and big money


The term Big Data is spoiled by modern fantastic exaggeration of new things. As AI will enslave people, and the blockchain will build an ideal economy - and big data will let you know everything about everyone and see the future.

But the reality, as always, is more boring and pragmatic. There is no magic in big data - just as there is nowhere - just information and connections between different data become so much that it takes too long to process and analyze everything using old methods.

There are new methods. Together with them - new professions. Dean

What is “big data”

The question “what to call big data” is rather confusing. Even in the publications of scientific journals descriptions diverge. Somewhere, millions of observations are considered “ordinary” data, while somewhere large are already called hundreds of thousands, because each of the observations has a thousand signs. Therefore, the data decided to conditionally break into three parts - small, medium and large - by the simplest principle: the volume that they occupy.

Small data is a few gigabytes. Mediums are all about terabytes. Big data is around petabyte. But it did not remove the confusion. Therefore, the criterion is even simpler: all that does not fit on one server is big data.

In small, medium and large data different principles of operation. Big data is usually stored in a cluster on several servers at once. Because of this, even simple actions are more difficult.

For example, a simple task is to find the average value. If it is small data, we simply add and divide everything by quantity. And in big data we cannot collect all the information from all servers at once. It's complicated. Often it is necessary not to pull data to yourself, but to send a separate program to each server. After these programs work, intermediate results are formed, and the average value is determined by them.

Sergey Shirkin

Which companies are involved in big data
The first mobile operators and search engines began working with big data. The search engines became more and more queries, and the text is heavier than numbers. It takes more time to work with a paragraph of text than with a financial transaction. The user waits for the search engine to complete the request in a split second - it is unacceptable that it will work even for half a minute. Therefore, the first search engines started working with parallelization when working with data.

A bit later various financial organizations and retail joined. The transactions themselves are not so voluminous, but big data appears because there are a lot of transactions.

The amount of data is growing at all. For example, banks used to have a lot of data before, but they didn’t always need working principles, as with large ones. Then the banks began to work more with customer data. They began to come up with more flexible deposits, loans, different tariffs, they began to analyze transactions more closely. This required fast ways of working.

Now banks want to analyze not only internal information, but also third-party information. They want to receive data from the same retail, they want to know what a person is spending money on. Based on this information, they are trying to make commercial offers.

Now all information is interconnected.Retail, banks, telecom operators and even search engines - everyone is now interested in each other’s data.

What a big data specialist should be

Since the data is located on a cluster of servers, a more complex infrastructure is used to work with it. This puts a heavy load on the person who works with it - the system must be very reliable.

Making a single server reliable is easy. But when there are several of them - the probability of falling increases in proportion to the number, and so does the responsibility of the data engineer who works with this data.

The analyst must understand that he can always get incomplete or even incorrect data. He wrote the program, trusted its results, and then found out that due to the crash of one server out of a thousand, some of the data was turned off, and all conclusions were wrong.

Take, for example, text search. Suppose all words are arranged in alphabetical order on several servers (if we speak very simply and conventionally). And one of them disconnected, all the words with the letter “K” disappeared. Search stopped issuing the word "Cinema". All movie news disappears, and the analyst makes a false conclusion that people are no longer interested in movie theaters.

Therefore, a specialist in big data should know the principles of work from the lowest levels — servers, ecosystems, task planners — to the highest level programs — machine learning libraries, statistical analysis, and others. He must understand the principles of operation of hardware, computer equipment and everything that is configured on top of it.

Otherwise, you need to know everything the same as when working with small data. You need mathematics, you need to be able to program and especially well know the algorithms of distributed computing, to be able to attach them to the usual principles of working with data and machine learning.

What tools are used

Since data is stored on a cluster, a special infrastructure is needed to work with it. The most popular ecosystem is Hadoop. It can work a lot of different systems: special libraries, planners, tools for machine learning and much more. But first of all, this system is needed to cope with large amounts of data through distributed computing.

For example, we are looking for the most popular tweet among the data broken on a thousand servers. On one server, we would just make a table and that's it. Here we can drag all the data to us and recount. But this is not correct, because it is very long.

Therefore, there is a Hadoop with the Map Reduce paradigms and the Spark framework. Instead of pulling data to themselves, they send program sections to this data. Work goes in parallel, in a thousand streams. Then you get a sample of thousands of servers based on which you can choose the most popular tweet.

Map Reduce is an older paradigm, Spark is newer. With its help, data from clusters is obtained, and machine learning models are built in it.

What professions are in big data?

The two main professions are analysts and data engineers.

Analyst primarily works with information. He is interested in tabular data, he is engaged in models. His responsibilities include aggregation, cleaning, addition and visualization of data. That is, the analyst is the link between raw information and business.

The analyst has two main areas of work. The first is that he can transform the received information, draw conclusions and present it in an understandable form.

The second is that analysts are developing applications that will work and produce the result automatically. For example, to make a forecast on the securities market every day.

Data Engineer is a lower-level specialty. This is a person who must provide storage, processing and delivery of information analytics. But where there is a supply and cleaning - their duties may overlap.

Data engineer gets all the black work. If the system failed, or one of the servers disappeared from the cluster, it connects. This is a very responsible and stressful job. The system can shut down on weekends and off-hours, and the engineer should take immediate action.

These are the two main professions, but there are others. They appear when parallel computing algorithms are added to tasks related to artificial intelligence. For example, an NLP engineer. This is a programmer who deals with the processing of natural language, especially in cases when it is necessary not only to find words, but to grasp the meaning of the text. Such engineers write programs for chat bots and dialogue systems, voice assistants and automated call centers.

There are situations when it is necessary to classify billions of pictures, to do moderation, weed out the superfluous and find similar. These professions more intersect with computer vision.

You can watch the most
According to the My Circle salary cultivator, the average salary of professionals whose professions are associated with big data is 139,400 rubles . A quarter of professionals earn more than 176,000 rubles. One tenth - more than 200,000 rubles.

How to prepare for interviews

No need to go deep into just one thing. At interviews ask questions about statistics, machine learning, programming. They may ask about data structures, algorithms, cases from real life: the server fell, an accident happened - how to fix it? There may be questions on the subject area - something that is closer to the business.

And if a person is too deep in one math, and at the interview did not do a simple task of programming, the chances of employment are reduced.It is better to have an average level in each direction than to show yourself well in one, and in the other to fail completely.

There is a list of questions asked at 80 percent of interviews. If this is machine learning, they will definitely ask about gradient descent. If statistics - you will need to tell about the correlation and testing of hypotheses. Programming is likely to give a small problem of medium complexity. And on tasks you can easily fill your hand - just to solve them more.

Where to gain experience yourself

Python can be pulled up to Python , work with the database - on SQL-EX . There are given tasks for which in practice they learn to make inquiries.

Higher Mathematics - Mathprofi . There you can get clear information on mathematical analysis, statistics and linear algebra. And if it’s bad with a school program, then there’s a site.

Distributed computing can only be trained in practice. Firstly, this requires infrastructure, and secondly, algorithms can quickly become outdated. Now something new is constantly appearing.

What are the trends the community is discussing

One more direction is gradually gaining momentum, which can lead to a rapid growth in the amount of data - the Internet of Things (IoT). Data of this kind comes from the sensors of devices connected to the network, and the number of sensors in the beginning of the next decade should reach tens of billions.

The devices are very different - from household appliances to vehicles and industrial machines, the continuous flow of information from which will require additional infrastructure and a large number of highly qualified specialists. This means that in the near future there will be an acute shortage of data engineers and big data analysts.

Source text: Big data - big responsibility, big stress and big money