Developers go crazy with the strangest things. We all prefer to consider ourselves super-rational creatures, but when it comes to choosing a technology, we fall into a kind of madness, jumping from a comment on HackerNews to a post in some blog, and now as if in oblivion, we are helpless We are sailing towards the brightest light source and dutifully bow before it, completely forgetting about what we were originally looking for.
It's not at all like rational people make decisions. But exactly the way developers decide to use, for example, MapReduce.
As Joe Hellerstein noted, in his database lecture for bachelor students (in the 54th minute):
The fact is that there are about 5 companies in the world that perform such large-scale tasks. As for everyone else ... they are spending incredible resources to ensure resiliency of the system, which they really do not need. People had a kind of “guglomania” in the 2000s: “we will do everything exactly as Google does, because we also manage the largest data processing service in the world ...” [ironically shakes his head and waits for laughter from the audience]
How many floors are there in your data center? Google decided to stop at four, at least in this particular data center located in Mays County, Oklahoma.
Yes, your system is more resilient than you need, but think about what it can cost. It's not just the need to process large amounts of data. You are probably exchanging a complete system — with transactions, indices, and query optimization — for something relatively weak. This is a significant step backwards. How many Hadoop users go for it consciously? How many of them make a really balanced decision?
MapReduce/Hadoop is a very simple example. Even the followers of the "cargo cult" have already understood that the planes will not solve all their problems. Nevertheless, the use of MapReduce allows you to make an important generalization: if you use the technology created for a large corporation, but at the same time solve small problems, you may be acting rashly. Even so, it’s most likely that you are guided by mystical ideas that by imitating giants like Google and Amazon, you will reach the same vertices.
Yes, this article is another adversary of the "cargo cult". But wait, I have a useful checklist for you, which you can use to make better decisions.
Next time, when you google any new cool technique (re) shaping your system, I urge you to stop and just use the UNPHAT framework :
Using UNPHAT is easy. Recall my recent conversation with a company that quickly decided to use Cassandra for the process of intensive reading of data loaded at night.
Since I was already familiar with the documentation for Dynamo and knew that Cassandra is a derivative system, I understood that these databases focused on the ability to write (in Amazon there was a need to so that the “add to cart” action never failed). I also appreciated that the developers sacrificed data integrity - and, in fact, every feature inherent in a traditional RDBMS. But after all, the company with which I spoke, the ability to record was not a priority. To be honest, the project meant creating one big record per day.
Amazon sells a lot of things. If the function "add to cart" suddenly stopped working, they would have lost a LOT of money. Do you have a problem of the same order?
This company decided to use Cassandra because it took a few minutes to execute the PostgreSQL query in question, and they decided that these were technical limitations from their hardware. After clearing a couple of moments, we realized that the table consisted of approximately 50 million lines of 80 bytes each. Her reading from the SSD would have taken about 5 seconds if she had to walk through it completely. This is slow, but it is still two orders of magnitude faster than the speed at which the request was executed.
At this stage, I had many questions (U = understand, understand the problem!) and I began to weigh about 5 different strategies that could solve the original problem (N = eNumberate, list several possible solutions!), but in any By that time, it was already quite clear that using Cassandra was a completely wrong decision. All they needed was a little patience to set up, probably a new design for the database, and perhaps (although unlikely), the choice of a different technology ... But definitely not a data storage in a “key-value” format with the ability to write intensively which was created in Amazon for their basket!
I was quite surprised to find that one student startup decided to build his architecture around Kafka. It was amazing. As far as I could tell, their business conducted only a few dozen very large operations per day. Perhaps a few hundred on the most successful days. With such bandwidth, the main data repository could be handwritten entries in an ordinary book.
For comparison, recall that Kafka was created to handle all analytical events on LinkedIn. This is just a huge amount of data. Even a couple of years ago, it was about 1 trillion events daily , with a peak load of 10 million messages per second. Of course, I understand that Kafka can be used to work with lower loads, but in order that it is 10 orders less?
The sun, being a very massive object, is only 6 orders of magnitude heavier than Earth.
Maybe the developers even made a deliberate decision based on expected needs and a good understanding of the purpose of Kafka.But I think that they rather fed on (as a rule, justified) community enthusiasm for Kafka and hardly ever wondered if this was really the tool they needed. Just imagine ... 10 orders of magnitude!
Even more popular than Amazon’s distributed data warehouse is an architectural approach to development that provided them with scalability: a service-oriented architecture. As Werner Vogels noted in this 2006 interview , which he gave to Jim Gray, in 2001 at Amazon realized that they have difficulty scaling the interface (front-end parts) and that a service-oriented architecture could help them. This idea infected one developer after another while start-ups consisting of only a couple of developers and almost no customers started to split up their software into nanoscale services.
By the time Amazon decided to switch to SOA (Service-oriented architecture), they had about 7,800 employees, and their sales volumes exceeded $ 3 billion .
The Bill Graham Auditorium in San Francisco seats 7,000 people. Amazon had about 7,800 employees when they switched to SOA.
This does not mean that you have to delay the transition to SOA until your company reaches 7800 employees ... just always think with your head . Is this really the best solution for your task? What kind of task is you facing and are there any other ways to solve it?
But if you tell me that the work of your organization, consisting of 50 developers, will simply rise without SOA, then I’m wondering why so many large companies just function perfectly using a single, but well-organized application.
Examples of using systems for processing high-loaded data streams (Hadoop or Spark) can really be puzzling. Very often, traditional DBMSs are better suited for the available load, and sometimes the data volumes are so small that even available memory would suffice for them. Did you know that you can buy 1TB of RAM somewhere for $ 10,000? Even if you had a billion users, you would still be able to provide each of them with 1 KB of RAM.
Perhaps this will not be enough for your workload, because you will need to read and write to disk. But do you really need several thousand disks to read and write? Here's how much data you have in fact? GFS and MapReduce were created to solve computing problems across the entire Internet ... for example, to recalculate the search index throughout the Internet .
Hard drive prices are now much lower than in 2003 when the GFS documentation was published.
Maybe you read the GFS and MapReduce documentation and noticed that one of the problems for Google was not data volumes, but bandwidth (processing speed): they used distributed storage because it took too long to transfer bytes from disks. But what will be the bandwidth of the devices that you will use this year? Considering that you don’t even need to get as many devices as Google needed, maybe it’s better to just buy more modern drives? How much will it cost to use SSD?
Maybe you want to consider the possibility of scaling in advance.Have you done all the necessary calculations? Will you accumulate data faster than SSD prices will go down? How many times will your business need to grow so that all available data no longer fits on one device? As of 2016, the Stack Exchange processed 200 million requests per day with support for only 4 SQL servers : the main one for Stack Overflow, one more for everything else, and two copies.
Again, you can resort to UNPHAT and still decide to use Hadoop or Spark. And the decision may even be true. The main thing is that you really use the appropriate technology to solve your problem . By the way, Google is well aware: when they decided that MapReduce is not suitable for indexing, they stopped using it.
Suppose my promise is not something new, but perhaps in this form it will respond to you or maybe it will be easy for you to remember UNPHAT and apply it in real life. If not, you can watch Rich Hickey's speech at Hammock Driven Development , or the book Fields " How to Solve it ", or the Hamming course " The Art of Doing Science and Engineering ". Because the main thing that we all ask is to think!
And really understand the task you are trying to solve. Speaking the inspirational words of Fields:
" It is foolish to answer a question that you do not understand. It’s sad to pursue a goal that you don’t want to achieve. ”
Source text: [Translation] You are not Google