Data lakes: how data lakes work and why they are needed

Lakes, showcases and storage

Imagine that the company has access to inexhaustible information

resource - diving into it, analysts regularlyget valuable business insights and launch new, better products. Data lakes work approximately on this principle. This is a relatively new type of data architecture that allows you to collect together raw and disparate information from different sources, and then find their effective use. Giants such as Oracle, Amazon and Microsoft were the first to experiment with the technology - they also developed convenient services for building lakes.

The term data lake itself was introduced by James Dixon,founder of the Pentaho platform. He compared data marts to data lakes: the former are like bottled water that has been purified, filtered and packaged. Lakes are open bodies of water into which water flows from different sources. You can dive into them, or you can take samples from the surface. There are also data storages that perform specific tasks and serve specific interests. Lakeside repos, on the other hand, can benefit many players if used wisely.

It would seem that the flow of information only complicateswork for analysts, because the information is not structured, and besides, there are too many of them. But if the company knows how to work with data and get value from it, the lake does not become a swamp.

Extracting data from the "bunker"

Still, what are the benefits of data lakes?companies? Their main advantage is abundance. The repository receives information from different teams and departments, which are usually not related to each other. Take an online school for example. Different departments keep their statistics and pursue their own goals - one team monitors user retention metrics, the second studies the customer journey of new customers, and the third collects information about graduates. No one has access to the full picture. But if you accumulate disparate information in a single repository, you can find interesting patterns. For example, it turns out that users who come to design courses and watched at least two webinars are more likely than others to reach the end of the program and build a successful career in the market. This information will help the company retain students and create a more compelling product.

Often unexpected patterns are foundaccidentally - for example, the data lake helps data analysts experimentally "cross" different streams of information and find parallels that they would hardly have found in other circumstances.

Data sources can be any:an online school will have statistics from different promotion channels, a factory will have IoT sensor indicators, a machine tool usage schedule and equipment wear rates, a marketplace will have information on the availability of goods in stock, sales statistics and data on the most popular payment methods. Lakes just help to collect and study arrays of information that usually do not intersect in any way and fall into the field of attention of different departments.

Another plus of data lakes is data extractionfrom disparate repositories and closed subsystems. Often information is stored in a kind of information "bunker", access to which only one department has. It is difficult or impossible to transfer materials from it - there are too many restrictions. Lakes solve this problem.

So, there are at least eight advantages of data lakes:

  • Help data analysts gain valuable insights.
  • Allows the company to make quick decisions based on statistics and facts.
  • Allows you to experiment with different types of data from different sources.
  • Make the analytics process more democratic and remove barriers between departments.
  • Provide a high level of data centralization and granularity - this allows you to find a "needle in a haystack".
  • Suitable for companies of all sizes - at an early stage, you can start with mini-lakes and gradually build up volumes.
  • They simplify business processes - for example, they allow you to make cross-domain queries and create complex product reporting.
  • They are cheaper than storage because the data does not need to be pre-processed.

Lakes are primarily needed for distributed andbranched teams. Amazon is a classic example. The corporation has accumulated data from thousands of different sources. So, financial transactions alone were stored in 25 different databases, which were differently arranged and organized. This created confusion and inconvenience. The lake helped to collect all the materials in one place and establish a unified data protection system. Now professionals - data and business analysts, developers and CTOs - could take the components they needed and process them using different tools and technologies. And machine learning has helped Amazon analysts make super-accurate predictions - now they know how many boxes of a certain size will be required for parcels in Texas in November.

Four steps to data lakes

But data lakes also have disadvantages.First of all, they require additional resources and a high level of expertise - only highly qualified analysts can truly benefit from them. You will also need additional Business Intelligence tools to help transform your insights into a coherent strategy.

Another problem is the use of third partysystems to maintain data lakes. In this case, the company depends on the provider. If a system crash or data leak occurs, it can lead to large financial losses. However, the main problem of the lakes is the hype around technology. Often, companies are adopting this format following fashion, but don't know why they actually need it. As a result, they spend large sums, but do not achieve return on investment. Therefore, experts advise, even at the stage of preparation for the launch, to determine what business tasks the lakes will solve.

McKinsey experts identify four stages of creating data lakes:

  1. Creation of a platform for collecting raw data. At this stage, it is important to learn how to retrieve and store information.
  2. Platform development and first experiments. Data analysts are already starting to analyze data and build analytical prototypes.
  3. Tight integration with data storage. At this stage, more and more data sets flock to the lakes, and the navigation process is simplified.
  4. Data lake becomes keyarchitecture. New application scenarios are developing, new add-ons and services with a user-friendly interface appear, the company is starting to use the Data-as-a-Service business model.

Analytical algorithms

There is nothing in the accumulation of data itselffundamentally new, but thanks to the development of cloud systems, open source platforms and, in general, an increase in computer power, even startups can work with lake architecture today.

Another industry driver was machinetraining - the technology somewhat simplifies the work of analysts and gives them more tools for post-processing. If earlier a specialist would have drowned in the number of files, summaries and tables, now he can “feed” them to the algorithm and build an analytical model faster.

Using data lakes in combination with AI helpsnot just centrally analyze statistics, but also track trends throughout the entire history of the company. For example, one of the American colleges has collected information about applicants over the past 60 years. The data on the number of new students were taken into account, as well as indicators on employment and the general economic situation in the country. As a result, the university adjusted the curriculum so that students graduate rather than drop out halfway through.

What other business tasks can data lakes solve:

  • Allocate resources efficiently to avoid stockouts during periods of peak demand.
  • Build more accurate forecasts and predict trends, and launch innovative products ahead of competitors.
  • Segment your audience and identify the interests of even the most niche groups.
  • Build more detailed and accurate reports that will help improve metrics and increase productivity.
  • More efficiently customize promotion algorithms and recommendation systems.
  • Save resources in production or in the laboratory - even if it is a complex structure like CERN.

However, lakes are used not only inbusiness environment - for example, at the beginning of the pandemic, AWS collected information about COVID-19 in a single repository: research data, articles, statistical summaries. The information was regularly updated, and access to it was provided free of charge - you had to pay only for analytics tools.

Data lakes cannot be considered universaltool and panacea, but in an era when data is considered the new oil, it is important for companies to look for different ways to research and apply big data. The main task is to centralize and consolidate disparate information. In the era of microservices and distributed teams, situations often arise where one department does not know what another is working on. Because of this, the business wastes resources, and different specialists perform the same tasks, often unaware of it. This ultimately reduces efficiency and overloads the company's “operating system”. Surveys show that most companies invest in data lakes to improve operational efficiency. But the results exceed expectations: early adopters of technology grow revenue and profits faster than those lagging behind, and most importantly, they bring new products and services to market faster.

See also:

The Ministry of Health of Argentina disclosed data on side effects in those who received "Sputnik V"

Platypus turned out to be a genetic mixture of mammals, birds and reptiles

Abortion and science: what will happen to the children who will give birth