当前位置:网站首页>Thoughts on the construction of some enterprise data platforms

Thoughts on the construction of some enterprise data platforms

2022-07-28 14:23:00 InfoQ

( This article was first published in https://brightliao.com/2021/01/21/some-thoughts-about-data-platform/)

I have recently come into contact with many customers in traditional industries , They all want to build their own data platform . Its purpose is mostly to learn from the mature technical experience accumulated by Internet companies , Build your own data capabilities , Finally, realize the data-driven enterprise .
Data platforms are no longer new , Even when everyone is talking about the data center , It also seems a little out of date . in my opinion , In fact, everyone's understanding of data center is still in the stage of exploration , There is no consensus that convinces the industry . But when it comes to data platforms , We are relatively clear about what it should contain . therefore , This article tries to choose “ Data platform ” This may be a little outdated but relatively pragmatic words to organize content . I don't want to talk too much about the concept here , No matter what this thing is , The data problem it wants to solve in the enterprise is clear and specific .

What is an enterprise data platform

What kind of data problems do enterprises want to solve when building data platforms ? Let's take a look at how data plays its value .

How to apply data in enterprises

Generally, the data inside the enterprise will be used in two ways .
First, support bi analysis , That is, what we usually call various data report applications , It also includes data roll-up and drill down analysis . The second is to support self-help exploratory data analysis , It usually includes statistical analysis and modeling analysis .
Either way , The first is to have data . The general approach is to collect data from the business system into a data store dedicated to analysis . Then perform data calculation in this system .
For report application , Usually we need to calculate some data indicators periodically , And stored as a database table , for bi The system queries quickly .
For exploratory data analysis , We often need to provide a programmable interface , So that data analysts can be used for data processing and analysis . Considering the technical background of data analysts , The programming interface here is usually
SQL
and
Python
.

Data platform functions

Understand the process of data application , We can sort out the functions of the data platform .
  • Data access support , An interface needs to be provided so that the business team can quickly access the data to the data platform .
  • Data development support , You need to provide an interface for data calculation , So that data indicators can be calculated and provided to bi Tools .
  • Task Scheduling Support , It is necessary to periodically schedule the above computing programs , So that data indicators can be calculated periodically .
  • Exploratory data analysis support , Provide
    SQL
    and
    Python
    Interface for data analysts .
From the perspective of more convenient organization and management of data , The data platform also needs to have the following capabilities :
  • Data security management . For example, Data permission control , Data that can be read or written without permission cannot be read or written , The process of permission application is simple ; Another example is data desensitization control .
  • Data quality management . For example, you can easily query data standards , Perform data inspection according to data standards .
  • Data discovery support , It is convenient for platform users to quickly find and understand data . This will include the data directory , Metadata management , A series of data management functions such as data kinship management .
From the perspective of software architecture , In order to better support data understanding , In order to efficiently develop indicators and data analysis , Generally, we will design a certain data model . In order to support efficient complex data calculation , A large number of reusable basic indicators will precipitate in the general data platform . therefore , From the perspective of improving software development , The data platform will also have the following functions :
  • Data modeling capability support
  • Layered data architecture support
  • Maintain data development specifications 、 Design suggestions 、 Guiding documents such as best practices
If you use a diagram to summarize the above data platform , It forms the following data platform architecture diagram that you often see .


among , If the enterprise has a strong demand for modeling ability , Usually, we will further divide the functions related to machine learning models , Form a machine learning platform .

Data platform construction ideas

Understand what the data platform is , Let's take a look at the construction ideas of enterprise data platform .

Centralization or decentralization

First of all, let's analyze how to build an enterprise data platform from the organizational form . generally speaking , The data platform will be centralized and integrated to a certain extent .
When it comes to centralization , Surely someone will disagree , I feel that such a centralized system and organization must not succeed . I have heard someone firmly deny the value of centralization , Its counterexample may be small agile teams based on projects , Decentralized blockchain , Decentralized operation and maintenance team devops thought , Agile practices of decentralized test teams, etc .
Centralization does have its shortcomings , But we should also see its many advantages , such as :
  • It can avoid the waste of resources caused by repeated construction of various teams
  • Unified data management can better and faster promote the implementation of internal data strategies , For example, data standards , Data security, etc
  • Realize the sharing of computing and storage resources , Save money
  • It is more convenient to realize data integration across business lines
Actually , Which is better, centralization or decentralization , And the size of the business , Culture , Organizational structure related .
such as , Some large business lines , There are many human resources and strong technical ability , Usually, it is fully capable of building a set of data platform by itself . The internal business of such a business line is usually particularly complex , The amount of data is very large , And has high requirements for customization ability . At this time , The cost advantage brought by enterprise level centralized data platform is not attractive . On the contrary, it will reduce efficiency due to the need for a lot of cross team communication and cooperation .
however , If we look at this business line as a small independent and autonomous organization , It is usually divided into smaller business teams , Building a centralized data platform among these business teams will also bring value .
Another example , There are some business lines , Although it's also big , But its internal organizational structure is based on the project group with strong business isolation as the basic unit . Each project team forms a small full-featured agile team , Explore and promote business development respectively . In this case , There is little data sharing between teams , And the requirements for data management are not high , Their first priority is to quickly promote business development . So the value of centralized data platform is not so obvious . They may not even need a line of business level data platform , Instead, build some small data platforms according to the needs of the team .
There are many examples of the above situation , For example, Huawei , Its internal consumer business line has built an independent data platform to provide centralized data services for a large number of internal project teams , And the business line of operators is due to its characteristics of providing services for enterprises , Low business similarity and high isolation between projects , It is more inclined for each team to build a small data platform .
Another example , Many banks or retail enterprises , Its internal business is usually relatively mature , The common situation is to build a unified and centralized data platform .
therefore , Back to the original idea of data platform construction , generally speaking , The data platform will be centralized and integrated to a certain extent . But what is the degree of centralization , We should also look at it according to the specific situation of the enterprise .
For the situation of traditional industries mentioned at the beginning of this article , Most of them are not centered on software services , Instead, it centers on its existing production or information business ( Like cars 、 retail 、 insurance ). Within these enterprises , Software is often just an aid , Therefore, the ability of the software development team is not particularly strong . From the above analysis , For these enterprises , It may be more appropriate to build a centralized and integrated data platform .

Adopt lean thinking to gradually build a data platform

secondly , Let's analyze from the responsibility of the data platform construction team . generally speaking , The construction of data platform needs to be analyzed with a certain high-value data ( Including business indicators or machine learning models ) Driven by demand , Slowly improve the platform while realizing business value , Finally, an enterprise level data platform will be realized .
Corresponding to this , Many enterprises directly want to build a powerful and comprehensive data platform , Many of them have become cases of failure . The typical idea of this approach is , First, benchmark the data platform of an industry benchmark , Set up a large team to start development . Development process , Who will use the data platform , Pay insufficient attention to what value has been generated . The end result is , The project investors spent a lot of money and failed to see the value , Thus slowly reducing investment leads to project abortion .
Adopting the value driven model is actually using lean thinking to guide the construction of data platforms . It aims to realize value , The realization of each platform function corresponds to the immediately visible value realization . In the long run , Through continuous technical reconfiguration and architecture evolution , The platform will gradually form .
Some people may say that the data platform built in this way is different , There will be a lack of a unified industry standard . However , Enterprises are based on the premise of realizing economic benefits , Why does the data platform have to be the standard implementation of the industry ? in my opinion , Due to the strong technical relevance of the data platform , The final form itself will vary according to different enterprises and teams . Just imagine , For the software implementation of a certain function , If it is left to different developers , Will the resulting design and implementation code be the same ? However , In fact, we are not very concerned about whether the code is the same , As long as the final function is realized , The goal is achieved . therefore , The final form of data platform is likely to be completely different from different enterprises .
The process of building a successful data platform is usually :
  • Set up a data platform team
  • Build a data platform with basic functions based on open source technology
  • In order to achieve a certain business indicator calculation , Access data of a certain system , Thus, a certain platform data access function is completed by the way
  • In order to implement a machine learning model , Access data from another system , Thus, the data access function of the previous platform is enhanced , And by the way, I have completed the calculation of some general reusable indicators
  • To support more exploratory data analysis , According to need , The data platform supports self-service SQL Data analysis for the interface
  • To support more exploratory data analysis , According to need , The data platform supports self-service Python Data analysis for the interface
  • According to the needs of data security , The data platform improves the management of data permissions , Data encryption desensitization support
  • With the gradual improvement of data platform functions , Business teams are more self-help in data access and data analysis , The data platform team focuses on the continuous enhancement of platform functions and the maintenance of platform stability

summary

To sum up . This paper first discusses the problem of what is a data platform , Try to answer the definition and functional scope of the data platform . next , Combined with the author's own experience in data projects , The idea of building an enterprise data platform is combed .
This article hopes to inspire colleagues engaged in data work , It is also hoped that non data working partners can have a certain understanding of data work .
原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/209/202207281326469747.html