当前位置:网站首页>Data Lake Governance: advantages, challenges and entry
Data Lake Governance: advantages, challenges and entry
2022-07-04 15:19:00 【Software testing network】
Successful data governance programs leverage policies 、 Standards and processes to create high-quality data , And ensure that these data are used correctly throughout the organization . Data governance initially focused on structured data in relational databases and traditional data warehouses , But then things changed . If your enterprise has a data lake environment , And hope to obtain accurate analysis results , Then you also need to deploy appropriate data Lake governance , As part of the overall governance plan .

But data lake is important for all fields of enterprise data management ( Including data governance ) Bring all kinds of challenges . Next we will explore some of the major governance challenges , And the benefits of effectively managing the data Lake . however , First, let's define what is a data Lake : This refers to a data platform with a large amount of raw data , It usually includes various structures 、 Unstructured and semi-structured data types . It is usually based on Hadoop、Spark And other big data technologies .
Although most data warehouses store data in relational tables , But the data lake uses a flat architecture . Each data element is assigned a unique identifier , And tag it with a set of metadata tags . therefore , Data lake is not as structured as data warehouse . Data is usually retained in its original format , And classify according to the needs of specific analytical purposes 、 Sorting and filtering , Instead of loading it into the data Lake .
Data lake and data swamp
If the data lake is not well managed and managed , It may become a swamp instead of a lake . Data is dumped into the platform without proper monitoring and recording , Make it difficult for data management and governance teams to track the contents of the data Lake . This may lead to data quality 、 Uniformity 、 Problems in reliability and accessibility .
therefore , Data scientist 、 Data engineers and other end users may not be able to find relevant data for analytic applications . What's worse is , Data swamp may lead to analysis errors , And ultimately lead to bad business decisions . Data security and privacy protection may not be applied correctly , Thus, the data assets and business reputation of the enterprise are at risk . In order to avoid this swamp situation , Enterprises must manage the data lake environment .
Benefits of data Lake governance
Effective data governance enables enterprises to improve data quality , And maximize the use of data for business decisions , This can lead to operational improvements 、 Stronger business strategy and better financial performance . This principle also applies to the governance of data lakes , Just like other types of systems . The specific benefits offered by data Lake governance include :
- Increase access to relevant data for advanced analysis . In a well managed data Lake , It is easier for data scientists and other members of the analysis team to find machine learning 、 Data required for predictive analysis and other data science applications .
- It takes less time to prepare data for analytical purposes . Although the data in the data lake is usually retained in its original form , Know that specific applications need it , But in a regulated environment, the data preparation process can be shortened . for example , Early data cleaning reduces the time to repair data errors and other problems in the future .
- Reduce IT And data management costs . By preventing the data lake from getting out of control , It can reduce the required data processing and storage resources . By improving data accuracy 、 Uniformity and consistency , It can also reduce the overall data management requirements .
- Improve the security of sensitive data and regulatory compliance . The common use case of data lake is to help marketing and sales . therefore , They usually contain sensitive confidence about customers . The strong governance of the data Lake helps to help such data be properly protected , And will not be abused .
Data Lake governance challenges
The supporting data management disciplines of data governance include data quality 、 Metadata management and data security , All these factors will affect data Lake governance and its challenges . Here are five common data governance challenges encountered in data Lake deployment .
(1) Identify and maintain correct data sources . In many data Lake implementations , The source metadata is not captured or is not available at all , This makes the validity of the data Lake content questionable . for example , The business owner of the recording system or dataset is not listed , Or obviously redundant data may cause problems for data analysts . At least , The source metadata of all data in the data lake should be recorded , And provide users with in-depth understanding of its source .
(2) Metadata management issues . Metadata provides background information for the content of a dataset , Make data easy to understand and use in applications , Metadata is an important part . However, many data Lake deployments do not apply the correct data definitions to the collected data . Besides , Because the original data is usually loaded into the data Lake , Many enterprises have no deployment steps to validate data or apply organizational data standards . Due to the lack of proper metadata management , The data in the data lake is not useful for analysis .
(3) Lack of coordination between data governance and data quality . Uncoordinated data Lake governance and data quality work may lead to low-quality data entering the data Lake . When data is used to analyze and drive business decisions , This may lead to inaccurate results , This leads to a loss of confidence in the data lake and a general mistrust of data throughout the organization . Effective data Lake implementation requires data quality analysts and engineers to work closely with data governance teams and business data managers , To apply data quality strategies 、 Analyze the data and take necessary measures to improve its quality .
(4) Lack of coordination between data governance and data security . under these circumstances , Data security standards and policies that are not properly applied in the governance process , It may cause problems in accessing personal data and other types of sensitive data protected by privacy regulations . Although the data lake is designed to be a fairly open data source , But security and access control measures are still needed , And the data governance and data security teams should work together to deal with the data Lake design and loading process as well as continuous data governance .
(5) Conflicts between business units using the same data Lake . Different departments may have different business rules for similar data , This may result in the inability to reconcile data differences for accurate analysis . Have a strong data governance plan , With data policy 、 standard 、 Enterprise view of programs and definitions , Including enterprise business glossary , It can reduce the problems when multiple business departments use a data Lake . If an enterprise has multiple data lakes , Then each data lake should be included in the data Lake governance process , And assign a business data administrator .
How to start managing data lakes
Like data governance in other types of systems , Common initial steps of data Lake governance include :
- Record the business case of managing the data Lake , Including data quality indicators and other methods to measure the benefits of management .
- Look for executives or business sponsors , To help get approval and financial support for governance .
- If you don't have an appropriate data governance architecture , Please create an architecture , This includes the governance team 、 Strict data management and data governance committee - It consists of business executives and other relevant data owners .
- Cooperate with the Governance Committee , Develop data standards and governance policies for the data Lake Environment .
Another good initial step is to build a data directory , To help end users locate and understand the data stored in the data Lake . perhaps , If you already have a directory of other data assets , It can be extended to include data Lake . The data catalog captures metadata and creates an inventory of available data , Users can search to find the data they need . You can also embed information about your organization's data governance policies in the directory , And mechanisms to enforce rules and restrictions .
All in all , Through in design 、 The loading and maintenance data environment covers powerful data governance and metadata management 、 Data quality and data security process , It can significantly improve the value of the data Lake . The active participation of experienced professionals in all these areas is also crucial . otherwise , Your data lake may indeed become more data swamps .
边栏推荐
- Hexadecimal form
- 怎么判断外盘期货平台正规,资金安全?
- Redis 解决事务冲突之乐观锁和悲观锁
- Redis publish and subscribe
- Dialogue with ye Yanxiu, senior consultant of Longzhi and atlassian certification expert: where should Chinese users go when atlassian products enter the post server era?
- 谈SaaS下如何迅速部署应用软件
- Luo Gu - some interesting questions 2
- Implementation of macro instruction of first-order RC low-pass filter in signal processing (easy touch screen)
- The per capita savings of major cities in China have been released. Have you reached the standard?
- 重排数组
猜你喜欢

开源人张亮的 17 年成长路线,热爱才能坚持

03 storage system

The per capita savings of major cities in China have been released. Have you reached the standard?

Kubernets pod exists finalizers are always in terminating state

近一亿美元失窃,Horizon跨链桥被攻击事件分析

Analysis of nearly 100 million dollars stolen and horizon cross chain bridge attacked

MP3是如何诞生的?

这几年爆火的智能物联网(AIoT),到底前景如何?

Intelligent customer service track: Netease Qiyu and Weier technology play different ways

从0到1建设智能灰度数据体系:以vivo游戏中心为例
随机推荐
深度学习 网络正则化
Techsmith Camtasia Studio 2022.0.2屏幕录制软件
Who the final say whether the product is good or not? Sonar puts forward performance indicators for analysis to help you easily judge product performance and performance
Preliminary exploration of flask: WSGI
Optimization method of deep learning neural network
Building intelligent gray-scale data system from 0 to 1: Taking vivo game center as an example
Deep learning network regularization
PLC Analog input analog conversion FC s_ ITR (CoDeSys platform)
CentOS 6.3 下 PHP编译安装JSON模块报错解决
Ffmpeg Visual Studio development (IV): audio decoding
华为云数据库DDS产品深度赋能
c# 实现定义一套中间SQL可以跨库执行的SQL语句
Memory management summary
谈SaaS下如何迅速部署应用软件
数据湖治理:优势、挑战和入门
What are the concepts of union, intersection, difference and complement?
Redis 發布和訂閱
Unity脚本API—Component组件
Korean AI team plagiarizes shock academia! One tutor with 51 students, or plagiarism recidivist
数据库函数的用法「建议收藏」