当前位置:网站首页>Data Lake Governance: advantages, challenges and entry
Data Lake Governance: advantages, challenges and entry
2022-07-04 15:19:00 【Software testing network】
Successful data governance programs leverage policies 、 Standards and processes to create high-quality data , And ensure that these data are used correctly throughout the organization . Data governance initially focused on structured data in relational databases and traditional data warehouses , But then things changed . If your enterprise has a data lake environment , And hope to obtain accurate analysis results , Then you also need to deploy appropriate data Lake governance , As part of the overall governance plan .
But data lake is important for all fields of enterprise data management ( Including data governance ) Bring all kinds of challenges . Next we will explore some of the major governance challenges , And the benefits of effectively managing the data Lake . however , First, let's define what is a data Lake : This refers to a data platform with a large amount of raw data , It usually includes various structures 、 Unstructured and semi-structured data types . It is usually based on Hadoop、Spark And other big data technologies .
Although most data warehouses store data in relational tables , But the data lake uses a flat architecture . Each data element is assigned a unique identifier , And tag it with a set of metadata tags . therefore , Data lake is not as structured as data warehouse . Data is usually retained in its original format , And classify according to the needs of specific analytical purposes 、 Sorting and filtering , Instead of loading it into the data Lake .
Data lake and data swamp
If the data lake is not well managed and managed , It may become a swamp instead of a lake . Data is dumped into the platform without proper monitoring and recording , Make it difficult for data management and governance teams to track the contents of the data Lake . This may lead to data quality 、 Uniformity 、 Problems in reliability and accessibility .
therefore , Data scientist 、 Data engineers and other end users may not be able to find relevant data for analytic applications . What's worse is , Data swamp may lead to analysis errors , And ultimately lead to bad business decisions . Data security and privacy protection may not be applied correctly , Thus, the data assets and business reputation of the enterprise are at risk . In order to avoid this swamp situation , Enterprises must manage the data lake environment .
Benefits of data Lake governance
Effective data governance enables enterprises to improve data quality , And maximize the use of data for business decisions , This can lead to operational improvements 、 Stronger business strategy and better financial performance . This principle also applies to the governance of data lakes , Just like other types of systems . The specific benefits offered by data Lake governance include :
- Increase access to relevant data for advanced analysis . In a well managed data Lake , It is easier for data scientists and other members of the analysis team to find machine learning 、 Data required for predictive analysis and other data science applications .
- It takes less time to prepare data for analytical purposes . Although the data in the data lake is usually retained in its original form , Know that specific applications need it , But in a regulated environment, the data preparation process can be shortened . for example , Early data cleaning reduces the time to repair data errors and other problems in the future .
- Reduce IT And data management costs . By preventing the data lake from getting out of control , It can reduce the required data processing and storage resources . By improving data accuracy 、 Uniformity and consistency , It can also reduce the overall data management requirements .
- Improve the security of sensitive data and regulatory compliance . The common use case of data lake is to help marketing and sales . therefore , They usually contain sensitive confidence about customers . The strong governance of the data Lake helps to help such data be properly protected , And will not be abused .
Data Lake governance challenges
The supporting data management disciplines of data governance include data quality 、 Metadata management and data security , All these factors will affect data Lake governance and its challenges . Here are five common data governance challenges encountered in data Lake deployment .
(1) Identify and maintain correct data sources . In many data Lake implementations , The source metadata is not captured or is not available at all , This makes the validity of the data Lake content questionable . for example , The business owner of the recording system or dataset is not listed , Or obviously redundant data may cause problems for data analysts . At least , The source metadata of all data in the data lake should be recorded , And provide users with in-depth understanding of its source .
(2) Metadata management issues . Metadata provides background information for the content of a dataset , Make data easy to understand and use in applications , Metadata is an important part . However, many data Lake deployments do not apply the correct data definitions to the collected data . Besides , Because the original data is usually loaded into the data Lake , Many enterprises have no deployment steps to validate data or apply organizational data standards . Due to the lack of proper metadata management , The data in the data lake is not useful for analysis .
(3) Lack of coordination between data governance and data quality . Uncoordinated data Lake governance and data quality work may lead to low-quality data entering the data Lake . When data is used to analyze and drive business decisions , This may lead to inaccurate results , This leads to a loss of confidence in the data lake and a general mistrust of data throughout the organization . Effective data Lake implementation requires data quality analysts and engineers to work closely with data governance teams and business data managers , To apply data quality strategies 、 Analyze the data and take necessary measures to improve its quality .
(4) Lack of coordination between data governance and data security . under these circumstances , Data security standards and policies that are not properly applied in the governance process , It may cause problems in accessing personal data and other types of sensitive data protected by privacy regulations . Although the data lake is designed to be a fairly open data source , But security and access control measures are still needed , And the data governance and data security teams should work together to deal with the data Lake design and loading process as well as continuous data governance .
(5) Conflicts between business units using the same data Lake . Different departments may have different business rules for similar data , This may result in the inability to reconcile data differences for accurate analysis . Have a strong data governance plan , With data policy 、 standard 、 Enterprise view of programs and definitions , Including enterprise business glossary , It can reduce the problems when multiple business departments use a data Lake . If an enterprise has multiple data lakes , Then each data lake should be included in the data Lake governance process , And assign a business data administrator .
How to start managing data lakes
Like data governance in other types of systems , Common initial steps of data Lake governance include :
- Record the business case of managing the data Lake , Including data quality indicators and other methods to measure the benefits of management .
- Look for executives or business sponsors , To help get approval and financial support for governance .
- If you don't have an appropriate data governance architecture , Please create an architecture , This includes the governance team 、 Strict data management and data governance committee - It consists of business executives and other relevant data owners .
- Cooperate with the Governance Committee , Develop data standards and governance policies for the data Lake Environment .
Another good initial step is to build a data directory , To help end users locate and understand the data stored in the data Lake . perhaps , If you already have a directory of other data assets , It can be extended to include data Lake . The data catalog captures metadata and creates an inventory of available data , Users can search to find the data they need . You can also embed information about your organization's data governance policies in the directory , And mechanisms to enforce rules and restrictions .
All in all , Through in design 、 The loading and maintenance data environment covers powerful data governance and metadata management 、 Data quality and data security process , It can significantly improve the value of the data Lake . The active participation of experienced professionals in all these areas is also crucial . otherwise , Your data lake may indeed become more data swamps .
边栏推荐
- Unity动画Animation Day05
- MySQL学习笔记——数据类型(2)
- Redis的4种缓存模式分享
- 深度学习 网络正则化
- 2022 financial products that can be invested
- 近一亿美元失窃,Horizon跨链桥被攻击事件分析
- PLC Analog input analog conversion FC s_ ITR (CoDeSys platform)
- lnx 高效搜索引擎、FastDeploy 推理部署工具箱、AI前沿论文 | ShowMeAI资讯日报 #07.04
- Ffprobe common commands
- Logstash~Logstash配置(logstash.yml)详解
猜你喜欢
对话龙智高级咨询顾问、Atlassian认证专家叶燕秀:Atlassian产品进入后Server时代,中国用户应当何去何从?
Building intelligent gray-scale data system from 0 to 1: Taking vivo game center as an example
Halcon knowledge: NCC_ Model template matching
大神详解开源 BUFF 增益攻略丨直播
31年前的Beyond演唱会,是如何超清修复的?
每周招聘|高级DBA年薪49+,机会越多,成功越近!
03 storage system
音视频技术开发周刊 | 252
How to build a technical team that will bring down the company?
Numpy notes
随机推荐
Width accuracy
Techsmith Camtasia Studio 2022.0.2屏幕录制软件
Go zero micro service practical series (IX. ultimate optimization of seckill performance)
Openresty current limiting
深度学习 神经网络的优化方法
【读书会第十三期】视频文件的编码格式
每周招聘|高级DBA年薪49+,机会越多,成功越近!
Preliminary exploration of flask: WSGI
Unity脚本介绍 Day01
Width and alignment
一篇文章搞懂Go语言中的Context
%s格式符
Redis publish and subscribe
Introduction to modern control theory + understanding
selenium 浏览器(2)
Optimization method of deep learning neural network
对话龙智高级咨询顾问、Atlassian认证专家叶燕秀:Atlassian产品进入后Server时代,中国用户应当何去何从?
%f格式符
Unity动画Animation Day05
go-zero微服务实战系列(九、极致优化秒杀性能)