当前位置:网站首页>Data Lake Governance: advantages, challenges and entry

Data Lake Governance: advantages, challenges and entry

2022-07-04 15:19:00 Software testing network

Successful data governance programs leverage policies 、 Standards and processes to create high-quality data , And ensure that these data are used correctly throughout the organization . Data governance initially focused on structured data in relational databases and traditional data warehouses , But then things changed . If your enterprise has a data lake environment , And hope to obtain accurate analysis results , Then you also need to deploy appropriate data Lake governance , As part of the overall governance plan .

But data lake is important for all fields of enterprise data management ( Including data governance ) Bring all kinds of challenges . Next we will explore some of the major governance challenges , And the benefits of effectively managing the data Lake . however , First, let's define what is a data Lake : This refers to a data platform with a large amount of raw data , It usually includes various structures 、 Unstructured and semi-structured data types . It is usually based on Hadoop、Spark And other big data technologies .

Although most data warehouses store data in relational tables , But the data lake uses a flat architecture . Each data element is assigned a unique identifier , And tag it with a set of metadata tags . therefore , Data lake is not as structured as data warehouse . Data is usually retained in its original format , And classify according to the needs of specific analytical purposes 、 Sorting and filtering , Instead of loading it into the data Lake .

Data lake and data swamp

If the data lake is not well managed and managed , It may become a swamp instead of a lake . Data is dumped into the platform without proper monitoring and recording , Make it difficult for data management and governance teams to track the contents of the data Lake . This may lead to data quality 、 Uniformity 、 Problems in reliability and accessibility .

therefore , Data scientist 、 Data engineers and other end users may not be able to find relevant data for analytic applications . What's worse is , Data swamp may lead to analysis errors , And ultimately lead to bad business decisions . Data security and privacy protection may not be applied correctly , Thus, the data assets and business reputation of the enterprise are at risk . In order to avoid this swamp situation , Enterprises must manage the data lake environment .

Benefits of data Lake governance

Effective data governance enables enterprises to improve data quality , And maximize the use of data for business decisions , This can lead to operational improvements 、 Stronger business strategy and better financial performance . This principle also applies to the governance of data lakes , Just like other types of systems . The specific benefits offered by data Lake governance include :

  • Increase access to relevant data for advanced analysis . In a well managed data Lake , It is easier for data scientists and other members of the analysis team to find machine learning 、 Data required for predictive analysis and other data science applications .
  • It takes less time to prepare data for analytical purposes . Although the data in the data lake is usually retained in its original form , Know that specific applications need it , But in a regulated environment, the data preparation process can be shortened . for example , Early data cleaning reduces the time to repair data errors and other problems in the future .
  • Reduce IT And data management costs . By preventing the data lake from getting out of control , It can reduce the required data processing and storage resources . By improving data accuracy 、 Uniformity and consistency , It can also reduce the overall data management requirements .
  • Improve the security of sensitive data and regulatory compliance . The common use case of data lake is to help marketing and sales . therefore , They usually contain sensitive confidence about customers . The strong governance of the data Lake helps to help such data be properly protected , And will not be abused .

Data Lake governance challenges

The supporting data management disciplines of data governance include data quality 、 Metadata management and data security , All these factors will affect data Lake governance and its challenges . Here are five common data governance challenges encountered in data Lake deployment .

(1) Identify and maintain correct data sources . In many data Lake implementations , The source metadata is not captured or is not available at all , This makes the validity of the data Lake content questionable . for example , The business owner of the recording system or dataset is not listed , Or obviously redundant data may cause problems for data analysts . At least , The source metadata of all data in the data lake should be recorded , And provide users with in-depth understanding of its source .

(2) Metadata management issues . Metadata provides background information for the content of a dataset , Make data easy to understand and use in applications , Metadata is an important part . However, many data Lake deployments do not apply the correct data definitions to the collected data . Besides , Because the original data is usually loaded into the data Lake , Many enterprises have no deployment steps to validate data or apply organizational data standards . Due to the lack of proper metadata management , The data in the data lake is not useful for analysis .

(3) Lack of coordination between data governance and data quality . Uncoordinated data Lake governance and data quality work may lead to low-quality data entering the data Lake . When data is used to analyze and drive business decisions , This may lead to inaccurate results , This leads to a loss of confidence in the data lake and a general mistrust of data throughout the organization . Effective data Lake implementation requires data quality analysts and engineers to work closely with data governance teams and business data managers , To apply data quality strategies 、 Analyze the data and take necessary measures to improve its quality .

(4) Lack of coordination between data governance and data security . under these circumstances , Data security standards and policies that are not properly applied in the governance process , It may cause problems in accessing personal data and other types of sensitive data protected by privacy regulations . Although the data lake is designed to be a fairly open data source , But security and access control measures are still needed , And the data governance and data security teams should work together to deal with the data Lake design and loading process as well as continuous data governance .

(5) Conflicts between business units using the same data Lake . Different departments may have different business rules for similar data , This may result in the inability to reconcile data differences for accurate analysis . Have a strong data governance plan , With data policy 、 standard 、 Enterprise view of programs and definitions , Including enterprise business glossary , It can reduce the problems when multiple business departments use a data Lake . If an enterprise has multiple data lakes , Then each data lake should be included in the data Lake governance process , And assign a business data administrator .

How to start managing data lakes

Like data governance in other types of systems , Common initial steps of data Lake governance include :

  • Record the business case of managing the data Lake , Including data quality indicators and other methods to measure the benefits of management .
  • Look for executives or business sponsors , To help get approval and financial support for governance .
  • If you don't have an appropriate data governance architecture , Please create an architecture , This includes the governance team 、 Strict data management and data governance committee - It consists of business executives and other relevant data owners .
  • Cooperate with the Governance Committee , Develop data standards and governance policies for the data Lake Environment .

Another good initial step is to build a data directory , To help end users locate and understand the data stored in the data Lake . perhaps , If you already have a directory of other data assets , It can be extended to include data Lake . The data catalog captures metadata and creates an inventory of available data , Users can search to find the data they need . You can also embed information about your organization's data governance policies in the directory , And mechanisms to enforce rules and restrictions .

All in all , Through in design 、 The loading and maintenance data environment covers powerful data governance and metadata management 、 Data quality and data security process , It can significantly improve the value of the data Lake . The active participation of experienced professionals in all these areas is also crucial . otherwise , Your data lake may indeed become more data swamps .

原网站

版权声明
本文为[Software testing network]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/185/202207041421366136.html