当前位置:网站首页>Common concepts and terms in offline warehouse modeling

Common concepts and terms in offline warehouse modeling

2022-06-23 04:42:00 Big data Xiao Wang

Analysis of common concepts and terms
One 、 Analysis of common concepts in data warehouse
1、 Entity
2、 dimension
3、 Measure
4、 Particle size
5、 CAL
6、 indicators
7、 label
8、 Natural bond
9、 Persistent key
10、 Surrogate key
11、 Degenerate dimension
12、 Run in
13、 Scroll up
Two 、 Relationship between data warehouse concepts
1、 Entity table , Fact table , Relationship between dimension tables
2、 Difference between indicators and labels
3、 Differences and relationships between dimensions and indicators
4、 The difference between the use of natural keys and proxy keys in the data warehouse


One 、 Analysis of common concepts in data warehouse
1、 Entity
Entity refers to the subject attached , Is an object we analyze , For example, we analyze the sales of goods , For example, what is the sales volume of Huawei's mobile phones in the past six months , Huawei mobile phone is an entity ; We analyze user activity , A user is an entity . Of course, entities can also exist in reality , For example, virtual business objects , Activities , Members, etc. can be regarded as an entity .

 

Entities exist for business analysis , As a filtered dimension of analysis , Have properties that describe yourself , It has analyzable value .

2、 dimension
Dimension is the angle of view , Analyze business data , From what angle , What kind of dimension is established . Therefore, dimension is a quantity used to analyze data , For example, you need to analyze the sales of products , You can choose to analyze by commodity category , This constitutes a dimension , Put all commodity categories together , It constitutes a dimension table .

3、 Measure
A metric is a numeric value on a business process node . Like sales , Price , Cost, etc .

The measures in the fact table can be divided into three categories : Completely additive , Half can add , Do not add .

Completely additive metrics are the most flexible , The most useful , For example, sales , Sales, etc , You can summarize any dimension ;

Semi additive measures can summarize some dimensions , But you can't aggregate all dimensions , Difference is a common semi additive measure , In addition to the time dimension , You can add across all dimensions ;

The other is completely non additive , for example : ratio . For this kind of non additive metric , A good way is , Store as many fully additive components as possible that are not additive , And before calculating the final non additive fact , Summarize these components into the final result set .

4、 Particle size
Granularity is the unit of measurement in a business process , For example, goods are measured by piece record , Or is it measured by batch records .

In the construction of digital warehouse , We say this is a fact table of user granularity , Then each row of data in the table is a user , No duplicate users ; For example, there is a table of sales granularity , Then each row in the table is a sales record .

Choosing an appropriate granularity level is an important key content for the construction of data warehouse , When designing data granularity , Generally, the following factors need to be considered :

Type of analysis to accept 、 The minimum granularity of acceptable data and the amount of data that can be stored ;

The higher the level of granularity is defined , The less you can do a more detailed analysis in the warehouse ;

If storage resources are limited , You can only use higher data granularity ;

The data granularity partition strategy must ensure : The granularity of data can really meet the needs of users' decision analysis , This is the most important criterion in the data granularity partition strategy .

5、 CAL
Dimension is data access logic ( How to access data ), For example, the number to be taken is 10 The average height of boys under the age of , This is the statistical caliber .

6、 indicators
An indicator is a measure of caliber , That is the final result . For example, the order volume in the last seven days , Purchase conversion rate of a promotion, etc .

An indicator is specific to the calculation and implementation , It mainly consists of the following parts :

Index processing logic , such as count ,sum, avg

dimension , For example, by Department 、 Regional index statistics , Corresponding sql Medium group by

Business limits / Modifiers , For example, different payment channels are used to calculate the corresponding indicators , Order refund rate paid by wechat , Order refund rate paid by Alipay . Corresponding sql Medium where.

besides , Indicators themselves can also be derived 、 Derive more indicators , Based on these characteristics , Indicators can be classified :

Atomic index : Basic business facts , No business restrictions 、 There is no dimension . For example, the order quantity in the order table 、 The total amount of the order is an atomic indicator ;
Indicators more concerned by business parties , It has practical business meaning , You can take the indicators of data directly .
For example, the one-day order payment amount of a store is a derived indicator , It will be displayed directly on the products to the merchants .
But this index can't be taken directly from the unified middle layer of the data warehouse ( Because there are no existing fact fields , Generally, they provide wide watches ). There needs to be a bridge between the middle tier of the warehouse and the business side's index requirements , So there are derived indicators

Derived indicators : dimension + Modifiers + Atomic index . for example : The shop is near 1 Day order payment amount : Store is a dimension , near 1 Day is a modifier of time type , Payment amount is an atomic indicator ;
dimension : Observe the angle of each index ;
Modifiers : One or more values of the dimension , For example, under the dimension of gender , Men and women are 2 A modifier

Derivatives : For example, the conversion rate of a promotion activity is a derivative indicator , Because it needs to calculate the number of promotional releases and the number of promotional orders .
7、 label
Labels are set artificially 、 According to business scenario requirements , The highly refined feature identification obtained by using a certain algorithm for the target object .

It can be seen that the label is the result of artificial reprocessing , Ruwanghong 、LOL A great god . For ambiguous labels , We can distinguish labels internally , such as : Apple , We can define apple as fruit , Apple phone means mobile phone .

8、 Natural bond
A key consisting of attributes that already exist in reality , It is unique in the business concept , And has certain business implications , For example, commodities ID, staff ID.

From a warehouse Perspective , Identifiers from business systems are natural keys , For example, the employee number in the business library .

9、 Persistent key
Remain permanent and will not change . Sometimes called a supernatural persistent bond . For example, ID number is persistent key. .

Difference between natural key and persistent key : for instance , For example, the company's employees leave their jobs and rejoin them , His natural key, employee number, has changed , But his persistent key ID number is unchanged .

10、 Surrogate key
Is a key that has no business meaning . Proxy keys have many other names : Meaningless key 、 Integer key 、 Unnatural bond 、 Manual key 、 Synthetic bond, etc .

Proxy keys are simply expressed as integers produced in a sequential sequence . Page of product line 1 The row proxy key is 1, Then the proxy key in the next line is 2, Do so . The proxy key is only used to connect the dimension table and the fact table .

11、 Degenerate dimension
Degenerate dimension , It's a dimension keyword that looks like a fact table , But in fact, there is no corresponding dimension table , That is, dimension attributes are stored in the fact table , This dimension column stored in the fact table is called degenerate dimension . Like other dimensions stored in dimension tables , The degenerate dimension can also be used to filter and query the fact table 、 Realize aggregation operation, etc .

So how to define the degenerate dimension ?

For example, orders id, This dimension is of great magnitude , There is no need to use a dimension table to store , When we query or filter data, we need , So this redundancy is in the fact table , This is called degenerate dimension ,citycode We will also be redundant in the fact table , But it has a corresponding dimension table , So it's not a degenerate dimension .
12、 Run in
This is a common concept in data analysis , Drilling down can be understood as increasing the level of dimension , Thus, the data can be observed from coarse-grained to fine-grained , For example, when analyzing product sales , You can observe more fine-grained data from year to month to day along the time dimension . You can drill down from the year dimension to the month dimension 、 Dimension of day, etc .

13、 Scroll up
I see. Run in , The first volume is easy to understand , The two are opposite operations , So roll up can be understood as deleting some layers of the dimension , The operation of observing data from fine-grained to coarse-grained or Aggregate and summarize data up the dimension hierarchy .

Two 、 Relationship between data warehouse concepts
1、 Entity table , Fact table , Relationship between dimension tables
stay Kimball There are dimensions and facts in dimension modeling , stay Inmon There are entities and relationships in paradigm modeling , If we separate the two modeling methods, these concepts are easier to understand . However, there are many hybrid modeling methods , The two modeling methods are combined , These concepts are not easy to remember and confused , Especially the fact table and entity table , What are the differences and connections between them , Let's look at their respective concepts first :

Dimension table : The dimension table can be seen as a window for users to analyze a fact , The data in it should be a description of all aspects of the facts , For example, time dimension table , Regional dimension table , Dimension table is an analytical perspective of fact table .

Fact table : In fact, a fact table determines a fact through the combination of various dimensions and some index values , For example, through the time dimension , Regional organization dimension , The fact that the index value can be used to determine the value of some indexes at a certain time and place . Each piece of data in the fact table is obtained by the intersection of the data and indicator values of several dimension tables .

Entity table : An entity table is a table of actual objects , The data in the entity table must be the objective thing data , For example, all kinds of goods , It exists objectively , So you can design it as an entity table . Real time tables only describe things , There are no concrete facts , Therefore, some people call the entity table a fact table without facts .

for instance : For example, there are apple phones in the mobile phone Mall , Huawei mobile phones and other mobile phones of various brands and models , These data can form a mobile entity table , But there is no measurable data in the table .

One day the Apple phone sold 15 platform , Huawei mobile phones sold 20 platform , These mobile phone sales data are facts , Form a fact table . In this way, you can use the date dimension table and the region dimension table to perform various dimension analysis on this fact table .

2、 Difference between indicators and labels
Different concepts

Indicators are used to define 、 A standard or way of evaluating and describing a particular thing . such as : Number of new users 、 Cumulative users 、 User activity rate is an indicator to measure user development ;

Labels are set artificially 、 According to business scenario requirements , The highly refined feature identification obtained by using a certain algorithm for the target object . It can be seen that the label is the result of artificial reprocessing , Ruwanghong 、LOL A great god .

Different composition

Index name is the name of the characteristics of matter matter and quantity ; The indicator value is the indicator at a specific time 、 regional 、 Quantitative performance under , Such as human weight , The indicator name is body weight , The value of the indicator is 120 Jin ;

Label names are usually adjectives or adjectives + The structure of nouns , Labels are generally not quantifiable , Usually isolated , In addition to the basic class label , Labels processed by certain algorithms generally have no units and dimensions . If will exceed 200 Those who weigh five kilograms are called fat people .

Different categories

Classification of indicators :

Calculation logic according to indicators , Indicators can be divided into atomic indicators 、 Derived indicators 、 There are three types of derivative indicators ;

According to different descriptions of events , It is divided into process indicators and outcome indicators ;

Classification of labels :

According to the variability of labels, they are divided into static labels and dynamic labels ;

According to the label reference and evaluation indicators , It can be divided into qualitative label and quantitative label ;

The best application of indicators is monitoring 、 analysis 、 Evaluation and modeling .
The best application of labels is labeling 、 Characterization 、 Classification and feature extraction .

In particular, it should be noted that , Because the annotation of the result is also a label , So in the application scenario of algorithms related to natural language processing and machine learning , Tags are of great value for supervised learning , Just a simple indicator is difficult to achieve . The indicators are in task allocation 、 Role of performance management , The label can't do it .

3、 Differences and relationships between dimensions and indicators
Dimension is the observation angle of data , That is, from which angle to analyze the problem , Look at problems .

The indicator is to calculate the value of this result on the basis of dimension .

A dimension is generally a discrete value , For example, each independent date or region in the time or regional dimension . Therefore, in statistics , You can aggregate records with the same dimension , Apply aggregate function to accumulate 、 mean value 、 Maximum 、 Minimum value and other aggregate calculation .

Indicators are aggregated general calculations , That is, the result of aggregation operation , It is generally a continuous value .

4、 The difference between the use of natural keys and proxy keys in the data warehouse
The data warehouse toolbox says that the only primary key of a dimension table should be a proxy key instead of a natural key . Sometimes modelers are reluctant to give up using natural keys , Because they want to query the fact table with operational code , You don't want to connect with the dimension table . However , Multidimensional keys with business implications should be avoided , Because whatever assumptions we make may eventually become invalid , Because we can't control the change of Business Library .

Therefore, each connection between dimension table and fact table in data warehouse should be based on integer proxy key with no actual meaning . Avoid using natural keys as the primary keys of dimension tables .
 

原网站

版权声明
本文为[Big data Xiao Wang]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206222323404699.html