当前位置:网站首页>Practice and thinking of offline data warehouse and Bi development
Practice and thinking of offline data warehouse and Bi development
2022-07-02 07:16:00 【Software development heart】
Offline warehouse and bi Practice and thinking of development
background
The author's main work before coming to vipshop was mainly based on the company's self-study bi Low code platform for reporting 、 Chart development work , On this basis, we should explore the commercialization . Therefore, I have some practice in offline data warehouse modeling and development , There are also some thoughts , Try to do a brief comb here .
What is a data warehouse
First of all, a brief explanation , What is a data warehouse . Data warehouse is a collection of data storage , Its purpose is to analyze business data and support decision-making . The input of data warehouse is the data generated by each application system of the enterprise , Filter enterprise data through a specific paradigm model , Consolidation and storage , And then provide it to the data application . It can provide... For enterprises bi Ability , The purpose is to better control and guide the business process of the enterprise .
The structure of data warehouse includes Bill Inmon Paradigms and Ralph Kimball normal form , In order to bi Data report is the final application of small and medium-sized data projects , Usually used Kimball Paradigm as a development guidance model , But usually more ods Layer stores the data introduced by the business system , Similar to a hybrid architecture .
Kimball normal form 
Inmon normal form 
Layered design
Data import layer (ODS): Structured processing of unstructured data , Structural data usually uses the offline data synchronization method to synchronize the data of the business system in full or increment , Prepare for data conversion and loading .
Data common layer (CDM):1, Dimension table (DIM), Through the sorting of business , Establish consistent data analysis dimensions . The unified definition and management of dimension tables is often an important factor in whether the final project is controllable .2, A detailed fact sheet (DWD), Dimension based modeling , Reuse relational Computing , Reduce data scanning .3, Public summary (DWS), It is usually summarized on the basis of the schedule , Build a unified statistical index , Creating a wide table .
Data application layer (ADS): complex , Individualization , Customized calculation index . Application based wide table data assembly .
Dimensional modeling
As mentioned above , With bi The modeling method based on application usually adopts Kimball normal form , The logical model is star model or snowflake model . Considering the complexity control , In actual development, we often use star model . The logical structure of a typical star model is shown in the figure below , A topic table is associated with multiple dimension tables , stay sql Layers are connected in different ways and accumulated , The use of semi cumulative function produces data on this basis , Do multidimensional analysis .
What do low code platforms do ?
Based on dimension model , In the scenario of report and chart development , Can design ide Generate model agreement and report agreement by dragging . The report protocol metadata is dynamically generated through engine parsing sql, get data . A model is listed below , agreement , Reports and dynamically generated sql Simple example of .
Suppose there's a model :
And then there's a ide Take over the whole development process ,ide Metadata of topics can be generated , Dimension metadata , Multidimensional model metadata .
Now there is a report requirement as shown in the figure below .
Then the implementer can pass ide Drag and drop method to pull indicators and dimensions ,
Suppose such a report agreement is generated ( Just the simplest example description ).
{ “id”:xxx,
“modelid”: Linking multidimensional models id, You can get the table and connection relationship ,
“reporttype”: “linebar”,
“rows”: [
{
“code”: “xxx”,
“dimid”:“ Metadata of linked dimension table ”
“elementname”: “yearmonth”,
“alias”: “ years ”
}
],
“indexs”: [
{
“code”: “xxx”,
“factid”:“ Metadata of linked topic table ”
“elementname”: “index”,
“aggregator”: “sum”,
“alias”: “ indicators ”
}
]
}
],
“options”: { Front end interaction related parameter Protocol , For example, the definition of time control },
“echartsoption”: {echarts Chart related parameters }
}
Then when the front end requests the report , The engine parses the Report Protocol , Combined with metadata such as model and subject dimension, as well as the transfer parameters of the front end of this request , Dynamically generate the requested sql( be based on postgres), similar
select
dim_date.yearmonth as years ,
sum(fact.index) as indicators
from fact fact
left join dim_date dim_date on fact.date=dim_date.date
where to_char(dim_date.date,'yyyy')='2021'
group by dim_date.yearmonth
Use this sql Count , Then splice the protocols that return data , Front end rendering to echarts In the line chart component
Such a process , And that's what happened bi Low code development of reports .
Ideal vs reality , Stepping on pits and thinking in practice
1, The pain of being downstream
Offline data warehouse cannot record all operations and operation results of business system , Data synchronization can only be periodic snapshots of synchronized data . If the business system appears similar ABA This kind of problem , The diffusion of this situation in the business system will lead to the logical inconsistency of data in the data application . And the investigation process is unbearable , In the end, they often help find out the hidden business system bug. Therefore, the perfection and robustness of business system is often the key factor for the success of data warehouse .
2,olap Database selection
Before the selection of the company database postgres, Known as the most advanced open source database . Since I joined vipshop mysql Comparison of , You can still feel postgres Strong in some aspects of performance . On some small projects ,pg Can be flattened etl Work and olap analysis . But there's a lot of data , Reuse pg It's just that the little horse pulled the cart a little reluctantly . And we often encounter a scenario in business , It is the year-on-year comparison of the calculation indicators . It is difficult to deal with the above multidimensional model and report engine directly , such as , The data synchronized by the business system has this year's data in some dimensions, but there is no last year's data , This kind of data is missing, and the report display often cannot be self consistent . In this case, if you don't manually push the stored procedure to produce a report, you can only etl The phase expands all dimension values first with Cartesian product , And then it corresponds to update Corresponding indicators , It will lead to the terrible expansion of the table data , A report can't be printed in a few minutes . But if you split the model into several , It also loses the significance of unified modeling . This has always been a thorny problem .
3, Report engine capability boundary and thinking without code
Roast before that the platform report engine is becoming more and more complex , The complexity and mental burden of using report engine drag and drop method to match a usable report is no less than that of using it directly sql Roll yourself up , The boss said that the report engine is not just to simplify the development process , It also helps precipitate the business model . I have discussed with my friends a long time ago , If the codeless platform wants to be all inclusive, the final result can only be to come up with another set of Turing complete language . So I have a thought , If only the business is precipitated in the static multidimensional model , The interaction process is entrusted to a general report engine , I don't think we can cope with customers' tricky changing needs and mental burden significantly lower than handwriting sql Have both . As a platform developer , Recognize your boundaries and carefully explore what you can do in specific areas , It's a huge challenge .
边栏推荐
- 华为机试题-20190417
- RMAN incremental recovery example (1) - without unbacked archive logs
- PHP Session原理简析
- Sqli-labs customs clearance (less2-less5)
- Common prototype methods of JS array
- Basic knowledge of software testing
- 腾讯机试题
- Pyspark build temporary report error
- Only the background of famous universities and factories can programmers have a way out? Netizen: two, big factory background is OK
- 2021-07-17C#/CAD二次开发创建圆(5)
猜你喜欢

mapreduce概念和案例(尚硅谷学习笔记)

Write a thread pool by hand, and take you to learn the implementation principle of ThreadPoolExecutor thread pool

Sqli Labs clearance summary - page 2

架构设计三原则

PHP Session原理简析

Principle analysis of spark

MySQL无order by的排序规则因素

腾讯机试题

Sqli-labs customs clearance (less15-less17)

Ingress Controller 0.47.0的Yaml文件
随机推荐
Changes in foreign currency bookkeeping and revaluation general ledger balance table (Part 2)
图解Kubernetes中的etcd的访问
Oracle segment advisor, how to deal with row link row migration, reduce high water level
使用 Compose 实现可见 ScrollBar
Explain in detail the process of realizing Chinese text classification by CNN
JSP智能小区物业管理系统
软件开发模式之敏捷开发(scrum)
UEditor . Net version arbitrary file upload vulnerability recurrence
UEditor .Net版本任意文件上传漏洞复现
Anti shake and throttling of JS
php中时间戳转换为毫秒以及格式化时间
SQLI-LABS通关(less18-less20)
2021-07-19C#CAD二次开发创建多线段
Spark的原理解析
优化方法:常用数学符号的含义
PM2 simple use and daemon
Module not found: Error: Can't resolve './$$_ gendir/app/app. module. ngfactory'
ARP攻击
Oracle EBS interface development - quick generation of JSON format data
SSM实验室设备管理