当前位置：网站首页>Real time data warehouse

Real time data warehouse

2022-07-04 14:22:00 【This program ape is so beautiful】

This article is just a summary of my real-time data warehouse experience , In terms of architecture and data flow, it is actually similar to offline data warehouse , But real-time processing has its own particularity

Why should there be real-time data warehouse ？

We have been able to take off-line positions , The purpose of data warehouse is to reuse , But offline is T+1 Of , In our massive real-time demand , Previous offline computing cannot be reused , A lot of new repetitive real-time code development , The cost of developing and computing resources is increasing

Real time data warehouse layering

ODS Raw data , Including logs and business data

DWD

DIM

DWM

DWS

ADS

DWD

One for each table Topic, Rewrite the order flow and other business data back kafka, In addition, the log data is output from the measurement output stream （sql That's more than one. insert +filter）, There are mainly startup and exit logs 、 page （ Only include pages, that is pv journal ） journal 、 Behavior log, etc , Different data have completely different data structures , So we need to split it

At the same time, do some illegal value filtering , Like time stamps ,uid check （ Mainly regular matching , We are 13 Digit number ）, in addition ODS In addition to the fact data, there will also be dimension data , Need to write DIM instead of DWD

DWD The main core of the layer is data diversion and state recognition

DIM

Like I said , some ODS Dimension data of Flink After you get it, you usually write it directly Hbase 了 , It is convenient for us to do dimensional flow join

DWM

DWM Layer is mainly due to the high cost of real-time computing, development, operation and maintenance , But in DWD -> DWS There are still many repeated calculations in the calculation of , Mainly extract this part for public

For example, order wide table , You need to associate order tables with order details and dimension tables , Then we can only process it once as a wide table , stay DWS Various behaviors or orders are used directly from DWM Just associate the data

This layer is often designed to have more streams join And flow dimension join

DWS

Mild polymerization , Deal with all kinds of real-time queries , And relieve the pressure of query

Combine more real-time data in a thematic way for easy management , At the same time, it can also reduce the number of dimension queries

How to make design DWS Table of , It mainly depends on dimension + Measure （ Fact data ）

Metrics such as uv、pv、 Number of jumps 、 Number of times to enter the page （session_count）、 Continuous access duration, etc

The dimension is mainly the main , channel 、 Go to the ground 、 edition 、 at home and abroad 、 New and old users 、 System （ios, Android , The computer ） these

Accept detailed data , Merge streams into the same data format , Then the window is aggregated and output to the database （ We are clickhouse）