当前位置:网站首页>How to build a quasi-real-time data warehouse?

How to build a quasi-real-time data warehouse?

2022-08-02 20:43:00 software testnet

当前,The data warehouse is divided into offline data warehouse and real-time data warehouse,Offline data warehouses are generally traditionalT+1型数据ETL方案,The real-time data warehouse is generally in minutes or even seconds.ETL方案.并且,The underlying architecture of offline data warehouse and real-time data warehouse is also different,Offline data warehouses are generally built using traditional big data architecture models.,The real-time data warehouse usesLambda、Kappa等架构搭建.

其中,The real-time data warehouse is subdivided into two categories:One kind is standard on the number of real-time warehouse,所有ETLprocess throughSpark或Flink等实时计算、落地;Another kind is to simplify the number of positions in real time,Even simple upgrade of offline data warehouse,This type of data warehouse is called quasi-real-time data warehouse.

接下来,This paper focuses on sorting out the application scenarios of quasi-real-time data warehouse!

简单理解,There will be delays in quasi-real-time data warehouses,Compared with offline data warehouse that only counts once a day,Quasi-real-time warehouse should be based on business needs,按照小时、minutes or seconds.这里,以5minutes are the limit,5Results every minute,可以基于Structured StreamingRealize the construction of quasi-real-time data warehouse,This is an offline operation based on streaming data,Divide batches according to time,The overall data is on the stream computing engine,也就是在Structured Streaming上面.

Real-time data warehouse projects by industry、分领域,Take news information as an example,比如今日头条、一点资讯、腾讯新闻、网易新闻、百度浏览器、360浏览器、新浪、搜狐等.What data sources are there for this type of application?一般包括用户信息、Privacy and the relevant business and user income data;There are also behavior logs left by users browsing articles;The content of the user work published log,This information will first be collectedKafka上.

之后的过程是,通过Spark Structured Streaming消费Kafka的原始数据.这里需要强调一点,采用Spark Structured Streaming有三个原因.第一,实现流批统一,Can handle batch calculations;second supportfile sink,实现端到端的一致性语义;第三,可以控制sink到HDFS的时间,比如:Set up batch data5minute node,延时低,处理速度快.

从sink到HDFS时,可以选择使用Hudi,也可以选择不使用Hudi,如果通过Spark Streaming直接写数据到HDFS时,Inevitably dealing with small file issues,There are four general treatment.第一,Increase the batch capacity,but also increases the delay;Second partition merge;Third external program integration;第四,If the file does not reach the specified size,Do not create files when writing data in the next batch,Instead, merge with existing small files.Each of these four methods has its own usage scenarios,无论采用哪种方式,Will work more.但是,如果通过Hudi写入数据,小文件的问题,Hudi会帮忙解决.

还有一个问题,Except the user behavior event log will not update,A lot of business data needs to be updated in real time,比如:用户信息的修改.但是,HDFSItself does not support updates,Lead to the need to modify the data through a complicated process,并且在整个过程中,The real-time nature of the data cannot be guaranteed,如果使用Hudi,with relatively short delays,比如分钟级别,Provide data update support,同时Hudi也支持ACID.

When the original data landed onHDFS上,You can do some data preprocessing during the landing process,比如之前在Flume InterceptorThe data processing work,之后我们可以通过HiveCreate the corresponding external table,These tables can be divided into a hierarchy,叫做ODS层的表,These tables are the most original data,It is also the first layer of the warehouse.

建立完ODS层的Hive表,You can query data according to business needs.至于,Are we going to build a higher data warehouse level?,According to the needs of the business.映射Hiveraw data layerODS后,data to analyze,分析使用的是Presto分析引擎,基于内存的计算框架,计算速度要比Hive和Spark快很多.

使用Presto查询操作完成OLAP分析处理,Will integrateSpring Boot框架,使用JDBC连接Presto,Provide external query interface,for analysts.

原网站

版权声明
本文为[software testnet]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/214/202208021746214964.html