当前位置:网站首页>How to build a quasi-real-time data warehouse?
How to build a quasi-real-time data warehouse?
2022-08-02 20:43:00 【software testnet】
其中,The real-time data warehouse is subdivided into two categories:One kind is standard on the number of real-time warehouse,所有ETLprocess throughSpark或Flink等实时计算、落地;Another kind is to simplify the number of positions in real time,Even simple upgrade of offline data warehouse,This type of data warehouse is called quasi-real-time data warehouse.
接下来,This paper focuses on sorting out the application scenarios of quasi-real-time data warehouse!
简单理解,There will be delays in quasi-real-time data warehouses,Compared with offline data warehouse that only counts once a day,Quasi-real-time warehouse should be based on business needs,按照小时、minutes or seconds.这里,以5minutes are the limit,5Results every minute,可以基于Structured StreamingRealize the construction of quasi-real-time data warehouse,This is an offline operation based on streaming data,Divide batches according to time,The overall data is on the stream computing engine,也就是在Structured Streaming上面.
Real-time data warehouse projects by industry、分领域,Take news information as an example,比如今日头条、一点资讯、腾讯新闻、网易新闻、百度浏览器、360浏览器、新浪、搜狐等.What data sources are there for this type of application?一般包括用户信息、Privacy and the relevant business and user income data;There are also behavior logs left by users browsing articles;The content of the user work published log,This information will first be collectedKafka上.
之后的过程是,通过Spark Structured Streaming消费Kafka的原始数据.这里需要强调一点,采用Spark Structured Streaming有三个原因.第一,实现流批统一,Can handle batch calculations;second supportfile sink,实现端到端的一致性语义;第三,可以控制sink到HDFS的时间,比如:Set up batch data5minute node,延时低,处理速度快.
从sink到HDFS时,可以选择使用Hudi,也可以选择不使用Hudi,如果通过Spark Streaming直接写数据到HDFS时,Inevitably dealing with small file issues,There are four general treatment.第一,Increase the batch capacity,but also increases the delay;Second partition merge;Third external program integration;第四,If the file does not reach the specified size,Do not create files when writing data in the next batch,Instead, merge with existing small files.Each of these four methods has its own usage scenarios,无论采用哪种方式,Will work more.但是,如果通过Hudi写入数据,小文件的问题,Hudi会帮忙解决.
还有一个问题,Except the user behavior event log will not update,A lot of business data needs to be updated in real time,比如:用户信息的修改.但是,HDFSItself does not support updates,Lead to the need to modify the data through a complicated process,并且在整个过程中,The real-time nature of the data cannot be guaranteed,如果使用Hudi,with relatively short delays,比如分钟级别,Provide data update support,同时Hudi也支持ACID.
When the original data landed onHDFS上,You can do some data preprocessing during the landing process,比如之前在Flume InterceptorThe data processing work,之后我们可以通过HiveCreate the corresponding external table,These tables can be divided into a hierarchy,叫做ODS层的表,These tables are the most original data,It is also the first layer of the warehouse.
建立完ODS层的Hive表,You can query data according to business needs.至于,Are we going to build a higher data warehouse level?,According to the needs of the business.映射Hiveraw data layerODS后,data to analyze,分析使用的是Presto分析引擎,基于内存的计算框架,计算速度要比Hive和Spark快很多.
使用Presto查询操作完成OLAP分析处理,Will integrateSpring Boot框架,使用JDBC连接Presto,Provide external query interface,for analysts.
边栏推荐
猜你喜欢
Navicat 连接Oracle时提示oracle library is not loaded的问题解决
小程序毕设作品之微信体育馆预约小程序毕业设计成品(7)中期检查报告
2022高压电工特种作业证考试题库及答案
How a "cloud" can bring about new changes in the industry
vulnhub W34kn3ss: 1
redis summary_distributed cache
How Tencent architects explained: The principle of Redis high-performance communication (essential version)
嵌入式Qt-做一个秒表
Code Inspection for DevOps
千万级别的表分页查询非常慢,怎么办?
随机推荐
Enterprise cloud cost control, are you really doing it right?
Wechat Gymnasium Appointment Mini Program Graduation Design Finished Works (7) Mid-term Inspection Report
深圳地铁16号线二期进入盾构施工阶段,首台盾构机顺利始发
灵动微电子发布低功耗 MM32L0130 系列 MCU 产品
MySQL命令(命令行方式,而非图形界面方式)
Taking advantage of cloud-network integration, e-Surfing Cloud has paved the way for digital transformation for government and enterprises
攻防世界-favorite_number
百问百答第49期:极客有约——国内可观测领域SaaS产品的发展前景
Ubuntu系统下用docker安装oracle
危及安全的常见物联网攻击有哪些?
透过案例看清API接口的作用——演示1688商品详情接口
一朵“云“如何带来产业新变革
有关代购系统搭建的那点事
IDEA相关配置(特别完整)看完此篇就将所有的IDEA的相关配置都配置好了、设置鼠标滚轮修改字体大小、设置鼠标悬浮提示、设置主题、设置窗体及菜单的字体及字体大小、设置编辑区主题、通过插件更换主题
如何应对机器身份带来的安全风险
脉脉上的相亲生意
golang刷leetcode滑动窗口(9) 颜色分类
erp系统和wms系统有什么区别
golang刷leetcode 经典(3) 设计推特
redis总结_多级缓存