当前位置:网站首页>How to build a quasi-real-time data warehouse?
How to build a quasi-real-time data warehouse?
2022-08-02 20:43:00 【software testnet】
其中,The real-time data warehouse is subdivided into two categories:One kind is standard on the number of real-time warehouse,所有ETLprocess throughSpark或Flink等实时计算、落地;Another kind is to simplify the number of positions in real time,Even simple upgrade of offline data warehouse,This type of data warehouse is called quasi-real-time data warehouse.
接下来,This paper focuses on sorting out the application scenarios of quasi-real-time data warehouse!
简单理解,There will be delays in quasi-real-time data warehouses,Compared with offline data warehouse that only counts once a day,Quasi-real-time warehouse should be based on business needs,按照小时、minutes or seconds.这里,以5minutes are the limit,5Results every minute,可以基于Structured StreamingRealize the construction of quasi-real-time data warehouse,This is an offline operation based on streaming data,Divide batches according to time,The overall data is on the stream computing engine,也就是在Structured Streaming上面.
Real-time data warehouse projects by industry、分领域,Take news information as an example,比如今日头条、一点资讯、腾讯新闻、网易新闻、百度浏览器、360浏览器、新浪、搜狐等.What data sources are there for this type of application?一般包括用户信息、Privacy and the relevant business and user income data;There are also behavior logs left by users browsing articles;The content of the user work published log,This information will first be collectedKafka上.
之后的过程是,通过Spark Structured Streaming消费Kafka的原始数据.这里需要强调一点,采用Spark Structured Streaming有三个原因.第一,实现流批统一,Can handle batch calculations;second supportfile sink,实现端到端的一致性语义;第三,可以控制sink到HDFS的时间,比如:Set up batch data5minute node,延时低,处理速度快.
从sink到HDFS时,可以选择使用Hudi,也可以选择不使用Hudi,如果通过Spark Streaming直接写数据到HDFS时,Inevitably dealing with small file issues,There are four general treatment.第一,Increase the batch capacity,but also increases the delay;Second partition merge;Third external program integration;第四,If the file does not reach the specified size,Do not create files when writing data in the next batch,Instead, merge with existing small files.Each of these four methods has its own usage scenarios,无论采用哪种方式,Will work more.但是,如果通过Hudi写入数据,小文件的问题,Hudi会帮忙解决.
还有一个问题,Except the user behavior event log will not update,A lot of business data needs to be updated in real time,比如:用户信息的修改.但是,HDFSItself does not support updates,Lead to the need to modify the data through a complicated process,并且在整个过程中,The real-time nature of the data cannot be guaranteed,如果使用Hudi,with relatively short delays,比如分钟级别,Provide data update support,同时Hudi也支持ACID.
When the original data landed onHDFS上,You can do some data preprocessing during the landing process,比如之前在Flume InterceptorThe data processing work,之后我们可以通过HiveCreate the corresponding external table,These tables can be divided into a hierarchy,叫做ODS层的表,These tables are the most original data,It is also the first layer of the warehouse.
建立完ODS层的Hive表,You can query data according to business needs.至于,Are we going to build a higher data warehouse level?,According to the needs of the business.映射Hiveraw data layerODS后,data to analyze,分析使用的是Presto分析引擎,基于内存的计算框架,计算速度要比Hive和Spark快很多.
使用Presto查询操作完成OLAP分析处理,Will integrateSpring Boot框架,使用JDBC连接Presto,Provide external query interface,for analysts.
边栏推荐
- Data Governance: The Evolution of Data Integration and Application Patterns
- golang刷leetcode 经典(1) LRU缓存机制
- Go编译原理系列6(类型检查)
- vulnhub W34kn3ss: 1
- ffmpeg cannot find libx264 after compilation
- redis summary_distributed cache
- 红队实战靶场ATT&CK(一)
- 小程序毕设作品之微信体育馆预约小程序毕业设计成品(7)中期检查报告
- 分布式 | dble 启动的时候做了什么之配置检测
- Ubuntu系统下用docker安装oracle
猜你喜欢
随机推荐
租房小程序自动定位城市
DevOps之代码检查
方法的使用
故障分析 | 一条 SELECT 语句跑崩了 MySQL ,怎么回事?
千万级别的表分页查询非常慢,怎么办?
Wechat Gymnasium Appointment Mini Program Graduation Design Finished Works (7) Mid-term Inspection Report
有关代购系统搭建的那点事
面试官:可以谈谈乐观锁和悲观锁吗
红队实战靶场ATT&CK(一)
redis总结_分布式缓存
记一次 .NET 某工控自动化控制系统 卡死分析
小程序毕设作品之微信体育馆预约小程序毕业设计成品(7)中期检查报告
魔豹联盟:佛萨奇2.0dapp系统开发模式详情
Local broadcast MSE fragments mp4 service
危及安全的常见物联网攻击有哪些?
vulnhub W34kn3ss: 1
全面认识二极管,一篇文章就够了
Several common cross-domain solutions
MySQL基本操作和基于MySQL基本操作的综合实例项目
Taking advantage of cloud-network integration, e-Surfing Cloud has paved the way for digital transformation for government and enterprises