当前位置:网站首页>How to build a quasi-real-time data warehouse?
How to build a quasi-real-time data warehouse?
2022-08-02 20:43:00 【software testnet】
其中,The real-time data warehouse is subdivided into two categories:One kind is standard on the number of real-time warehouse,所有ETLprocess throughSpark或Flink等实时计算、落地;Another kind is to simplify the number of positions in real time,Even simple upgrade of offline data warehouse,This type of data warehouse is called quasi-real-time data warehouse.
接下来,This paper focuses on sorting out the application scenarios of quasi-real-time data warehouse!
简单理解,There will be delays in quasi-real-time data warehouses,Compared with offline data warehouse that only counts once a day,Quasi-real-time warehouse should be based on business needs,按照小时、minutes or seconds.这里,以5minutes are the limit,5Results every minute,可以基于Structured StreamingRealize the construction of quasi-real-time data warehouse,This is an offline operation based on streaming data,Divide batches according to time,The overall data is on the stream computing engine,也就是在Structured Streaming上面.
Real-time data warehouse projects by industry、分领域,Take news information as an example,比如今日头条、一点资讯、腾讯新闻、网易新闻、百度浏览器、360浏览器、新浪、搜狐等.What data sources are there for this type of application?一般包括用户信息、Privacy and the relevant business and user income data;There are also behavior logs left by users browsing articles;The content of the user work published log,This information will first be collectedKafka上.
之后的过程是,通过Spark Structured Streaming消费Kafka的原始数据.这里需要强调一点,采用Spark Structured Streaming有三个原因.第一,实现流批统一,Can handle batch calculations;second supportfile sink,实现端到端的一致性语义;第三,可以控制sink到HDFS的时间,比如:Set up batch data5minute node,延时低,处理速度快.
从sink到HDFS时,可以选择使用Hudi,也可以选择不使用Hudi,如果通过Spark Streaming直接写数据到HDFS时,Inevitably dealing with small file issues,There are four general treatment.第一,Increase the batch capacity,but also increases the delay;Second partition merge;Third external program integration;第四,If the file does not reach the specified size,Do not create files when writing data in the next batch,Instead, merge with existing small files.Each of these four methods has its own usage scenarios,无论采用哪种方式,Will work more.但是,如果通过Hudi写入数据,小文件的问题,Hudi会帮忙解决.
还有一个问题,Except the user behavior event log will not update,A lot of business data needs to be updated in real time,比如:用户信息的修改.但是,HDFSItself does not support updates,Lead to the need to modify the data through a complicated process,并且在整个过程中,The real-time nature of the data cannot be guaranteed,如果使用Hudi,with relatively short delays,比如分钟级别,Provide data update support,同时Hudi也支持ACID.
When the original data landed onHDFS上,You can do some data preprocessing during the landing process,比如之前在Flume InterceptorThe data processing work,之后我们可以通过HiveCreate the corresponding external table,These tables can be divided into a hierarchy,叫做ODS层的表,These tables are the most original data,It is also the first layer of the warehouse.
建立完ODS层的Hive表,You can query data according to business needs.至于,Are we going to build a higher data warehouse level?,According to the needs of the business.映射Hiveraw data layerODS后,data to analyze,分析使用的是Presto分析引擎,基于内存的计算框架,计算速度要比Hive和Spark快很多.
使用Presto查询操作完成OLAP分析处理,Will integrateSpring Boot框架,使用JDBC连接Presto,Provide external query interface,for analysts.
边栏推荐
猜你喜欢
什么是SVN(Subversion)?
土巴兔IPO五次折戟,互联网家装未解“中介”之痛
危及安全的常见物联网攻击有哪些?
Code Inspection for DevOps
成功部署工业物联网的五个关键
MySQL命令(命令行方式,而非图形界面方式)
白话电子签章原理及风险
erp系统和wms系统有什么区别
衡量软件产品质量的 14 个指标
Wechat Gymnasium Appointment Mini Program Graduation Design Finished Works Mini Program Graduation Design Finished Work (6) Question Opening Reply PPT
随机推荐
What is the difference between erp system and wms system
判断文件属主
Mysql和Redis如何保证数据一致性
宝塔搭建实测-基于ThinkPHP5.1的wms进销存源码
织梦自定义表单添加全选和全不选功能按钮
发挥云网融合优势,天翼云为政企铺设数字化转型跑道
如何确保智能工厂的安全?
究极异常处理逻辑——多层次异常的处理顺序
基于HDF的LED驱动程序开发(1)
Enterprise cloud cost control, are you really doing it right?
无法超越的100米_百兆以太网传输距离_网线有哪几种?
2021年下半年软件设计师上午真题
CUDA+Pycharm-gpu版本+Anaconda安装
CWE4.8:2022年危害最大的25种软件安全问题
字节面试官狂问我:你没有高并发、性能调优经验,为什么录取你?
ES: export 的用法
erp系统和wms系统有什么区别
织梦提示信息提示框美化
Data Governance: The Evolution of Data Integration and Application Patterns
Open Source Summer | [Cloud Native] DevOps (5): Integrating Harbor