当前位置:网站首页>4. Data splitting of Flink real-time project
4. Data splitting of Flink real-time project
2022-07-03 21:27:00 【Contestant No.1 position】
1. Abstract
The log data we collected earlier has been saved to Kafka in , As log data ODS layer , from kafka Of ODS The log data read by the layer is divided into 3 class , Page log 、 Start log and exposure log . Although these three types of data are user behavior data , But it has a completely different data structure , So we need to split it . Write the split logs back to Kafka In different topics , As log DWD layer .
The page log is output to the main stream , The startup log is output to the startup side output stream , The exposure log is output to the exposure side output stream
2. Identify new and old users
The client business itself has the identification of new and old users , But it's not accurate enough , Need to reconfirm with real-time calculation ( Business operations are not involved , Just make a simple status confirmation ).
Data splitting is realized by using side output stream
According to the content of log data , Divide the log data into 3 class : Page log 、 Start log and exposure log . Push the data of different streams to the downstream kafka Different Topic in
3. Code implementation
In the bag app Create flink Mission BaseLogTask.java,
adopt flink consumption kafka The data of , Then record the consumption checkpoint Deposit in hdfs in , Remember to create the path manually , Then give the authority
checkpoint Optional use , You can turn it off during the test .
MyKafkaUtil.java Tool class
4. New and old visitor status repair
Rules for identifying new and old customers
Identify new and old visitors , The front end will record the status of new and old customers , Maybe not , Here again , preservation mid State of one day ( Save first visit date as status ), Wait until there is a log in the back equipment , Get the date from the status and compare the log generation date , If the status is not empty , And the status date is not equal to the current date , It means it's a regular visitor , If is_new Mark is 1, Then repair its state .
5. Data splitting is realized by using side output stream
After the above new and old customers repair , Then divide the log data into 3 class
Start log label definition : OutputTag<String> startTag = new OutputTag<String>("start"){};
And exposure log label definitions : OutputTag<String> displayTag = new OutputTag<String>("display"){};
The page log is output to the main stream , The startup log is output to the startup side output stream , The exposure log is output to the exposure log side output stream .
The data is split and sent to kafka
- dwd_start_log: start log
- dwd_display_log: Exposure log
- dwd_page_log: Page log
边栏推荐
- Capture de paquets et tri du contenu externe - - autoresponder, composer, statistiques [3]
- Service discovery and load balancing mechanism -service
- Analyse de REF nerf
- The 12th Blue Bridge Cup
- 内存分析器 (MAT)
- Pengcheng cup Web_ WP
- [vulnhub shooting range] impulse: lupinone
- Basic preprocessing and data enhancement of image data
- Inventory 2021 | yunyuansheng embracing the road
- QFileDialog
猜你喜欢

The post-90s resigned and started a business, saying they would kill cloud database

XAI+网络安全?布兰登大学等最新《可解释人工智能在网络安全应用》综述,33页pdf阐述其现状、挑战、开放问题和未来方向

JS three families

MySQL - database backup

使用dnSpy对无源码EXE或DLL进行反编译并且修改

gslb(global server load balance)技术的一点理解

Tidb's initial experience of ticdc6.0

Yyds dry goods inventory TCP & UDP

Pengcheng cup Web_ WP

仿网易云音乐小程序
随机推荐
Rhcsa third day operation
[Yugong series] go teaching course 002 go language environment installation in July 2022
MySQL——规范数据库设计
Collections SQL communes
The 12th Blue Bridge Cup
17 websites for practicing automated testing. I'm sure you'll like them
Solve the problem that openocd fails to burn STM32 and cannot connect through SWD
Station B, dark horse programmer, employee management system, access conflict related (there is an unhandled exception at 0x00007ff633a4c54d (in employee management system.Exe): 0xc0000005: read locat
APEC industry +: father of the king of the ox mill, industrial Internet "king of the ox mill anti-wear faction" Valentine's Day greetings | Asia Pacific Economic media | ChinaBrand
How to choose cache read / write strategies in different business scenarios?
Visiontransformer (I) -- embedded patched and word embedded
Basic preprocessing and data enhancement of image data
请教大家一个问题,用人用过flink sql的异步io关联MySQL中的维表吗?我按照官网设置了各种
Software testing skills, JMeter stress testing tutorial, obtaining post request data in x-www-form-urlencoded format (24)
The White House held an open source security summit, attended by many technology giants
Transformation between yaml, Jason and Dict
How to install sentinel console
MySQL——数据库备份
The post-90s resigned and started a business, saying they would kill cloud database
Global and Chinese market of gallic acid 2022-2028: Research Report on technology, participants, trends, market size and share