当前位置:网站首页>4. Data splitting of Flink real-time project
4. Data splitting of Flink real-time project
2022-07-03 21:27:00 【Contestant No.1 position】
1. Abstract
The log data we collected earlier has been saved to Kafka in , As log data ODS layer , from kafka Of ODS The log data read by the layer is divided into 3 class , Page log 、 Start log and exposure log . Although these three types of data are user behavior data , But it has a completely different data structure , So we need to split it . Write the split logs back to Kafka In different topics , As log DWD layer .
The page log is output to the main stream , The startup log is output to the startup side output stream , The exposure log is output to the exposure side output stream
2. Identify new and old users
The client business itself has the identification of new and old users , But it's not accurate enough , Need to reconfirm with real-time calculation ( Business operations are not involved , Just make a simple status confirmation ).
Data splitting is realized by using side output stream
According to the content of log data , Divide the log data into 3 class : Page log 、 Start log and exposure log . Push the data of different streams to the downstream kafka Different Topic in
3. Code implementation
In the bag app Create flink Mission BaseLogTask.java,
adopt flink consumption kafka The data of , Then record the consumption checkpoint Deposit in hdfs in , Remember to create the path manually , Then give the authority
checkpoint Optional use , You can turn it off during the test .
MyKafkaUtil.java Tool class
4. New and old visitor status repair
Rules for identifying new and old customers
Identify new and old visitors , The front end will record the status of new and old customers , Maybe not , Here again , preservation mid State of one day ( Save first visit date as status ), Wait until there is a log in the back equipment , Get the date from the status and compare the log generation date , If the status is not empty , And the status date is not equal to the current date , It means it's a regular visitor , If is_new Mark is 1, Then repair its state .
5. Data splitting is realized by using side output stream
After the above new and old customers repair , Then divide the log data into 3 class
Start log label definition : OutputTag<String> startTag = new OutputTag<String>("start"){};
And exposure log label definitions : OutputTag<String> displayTag = new OutputTag<String>("display"){};
The page log is output to the main stream , The startup log is output to the startup side output stream , The exposure log is output to the exposure log side output stream .
The data is split and sent to kafka
- dwd_start_log: start log
- dwd_display_log: Exposure log
- dwd_page_log: Page log
边栏推荐
- Netfilter ARP log
- leetcode-540. A single element in an ordered array
- Global and Chinese market of AC induction motors 2022-2028: Research Report on technology, participants, trends, market size and share
- TiDB 之 TiCDC6.0 初体验
- 鹏城杯 WEB_WP
- Global and Chinese market of wireless hard disk 2022-2028: Research Report on technology, participants, trends, market size and share
- APEC industry +: father of the king of the ox mill, industrial Internet "king of the ox mill anti-wear faction" Valentine's Day greetings | Asia Pacific Economic media | ChinaBrand
- Is it OK for fresh students to change careers to do software testing? The senior answered with his own experience
- Redis data migration (II)
- What should the future of the Internet be like when Silicon Valley employees flee the big factory and rush to Web3| Footprint Analytics
猜你喜欢

使用dnSpy对无源码EXE或DLL进行反编译并且修改

Summary of common operation and maintenance commands

leetcode-540. A single element in an ordered array

(5) Web security | penetration testing | network security operating system database third-party security, with basic use of nmap and masscan

UI automation test: selenium+po mode +pytest+allure integration

Memory analyzer (MAT)
![Capture de paquets et tri du contenu externe - - autoresponder, composer, statistiques [3]](/img/bf/ac3ba04c48e80b2d4f9c13894a4984.png)
Capture de paquets et tri du contenu externe - - autoresponder, composer, statistiques [3]

Basic preprocessing and data enhancement of image data
![[vulnhub shooting range] impulse: lupinone](/img/27/b92eeceefd1c71f19d926bdd1eee8b.jpg)
[vulnhub shooting range] impulse: lupinone

@Scenario of transactional annotation invalidation
随机推荐
Rhcsa third day notes
Global and Chinese market of wireless hard disk 2022-2028: Research Report on technology, participants, trends, market size and share
[vulnhub shooting range] impulse: lupinone
MySQL——数据库备份
XAI+网络安全?布兰登大学等最新《可解释人工智能在网络安全应用》综述,33页pdf阐述其现状、挑战、开放问题和未来方向
Monkey/ auto traverse test, integrate screen recording requirements
抓包整理外篇——————autoResponder、composer 、statistics [ 三]
The White House held an open source security summit, attended by many technology giants
How to choose cache read / write strategies in different business scenarios?
No matter how hot the metauniverse is, it cannot be separated from data
Under the double reduction policy, research travel may become a big winner
Compilation Principle -- syntax analysis
Capture de paquets et tri du contenu externe - - autoresponder, composer, statistiques [3]
Selenium has three waiting methods (forced waiting, implicit waiting, and display waiting)
Let me ask you a question. Have you ever used the asynchronous io of Flink SQL to associate dimension tables in MySQL? I set various settings according to the official website
JS three families
The "boss management manual" that is wildly spread all over the network (turn)
Yyds dry goods inventory TCP & UDP
Global and Chinese market of wall mounted kiosks 2022-2028: Research Report on technology, participants, trends, market size and share
Transformation between yaml, Jason and Dict