当前位置:网站首页>User behavior collection platform
User behavior collection platform
2022-07-05 04:10:00 【cpuCode】
User behavior collection platform
Data warehouse concept
Data warehouse (Data Warehouse ) Is owned by the enterprise Decision making The process , Provide a strategic set of all system data
Through the analysis of data in the data warehouse , Can help enterprises , Improve business 、 Control the cost 、 Improve product quality, etc .
Data warehouse , It's not the ultimate destination of the data , It's about preparing for the final destination of the data .
These preparations include the... Of the data :
- cleaning
- escape
- classification
- restructuring
- Merge
- Split
- Statistics
Project requirements and architecture design
Project requirements analysis
Project requirements :
- user Behavior Build data acquisition platform
- Business Build data acquisition platform
- Data warehouse Dimensional modeling
- analysis , equipment 、 members 、 goods 、 region 、 Activities and other core themes of e-commerce , The statistical report indicators are close to 100 individual
- use Ad hoc inquiry Tools , Index analysis at any time
- Monitor cluster performance , An exception needs to be reported to the police
- Metadata management
- Quality monitoring
The project framework
Technology selection
Data acquisition and transmission :Flume,DataX , Maxwell , Kafka , Sqoop
data storage :MySql ,HDFS,HBase,Redis , MongoDB
Data calculation :Hive,Spark,Flink , Tez , Strom
Data query :Presto,Kylin,Impala , DataX
Data visualization :Superset , QuickBI , DataV
Task scheduling :DolphinScheduler、Azkaban、Oozie
Cluster monitoring :Zabbix
Metadata management :Atlas
System data flow design
Business interaction data : Login generated in business process 、 Order 、 user 、 goods 、 Payment and other relevant data , Usually stored in DB in , Such as : Mysql、Oracle
Embedded user behavior data : When users use the product , Data generated during interaction with client products , Such as : Page view 、 Click on 、 Stop 、 Comment on 、 give the thumbs-up 、 Collection
Frame version selection
Apache: Operation and maintenance is troublesome , Compatibility between components needs to be investigated by yourself .( Generally used in large factories , Strong technical strength , There are professional operation and maintenance personnel )( It is recommended to use )
CDH:: The most used version in China , but CM Not open source , There will be a charge this year , A node 1 Ten thousand dollars
HDP: Open source , It can be redeveloped , But no CDH Stable , Less used in China
product | edition |
---|---|
Java | 1.8 |
Hadoop | 3.1.3 |
Hive | 3.1.2 |
Flume | 1.9.0 |
Zookeeper | 3.5.7 |
Kafka | 2.4.1 |
DataX | 3.0 |
Maxwell | 1.29.2 |
Frame selection try not to choose the latest frame , Choose the stable version of the latest framework about half a year ago
Server selection
The physical machine :
- With 128G Memory ,20 Nuclear physics CPU,40 Threads ,8THDD and 2TSSD Hard disk , Dell brand single unit quotation 4W Head start . General physical machine life 5 About years ago
- Need professional operation and maintenance personnel , On average, one month 1 ten thousand . Electricity is also a lot of expenses
Virtual machine :
- Take Alibaba cloud for example , Almost the same configuration , Every year, 5W, The main disk is expensive
- A lot of operation and maintenance work is done by Alibaba cloud , Operation and maintenance is relatively easy
Enterprise selection :
- Companies with financial wealth that have no direct conflict with Alibaba choose Alibaba cloud
- Small and medium companies 、 In order to finance the listing , Choose alicloud , Buy physics machine after pulling down financing
- Have a long-term plan , The funds are quite sufficient , Choose the physical machine
Cluster resource planning and design
( hypothesis : Every server 8T disk ,128G Memory )
- Daily active users 100 ten thousand , Each person has an average of 100 strip :100 ten thousand * 100 strip = 1 Billion bars
- Every log 1K about , Every day 1 Billion bars :100000000 / 1024 / 1024 = about 100G
- We will not expand the server capacity within half a year :100G * 180 God = about 18T
- preservation 3 copy :18T * 3 = 54T
- reserve 20% 30%Buf = 54T / 0.7 = 77T
about 8T * 10 Servers
Cluster server planning :
The service name | Sub service | The server cpucode101 | The server cpucode102 | The server cpucode103 |
---|---|---|---|---|
HDFS | NameNode | √ | ||
DataNode | √ | √ | √ | |
SecondaryNameNode | √ | |||
Yarn | NodeManager | √ | √ | √ |
Resourcemanager | √ | |||
Zookeeper | Zookeeper Server | √ | √ | √ |
Flume( Collect logs ) | Flume | √ | √ | |
Kafka | Kafka | √ | √ | √ |
Flume( consumption Kafka) | Flume | √ | ||
Hive | Hive | √ | ||
MySQL | MySQL | √ | ||
DataX | DataX | √ | ||
Maxwell | Maxwell | √ | ||
Presto | Coordinator | √ | ||
Worker | √ | √ | √ | |
DolphinScheduler | MasterServer | √ | ||
WorkerServer | √ | √ | √ | |
Druid | Druid | √ | √ | √ |
Kylin | √ | |||
Hbase | HMaster | √ | ||
HRegionServer | √ | √ | √ | |
Superset | √ | |||
Atlas | √ | |||
Solr | Jar | √ | √ | √ |
User behavior log
Overview of user behavior logging
User behavior log content
User behavior log format
Simulate and generate user behavior logs
Data acquisition module
边栏推荐
- 阿里云ECS使用cloudfs4oss挂载OSS
- Use object composition in preference to class inheritance
- 【虚幻引擎UE】运行和启动的区别,常见问题分析
- 【看完就懂系列】一文6000字教你从0到1实现接口自动化
- Behavior perception system
- How to solve the problem that easycvr changes the recording storage path and does not generate recording files?
- [moteur illusoire UE] il ne faut que six étapes pour réaliser le déploiement du flux de pixels ue5 et éviter les détours! (4.26 et 4.27 principes similaires)
- Phpmailer reported an error: SMTP error: failed to connect to server: (0)
- 根据入栈顺序判断出栈顺序是否合理
- Threejs factory model 3DMAX model obj+mtl format, source file download
猜你喜欢
Differences among 10 addressing modes
Soul 3: what is interface testing, how to play interface testing, and how to play interface automation testing?
【刷题】BFS题目精选
Why can't all browsers on my computer open web pages
Use of vscode software
“金九银十”是找工作的最佳时期吗?那倒未必
[moteur illusoire UE] il ne faut que six étapes pour réaliser le déploiement du flux de pixels ue5 et éviter les détours! (4.26 et 4.27 principes similaires)
Common features of ES6
Ctfshow web entry code audit
Resolved (sqlalchemy+pandas.read_sql) attributeerror: 'engine' object has no attribute 'execution_ options‘
随机推荐
长度为n的入栈顺序的可能出栈顺序种数
Pyqt5 displays file names and pictures
小程序中实现文章的关注功能
Threejs factory model 3DMAX model obj+mtl format, source file download
Threejs rendering obj+mtl model source code, 3D factory model
在线文本行固定长度填充工具
灵魂三问:什么是接口测试,接口测试怎么玩,接口自动化测试怎么玩?
[wp]bmzclub writeup of several questions
How to solve the problem that easycvr changes the recording storage path and does not generate recording files?
如何优雅的获取每个分组的前几条数据
Behavior perception system
Clickpaas low code platform
This article takes you to understand the relationship between the past and present of Bi and the digital transformation of enterprises
How to use jedis of redis
A应用唤醒B应该快速方法
根据入栈顺序判断出栈顺序是否合理
Wechat applet development process (with mind map)
Containerd series - detailed explanation of plugins
Looking back on 2021, looking forward to 2022 | a year between CSDN and me
Why do big companies such as Baidu and Alibaba prefer to spend 25K to recruit fresh students rather than raise wages by 5K to retain old employees?