当前位置:网站首页>User behavior collection platform

User behavior collection platform

2022-07-05 04:10:00 cpuCode

Data warehouse concept

Data warehouse (Data Warehouse ) Is owned by the enterprise Decision making The process , Provide a strategic set of all system data

Through the analysis of data in the data warehouse , Can help enterprises , Improve business 、 Control the cost 、 Improve product quality, etc .

Data warehouse , It's not the ultimate destination of the data , It's about preparing for the final destination of the data .

These preparations include the... Of the data :

  • cleaning
  • escape
  • classification
  • restructuring
  • Merge
  • Split
  • Statistics

 Insert picture description here

Project requirements and architecture design

Project requirements analysis

Project requirements :

  • user Behavior Build data acquisition platform
  • Business Build data acquisition platform
  • Data warehouse Dimensional modeling
  • analysis , equipment 、 members 、 goods 、 region 、 Activities and other core themes of e-commerce , The statistical report indicators are close to 100 individual
  • use Ad hoc inquiry Tools , Index analysis at any time
  • Monitor cluster performance , An exception needs to be reported to the police
  • Metadata management
  • Quality monitoring

The project framework

Technology selection

Data acquisition and transmission :Flume,DataX , Maxwell , Kafka , Sqoop

data storage :MySql ,HDFS,HBase,Redis , MongoDB

Data calculation :Hive,Spark,Flink , Tez , Strom

Data query :Presto,Kylin,Impala , DataX

Data visualization :Superset , QuickBI , DataV

Task scheduling :DolphinScheduler、Azkaban、Oozie

Cluster monitoring :Zabbix

Metadata management :Atlas

System data flow design

 Insert picture description here

Business interaction data : Login generated in business process 、 Order 、 user 、 goods 、 Payment and other relevant data , Usually stored in DB in , Such as : Mysql、Oracle

Embedded user behavior data : When users use the product , Data generated during interaction with client products , Such as : Page view 、 Click on 、 Stop 、 Comment on 、 give the thumbs-up 、 Collection

Frame version selection

Apache: Operation and maintenance is troublesome , Compatibility between components needs to be investigated by yourself .( Generally used in large factories , Strong technical strength , There are professional operation and maintenance personnel )( It is recommended to use )

CDH:: The most used version in China , but CM Not open source , There will be a charge this year , A node 1 Ten thousand dollars

HDP: Open source , It can be redeveloped , But no CDH Stable , Less used in China

product edition
Java1.8
Hadoop3.1.3
Hive3.1.2
Flume1.9.0
Zookeeper3.5.7
Kafka2.4.1
DataX3.0
Maxwell1.29.2

Frame selection try not to choose the latest frame , Choose the stable version of the latest framework about half a year ago

Server selection

The physical machine :

  • With 128G Memory ,20 Nuclear physics CPU,40 Threads ,8THDD and 2TSSD Hard disk , Dell brand single unit quotation 4W Head start . General physical machine life 5 About years ago
  • Need professional operation and maintenance personnel , On average, one month 1 ten thousand . Electricity is also a lot of expenses

Virtual machine :

  • Take Alibaba cloud for example , Almost the same configuration , Every year, 5W, The main disk is expensive
  • A lot of operation and maintenance work is done by Alibaba cloud , Operation and maintenance is relatively easy

Enterprise selection :

  • Companies with financial wealth that have no direct conflict with Alibaba choose Alibaba cloud
  • Small and medium companies 、 In order to finance the listing , Choose alicloud , Buy physics machine after pulling down financing
  • Have a long-term plan , The funds are quite sufficient , Choose the physical machine

Cluster resource planning and design

( hypothesis : Every server 8T disk ,128G Memory )

  • Daily active users 100 ten thousand , Each person has an average of 100 strip :100 ten thousand * 100 strip = 1 Billion bars
  • Every log 1K about , Every day 1 Billion bars :100000000 / 1024 / 1024 = about 100G
  • We will not expand the server capacity within half a year :100G * 180 God = about 18T
  • preservation 3 copy :18T * 3 = 54T
  • reserve 20% 30%Buf = 54T / 0.7 = 77T

about 8T * 10 Servers

Cluster server planning :

The service name Sub service The server cpucode101 The server cpucode102 The server cpucode103
HDFSNameNode
DataNode
SecondaryNameNode
YarnNodeManager
Resourcemanager
ZookeeperZookeeper Server
Flume( Collect logs )Flume
KafkaKafka
Flume( consumption Kafka)Flume
HiveHive
MySQLMySQL
DataXDataX
MaxwellMaxwell
PrestoCoordinator
Worker
DolphinSchedulerMasterServer
WorkerServer
DruidDruid
Kylin
HbaseHMaster
HRegionServer
Superset
Atlas
SolrJar

User behavior log

Overview of user behavior logging

User behavior log content

User behavior log format

Simulate and generate user behavior logs

Data acquisition module

原网站

版权声明
本文为[cpuCode]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202140702037158.html