当前位置:网站首页>Design and implementation of data analysis system
Design and implementation of data analysis system
2022-06-11 15:14:00 【Tcoder-l3est】
Design and implementation of data analysis system
big data demo, Reference from Tao Bao Liang 11 Data analysis and prediction course cases Xiamen University database laboratory blog (xmu.edu.cn)
Experimental environment
- operating system Linux ( Lab version is Ubuntu1 8 .04 , The cluster environment is centos6.5
- Hadoop edition : 2.9.0
- JDK edition : 1.8
- Java IDE Eclipse 3.8
- Spark edition : 2.3.0
- Mysql 8.0.29
- Tomcat 8.5
- Echarts 3.4.0
- Hive 3.1.2
- Sqoop 1.4.6
Experiment introduction
be based on Dynamic Web + JSP + Hive +Mysql + Spark Realized the double 11 shopping data visual analysis , Use Echarts visualization ,spark Of SVM Forecast repeat customers
Experimental framework

Data set description
The three data sets are user behavior log files user_log.csv 、 Repeat guest training set train.csv 、 Repeat customer test set test.csv
User behavior log user_log.csv, The fields in the log are defined as follows :
- user_id | buyers id
- item_id | goods id
- cat_id | Commodity categories id
- merchant_id | The seller id
- brand_id | brand id
- month | Trading hours : month
- day | Trading event : Japan
- action | Behavior , Value range {0,1,2,3},0 It means to click ,1 Join the shopping cart ,2 Means purchase ,3 Express concern about goods
- age_range | Buyer age segment :1 It means age <18,2 Indicates age at [18,24],3 Indicates age at [25,29],4 Indicates age at [30,34],5 Indicates age at [35,39],6 Indicates age at [40,49],7 and 8 It means age >=50,0 and NULL It means unknown
- gender | Gender :0 For women ,1 For men ,2 and NULL It means unknown
- province| Harvest address Province
Repeat customer data set
- user_id | buyers id
- age_range | Buyer age segment :1 It means age <18,2 Indicates age at [18,24],3 Indicates age at [25,29],4 Indicates age at [30,34],5 Indicates age at [35,39],6 Indicates age at [40,49],7 and 8 It means age >=50,0 and NULL It means unknown
- gender | Gender :0 For women ,1 For men ,2 and NULL It means unknown
- merchant_id | merchants id
- label | Whether it's a repeat customer ,0 A value indicates that it is not a repeat customer ,1 A value indicates repeat customers ,-1 Value indicates that the user has exceeded the prediction range we need to consider .NULL The value only exists in the test set , Represent the values that need to be predicted in the test set .
cd /usr/local/hadoop
./bin/hdfs dfs -mkdir -p /dbtaobao/dataset/user_log
# Upload
./bin/hdfs dfs -put /usr/local/dbtaobao/dataset/small_user_log.csv /dbtaobao/dataset/user_log
stay Hive Create a database on
Hive Is based on Hadoop A data warehouse tool , For data extraction 、 conversion 、 load , It's a way to store 、 Queries and analysis are stored in Hadoop The mechanism of large-scale data in .
Hive Data warehouse tools can map structured data files into a database table , And provide SQL Query function , Be able to make SQL The sentence changes into MapReduce Task to carry out .
hive> create database dbtaobao;
hive> use dbtaobao;
establish Hive External table ,
hive> CREATE EXTERNAL TABLE dbtaobao.user_log(user_id INT,item_id INT,cat_id INT,merchant_id INT,brand_id INT,month STRING,day STRING,action INT,age_range INT,gender INT,province STRING) COMMENT 'Now create dbtaobao.user_log!' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/dbtaobao/dataset/user_log';
Hive Data analysis on the Internet
for example
hive> select brand_id from user_log limit 10; -- Before checking the log 10 A commodity brand with a transaction log
Hive You can use classes sql Statement to add, delete, check and modify , The corresponding will be automatically converted to mapreduce operation
from Hive Import data to MySQL
Use Sqoop Take data from Hive Import MySQL
First, in the mysql Create a database inside
mysql> show databases; # Display all databases
mysql> create database dbtaobao; # establish dbtaobao database
mysql> use dbtaobao; # Using a database
Create table
mysql> CREATE TABLE `dbtaobao`.`user_log` (`user_id` varchar(20),`item_id` varchar(20),`cat_id` varchar(20),`merchant_id` varchar(20),`brand_id` varchar(20), `month` varchar(6),`day` varchar(6),`action` varchar(6),`age_range` varchar(6),`gender` varchar(6),`province` varchar(10)) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Import data
cd /usr/local/sqoop
bin/sqoop export --connect jdbc:mysql://localhost:3306/dbtaobao --username root --password root --table user_log --export-dir '/user/hive/warehouse/dbtaobao.db/inner_user_log' --fields-terminated-by ',';
Spark SVM
utilize Spark Of SVM Forecast repeat customers
Reading data , from Hadoop
val train_data = sc.textFile("/dbtaobao/dataset/train_after.csv") val test_data = sc.textFile("/dbtaobao/dataset/test_after.csv")Build a training model
// attribute 1 2 3 4 As input variable val train= train_data.map{ line => val parts = line.split(',') LabeledPoint(parts(4).toDouble,Vectors.dense(parts(1).toDouble,parts (2).toDouble,parts(3).toDouble)) } val test = test_data.map{ line => val parts = line.split(',') LabeledPoint(parts(4).toDouble,Vectors.dense(parts(1).toDouble,parts(2).toDouble,parts(3).toDouble)) } // Set the number of iterations , structure SVMWithSGD Model val numIterations = 1000 val model = SVMWithSGD.train(train, numIterations)Evaluation model , Set thresholds and output scores and classification results
// Set the threshold Output model.setThreshold(0.0) val scoreAndLabels = test.map{ point => val score = model.predict(point.features) score+" "+point.label } scoreAndLabels.foreach(println)Upload to mysql
Web visualization
Tomcat+mysql+JSP Development , Use Dynamic Web Project
Ideas
Java The back end queries data from the database Then the front end obtains the data of the server ,JSP At the end of the page ECharts And query data to show visualization , And agent to Web 8080 Port display interface
effect
Comparison of consumption behaviors of all sellers
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-yMcTfMSH-1654792113133)(https://vvtorres.oss-cn-beijing.aliyuncs.com/image-20220610000233762.png)]
Transaction comparison between male and female buyers

Age group transaction comparison

Comparison of transaction volume of commodity categories

Comparison of trading volume of each province

SVM Forecast comparison of repeat customers

Conclusion analysis
I'm familiar Hive,Hive Is based on Hadoop A data warehouse tool , For data extraction 、 conversion 、 load , It's a way to store 、 Queries and analysis are stored in Hadoop The mechanism of large-scale data in .Hive Data warehouse tools can map structured data files into a database table , And provide SQL Query function , Be able to make SQL The sentence changes into MapReduce Task to carry out . It is very suitable for statistical analysis of data warehouse !
I'm familiar Java Web(Dynamic Web Project) Development , Use eclipse IDE,Tomcat As Web Containers , java Write back end ,JSP Write user interface ,Echarts To visualize charts , The data analysis system can be realized
I'm familiar SVM Algorithm , as well as Spark Of ML In the library SVM Use , Support vector machine SVM It's a two category model . It has been learned in machine learning . And in the Spark Of Machine Learning The library has been encapsulated SVM, Inside SVM Only the most basic two categories are supported , You can set parameters such as the number of iterations and regularization , If the threshold is set , The result greater than the threshold will be regarded as a positive prediction , Results below the threshold are treated as negative predictions , Can pass AUC To assess the .
Self made question
About Hive External and internal tables of
Hive Before importing data to an external table , The data has not been moved to its own data warehouse directory , That is to say, the data in the external table is not managed by itself
When deleting an external table ,Hive Just delete the metadata of the external table , Data is not deleted
that , How should I choose which table to use ? In most cases there is not much difference , So choice is just a matter of personal preference . But as an experience , If all processing needs to be done by Hive complete , Then you should create an internal table , Otherwise, use an external table !
Hive Import MySQL Database failure
Because it's a coding problem , hold Mysql The codes are all changed to utf-8 It can be imported normally .
The data in the external table is not managed by itself
When deleting an external table ,Hive Just delete the metadata of the external table , Data is not deleted
that , How should I choose which table to use ? In most cases there is not much difference , So choice is just a matter of personal preference . But as an experience , If all processing needs to be done by Hive complete , Then you should create an internal table , Otherwise, use an external table !
Hive Import MySQL Database failure
Because it's a coding problem , hold Mysql The codes are all changed to utf-8 It can be imported normally .
边栏推荐
- 19. Insertion et suppression d'un arbre de recherche binaire
- Flutter 3.0 was officially released: it stably supports 6 platforms, and byte jitter is the main user
- Mysql database optimization details
- PowerShell chief architect: I used my spare time to develop projects, but I was demoted by Microsoft because of my excellent performance
- Art plus online school: Sketch common sitting posture test questions, these three angles must be mastered~
- Hamad application layout scheme of hashicopy 01
- Exporting data using mysqldump
- In depth interpretation: distributed system resilience architecture ballast openchaos
- 数字化转型项目做了多年,主架构师都绝望了:当初就不应该用外包!
- 2022质量员-市政方向-岗位技能(质量员)考试模拟100题及模拟考试
猜你喜欢

Hebei huangjinzhai scenic spot adds "AED automatic defibrillator" to ensure the life safety of tourists!

Avenue to Jane | Comment concevoir un vit pour configurer l'auto - attraction est - il le plus raisonnable?
![[multi thread performance tuning] what operations cause context switching?](/img/a6/5d82c81dba546092447debebf7fc3e.jpg)
[multi thread performance tuning] what operations cause context switching?

uniapp开发微信小程序,从构建到上线

C language simple webserver
[mysql_11] addition, deletion and modification of data processing
[mysql_12] MySQL data types

19. insertion, deletion and pruning of binary search tree

Backtracking / solution space tree permutation tree

回溯法/解空间树 排列树
随机推荐
Leetcode daily question - Search insertion position
Installation and use of sonarqube
MySQL用户权限总结【用户授权必会】
04 _ 深入浅出索引(上)
【SystemVerilog 之 接口】~ Interface
19. 二叉搜索树的插入删除修剪
Avenue to simplicity | how to configure self attention for vit is the most reasonable?
基于STM32F1的开源小项目
Analyse approfondie de la conception du système relationnel du Groupe de cercles
C语言简易版webserver
英伟达研发主管:AI 是如何改进芯片设计的?
How can local retail release the "imprisoned value" and make physical stores grow again?
Tangzhengrong: CTO is the intersection of business thinking and technical thinking
Iclr2022| small sample fine tuning method of language model based on differentiable hints
Station B executives interpret the financial report: the epidemic has no impact on the company's long-term development, and the video trend is irresistible
【SystemVerilog 之 过程块和方法】~ 域、always过程块、initial过程块、函数 function、任务 task、生命周期
Hebei huangjinzhai scenic spot adds "AED automatic defibrillator" to ensure the life safety of tourists!
社交软件Soul撤回IPO申请:上市只差临门一脚 腾讯是大股东
Qcustomplot 1.0.1 learning (1) - Download and use qcustomplot
Simple C language address book