当前位置:网站首页>Design and implementation of data analysis system

Design and implementation of data analysis system

2022-06-11 15:14:00 Tcoder-l3est

Design and implementation of data analysis system

big data demo, Reference from Tao Bao Liang 11 Data analysis and prediction course cases Xiamen University database laboratory blog (xmu.edu.cn)

Experimental environment

  1. operating system Linux ( Lab version is Ubuntu1 8 .04 , The cluster environment is centos6.5
  2. Hadoop edition : 2.9.0
  3. JDK edition : 1.8
  4. Java IDE Eclipse 3.8
  5. Spark edition : 2.3.0
  6. Mysql 8.0.29
  7. Tomcat 8.5
  8. Echarts 3.4.0
  9. Hive 3.1.2
  10. Sqoop 1.4.6

Experiment introduction

be based on Dynamic Web + JSP + Hive +Mysql + Spark Realized the double 11 shopping data visual analysis , Use Echarts visualization ,spark Of SVM Forecast repeat customers

Experimental framework

image-20220609233133730

Data set description

The three data sets are user behavior log files user_log.csv 、 Repeat guest training set train.csv 、 Repeat customer test set test.csv

User behavior log user_log.csv, The fields in the log are defined as follows :

  • user_id | buyers id
  • item_id | goods id
  • cat_id | Commodity categories id
  • merchant_id | The seller id
  • brand_id | brand id
  • month | Trading hours : month
  • day | Trading event : Japan
  • action | Behavior , Value range {0,1,2,3},0 It means to click ,1 Join the shopping cart ,2 Means purchase ,3 Express concern about goods
  • age_range | Buyer age segment :1 It means age <18,2 Indicates age at [18,24],3 Indicates age at [25,29],4 Indicates age at [30,34],5 Indicates age at [35,39],6 Indicates age at [40,49],7 and 8 It means age >=50,0 and NULL It means unknown
  • gender | Gender :0 For women ,1 For men ,2 and NULL It means unknown
  • province| Harvest address Province

Repeat customer data set

  • user_id | buyers id
  • age_range | Buyer age segment :1 It means age <18,2 Indicates age at [18,24],3 Indicates age at [25,29],4 Indicates age at [30,34],5 Indicates age at [35,39],6 Indicates age at [40,49],7 and 8 It means age >=50,0 and NULL It means unknown
  • gender | Gender :0 For women ,1 For men ,2 and NULL It means unknown
  • merchant_id | merchants id
  • label | Whether it's a repeat customer ,0 A value indicates that it is not a repeat customer ,1 A value indicates repeat customers ,-1 Value indicates that the user has exceeded the prediction range we need to consider .NULL The value only exists in the test set , Represent the values that need to be predicted in the test set .
cd /usr/local/hadoop
./bin/hdfs dfs -mkdir -p /dbtaobao/dataset/user_log
#  Upload 
./bin/hdfs dfs -put /usr/local/dbtaobao/dataset/small_user_log.csv /dbtaobao/dataset/user_log

stay Hive Create a database on

Hive Is based on Hadoop A data warehouse tool , For data extraction 、 conversion 、 load , It's a way to store 、 Queries and analysis are stored in Hadoop The mechanism of large-scale data in .

Hive Data warehouse tools can map structured data files into a database table , And provide SQL Query function , Be able to make SQL The sentence changes into MapReduce Task to carry out .

hive>  create database dbtaobao;
hive>  use dbtaobao;

establish Hive External table ,

hive>  CREATE EXTERNAL TABLE dbtaobao.user_log(user_id INT,item_id INT,cat_id INT,merchant_id INT,brand_id INT,month STRING,day STRING,action INT,age_range INT,gender INT,province STRING) COMMENT 'Now create dbtaobao.user_log!' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/dbtaobao/dataset/user_log';

Hive Data analysis on the Internet

for example

hive> select brand_id from user_log limit 10; --  Before checking the log 10 A commodity brand with a transaction log 

Hive You can use classes sql Statement to add, delete, check and modify , The corresponding will be automatically converted to mapreduce operation

from Hive Import data to MySQL

Use Sqoop Take data from Hive Import MySQL

First, in the mysql Create a database inside

mysql> show databases; # Display all databases 
mysql> create database dbtaobao; # establish dbtaobao database 
mysql> use dbtaobao; # Using a database 

Create table

mysql> CREATE TABLE `dbtaobao`.`user_log` (`user_id` varchar(20),`item_id` varchar(20),`cat_id` varchar(20),`merchant_id` varchar(20),`brand_id` varchar(20), `month` varchar(6),`day` varchar(6),`action` varchar(6),`age_range` varchar(6),`gender` varchar(6),`province` varchar(10)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Import data

cd /usr/local/sqoop
bin/sqoop export --connect jdbc:mysql://localhost:3306/dbtaobao --username root --password root --table user_log --export-dir '/user/hive/warehouse/dbtaobao.db/inner_user_log' --fields-terminated-by ',';

Spark SVM

utilize Spark Of SVM Forecast repeat customers

  1. Reading data , from Hadoop

    val train_data = sc.textFile("/dbtaobao/dataset/train_after.csv")
    val test_data = sc.textFile("/dbtaobao/dataset/test_after.csv")
    
  2. Build a training model

    //  attribute 1 2 3 4  As input variable 
    val train= train_data.map{
          line =>
      val parts = line.split(',')
      LabeledPoint(parts(4).toDouble,Vectors.dense(parts(1).toDouble,parts
    (2).toDouble,parts(3).toDouble))
    }
    val test = test_data.map{
          line =>
      val parts = line.split(',')
      LabeledPoint(parts(4).toDouble,Vectors.dense(parts(1).toDouble,parts(2).toDouble,parts(3).toDouble))
    }
    //  Set the number of iterations , structure SVMWithSGD Model 
    val numIterations = 1000
    val model = SVMWithSGD.train(train, numIterations)
    
  3. Evaluation model , Set thresholds and output scores and classification results

    //  Set the threshold   Output 
    model.setThreshold(0.0)
    val scoreAndLabels = test.map{
          point =>
      val score = model.predict(point.features)
      score+" "+point.label
    }
    scoreAndLabels.foreach(println)
    
  4. Upload to mysql

Web visualization

Tomcat+mysql+JSP Development , Use Dynamic Web Project

Ideas

Java The back end queries data from the database Then the front end obtains the data of the server ,JSP At the end of the page ECharts And query data to show visualization , And agent to Web 8080 Port display interface

effect

Comparison of consumption behaviors of all sellers

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-yMcTfMSH-1654792113133)(https://vvtorres.oss-cn-beijing.aliyuncs.com/image-20220610000233762.png)]

Transaction comparison between male and female buyers

image-20220610000302109

Age group transaction comparison

image-20220610000443946

Comparison of transaction volume of commodity categories

image-20220610000522760

Comparison of trading volume of each province

image-20220610000545122

SVM Forecast comparison of repeat customers

image-20220610000807708

Conclusion analysis

  1. I'm familiar Hive,Hive Is based on Hadoop A data warehouse tool , For data extraction 、 conversion 、 load , It's a way to store 、 Queries and analysis are stored in Hadoop The mechanism of large-scale data in .Hive Data warehouse tools can map structured data files into a database table , And provide SQL Query function , Be able to make SQL The sentence changes into MapReduce Task to carry out . It is very suitable for statistical analysis of data warehouse !

  2. I'm familiar Java Web(Dynamic Web Project) Development , Use eclipse IDE,Tomcat As Web Containers , java Write back end ,JSP Write user interface ,Echarts To visualize charts , The data analysis system can be realized

  3. I'm familiar SVM Algorithm , as well as Spark Of ML In the library SVM Use , Support vector machine SVM It's a two category model . It has been learned in machine learning . And in the Spark Of Machine Learning The library has been encapsulated SVM, Inside SVM Only the most basic two categories are supported , You can set parameters such as the number of iterations and regularization , If the threshold is set , The result greater than the threshold will be regarded as a positive prediction , Results below the threshold are treated as negative predictions , Can pass AUC To assess the .

Self made question

About Hive External and internal tables of

Hive Before importing data to an external table , The data has not been moved to its own data warehouse directory , That is to say, the data in the external table is not managed by itself

When deleting an external table ,Hive Just delete the metadata of the external table , Data is not deleted

that , How should I choose which table to use ? In most cases there is not much difference , So choice is just a matter of personal preference . But as an experience , If all processing needs to be done by Hive complete , Then you should create an internal table , Otherwise, use an external table !

Hive Import MySQL Database failure

Because it's a coding problem , hold Mysql The codes are all changed to utf-8 It can be imported normally .
The data in the external table is not managed by itself

When deleting an external table ,Hive Just delete the metadata of the external table , Data is not deleted

that , How should I choose which table to use ? In most cases there is not much difference , So choice is just a matter of personal preference . But as an experience , If all processing needs to be done by Hive complete , Then you should create an internal table , Otherwise, use an external table !

Hive Import MySQL Database failure

Because it's a coding problem , hold Mysql The codes are all changed to utf-8 It can be imported normally .

原网站

版权声明
本文为[Tcoder-l3est]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111503195652.html