当前位置:网站首页>[spark]spark introductory practical series_ 8_ Spark_ Mllib (lower)__ Machine learning library sparkmllib practice

[spark]spark introductory practical series_ 8_ Spark_ Mllib (lower)__ Machine learning library sparkmllib practice

2022-06-13 05:54:00 morpheusWB

Spark Introduction to actual combat series --8.Spark MLlib( Next )-- Machine learning library SparkMLlib actual combat

【 notes 】 This series of articles and how to use the installation package / Test data Can be in 《 I'd like to give you a big gift --Spark Introduction to actual combat series 》 obtain

1、MLlib example

1.1  Clustering examples

1.1.1  Algorithm description

clustering (Cluster analysis) Sometimes it's translated as cluster , Its core task is : Put a set of goals object Divided into clusters , Between each cluster object As similar as possible , Between clusters object As different as possible . Clustering algorithm is machine learning ( Or data mining is more appropriate ) An important part of , Except for the simplest K-Means Clustering algorithm , The more common is the hierarchy method (CURE、CHAMELEON etc. )、 Grid algorithm (STING、WaveCluster etc. ), wait .

More authoritative definition of clustering problem : The so-called clustering problem , Is to give a set of elements D, Each of these elements has n Two observable properties , Using some kind of algorithm D Divide into k A subset of , The dissimilarity between the elements within each subset is required to be as low as possible , The element dissimilarity of different subsets should be as high as possible . Each subset is called a cluster .

K-means Clustering belongs to unsupervised learning , The return of the past 、 Naive Bayes 、SVM There are category labels y Of , In other words, the classification of samples has been given in the sample . However, there is no given value in the clustered samples y, Only features x, For example, suppose that the stars in the universe can be represented as a set of points in three-dimensional space clip_image002. The purpose of clustering is to find each sample x Potential categories y, And the same category y The sample of x Put together . Like the stars above , After clustering, the results are star clusters , The points in the cluster are closer to each other , The stars are far away from each other .

Different from classification , Classification is learning by example , It is required to specify each category before classification , And assert that each element is mapped to a category . Clustering is observational learning , Before clustering, you can not know the category or even give the number of categories , It's a kind of unsupervised learning . At present, clustering is widely used in statistics 、 biology 、 Database technology and marketing , There are many corresponding algorithms .

1.1.2  introduces

In this example K-Means Algorithm ,K-Means It belongs to the iterative reassignment clustering algorithm based on square error , The core idea is very simple :

l Random selection K A central point ;

l Calculate all points to this K The distance between the center points , Select the nearest center point as its cluster ;

l Simply use the arithmetic mean (mean) To recalculate K The center of the cluster ;

l Repeat step 2 and 3, Until the cluster class no longer changes or reaches the maximum iteration value ;

l Output results .

K-Means The result of the algorithm depends on the selection of the initial clustering center , It is easy to fall into local optimal solution , Yes K There is no rule to follow for the selection of values , Sensitive to abnormal data , Only data of numeric attributes can be processed , The clustering structure may be unbalanced .

In this example, the following steps are performed :

1. Loading data , Data is stored as text files ;

2. Cluster data sets , Set up 2 Class and 20 Sub iteration , Conduct model training to form a data model ;

3. Print the center point of the data model ;

4. Use the sum of squares of errors to evaluate the data model ;

5. Use the model to test single point data ;

6. Cross evaluation 1, Return results ; Cross evaluation 2, Return the data set and the result .

1.1.3 Test data description

The data used by this instance is kmeans_data.txt, You can attach resources to this series /data/class8/ Found in the directory . In this document 6 The spatial coordinates of points , Use K-means Clustering classifies these points .

The use of kmeans_data.txt The data are as follows :

0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

9.0 9.0 9.0

9.1 9.1 9.1

9.2 9.2 9.2

1.1.4 Program code

import org.apache.log4j.{Level, Logger}

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.mllib.clustering.KMeans

import org.apache.spark.mllib.linalg.Vectors

 

object Kmeans {

  def main(args: Array[String]) {

    //  Block unnecessary log display on the terminal

    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

 

    //  Set up the operating environment

    val conf = new SparkConf().setAppName("Kmeans").setMaster("local[4]")

    val sc = new SparkContext(conf)

 

    //  Load data set

    val data = sc.textFile("/home/hadoop/upload/class8/kmeans_data.txt", 1)

    val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))

 

    //  Cluster data sets ,2 Classes ,20 Sub iteration , Conduct model training to form a data model

    val numClusters = 2

    val numIterations = 20

    val model = KMeans.train(parsedData, numClusters, numIterations)

 

    //  Print the center point of the data model

    println("Cluster centers:")

    for (c <- model.clusterCenters) {

      println("  " + c.toString)

    }

 

    //  Use the sum of squares of errors to evaluate the data model

    val cost = model.computeCost(parsedData)

    println("Within Set Sum of Squared Errors = " + cost)

 

    //  Use the model to test single point data

println("Vectors 0.2 0.2 0.2 is belongs to clusters:" + model.predict(Vectors.dense("0.2 0.2 0.2".split(' ').map(_.toDouble))))

println("Vectors 0.25 0.25 0.25 is belongs to clusters:" + model.predict(Vectors.dense("0.25 0.25 0.25".split(' ').map(_.toDouble))))

println("Vectors 8 8 8 is belongs to clusters:" + model.predict(Vectors.dense("8 8 8".split(' ').map(_.toDouble))))

 

    //  Cross evaluation 1, Only return results

    val testdata = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))

    val result1 = model.predict(testdata)

   result1.saveAsTextFile("/home/hadoop/upload/class8/result_kmeans1")

 

    //  Cross evaluation 2, Return the data set and the result

    val result2 = data.map {

      line =>

        val linevectore = Vectors.dense(line.split(' ').map(_.toDouble))

        val prediction = model.predict(linevectore)

        line + " " + prediction

    }.saveAsTextFile("/home/hadoop/upload/class8/result_kmeans2")

 

    sc.stop()

  }

}

clip_image004

1.1.5 IDEA The implementation of

First step     Start with the following command Spark colony

$cd /app/hadoop/spark-1.1.0

$sbin/start-all.sh

The second step     stay IDEA Set the running environment in

stay IDEA Set in the operation configuration Kmeans Run configuration , Because the read data has been specified in the program , Therefore, it is not necessary to set input parameters in this setting interface

clip_image006

The third step     Execute and observe the output

In the run log window, you can see , The model is calculated and two cluster centers are found :(9.1,9.1,9.1) and (0.1,0.1,0.1), Use the model to classify the test points and find out the family cluster .

clip_image008

Step four     View the output result file

stay /home/hadoop/upload/class8 There are two output directories in the directory :

clip_image010

View results 1, Only the results are output in this directory , Listed separately 6 Points belong to different family clusters

clip_image012

View results 2, Data sets and results are output in this directory

clip_image014

1.2  Regression algorithm example

1.2.1  Algorithm description

Linear regression is a regression analysis method that uses a function called linear regression equation to model the relationship between one or more independent variables and dependent variables , The case of only one independent variable is called simple regression , More than one independent variable is called multiple regression , In practice, most of them are multiple regression .

Linear regression (Linear Regression) The problem belongs to supervised learning (Supervised Learning) Category , Also known as classification (Classification) Or inductive learning (Inductive Learning). The type of data given in the training data set in this kind of analysis is certain . The goal of machine learning is , For a given training data set , Through continuous analysis and learning, a classification function connecting attribute set and class mark set is generated (Classification Function) Or prediction function )Prediction Function), This function is called the classification model (Classification Model—— Or prediction models (Prediction Model). The model can be a decision tree 、 Specification set 、 Bayesian model or a hyperplane . Through this model, the feature vectors of input objects can be predicted or the class marks of objects can be classified .

The least square is usually used in regression problems (Least Squares) Method to iterate the proportion of each attribute in the optimal feature , Through the loss function (Loss Function) Or wrong function (Error Function) Define to set the convergence state , That is, as the approximation parameter factor of gradient descent algorithm .

1.2.2  introduces

This example shows how to import training set data , Resolve it to a labeled dot RDD, And then used LinearRegressionWithSGD  Algorithm to build a simple linear model to predict the value of the tag , Finally, the mean square deviation is calculated to evaluate the coincidence between the predicted value and the actual value .

The whole process of linear regression analysis can be simply described as the following three steps :

(1) Find a suitable prediction function , That is to say  h(x) , It is used to predict the judgment result of input data . This process is very critical , Need to have a certain understanding or analysis of the data , Know or guess the prediction function “ Probably ” form , For example, linear function or nonlinear function , If it is nonlinear, it is impossible to use linear regression to get high-quality results .

(2) Construct a Loss function ( Loss function ), This function represents the predicted output (h) Deviation from training data labels , It can be the difference between the two (h-y) Or something else ( Such as the square difference ). Considering all the training data “ Loss ”, take Loss Sum or average , Write it down as  J(θ)  function , Represents the deviation between the predicted value of all training data and the actual category .

(3) obviously , J(θ)  The smaller the value of the function, the more accurate the prediction function is ( namely h The more accurate the function is ), So what we need to do in this step is to find  J(θ)  The minimum value of a function . There are different ways to find the minimum value of a function ,Spark The gradient descent method is used in (stochastic gradient descent,SGD).

1.2.3 Program code

import org.apache.log4j.{Level, Logger}

import org.apache.spark.{SparkContext, SparkConf}

import org.apache.spark.mllib.regression.LinearRegressionWithSGD

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

 

object LinearRegression {

  def main(args:Array[String]): Unit ={

    //  Screen unnecessary logs on the display terminal

    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)

    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

 

    //  Set up the operating environment

    val conf = new SparkConf().setAppName("Kmeans").setMaster("local[4]")

    val sc = new SparkContext(conf)

 

    // Load and parse the data

    val data = sc.textFile("/home/hadoop/upload/class8/lpsa.data")

    val parsedData = data.map { line =>

      val parts = line.split(',')

      LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))

    }

 

    // Building the model

    val numIterations = 100

    val model = LinearRegressionWithSGD.train(parsedData, numIterations)

 

    // Evaluate model on training examples and compute training error

    val valuesAndPreds = parsedData.map { point =>

      val prediction = model.predict(point.features)

      (point.label, prediction)

    }

 

    val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce (_ + _) / valuesAndPreds.count

    println("training Mean Squared Error = " + MSE)

 

    sc.stop()

  }

}

clip_image016

1.2.4  The implementation of

First step     start-up Spark colony

$cd /app/hadoop/spark-1.1.0

$sbin/start-all.sh

The second step     stay IDEA Set the running environment in

stay IDEA Set in the operation configuration LinearRegression Run configuration , Because the read data has been specified in the program , Therefore, it is not necessary to set input parameters in this setting interface

clip_image018

The third step     Execute and observe the output

clip_image020

1.3  Collaborative filtering instances

1.3.1  Algorithm description

Collaborative filtering (Collaborative Filtering, abbreviation CF,WIKI The definition on is : In short, it's about taking advantage of a congenial interest 、 Recommend interesting information to users according to the preferences of groups with common experience , Individuals respond to information to a considerable extent through a cooperative mechanism ( Such as score ) And record it to achieve the purpose of filtering , To help others sift through information , Responses are not necessarily limited to those of particular interest , Records of particularly uninterested information are also important .

Collaborative filtering is often used in recommendation systems . These technologies are designed to complement users — The missing part of the commodity correlation matrix .

MLlib  Currently, model-based collaborative filtering is supported , Users and commodities are expressed by a small group of implicit factors , And these factors are also used to predict missing elements .MLLib Use alternating least squares (ALS) To learn about these hidden factors .

User preferences for items or information , Depending on the application itself , It may include the user's rating of the item 、 The user checks the item's record 、 User's purchase record, etc . In fact, these user preference information can be divided into two categories :

l   Explicit user feedback : This type is the natural browsing of users on the website or using the website outside , Provide feedback explicitly , For example, users' ratings of items or comments on items .

l   Implicit user feedback : This kind of data is generated when users use the website , It implicitly reflects the user's preference for the item , For example, the user bought something , The user has viewed the information of an item , wait .

Explicit user feedback can accurately reflect users' real preferences for items , But users need to pay an extra price ; And implicit user behavior , Through some analysis and processing , It can also reflect users' preferences , It's just that the data is not very accurate , There is a lot of noise in the analysis of some behaviors . But just choose the right behavior characteristics , Implicit user feedback can also get good results , But the choice of behavior characteristics may be very different in different applications , For example, on the website of e-commerce , In fact, implicit feedback is a good behavior of users .

The recommendation engine may use part of the data source according to different recommendation mechanisms , Then based on these data , Analyze certain rules or directly predict and calculate users' preferences for other items . In this way, the recommendation engine can recommend items that users may be interested in when they enter .

MLlib Currently, collaborative filtering based models are supported , In this model , Users and products are described by a set of potential factors that can be used to predict missing items . In particular, we implement alternating least squares (ALS) Algorithms to learn these potential factors , stay  MLlib  The implementation in has the following parameters :

l  numBlocks Is the number of blocks used for parallel computing ( Set to -1 when For automatic configuration );

l  rank Is the number of recessive factors in the model ;

l  iterations Is the number of iterations ;

l  lambda yes ALS  The regularization parameter of ;

l  implicitPrefs Decided to use explicit feedback ALS  Is the version of the implicit feedback data set ;

l  alpha It's a needle for implicit feedback  ALS  Version parameters , This parameter determines the benchmark of the intensity of preference behavior .

clip_image022

1.3.2  introduces

In this example, the collaborative filtering algorithm will be used to GroupLens Research(http://grouplens.org/datasets/movielens/) Provide data for analysis , The data is a set from 20 century 90 By the end of the year 21 At the beginning of the century MovieLens Movie rating data provided by users , These figures include movie ratings 、 Movie metadata ( Style type and age ) And demographic data about users ( Age 、 Zip code 、 Gender and occupation, etc ). The organization provides sample data of different sizes according to different needs , Different sample information contains three types of data : score 、 User information and movie information .

The following steps are carried out for the analysis of these data :

1.  Load the following two types of data :

a) Load sample scoring data , The last column of the timestamp is divided by 10 The remainder of is taken as key,Rating Value ;

b) Load movie directory comparison table ( The movie ID-> Movie title )

2. Mark the sample score sheet with key Value segmentation 3 Parts of , For training  (60%, And add user ratings ),  check  (20%), and  test  (20%)

3. Train models with different parameters , And then verify the centralized verification , Get the model with the best parameters

4. Use the best model to predict the score of the test set , Root mean square error between calculated and actual score

5. According to the user rating data , Recommend the top ten most interesting movies ( Pay attention to eliminate the movies that users have rated )

1.3.3  Test data description

stay MovieLens The film rating data provided is divided into three tables : score 、 User information and movie information , The attached data provided in this series provides an approximate 6000 Readers and 100 10000 scoring data , The specific location is /data/class8/movielens/data Under the table of contents , For the data description of the three tables, please refer to the table of contents README file .

1. The scoring data shows (ratings.data)

The scoring data has four fields in total , The format is UserID::MovieID::Rating::Timestamp, Divided into user numbers :: Movie number :: score :: Rating timestamp , The fields are described as follows :

l User number range 1~6040

l Movie number 1~3952

l The film is rated as a five-star rating , Range 0~5

l The scoring timestamp is in seconds

l Each user has at least 20 Movie ratings

The use of ratings.dat The data sample of is as follows :

1::1193::5::978300760

1::661::3::978302109

1::914::3::978301968

1::3408::4::978300275

1::2355::5::978824291

1::1197::3::978302268

1::1287::5::978302039

1::2804::5::978300719

2. User information (users.dat)

Five fields of user information , The format is UserID::Gender::Age::Occupation::Zip-code, Divided into user numbers :: Gender :: Age :: occupation :: Zip code , The fields are described as follows :

l User number range 1~6040

l Gender , among M For men ,F For women

l Different numbers represent different age ranges , Such as :25 representative 25~34 The age range

l Career information , In the test data 21 Chinese occupation classification

l Regional zip code

The use of users.dat The data sample of is as follows :

1::F::1::10::48067

2::M::56::16::70072

3::M::25::15::55117

4::M::45::7::02460

5::M::25::20::55455

6::F::50::9::55117

7::M::35::1::06810

8::M::25::12::11413

3. Movie information (movies.dat)

The movie data is divided into three fields , The format is MovieID::Title::Genres, Divided into film numbers :: The movie name :: Movie category , The fields are described as follows :

l Movie number 1~3952

l from IMDB Provide movie title , This includes the year the film was released

l The film classification , The actual category name is used here instead of the number , Such as :Action、Crime etc.

The use of movies.dat The data sample of is as follows :

1::Toy Story (1995)::Animation|Children's|Comedy

2::Jumanji (1995)::Adventure|Children's|Fantasy

3::Grumpier Old Men (1995)::Comedy|Romance

4::Waiting to Exhale (1995)::Comedy|Drama

5::Father of the Bride Part II (1995)::Comedy

6::Heat (1995)::Action|Crime|Thriller

7::Sabrina (1995)::Comedy|Romance

8::Tom and Huck (1995)::Adventure|Children's

1.3.4  Program code

import java.io.File

import scala.io.Source

import org.apache.log4j.{Level, Logger}

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark.rdd._

import org.apache.spark.mllib.recommendation.{ALS, Rating, MatrixFactorizationModel}

 

object MovieLensALS {

 

  def main(args: Array[String]) {

    //  Block unnecessary log display on the terminal

    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

 

    if (args.length != 2) {

      println("Usage: /path/to/spark/bin/spark-submit --driver-memory 2g --class week7.MovieLensALS " +

        "week7.jar movieLensHomeDir personalRatingsFile")

      sys.exit(1)

    }

 

    //  Set up the operating environment

    val conf = new SparkConf().setAppName("MovieLensALS").setMaster("local[4]")

    val sc = new SparkContext(conf)

 

    //  Load user ratings , The score is generated by the rater

    val myRatings = loadRatings(args(1))

    val myRatingsRDD = sc.parallelize(myRatings, 1)

 

    //  Sample data directory

    val movieLensHomeDir = args(0)

 

    //  Load sample scoring data , The last column Timestamp Removal 10 The remainder of is taken as key,Rating Value , namely (Int,Rating)

    val ratings = sc.textFile(new File(movieLensHomeDir, "ratings.dat").toString).map { line =>

      val fields = line.split("::")

      (fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))

    }

 

    //  Load movie directory comparison table ( The movie ID-> Movie title )

    val movies = sc.textFile(new File(movieLensHomeDir, "movies.dat").toString).map { line =>

      val fields = line.split("::")

      (fields(0).toInt, fields(1))

    }.collect().toMap

 

    val numRatings = ratings.count()

    val numUsers = ratings.map(_._2.user).distinct().count()

    val numMovies = ratings.map(_._2.product).distinct().count()

 

    println("Got " + numRatings + " ratings from " + numUsers + " users on " + numMovies + " movies.")

 

    //  Mark the sample score sheet with key Value segmentation 3 Parts of , For training  (60%, And add user ratings ),  check  (20%), and  test  (20%)

    //  The data should be applied to many times in the calculation process , therefore cache To the memory

    val numPartitions = 4

    val training = ratings.filter(x => x._1 < 6)

      .values

      .union(myRatingsRDD) // Be careful ratings yes (Int,Rating), take value that will do

      .repartition(numPartitions)

      .cache()

    val validation = ratings.filter(x => x._1 >= 6 && x._1 < 8)

      .values

      .repartition(numPartitions)

      .cache()

    val test = ratings.filter(x => x._1 >= 8).values.cache()

 

    val numTraining = training.count()

    val numValidation = validation.count()

    val numTest = test.count()

 

    println("Training: " + numTraining + ", validation: " + numValidation + ", test: " + numTest)

 

    //  Train models with different parameters , And verify in the verification set , Get the model with the best parameters

    val ranks = List(8, 12)

    val lambdas = List(0.1, 10.0)

    val numIters = List(10, 20)

    var bestModel: Option[MatrixFactorizationModel] = None

    var bestValidationRmse = Double.MaxValue

    var bestRank = 0

    var bestLambda = -1.0

    var bestNumIter = -1

    for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {

      val model = ALS.train(training, rank, numIter, lambda)

      val validationRmse = computeRmse(model, validation, numValidation)

      println("RMSE (validation) = " + validationRmse + " for the model trained with rank = "

        + rank + ", lambda = " + lambda + ", and numIter = " + numIter + ".")

      if (validationRmse < bestValidationRmse) {

        bestModel = Some(model)

        bestValidationRmse = validationRmse

        bestRank = rank

        bestLambda = lambda

        bestNumIter = numIter

      }

    }

 

    //  Use the best model to predict the score of the test set , And calculate the root mean square error between the actual score

    val testRmse = computeRmse(bestModel.get, test, numTest)

 

    println("The best model was trained with rank = " + bestRank + " and lambda = " + bestLambda  + ", and numIter = " + bestNumIter + ", and its RMSE on the test set is " + testRmse + ".")

 

    // create a naive baseline and compare it with the best model

    val meanRating = training.union(validation).map(_.rating).mean

    val baselineRmse =

      math.sqrt(test.map(x => (meanRating - x.rating) * (meanRating - x.rating)).mean)

    val improvement = (baselineRmse - testRmse) / baselineRmse * 100

    println("The best model improves the baseline by " + "%1.2f".format(improvement) + "%.")

 

    //  Recommend the top ten most interesting movies , Pay attention to eliminate the movies that users have rated

    val myRatedMovieIds = myRatings.map(_.product).toSet

    val candidates = sc.parallelize(movies.keys.filter(!myRatedMovieIds.contains(_)).toSeq)

    val recommendations = bestModel.get

      .predict(candidates.map((0, _)))

      .collect()

      .sortBy(-_.rating)

      .take(10)

 

    var i = 1

    println("Movies recommended for you:")

    recommendations.foreach { r =>

      println("%2d".format(i) + ": " + movies(r.product))

      i += 1

    }

 

  sc.stop()

  }

 

  /**  Check the root mean square error between the predicted data and the actual data of the set  **/

  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {

    val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))

    val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating))

      .join(data.map(x => ((x.user, x.product), x.rating)))

      .values

    math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n)

  }

 

  /**  Load user rating file  **/

  def loadRatings(path: String): Seq[Rating] = {

    val lines = Source.fromFile(path).getLines()

    val ratings = lines.map { line =>

      val fields = line.split("::")

      Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble)

    }.filter(_.rating > 0.0)

    if (ratings.isEmpty) {

      sys.error("No ratings provided.")

    } else {

      ratings.toSeq

    }

  }

}

clip_image024

1.3.5 IDEA The implementation of

First step     Start with the following command Spark colony

$cd /app/hadoop/spark-1.1.0

$sbin/start-all.sh

The second step     Perform user rating , Generate user sample data

Because the program finally recommends ten movies to users , This requires users to provide ratings for the sample movie data , Then, according to the generated best model, get the current user's recommended movies . Users can use /home/hadoop/upload/class8/movielens/bin/rateMovies Program to score , The resulting personalRatings.txt file :

clip_image026

The third step     stay IDEA Set the running environment in

stay IDEA Set in the operation configuration MovieLensALS Run configuration , You need to set the folder where the input data is located and the scoring file path of the user :

l   Enter the directory where the data is located : Enter the data file directory , This directory contains scoring information 、 User information and movie information , I'm going to set it to /home/hadoop/upload/class8/movielens/data/

l   User's scoring file path : In the previous step, the user scores the results of ten movies , Set it here to /home/hadoop/upload/class8/movielens/personalRatings.txt

Step four     Execute and observe the output

l   Output Got 1000209 ratings from 6040 users on 3706 movies, Indicates that the calculation data in this algorithm includes approximately 100 Million rating data 、6000 Multi user and 3706 movie ;

l   Output Training: 602252, validation: 198919, test: 199049, Indicates that the scoring data is split into training data 、 Calibration data and test data , The approximate proportion is 6:2:2;

l   Select... During the calculation 8 Two different models are used to train the data , Then choose the best model from them , The best model provides 22.30%

RMSE (validation) = 0.8680885498009973 for the model trained with rank = 8, lambda = 0.1, and numIter = 10.

RMSE (validation) = 0.868882967482595 for the model trained with rank = 8, lambda = 0.1, and numIter = 20.

RMSE (validation) = 3.7558695311242833 for the model trained with rank = 8, lambda = 10.0, and numIter = 10.

RMSE (validation) = 3.7558695311242833 for the model trained with rank = 8, lambda = 10.0, and numIter = 20.

RMSE (validation) = 0.8663942501841964 for the model trained with rank = 12, lambda = 0.1, and numIter = 10.

RMSE (validation) = 0.8674684744165418 for the model trained with rank = 12, lambda = 0.1, and numIter = 20.

RMSE (validation) = 3.7558695311242833 for the model trained with rank = 12, lambda = 10.0, and numIter = 10.

RMSE (validation) = 3.7558695311242833 for the model trained with rank = 12, lambda = 10.0, and numIter = 20.

The best model was trained with rank = 12 and lambda = 0.1, and numIter = 10, and its RMSE on the test set is 0.8652326018300565.

The best model improves the baseline by 22.30%.

l   Make use of the best model obtained above , Combined with the sample data provided by the user , Finally, the following videos are recommended to users :

Movies recommended for you:

 1: Bewegte Mann, Der (1994)

 2: Chushingura (1962)

 3: Love Serenade (1996)

 4: For All Mankind (1989)

 5: Vie est belle, La (Life is Rosey) (1987)

 6: Bandits (1997)

 7: King of Masks, The (Bian Lian) (1996)

 8: I'm the One That I Want (2000)

 9: Big Trees, The (1952)

10: First Love, Last Rites (1997)

clip_image028

 

2、 Reference material

(1)Spark Official website  mlllib explain   http://spark.apache.org/docs/1.1.0/mllib-guide.html

(2)《 Classification and summary of machine learning common algorithms 》 http://www.ctocio.com/hotnews/15919.html

原网站

版权声明
本文为[morpheusWB]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202280506226590.html