当前位置:网站首页>[spark]spark introductory practical series_ 8_ Spark_ Mllib (lower)__ Machine learning library sparkmllib practice
[spark]spark introductory practical series_ 8_ Spark_ Mllib (lower)__ Machine learning library sparkmllib practice
2022-06-13 05:54:00 【morpheusWB】
Spark Introduction to actual combat series --8.Spark MLlib( Next )-- Machine learning library SparkMLlib actual combat
【 notes 】 This series of articles and how to use the installation package / Test data Can be in 《 I'd like to give you a big gift --Spark Introduction to actual combat series 》 obtain
1、MLlib example
1.1 Clustering examples
1.1.1 Algorithm description
clustering (Cluster analysis) Sometimes it's translated as cluster , Its core task is : Put a set of goals object Divided into clusters , Between each cluster object As similar as possible , Between clusters object As different as possible . Clustering algorithm is machine learning ( Or data mining is more appropriate ) An important part of , Except for the simplest K-Means Clustering algorithm , The more common is the hierarchy method (CURE、CHAMELEON etc. )、 Grid algorithm (STING、WaveCluster etc. ), wait .
More authoritative definition of clustering problem : The so-called clustering problem , Is to give a set of elements D, Each of these elements has n Two observable properties , Using some kind of algorithm D Divide into k A subset of , The dissimilarity between the elements within each subset is required to be as low as possible , The element dissimilarity of different subsets should be as high as possible . Each subset is called a cluster .
K-means Clustering belongs to unsupervised learning , The return of the past 、 Naive Bayes 、SVM There are category labels y Of , In other words, the classification of samples has been given in the sample . However, there is no given value in the clustered samples y, Only features x, For example, suppose that the stars in the universe can be represented as a set of points in three-dimensional space . The purpose of clustering is to find each sample x Potential categories y, And the same category y The sample of x Put together . Like the stars above , After clustering, the results are star clusters , The points in the cluster are closer to each other , The stars are far away from each other .
Different from classification , Classification is learning by example , It is required to specify each category before classification , And assert that each element is mapped to a category . Clustering is observational learning , Before clustering, you can not know the category or even give the number of categories , It's a kind of unsupervised learning . At present, clustering is widely used in statistics 、 biology 、 Database technology and marketing , There are many corresponding algorithms .
1.1.2 introduces
In this example K-Means Algorithm ,K-Means It belongs to the iterative reassignment clustering algorithm based on square error , The core idea is very simple :
l Random selection K A central point ;
l Calculate all points to this K The distance between the center points , Select the nearest center point as its cluster ;
l Simply use the arithmetic mean (mean) To recalculate K The center of the cluster ;
l Repeat step 2 and 3, Until the cluster class no longer changes or reaches the maximum iteration value ;
l Output results .
K-Means The result of the algorithm depends on the selection of the initial clustering center , It is easy to fall into local optimal solution , Yes K There is no rule to follow for the selection of values , Sensitive to abnormal data , Only data of numeric attributes can be processed , The clustering structure may be unbalanced .
In this example, the following steps are performed :
1. Loading data , Data is stored as text files ;
2. Cluster data sets , Set up 2 Class and 20 Sub iteration , Conduct model training to form a data model ;
3. Print the center point of the data model ;
4. Use the sum of squares of errors to evaluate the data model ;
5. Use the model to test single point data ;
6. Cross evaluation 1, Return results ; Cross evaluation 2, Return the data set and the result .
1.1.3 Test data description
The data used by this instance is kmeans_data.txt, You can attach resources to this series /data/class8/ Found in the directory . In this document 6 The spatial coordinates of points , Use K-means Clustering classifies these points .
The use of kmeans_data.txt The data are as follows :
0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
1.1.4 Program code
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
object Kmeans {
def main(args: Array[String]) {
// Block unnecessary log display on the terminal
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// Set up the operating environment
val conf = new SparkConf().setAppName("Kmeans").setMaster("local[4]")
val sc = new SparkContext(conf)
// Load data set
val data = sc.textFile("/home/hadoop/upload/class8/kmeans_data.txt", 1)
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
// Cluster data sets ,2 Classes ,20 Sub iteration , Conduct model training to form a data model
val numClusters = 2
val numIterations = 20
val model = KMeans.train(parsedData, numClusters, numIterations)
// Print the center point of the data model
println("Cluster centers:")
for (c <- model.clusterCenters) {
println(" " + c.toString)
}
// Use the sum of squares of errors to evaluate the data model
val cost = model.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + cost)
// Use the model to test single point data
println("Vectors 0.2 0.2 0.2 is belongs to clusters:" + model.predict(Vectors.dense("0.2 0.2 0.2".split(' ').map(_.toDouble))))
println("Vectors 0.25 0.25 0.25 is belongs to clusters:" + model.predict(Vectors.dense("0.25 0.25 0.25".split(' ').map(_.toDouble))))
println("Vectors 8 8 8 is belongs to clusters:" + model.predict(Vectors.dense("8 8 8".split(' ').map(_.toDouble))))
// Cross evaluation 1, Only return results
val testdata = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
val result1 = model.predict(testdata)
result1.saveAsTextFile("/home/hadoop/upload/class8/result_kmeans1")
// Cross evaluation 2, Return the data set and the result
val result2 = data.map {
line =>
val linevectore = Vectors.dense(line.split(' ').map(_.toDouble))
val prediction = model.predict(linevectore)
line + " " + prediction
}.saveAsTextFile("/home/hadoop/upload/class8/result_kmeans2")
sc.stop()
}
}
1.1.5 IDEA The implementation of
First step Start with the following command Spark colony
$cd /app/hadoop/spark-1.1.0
$sbin/start-all.sh
The second step stay IDEA Set the running environment in
stay IDEA Set in the operation configuration Kmeans Run configuration , Because the read data has been specified in the program , Therefore, it is not necessary to set input parameters in this setting interface
The third step Execute and observe the output
In the run log window, you can see , The model is calculated and two cluster centers are found :(9.1,9.1,9.1) and (0.1,0.1,0.1), Use the model to classify the test points and find out the family cluster .
Step four View the output result file
stay /home/hadoop/upload/class8 There are two output directories in the directory :
View results 1, Only the results are output in this directory , Listed separately 6 Points belong to different family clusters
View results 2, Data sets and results are output in this directory
1.2 Regression algorithm example
1.2.1 Algorithm description
Linear regression is a regression analysis method that uses a function called linear regression equation to model the relationship between one or more independent variables and dependent variables , The case of only one independent variable is called simple regression , More than one independent variable is called multiple regression , In practice, most of them are multiple regression .
Linear regression (Linear Regression) The problem belongs to supervised learning (Supervised Learning) Category , Also known as classification (Classification) Or inductive learning (Inductive Learning). The type of data given in the training data set in this kind of analysis is certain . The goal of machine learning is , For a given training data set , Through continuous analysis and learning, a classification function connecting attribute set and class mark set is generated (Classification Function) Or prediction function )Prediction Function), This function is called the classification model (Classification Model—— Or prediction models (Prediction Model). The model can be a decision tree 、 Specification set 、 Bayesian model or a hyperplane . Through this model, the feature vectors of input objects can be predicted or the class marks of objects can be classified .
The least square is usually used in regression problems (Least Squares) Method to iterate the proportion of each attribute in the optimal feature , Through the loss function (Loss Function) Or wrong function (Error Function) Define to set the convergence state , That is, as the approximation parameter factor of gradient descent algorithm .
1.2.2 introduces
This example shows how to import training set data , Resolve it to a labeled dot RDD, And then used LinearRegressionWithSGD Algorithm to build a simple linear model to predict the value of the tag , Finally, the mean square deviation is calculated to evaluate the coincidence between the predicted value and the actual value .
The whole process of linear regression analysis can be simply described as the following three steps :
(1) Find a suitable prediction function , That is to say h(x) , It is used to predict the judgment result of input data . This process is very critical , Need to have a certain understanding or analysis of the data , Know or guess the prediction function “ Probably ” form , For example, linear function or nonlinear function , If it is nonlinear, it is impossible to use linear regression to get high-quality results .
(2) Construct a Loss function ( Loss function ), This function represents the predicted output (h) Deviation from training data labels , It can be the difference between the two (h-y) Or something else ( Such as the square difference ). Considering all the training data “ Loss ”, take Loss Sum or average , Write it down as J(θ) function , Represents the deviation between the predicted value of all training data and the actual category .
(3) obviously , J(θ) The smaller the value of the function, the more accurate the prediction function is ( namely h The more accurate the function is ), So what we need to do in this step is to find J(θ) The minimum value of a function . There are different ways to find the minimum value of a function ,Spark The gradient descent method is used in (stochastic gradient descent,SGD).
1.2.3 Program code
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
object LinearRegression {
def main(args:Array[String]): Unit ={
// Screen unnecessary logs on the display terminal
Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// Set up the operating environment
val conf = new SparkConf().setAppName("Kmeans").setMaster("local[4]")
val sc = new SparkContext(conf)
// Load and parse the data
val data = sc.textFile("/home/hadoop/upload/class8/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce (_ + _) / valuesAndPreds.count
println("training Mean Squared Error = " + MSE)
sc.stop()
}
}
1.2.4 The implementation of
First step start-up Spark colony
$cd /app/hadoop/spark-1.1.0
$sbin/start-all.sh
The second step stay IDEA Set the running environment in
stay IDEA Set in the operation configuration LinearRegression Run configuration , Because the read data has been specified in the program , Therefore, it is not necessary to set input parameters in this setting interface
The third step Execute and observe the output
1.3 Collaborative filtering instances
1.3.1 Algorithm description
Collaborative filtering (Collaborative Filtering, abbreviation CF,WIKI The definition on is : In short, it's about taking advantage of a congenial interest 、 Recommend interesting information to users according to the preferences of groups with common experience , Individuals respond to information to a considerable extent through a cooperative mechanism ( Such as score ) And record it to achieve the purpose of filtering , To help others sift through information , Responses are not necessarily limited to those of particular interest , Records of particularly uninterested information are also important .
Collaborative filtering is often used in recommendation systems . These technologies are designed to complement users — The missing part of the commodity correlation matrix .
MLlib Currently, model-based collaborative filtering is supported , Users and commodities are expressed by a small group of implicit factors , And these factors are also used to predict missing elements .MLLib Use alternating least squares (ALS) To learn about these hidden factors .
User preferences for items or information , Depending on the application itself , It may include the user's rating of the item 、 The user checks the item's record 、 User's purchase record, etc . In fact, these user preference information can be divided into two categories :
l Explicit user feedback : This type is the natural browsing of users on the website or using the website outside , Provide feedback explicitly , For example, users' ratings of items or comments on items .
l Implicit user feedback : This kind of data is generated when users use the website , It implicitly reflects the user's preference for the item , For example, the user bought something , The user has viewed the information of an item , wait .
Explicit user feedback can accurately reflect users' real preferences for items , But users need to pay an extra price ; And implicit user behavior , Through some analysis and processing , It can also reflect users' preferences , It's just that the data is not very accurate , There is a lot of noise in the analysis of some behaviors . But just choose the right behavior characteristics , Implicit user feedback can also get good results , But the choice of behavior characteristics may be very different in different applications , For example, on the website of e-commerce , In fact, implicit feedback is a good behavior of users .
The recommendation engine may use part of the data source according to different recommendation mechanisms , Then based on these data , Analyze certain rules or directly predict and calculate users' preferences for other items . In this way, the recommendation engine can recommend items that users may be interested in when they enter .
MLlib Currently, collaborative filtering based models are supported , In this model , Users and products are described by a set of potential factors that can be used to predict missing items . In particular, we implement alternating least squares (ALS) Algorithms to learn these potential factors , stay MLlib The implementation in has the following parameters :
l numBlocks Is the number of blocks used for parallel computing ( Set to -1 when For automatic configuration );
l rank Is the number of recessive factors in the model ;
l iterations Is the number of iterations ;
l lambda yes ALS The regularization parameter of ;
l implicitPrefs Decided to use explicit feedback ALS Is the version of the implicit feedback data set ;
l alpha It's a needle for implicit feedback ALS Version parameters , This parameter determines the benchmark of the intensity of preference behavior .
1.3.2 introduces
In this example, the collaborative filtering algorithm will be used to GroupLens Research(http://grouplens.org/datasets/movielens/) Provide data for analysis , The data is a set from 20 century 90 By the end of the year 21 At the beginning of the century MovieLens Movie rating data provided by users , These figures include movie ratings 、 Movie metadata ( Style type and age ) And demographic data about users ( Age 、 Zip code 、 Gender and occupation, etc ). The organization provides sample data of different sizes according to different needs , Different sample information contains three types of data : score 、 User information and movie information .
The following steps are carried out for the analysis of these data :
1. Load the following two types of data :
a) Load sample scoring data , The last column of the timestamp is divided by 10 The remainder of is taken as key,Rating Value ;
b) Load movie directory comparison table ( The movie ID-> Movie title )
2. Mark the sample score sheet with key Value segmentation 3 Parts of , For training (60%, And add user ratings ), check (20%), and test (20%)
3. Train models with different parameters , And then verify the centralized verification , Get the model with the best parameters
4. Use the best model to predict the score of the test set , Root mean square error between calculated and actual score
5. According to the user rating data , Recommend the top ten most interesting movies ( Pay attention to eliminate the movies that users have rated )
1.3.3 Test data description
stay MovieLens The film rating data provided is divided into three tables : score 、 User information and movie information , The attached data provided in this series provides an approximate 6000 Readers and 100 10000 scoring data , The specific location is /data/class8/movielens/data Under the table of contents , For the data description of the three tables, please refer to the table of contents README file .
1. The scoring data shows (ratings.data)
The scoring data has four fields in total , The format is UserID::MovieID::Rating::Timestamp, Divided into user numbers :: Movie number :: score :: Rating timestamp , The fields are described as follows :
l User number range 1~6040
l Movie number 1~3952
l The film is rated as a five-star rating , Range 0~5
l The scoring timestamp is in seconds
l Each user has at least 20 Movie ratings
The use of ratings.dat The data sample of is as follows :
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
2. User information (users.dat)
Five fields of user information , The format is UserID::Gender::Age::Occupation::Zip-code, Divided into user numbers :: Gender :: Age :: occupation :: Zip code , The fields are described as follows :
l User number range 1~6040
l Gender , among M For men ,F For women
l Different numbers represent different age ranges , Such as :25 representative 25~34 The age range
l Career information , In the test data 21 Chinese occupation classification
l Regional zip code
The use of users.dat The data sample of is as follows :
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
6::F::50::9::55117
7::M::35::1::06810
8::M::25::12::11413
3. Movie information (movies.dat)
The movie data is divided into three fields , The format is MovieID::Title::Genres, Divided into film numbers :: The movie name :: Movie category , The fields are described as follows :
l Movie number 1~3952
l from IMDB Provide movie title , This includes the year the film was released
l The film classification , The actual category name is used here instead of the number , Such as :Action、Crime etc.
The use of movies.dat The data sample of is as follows :
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
1.3.4 Program code
import java.io.File
import scala.io.Source
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import org.apache.spark.mllib.recommendation.{ALS, Rating, MatrixFactorizationModel}
object MovieLensALS {
def main(args: Array[String]) {
// Block unnecessary log display on the terminal
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
if (args.length != 2) {
println("Usage: /path/to/spark/bin/spark-submit --driver-memory 2g --class week7.MovieLensALS " +
"week7.jar movieLensHomeDir personalRatingsFile")
sys.exit(1)
}
// Set up the operating environment
val conf = new SparkConf().setAppName("MovieLensALS").setMaster("local[4]")
val sc = new SparkContext(conf)
// Load user ratings , The score is generated by the rater
val myRatings = loadRatings(args(1))
val myRatingsRDD = sc.parallelize(myRatings, 1)
// Sample data directory
val movieLensHomeDir = args(0)
// Load sample scoring data , The last column Timestamp Removal 10 The remainder of is taken as key,Rating Value , namely (Int,Rating)
val ratings = sc.textFile(new File(movieLensHomeDir, "ratings.dat").toString).map { line =>
val fields = line.split("::")
(fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
}
// Load movie directory comparison table ( The movie ID-> Movie title )
val movies = sc.textFile(new File(movieLensHomeDir, "movies.dat").toString).map { line =>
val fields = line.split("::")
(fields(0).toInt, fields(1))
}.collect().toMap
val numRatings = ratings.count()
val numUsers = ratings.map(_._2.user).distinct().count()
val numMovies = ratings.map(_._2.product).distinct().count()
println("Got " + numRatings + " ratings from " + numUsers + " users on " + numMovies + " movies.")
// Mark the sample score sheet with key Value segmentation 3 Parts of , For training (60%, And add user ratings ), check (20%), and test (20%)
// The data should be applied to many times in the calculation process , therefore cache To the memory
val numPartitions = 4
val training = ratings.filter(x => x._1 < 6)
.values
.union(myRatingsRDD) // Be careful ratings yes (Int,Rating), take value that will do
.repartition(numPartitions)
.cache()
val validation = ratings.filter(x => x._1 >= 6 && x._1 < 8)
.values
.repartition(numPartitions)
.cache()
val test = ratings.filter(x => x._1 >= 8).values.cache()
val numTraining = training.count()
val numValidation = validation.count()
val numTest = test.count()
println("Training: " + numTraining + ", validation: " + numValidation + ", test: " + numTest)
// Train models with different parameters , And verify in the verification set , Get the model with the best parameters
val ranks = List(8, 12)
val lambdas = List(0.1, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
println("RMSE (validation) = " + validationRmse + " for the model trained with rank = "
+ rank + ", lambda = " + lambda + ", and numIter = " + numIter + ".")
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
}
}
// Use the best model to predict the score of the test set , And calculate the root mean square error between the actual score
val testRmse = computeRmse(bestModel.get, test, numTest)
println("The best model was trained with rank = " + bestRank + " and lambda = " + bestLambda + ", and numIter = " + bestNumIter + ", and its RMSE on the test set is " + testRmse + ".")
// create a naive baseline and compare it with the best model
val meanRating = training.union(validation).map(_.rating).mean
val baselineRmse =
math.sqrt(test.map(x => (meanRating - x.rating) * (meanRating - x.rating)).mean)
val improvement = (baselineRmse - testRmse) / baselineRmse * 100
println("The best model improves the baseline by " + "%1.2f".format(improvement) + "%.")
// Recommend the top ten most interesting movies , Pay attention to eliminate the movies that users have rated
val myRatedMovieIds = myRatings.map(_.product).toSet
val candidates = sc.parallelize(movies.keys.filter(!myRatedMovieIds.contains(_)).toSeq)
val recommendations = bestModel.get
.predict(candidates.map((0, _)))
.collect()
.sortBy(-_.rating)
.take(10)
var i = 1
println("Movies recommended for you:")
recommendations.foreach { r =>
println("%2d".format(i) + ": " + movies(r.product))
i += 1
}
sc.stop()
}
/** Check the root mean square error between the predicted data and the actual data of the set **/
def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating))
.join(data.map(x => ((x.user, x.product), x.rating)))
.values
math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n)
}
/** Load user rating file **/
def loadRatings(path: String): Seq[Rating] = {
val lines = Source.fromFile(path).getLines()
val ratings = lines.map { line =>
val fields = line.split("::")
Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble)
}.filter(_.rating > 0.0)
if (ratings.isEmpty) {
sys.error("No ratings provided.")
} else {
ratings.toSeq
}
}
}
1.3.5 IDEA The implementation of
First step Start with the following command Spark colony
$cd /app/hadoop/spark-1.1.0
$sbin/start-all.sh
The second step Perform user rating , Generate user sample data
Because the program finally recommends ten movies to users , This requires users to provide ratings for the sample movie data , Then, according to the generated best model, get the current user's recommended movies . Users can use /home/hadoop/upload/class8/movielens/bin/rateMovies Program to score , The resulting personalRatings.txt file :
The third step stay IDEA Set the running environment in
stay IDEA Set in the operation configuration MovieLensALS Run configuration , You need to set the folder where the input data is located and the scoring file path of the user :
l Enter the directory where the data is located : Enter the data file directory , This directory contains scoring information 、 User information and movie information , I'm going to set it to /home/hadoop/upload/class8/movielens/data/
l User's scoring file path : In the previous step, the user scores the results of ten movies , Set it here to /home/hadoop/upload/class8/movielens/personalRatings.txt
Step four Execute and observe the output
l Output Got 1000209 ratings from 6040 users on 3706 movies, Indicates that the calculation data in this algorithm includes approximately 100 Million rating data 、6000 Multi user and 3706 movie ;
l Output Training: 602252, validation: 198919, test: 199049, Indicates that the scoring data is split into training data 、 Calibration data and test data , The approximate proportion is 6:2:2;
l Select... During the calculation 8 Two different models are used to train the data , Then choose the best model from them , The best model provides 22.30%
RMSE (validation) = 0.8680885498009973 for the model trained with rank = 8, lambda = 0.1, and numIter = 10.
RMSE (validation) = 0.868882967482595 for the model trained with rank = 8, lambda = 0.1, and numIter = 20.
RMSE (validation) = 3.7558695311242833 for the model trained with rank = 8, lambda = 10.0, and numIter = 10.
RMSE (validation) = 3.7558695311242833 for the model trained with rank = 8, lambda = 10.0, and numIter = 20.
RMSE (validation) = 0.8663942501841964 for the model trained with rank = 12, lambda = 0.1, and numIter = 10.
RMSE (validation) = 0.8674684744165418 for the model trained with rank = 12, lambda = 0.1, and numIter = 20.
RMSE (validation) = 3.7558695311242833 for the model trained with rank = 12, lambda = 10.0, and numIter = 10.
RMSE (validation) = 3.7558695311242833 for the model trained with rank = 12, lambda = 10.0, and numIter = 20.
The best model was trained with rank = 12 and lambda = 0.1, and numIter = 10, and its RMSE on the test set is 0.8652326018300565.
The best model improves the baseline by 22.30%.
l Make use of the best model obtained above , Combined with the sample data provided by the user , Finally, the following videos are recommended to users :
Movies recommended for you:
1: Bewegte Mann, Der (1994)
2: Chushingura (1962)
3: Love Serenade (1996)
4: For All Mankind (1989)
5: Vie est belle, La (Life is Rosey) (1987)
6: Bandits (1997)
7: King of Masks, The (Bian Lian) (1996)
8: I'm the One That I Want (2000)
9: Big Trees, The (1952)
10: First Love, Last Rites (1997)
2、 Reference material
(1)Spark Official website mlllib explain http://spark.apache.org/docs/1.1.0/mllib-guide.html
(2)《 Classification and summary of machine learning common algorithms 》 http://www.ctocio.com/hotnews/15919.html
边栏推荐
- Cross compile HelloWorld with cmake
- Power of leetcode-4 - simple
- Leetcode- keyboard line - simple
- Shardingsphere JDBC exception: no table route info
- NVIDIA Jetson Nano/Xavier NX 扩容教程
- 2 first experience of drools
- 2021.9.29学习日志-Restful架构
- OpenGL马赛克(八)
- A simple recursion problem of linked list
- Hump naming and underlining
猜你喜欢
Zero copy technology
Tongweb card, tongweb card, tongweb card
3. Postman easy to use
Etcd understanding of microservice architecture
20 flowable container (event sub process, things, sub process, pool and pool)
MySQL fuzzy query and sorting by matching degree
Sentinel series introduction to service flow restriction
1 Introduction to drools rule engine (usage scenarios and advantages)
Validation set: ‘flowable-executable-process‘ | Problem: ‘flowable-servicetask-missing-implementatio
The SQL file of mysql8.0 was imported into version 5.5. There was a pit
随机推荐
Leetcode longest harmonic subsequence simple
C calls the API and parses the returned JSON string
Initial redis experience
One of PowerShell optimizations: prompt beautification
Current limiting and fusing of gateway gateway in Spirng cloud
Tongweb adapts to openrasp
Pychart professional edition's solution to SQL script error reporting
Leetcode- find a difference - simple
JNDI configuration for tongweb7
The reason why the process cannot be shut down after a spark job is executed and the solution
Randomly fetch data from the list
Leetcode- complement of numbers - simple
Summary of the 11th week of sophomore year
Leetcode- string addition - simple
Leetcode Timo attack - simple
Working principle of sentinel series (source code analysis)
Sentinel series introduction to service flow restriction
Calculate the number of days between two times (supports cross month and cross year)
Exception after repeated application redeployment on tongweb: application instance has been stopped already or outofmemoryerror:metaspace
Shardingsphere JDBC < bind table > avoid join Cartesian product