当前位置:网站首页>Data analysis - Thinking foreshadowing

Data analysis - Thinking foreshadowing

2022-07-05 23:13:00 Dutkig

Three parts of data analysis

  1. Data collection
     Insert picture description here

  2. data mining

    —— The core of data mining is to mine the commercial value of data , That is what we are talking about business intelligence (BI)

    You need to master and understand the following contents :
    ① The basic flow
    ② Ten algorithms
    ③ A certain mathematical foundation
     Insert picture description here

  3. Data visualization

    This part is mainly to learn how to use relevant tools

Two principles

  1. Try to use third-party class libraries to complete your own ideas
  2. Try to choose the tool with the most users ,bug Less , All the documents , Many cases

The basic flow

  1. Business understanding : Understand project requirements from a business perspective , Better serve the business ;
  2. Data understanding : Explore the data , Including data description , Data quality verification , So as to have a preliminary understanding of the data ;
  3. Data preparation : Data cleaning and inheritance ;
  4. model : Apply the mining model and optimize , In order to get better classification results ;
  5. Model to evaluate : Evaluate the model , Check every step of building the model , Confirm whether the model has achieved the business objectives ;
  6. Launch online

Ten algorithms of data mining

For different purposes , The above ten algorithms are divided into the following four categories :

  • Classification algorithm :C4.5 , Naive Bayes ,SVM,KNN,Adaboost,CART;
  • clustering algorithm :K—Means,EM
  • Correlation analysis :Apriori
  • Connection analysis :PageRank

First of all, let's have a preliminary understanding of the above 10 Algorithms :


A decision tree algorithm , Prune in the process of creating the decision tree , And can handle continuous attributes , It can also process incomplete data .

Naive Bayes

Based on the principle of probability theory , Want to classify the given unknown objects , We need to solve the probability of each category under the condition of occurrence , Which is the biggest , Which classification do you think it belongs to .


Support vector machine (Support Vector Machine) Build a hyperplane classification model .


K Nearest neighbor algorithm (K-Nearest Neighbor) Each sample can use its latest k A neighbor represents , If a sample , its k The closest neighbors belong to the classification A, So this sample also belongs to classification A


AdaBoost A joint classification model is established in the training , Build a classifier Lifting Algorithm , It allows us to form a strong classifier with multiple weak classifiers , therefore Adaboost It is also a commonly used classification algorithm .


CART Represents classification and regression trees , English is Classification and Regression Trees. Like English , It builds two trees : One is a classification tree , The other is the regression tree . and C4.5 equally , It is a decision tree learning method .


Apriori Is a kind of mining association rules (association rules) The algorithm of , It does this by mining frequent itemsets (frequent item sets) To reveal the relationship between objects , It is widely used in the fields of business mining and network security . Frequent itemsets are collections of items that often appear together , Association rules imply that there may be a strong relationship between the two objects .


K-Means Algorithm is a clustering algorithm . You can think of it this way , Finally, I want to divide the object into K class . Suppose that in each category , There was a “ Center point ”, Opinion leader , It is the core of this category . Now I have a new point to classify , In this case, just calculate the new point and K The distance between the center points , Which center point is it near , It becomes a category .


EM Algorithm is also called maximum expectation algorithm , It is a method to find the maximum likelihood estimation of parameters . The principle is : Suppose we want to evaluate parameters A And parameters B, In the initial state, both are unknown , And got it A You can get B Information about , In turn, I know B And you get A. Consider giving... First A Some initial value , So as to get B Valuation of , And then from B Starting from the valuation of , Reevaluate A The value of , This process continues until convergence .


PageRank It originated from the calculation of the influence of the paper , If a literary theory is introduced more times , It means that the stronger the influence of this paper . Again PageRank By Google It is creatively applied to the calculation of web page weight : When a page chains out more pages , Description of this page “ reference ” The more , The more frequently this page is linked , The higher the number of times this page is referenced . Based on this principle , We can get the weight of the website . Please add a picture description

