当前位置:网站首页>On the application of cluster analysis in work
On the application of cluster analysis in work
2022-06-30 23:56:00 【Little fire dragon said data】
Estimated reading time :8min
Reading suggestions : This paper describes the application of clustering , Some experiences in actual combat are summarized , I hope it helped you .
Solve the pain : What is clustering ? What value does it have in data analysis ? How to cluster ? What are the advantages and disadvantages of each method ? I hope you will come to this article with these questions .
00
preface
mention 「 clustering 」, Do you think 「 Birds of a feather flock together 、 Birds of a feather flock together 」 Well , This is the essence of clustering . At work , Cluster analysis is still common , Have you ever encountered the following problems ?
- User partition : When the product reaches a certain level , It is hoped that users can be divided into sub groups with obvious characteristic attributes , Implement strategies for specific groups .
- Product portfolio : With the enrichment of the company's products , We hope to divide the products into different combinations according to value and liquidity , And make marketing strategies for different combinations .
- Anti cheating judgment : The user's operation on the product belongs to normal behavior , And some people seek benefits , Use machine and other methods to brush data , For this kind of cheating , How do we explore it ?
When you encounter problems like the above , Cluster analysis can be used . below , Xiaohuolong will start from the practical application of clustering , Unveil the veil for everyone .
01
What is clustering
Clustering is based on certain rules ( for example : Distance ), Divide the data set into different clusters , Make the similarity of individuals in the same cluster as large as possible , The similarity between different clusters should be as small as possible . The performance in the data is , Individuals with similar eigenvalues are more likely to get together , On the contrary, the possibility is small .
Tips: There are many ways to calculate distance , List several common
Euclidean distance
Manhattan distance
Cosine distance
「 clustering 」 There is no prior knowledge 「 Unsupervised algorithm 」, It corresponds to 「 classification 」 It is 「 Supervised algorithms 」. Some students may have some dizziness , For example, it is easy to understand :
Classification problem
Men must have Adam's apple , Therefore, as long as people with Adam's apple , It's male . among , The Adam's apple is transcendental knowledge , Through this feature , People can be classified into men and women .
Clustering problem
For our ancestors , There is no distinction between men and women , But through features , Some people can be found to have Adam's apple 、 The other part didn't , Therefore, the group with Adam's apple is named male .
Here is a case of clustering in the work , Help people enhance their cognition .
02
Practical application of clustering
stay 「 Strategy Promotion 」 In the direction of , Clustering can cluster the target groups , And expand the potential population . Therefore, the application directions include but are not limited to : Precise market promotion 、 Expansion of potential target groups, etc . below , Take a case :
A car company , Before the Spring Festival , For possible target groups 「 Blue collar workers in first tier cities 」 Send wechat ads to push , Help improve the drainage of car enterprises . But after clustering by user behavior , Explore 「 White collar workers in second tier cities 」 Also have the potential to buy a car . therefore , In the process of promotion , Expanded the target group's delivery range , Improve drainage effect .( The specific application steps are shown in the figure below )
03
Common clustering methods and their advantages and disadvantages
After introducing the function of clustering , Do you have any friends who want to know , What are the ways of clustering , How to achieve , And the advantages and disadvantages of each method , Now little fire dragon makes a summary for you .
There are many kinds of clustering models , It can be roughly divided in the following three directions :
1. The processing power of the model : Ability to handle different distribution shapes ; Ability to handle outliers ; The ability to handle big data .
2. Whether the model needs preset parameters : Whether it is necessary to provide category quantity, etc .
3. Model requirements for data input : Does the order of data input affect the model ; Whether the model has requirements for the type of features, etc .
According to the above classification , Clustering models can be divided into the following categories :
Because of space , Little fire dragon selects three commonly used models to elaborate ( The red part of the picture ).
1、 A hierarchy based approach - Hierarchical clustering
1. Model principle
Hierarchical clustering consists of two ways ,「 Agglomerative hierarchical clustering 」 and 「 Split hierarchical clustering 」. Agglomerative hierarchical clustering is a common method in hierarchical clustering , The core principle is , The initial assumption is that each individual is a class , Each iteration merges the closest points , When all the points are merged into one class or the stop condition is satisfied , Then terminate the model iteration , It's a bottom-up approach . The corresponding split hierarchical clustering , It iterates in the opposite bottom-up way , Final output .
2. Model flow
With 「 Agglomerative hierarchical clustering 」 For example :
step 1: Each point as a class , Calculate the distance between two points ;
step 2: Merge the closest points together , Forming new classes , And calculate the centroid of the class ;
step 3: Repeat the first 1、2 step , Until the conditions are met .
3. The advantages and disadvantages of the model
[ advantage ]
- Strong model interpretation ability
- There is no need to set K( Can be used as K-means Cluster exploration K A priori algorithm )
- about K-means Non spherical points that you are not good at are handled better
[ shortcoming ]
- Time complexity is high , Slow operation
- Can't solve non convex object distribution
2、 Partition based approach - K-means clustering
1. Model principle
Some of its ideas are similar to 「 Agglomerative hierarchical clustering 」, But before starting the model , You need to input the number of final clusters in advance K, Then initially select several points as the center of mass , Then merge the similar points , And form a new center of mass , The principle of iteration is 「 The inner point of the class is close enough , Distance between class points is far enough 」, Until the number of clusters finally matches .
as for K The choice of , Can pass 「 The law of the elbow 」、「 Profile factor 」 And so on , Not much here , I will write about it in the following articles .
2. Model flow
step 1: Random selection K Objects , As K The initial centroid of a cluster ( because K Is random , therefore K-means The results of each cluster are different );
step 2: Objects close to the center of mass , Merge into one class , And iterate out a new center of mass ;
step 3: Repeat the first 1、2 step , Until you are satisfied K Conditions for clusters .
3. The advantages and disadvantages of the model
[ advantage ]
- Low time and space complexity , Run fast
[ shortcoming ]
- Sensitive to the principle of initial centroid
- Sensitive to noise , Will be biased
- It is easy to have local optimal solutions
- Can't solve non convex object distribution
3、 Density based approach - DBSCAN clustering
1. Model principle
Neither of the above two methods can deal with irregular shape clustering , and DBSCAN Density based method can solve the problem well , And it is friendly to noise data . Its core principle is more popular , Is to draw a circle through each dot , Contact with surrounding points , If certain rules are met, they will be included in this category , Until the iteration requirements are met .
The rules are mainly two parameters , One is the maximum radius of a circle (eps), The other is that the circle should contain at least a few points (MinPts). Here are DBSCAN The effect picture formed by clustering .
2. Model flow
step 1: From any point p Start ;
step 2: Find and merge p Objects within the diameter range ;
step 3: If p As the core point , And form a cluster ; If p For the boundary point , namely : It does not meet the requirements of minimum accommodation points , Find the next object point again ;
step 4: Repeat the first 2、3 step , Until all objects are overwritten .
3. The advantages and disadvantages of the model
[ advantage ]
- Satisfy arbitrary shape clustering
- Insensitive to noise
[ shortcoming ]
- The clustering result is directly related to the initial setting value
- Because the value is fixed , Therefore, it is not very friendly to the sparse distribution of different objects
The above is the content sharing of this issue , I hope it can give you a clear understanding of cluster analysis .
边栏推荐
- Cesiumjs 2022 ^ source code interpretation [6] - new architecture of modelempirical
- Code de conduite pour la vente de fonds et la gestion de l'information
- Is it safe to buy funds on the compass?
- lvm-snapshot:基于LVM快照的备份
- [UML] UML class diagram
- Explain kubernetes backup and recovery tools velero | learn more about carina series phase III
- 206页上海BIM技术应用与发展报告2021
- 在指南针上买基金安全吗?
- Wordpress blog uses volcano engine veimagex for static resource CDN acceleration (free)
- 1. crawler's beautifulsoup parsing library & online parsing image verification code
猜你喜欢

Achieve secure data sharing among multiple parties and solve the problem of asymmetric information in Inclusive Finance

Online customer service chat system source code_ Beautiful and powerful golang kernel development_ Binary operation fool installation_ Attached construction tutorial

Prospects of world digitalization and machine intelligence in the next decade

Inventory the six second level capabilities of Huawei cloud gaussdb (for redis)

MIT doctoral dissertation optimization theory and machine learning practice

QQmlApplicationEngine failed to load component qrc:/main. qml:-1 No such file or directory

76 page comprehensive solution 2022 for smart Logistics Park (download attached)

Why did kubernetes win? The changes in the container circle!

Shell multitasking to download video at the same time

206页上海BIM技术应用与发展报告2021
随机推荐
Matlab saves triangulation results as STL files
Netease cloud sign in lottery? That year I could sign in for 365 days. No? Look.
Fund customer service
LVM snapshot: preparation of backup based on LVM snapshot
Summer Challenge [FFH] harmonyos mobile phone remote control Dayu development board camera
Is it safe to open a stock account of Huatai Securities online?
Ctfshow framework reproduction
PS2 handle-1 "recommended collection"
股票开户要如何办理呢?办理手机开户安全吗
leetcode 474. Ones and zeroes (medium)
Five minutes to understand the exploratory test
lvm-snapshot:基于LVM快照的备份之准备工作
Fh6908a negative pole turn off synchronous rectification analog low voltage drop diode control IC chip tsot23-6 ultra low power rectifier 1W power consumption < 100ua static replacement mp6908
一次革命、两股力量、三大环节:《工业能效提升行动计划》背后的“减碳”路线图
1. crawler's beautifulsoup parsing library & online parsing image verification code
Dataloader source code_ DataLoader
IFLYTEK active competition summary! (12)
5g smart building solution 2021
How to use robots Txt and its detailed explanation
76页智慧物流园区综合解决方案2022(附下载)