当前位置：网站首页>[machine learning Q & A] cosine similarity, cosine distance, Euclidean distance and the meaning of distance in machine learning

[machine learning Q & A] cosine similarity, cosine distance, Euclidean distance and the meaning of distance in machine learning

2022-06-30 01:34:00 【Sickle leek】

Cosine similarity 、 Cosine distance 、 European distance and the meaning of distance in machine learning

problem 1： Why use cosine similarity instead of Euclidean distance in some scenes ？
problem 2： Whether cosine distance is a strictly defined distance ？
Reference material

In machine learning problems , Features are usually expressed in the form of vectors , So when analyzing the similarity between two eigenvectors , Regular use Cosine similarity To express .
The range of cosine similarity is [ -1, 1 ], The similarity between the same two vectors is 1, take 1 Subtract cosine similarity Cosine distance . therefore , The value range of cosine distance is [0, 2], The cosine distance of the same two vectors is 0.

problem 1： Why use cosine similarity instead of Euclidean distance in some scenes ？

For two vectors A and B, The rest of string similarity is defined as $B)=\frac{A\cdot B}{||A||_2||B||_2}$ , Cosine of the angle between two vectors , It's about the angular relationship between vectors , Don't care about their absolute size , Its value range is [-1, 1].

When the length difference of a pair of text similarity is large , But when the content is similar , If word frequency or word vector is used as a feature , Their Euclidean distance in the feature space is usually very large ; If cosine similarity is used , The angle between them may be very small , So the similarity is high .

Besides , In the text 、 Images 、 Video and other fields , The research object's characteristic dimension is often very high , Cosine similarity remains in high dimension “ Same as 1, When it's orthogonal, it's 0, On the contrary -1” The nature of , and The value of Euclidean distance is affected by the dimension , The range is not fixed , And the meaning is vague .

In some scenarios , Such as Word2Vec in , The module length of its vector is normalized , At this point, there is a monotonic relationship between Euclidean distance and cosine distance , namely ：
$||A-B||_2=\sqrt{2(1-cos(A,B))}$
among , $A-B||_2$ It means Euclidean distance , $c o s (A, B)$ Represents cosine similarity , $(1 - c o s (A, B))$ Represents the cosine distance . In this scenario , Select minimum distance （ The similarity is the biggest ） Nearest neighbor , Then the result of using cosine similarity and Euclidean distance is the same .

On the whole , The Euclidean distance reflects the absolute difference in value , And cosine distance represents the relative difference in direction .

problem 2： Whether cosine distance is a strictly defined distance ？

 Be careful ： Cosine distance is not a strictly defined distance ！

The definition of distance ： In a collection , If each pair of elements can uniquely determine a real number , Make three axioms of distance （ Positive definiteness 、 symmetry 、 Trigonometric inequality ） establish , Then the real number can be called the distance between the two elements .
（1） Positive definiteness
According to the definition of cosine distance , Yes
$\theta = \frac{||A||_2||B||_2-AB}{||A||_2||B||_2}$
in consideration of $||A||_2||B||_2-AB\ge 0$ , So there is $B)\ge 0$ Hang up .
（2） symmetry
According to the definition of cosine distance , Yes
$B)=\frac{||A||_2||B||_2-AB}{||A||_2||B||_2}=\frac{||B||_2||A||_2-AB}{||B||_2||A||_2}$
Satisfy symmetry
（3） Trigonometric inequality
This property does not hold , Counter example . Given A=(1,0),B=(1,1),C=(0,1), Then there are $dist(A,B)=1-\frac{\sqrt{2}}{2}$ , $dist(B,C)=1-\frac{\sqrt{2}}{2}$ , $d i s t (A, C) = 1$
So there is $dist(A,B)+dist(B,C)=2-\sqrt{2}<1=dist(A,C)$
The cosine distance satisfies positive definiteness and symmetry , Not satisfied with trigonometric inequality .

in addition , We know that Euclidean distance and cosine distance on unit circle satisfy
$||A-B||=\sqrt{2(1-cos(A,B))}=\sqrt{2dist(A,B)}$
It has the following relationship ：
$B)=\frac{1}{2}||A-B||^2$
Obviously on the unit circle , The range of cosine distance and Euclidean distance is [0,2], The known Euclidean distance is a legal distance , The cosine distance has a quadratic relationship with the Euclidean distance , Nature does not satisfy the trigonometric inequality .

In machine learning , It is commonly known as distance , But it is not only cosine distance that does not satisfy the three distance axioms , also KL distance （Kullback-Leibler Divergence）, It's also called Relative entropy , It is often used in Calculate the difference between the two distributions , But it does not satisfy symmetry and trigonometric inequality .

In the field of machine learning ,A/B Testing is the main means to verify the final effect of the model .
problem 1： After a full off-line evaluation of the model , Why go online A/B test ？
（1） Off line evaluation cannot eliminate the influence of model over fitting ; therefore , It is concluded that the The effect of offline evaluation cannot completely replace the result of online evaluation ;
（2） Offline evaluation cannot completely restore the online engineering environment . commonly , Offline assessments often do not take into account delays in online environments 、 Data loss 、 Missing tag data, etc .
（3） Online system Some business metrics cannot be calculated in an offline assessment . such as , A new recommendation algorithm has been launched , Offline people often care about ROC curve 、P-R Curves, etc ; Online evaluation can fully understand the user click through rate brought by the recommendation algorithm 、 Retention time 、PV Changes in traffic, etc .

problem 2： How to do online A/B test ？
（1） Conduct A/B The main means of testing is to Users divide buckets , Divide users into experimental group and control group , Apply the new model to the users of the experimental group , Apply the old model to the control group ;
（2） In the process of separating buckets , it is to be noted that The independence of the sample and Unbiased sampling , Make sure The same user can only be allocated to the same bucket each time ;

problem 3： How to divide the experimental group and the control group ？
The algorithm engineer of a certain company is very helpful to American users , Developed a new video recommendation model A; The recommendation model being used for all users is B.
The right way ： Will all U.S. users according to user_id Single digit parity was divided into experimental group and control group , Apply models separately A and B, To validate the model A The effect of ;
correct 、 Unbiased partition scheme

Reference material

[1] 《 Baimian machine learning 》 Chapter two ： Model to evaluate
[2] entropy Entropy – Shannon entropy 、 Relative entropy 、 Cross entropy 、 Conditional entropy

原网站

版权声明
本文为[Sickle leek]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/181/202206300123519625.html

当前位置：网站首页>[machine learning Q & A] cosine similarity, cosine distance, Euclidean distance and the meaning of distance in machine learning

[machine learning Q & A] cosine similarity, cosine distance, Euclidean distance and the meaning of distance in machine learning

Cosine similarity 、 Cosine distance 、 European distance and the meaning of distance in machine learning

problem 1： Why use cosine similarity instead of Euclidean distance in some scenes ？

problem 2： Whether cosine distance is a strictly defined distance ？

Reference material

边栏推荐

猜你喜欢

随机推荐