Job Recruitment Website - Immigration policy - Induction of various distances

Induction of various distances

In the process of software development and data analysis, there are many different distance calculation methods, such as Euclidean distance and Mahalanobis distance. Knowing these distances will help us to build a better model and plan the storage and indexing functions of the data platform. There have been many introductions about these concepts of distance on the Internet. The main purpose of this paper is to summarize these concepts.

First, we need to put some constraints on the "distance" itself. The distance we describe refers to the distance in the metric space. A good ranging function should have the following characteristics:

This paper classifies a series of common distance definitions that meet the above principles as follows:

Minkowski distance is suitable for judging the position of two points in multidimensional continuous space. The values in each space must be continuous. The definition of this distance includes Euclidean distance, Manhattan distance and Chebyshev distance. The definition of this distance family is called Minkowski distance. Defined as follows:

Two points in continuous n-dimensional space

The formula of Minkowski distance is:

When p is 1 or 2, Mintz distance is the most commonly used:

Euclidean distance is one of the most common types. Taking a two-dimensional space as an example, Euclidean distance is the linear distance between two points. Manhattan distance is the sum of absolute values of coordinate differences. Chebyshev distance is the maximum of the absolute value of the difference in each coordinate.

Min distance, including Manhattan distance, Euclid distance and Chebyshev distance, has some shortcomings:

Mahalanobis distance is improved for the above three shortcomings of 1. Wiki description is as follows:

If the covariance matrix is identity matrix, Mahara Nobis distance is simplified to Euclidean distance.

Similarity between vectors includes two concepts: similarity of angle and similarity of size. The common calculation method of vector distance is cosine distance (cosine similarity).

The cosine similarity between two vectors a and b is defined as follows:

An intuitive explanation of the above formula is that the length of the comparison object in all directions is the denominator, and the product sum of the components in all directions is the numerator, so that the components in the same direction can be "identical". 0 is 1, indicating that their directions are exactly the same, and 0 is-1, indicating that the two vectors point in opposite directions.

Cosine distance is often used to compare text similarity, such as TF-IDF weight.

This distance measures the difference between two variables in M-dimensional discrete disordered space. There is only the distance between two variables, and there is no difference in absolute value. The actual form of variables here can be a set of labels, a string, and so on. This distance is common:

1. editing distance: editing distance is a set of definitions, which refers to the minimum number of operations to convert a into b given two strings a and b.

Usually, the editing distance we refer to refers to the Levinstein distance, and only the following three character operations are allowed:

Calculating the editing distance requires a dynamic programming method, and the time complexity is.

In addition, Hamming distance Hamming/Hamming distance is also an editing distance, but it must be applied to equal-length strings. Typical applications include judging the similarity of images, first changing the images into black and white images with the same size, and then calculating.

2.Jaccard distance: judge the similarity of two sets. Jaccard similarity and Jaccard distance are calculated as follows:

Jaccard distance can be easily converted into the judgment of two binary strings with the same length, where any bit of 1 indicates that there is this item, and 0 indicates that there is no such item. Calculate (different digits)/(total digits)% to obtain Jaccard distance.

This distance index is a nightmare realized by software, and the name of the nightmare is called dimension disaster. The traditional tree index structure can only be applied to ordered data. At present, there is no effective index structure that can find the set distance efficiently enough (I have not checked the relevant papers in detail, but I guess there will be no such index structure in theory). Of course there is a way. The more common scheme is BK tree. However, for very large data sets, the pruning effect of BK tree is far from ideal. In practical applications, such as finding the distance of simhash, we can only use some tricky methods to optimize it, and there are many restrictions.

This paper mainly summarizes the uniqueness and scope of application of various definitions of distance. The classification based on mathematics may not be rigorous and formal, but I hope it can provide some reference for the design of related software systems. Later, I will summarize how we can realize the above-mentioned various types of distance calculation and indexing functions.

In addition to independent links, this article should at least refer to the following materials: