Text Similarity: Euclidian Distance VS Cosine Similarity !!!
Why do we need to compare documents?
Is this a simple process?
How much Maths do I need to know?
Let’s try to answer these questions by thinking about our daily workflow. Imagine that you are running an operation where you have to read every day hundreds of documents and sort them out in different folders, or maybe you have to evaluate documents and make suggestions based on the consumer’s needs. These are time-consuming actions that could affect productivity and the decision-making process.
In this article, I will present two metrics that can determine how similar random texts are. These metrics are the Euclidian distance and the Cosine similarity. The Maths that you need to know is the calculation of the Euclidean distance and the Cosine similarity. Thus, let’s use the equations and compute some text examples.
Euclidian Distance
In Mathematics, the Euclidian distance or Euclidean Metric represents the length of a line segment between two points, which can be calculated by the Pythagorean Theorem. Therefore, in the NLP, these points are represented by words. Let’s take a look at an example.
Text 1: I love ice cream
Text 2: I like ice cream
Text 3: I offer ice cream to the lady that I love
Compare the sentences using the Euclidean distance to find the two most similar sentences.
Firstly, I will create a table with all the available words.
Compute the Euclidean distance using the equation:
Notice the computations between the different texts below:
According to the Euclidian distance, the shorter the distance between the two texts is, the more similar they are. Thus, text 1 is more similar to text 2. Indeed, the meaning of these texts is the same, compared to text 3. Let’s also perform another example using Python by importing the essential modules. Here you can find the code that I used for this article.
Calculating the Euclidean distance using Python
In this case, the result is not what we expected. The above table presents text 1 to be similar to text 2, which is not pragmatic. As to the meaning of the above sentences, text 1 is similar to text 3. So, there is a factor that spoils the result and affects the Euclidian metric.
As you may have figured out by observing the sentences, the length of the text is a factor that affects the result. Long sentences tend to have higher Euclideum score than the short ones.
Overcoming the length issue and ensure accurate results
Cosine similarity
A way to overcome these issues is by using the Cosine Similarity metric. Cosine Similarity measures the cosine of the angle between two vectors in the space. It’s also a metric that is not affected by the frequency of the words being appeared in a document, and it is efficient for comparing different sizes of documents.
Consider the following plot:
Cosine similarity is an important metric because it is not affected by the length of the text. Asymmetrical texts (AKA Large Euclidian distance) may have a smaller angle among them. The smaller the angle is, the higher the similarity. Furthermore, the Cosine of an angle can take a value between -1 and 1. Speaking from the NLP perspective, this value could be between 0 and 1. If a word does not appear in one of the texts, the fraction becomes zero. Here is the equation that computes the metric:
We also have the advantage of inspecting documents with certain similarity values by utilizing the Cosine similarity (i.e. over 60% similarity). Furthermore, this is an attribute that distinguishes these two metrics. Euclidean distance can take any value. In this case, it’s impossible to define a threshold and compare documents.
Enough with the theory, let’s compute the Cosine similarity metric using Python.
The result verifies our hypothesis. When comparing texts with different sizes, the metric that can lead us to accurate results is the Cosine similarity. Besides, we can also inspect how similar two documents are and select documents by defining a certain number of similarities.
Conclusion
As you may notice, it wasn’t difficult to compute the metrics and compare the documents. Moreover, using Python, we don’t need to be aware of the computations. A few lines of code can produce the result, despite the document’s length. So, learning how to use this metric will be beneficial to build a model and compare different documents.
Thanks for your time!