A common issue across publishing and other companies have analyzing large volumes of text data is a seemingly simple question – how similar are documents to one another? The use cases for these algorithms are abundant from identifying plagiarism, to avoiding duplicative content warnings from Search Engines, contextual document recommendation engine generation, and tasks like redirecting old content to new content as your corpus evolves.
There are many different ways to approach solving this problem, however, we’re going to focus on three off the shelf practices related to this issue and one blended ML method to solving this problem.
Cosine Similarity Computation:
The first approach is by far the most standard across industries and represents the calculation of document cosine similarity.
- Python code for computing cosine-similarity using Sci-kit Learn
- From scratch implementation of cosine-similarity
One of the drawbacks with using a simple cosine similarity comparison is that where words may be sparse, or where documents contain many stopwords such as “to”, “the”, “so”, “a”, etc. the TF-IDF calculation critical to this calculation can result in some high similarity scores, which are from a human perspective may not actually be interpretable or useful to one’s end goals. On the other hand, cosine similarity is by far one of the fastest approaches to identifying similar documents within a corpus and can be used as the ‘go-to’ calculation when this work is required.
gensim for Topic Modeling
Another primary approach to identifying document similarity is using the gensim library for semantic processing. gensim entails a variety of topic modeling techniques, however, the two below are the most useful for document similarity comparisons.
- DocSim: https://radimrehurek.com/gensim/similarities/docsim.html
- Similarity Queries: https://radimrehurek.com/gensim/tut3.html
One challenge with using gensim’s functions is that they often make use of tokenization of each individual word within a corpus and create a matrix that is very large. This is especially true of the similarities.MatrixSimilarities class where 1000 documents with an average of 500 words can potentially lead to memory errors when computed locally (you’ll need a lot of GB RAM). That said, there are approaches to get high quality outputs from gensim using local computing power on relatively large corpus sets using other classes.
Fuzzy String Matching with FuzzyWuzzy:
And additional standard metric for checking string similarity is using a fuzzy-string matching technique using Levinstien distance using the FuzzyWuzzy python library. This performs particularly well with shorter text or when individual sentences can be compared within documents to all sentences in other documents.
One issue with this approach is that it requires a lot of computing power as the standard FuzzyWuzzy library compares each string to another while computing any of the standard ratio computations in the library, which can take some time, resulting in many using this approach searching for solutions in frameworks beyond single-thread Python to compute these measures.
ML Approaches to Document Similarity Computation:
Lastly, there have been large-scale data science competitions to help companies understand their duplicative titles from the ML side of text analysis. One of the most publicised examples of this is the competition to identifying duplicative questions on Quora held on Kaggle. The approaches which performed the best in this competition actually used several methods for developing a large feature set for each document to compare to a potentially duplicative document.
This included the use of libraries such as Spacy to extract entities, FuzzyWuzzy, standard text extraction features (such as counting punctuation), and cosine similarity computations. These large feature sets allowed many of the winning teams to deliver highly predictive results. The one distinction here is that there was a clear training set provided in this competition aiding the comparison of documents. In most use cases, this will not be present for a researcher to use pre-labeled or identified similarities in an out of the box assessment.
The Human Element:
Computing many of the metrics above is relatively easy using Python but the hard part is getting an output which is interpretable and actionable by end customers of this data. When comparing any of these approaches consider using tools like Mechanical Turk, editorial reviews, or interns to review and clearly label duplicative or highly similar documents as a taining set for machine learning algorithms to help improve the potential output of your project. This helped the Quora competition get highly accurate comparisons of documents.
Additionally, ensure that your dataset is cleansed appropriately for each of the techniques mentioned above. Often your text requires the removal of stopwords, HTML, XML, or other encoding and parsing errors that may otherwise corrupt the quality of these computations.
The safest bet from the above is using cosine similarity to generate document similarity calculations. It is not only fast, but also provides directionally strong results to inform the need for similarity computation over a large document set. Gensim can also perform well in this regard, but one needs to read and understand the constraints of their computing power and the underlying topic-modeling concepts necessary to calculate metrics effectively. FuzzyWuzzy is of limited effectiveness due to its speed limitations on large sets of multi-sentence text but, has a wide variety of metrics it can extract to inform more advanced ML applications that may use its output as feature sets for highly accurate document similarity extraction.