Latent Semantic Analysis

LSA (Latent Semantic Analysis), sometimes called LSI (Latent Semantic Indexing) is a technique for automatically processing NL (Natural Language) text such as in document comparison. The original LSA/LSI method was published and patented in 1988. The core idea in the method is the SVD (Singular Value Decomposition) in a VSM (Vector Space Model) that is truncated to reduce the dimensionality of the semantic space.

Documents
Words - Vocabulary - Terms
Words/Terms for a given Document

documents are customers, and
words are products offered, and
words for a given document are products bought by a given customer

then one can start to answer the question, "customers like you bought products like this".

Google (and others) use a version of LSI in their search engine.
NetFlix (and others) use a version of LSI in their recommendation engine.
LSI is used for document comparison in eDiscovery, PC (Predictive Coding), TAR (Technology Assisted Review), etc.

LSI starts with a term document matrix.

There are n documents in the set of documents D where document d_j in D goes from 1 to n.

A document d_j is some piece of text that is to be compared to other documents. A document can be an actual document, an email message, a headline, etc.

All of the terms (e.g., words, collocations, etc.) for all documents form the set of m terms in the set of terms T where term t_i in T goes from 1 to m.

A set (i.e., list) of stop, or omitted, words is maintained. These words are usually common words that do not add meaning. These words do not appear in the term set T.

The frequency (i.e., count) of terms in each document is used in this analysis although other parameters can be used.

Often, the Tf-idf (Term frequency - inverse document frequency) approach is used to weight terms that appear less often more than terms that appear more often.

A typical definition for Tf-idf is that the value of element a_i,j is

where g_i is the global weighting factor for term i in document j.

The global weighting factor for term i in document j is defined as

factor

where tf_i,j is the term count of term i in document j and n is the number of documents, and

Let the term document matrix A be the m by n matrix where element a_i,j in A is a measure (e.g., count, Tf-idf, etc.) of the frequency of term i in document j.

term frequency

The term document matrix A can become quite large (e.g., for 1000 terms and 1000 documents it has 1,000,000 elements) but it tends to be very sparse (i.e., most elements are zero). A row i in A (i.e., a row vector) has elements from 1 to m and each a_i,j corresponds to the frequency of the appearance of term i in each of documents 1 to n.

A column j in A (i.e., a column vector) has elements from 1 to m and each a_i,j corresponds to the frequency of the appearance of terms 1 to m in document j.

A SVD can be used to effectively reduce the dimensionality of the data. That is, a larger set of data can be effectively represented by a smaller set of data. The SVD has been used in recommendation engines to make better recommendations (e.g., as a result of the NetFlix competition). In LSI, the singular values represent semantic concepts in the data (e.g, documents). Synonyms are handled well as both are in the context of similar concepts. On the other hand, multiple meanings of terms are not handled well.

The SVD is related to PCA (Principal Components Analysis).

from scipy.linalg import svd

The SVD transforms matrix A as follows. If m less than or equal to n then the following holds.

If n less than or equal to m then the following holds.

The U matrix contains the left singular values which are the eigenvectors of AA^T.

The V matrix contains the right singular values which are the eigenvectors of A^TA.

Assume that m is less than or equal to n. Otherwise, switch m and n. The Sigma matrix is a diagonal matrix containing the singular values sigma_i,i which are the square roots of the eigenvalues of both AA^T and A^TA.

LSI equation

The matrix Sigma,hat is often written or processed as a vector as follows

By convention, and by invariant permutations of the matrices, the singular values sigma_i,i meet the following conditions.

By selecting an appropriate (positive) value k less than m, a semantically reduced system can be obtained that can be used to model the entire system.

LSI equation

To reduce, a value k is selected such that k ≤ m. The reduced form is as follows, where A is approximated by A,hat where

The following methodology was used as a starting point to prototype headline topic extraction. 1. RSS (Really Simple Syndication) feeds have been collected in Thunderbird for some time now. For example, the NBC news headlines has about 9,000 entries. For example purposes, the NBC security feed was used, with only 241 entries. 2. The RSS feed headlines for each feed were output to a text file. 3. The selected text file headlines were organized into the Python format needed for LSI (Latent Semantic Analysis). 4. The Python libraries SciPy (Scientific Python), NumPy (Numerical Python), and Gensim (Topic modeling for humans) were used to do a SVD (Singular Value Decomposition) of the headlines (i.e., documents) and terms. 5. In order to plot the results in a 2D scatter plot, only two dimensions of correlation were used (i.e., dimensions 2 and 3 as dimension 1 is related to the document size). With higher dimensions, a clustering technique needs to be applied to group similar items. 6. A selected part of the resulting Excel 2D scatter plot was obtained. 7. Stop words consisting of common words were used. Additional words can be added as necessary.

Here is a selected area of a plot of the 2 dimension correlation of headlines and terms for the 241 headlines in the NBC News security RSS feed.

Here are some changes/improvements that might be needed.

More dimensions need to be used. In order to do a 2-D scatter plot, only 2 dimensions were used. The results are, as might be expected, not as promising as might be hoped. In practice, with thousands of documents (e.g., headlines) hundreds of dimensions are used.

Once more dimensions are used, and a visual representation such as a scatter chart is not possible, some means of clustering (e.g., using k-means) is needed in order to group similar terms. This would need to be done even if 2 dimensions were used as the clustering is still needed to be automated. Clustering appears simple but is actually quite involved. One of the primary concerns of the field of machine learning is in clustering techniques.

Memory limits were exceeded with 7500 headlines. One way to reduce memory usage is to increase the number of stop words. It appears that the 32-bit Python implementation has a 4 GB memory limit. The 64-bit Python would remove this restriction but there were initial problems installing the NLTK using the 64-bit Python. Note: The above demo was done using SciPy and NumPy without using GenSim. GenSim is designed to overcome many of these limitations, but is more involved in making it work for customized headlines.

As problem sizes increase, the use of parallel processing becomes more important. There is a Python library (e.g., Bias) which can utilize multiple cores on a CPU but this was not investigated at this time. There are also ways to distribute the load over multiple machines (e.g., using MapReduce). This was not investigated at this time.

In addition, cursory investigations of the following were done. Some may prove useful in future endeavors.

Python-based NLTK (Natural Language Tool Kit)
Octave, which provides much of the functionality of MatLab but without the huge expense of MatLab.