Send Close Add comments: (status displays here)
Got it!  This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.  Note: This appears on each machine/browser from which this site is accessed.
Latent Semantic Analysis
by RS  admin@robinsnyder.com : 1024 x 640


1. Document comparison
LSA (Latent Semantic Analysis), sometimes called LSI (Latent Semantic Indexing) is a technique for automatically processing NL (Natural Language) text such as in document comparison. The original LSA/LSI method was published and patented in 1988. The core idea in the method is the SVD (Singular Value Decomposition) in a VSM (Vector Space Model) that is truncated to reduce the dimensionality of the semantic space.

2. Parts

3. Customers
If then one can start to answer the question, "customers like you bought products like this".

4. Document matrix
LSI starts with a term document matrix.

There are n documents in the set of documents D where document dj in D goes from 1 to n.

documents
A document dj is some piece of text that is to be compared to other documents. A document can be an actual document, an email message, a headline, etc.

5. Terms
All of the terms (e.g., words, collocations, etc.) for all documents form the set of m terms in the set of terms T where term ti in T goes from 1 to m.

terms
A set (i.e., list) of stop, or omitted, words is maintained. These words are usually common words that do not add meaning. These words do not appear in the term set T.

6. Frequency
The frequency (i.e., count) of terms in each document is used in this analysis although other parameters can be used.

7. TF-ID matrix
Often, the Tf-idf (Term frequency - inverse document frequency) approach is used to weight terms that appear less often more than terms that appear more often.

8. TD-IDF
A typical definition for Tf-idf is that the value of element ai,j is

tdif
where gi is the global weighting factor for term i in document j.

9. Global weighting factor
The global weighting factor for term i in document j is defined as

factor
where tfi,j is the term count of term i in document j and n is the number of documents, and

number of documentsLet the term document matrix A be the m by n matrix where element ai,j in A is a measure (e.g., count, Tf-idf, etc.) of the frequency of term i in document j.

term frequency

10. Term document matrix
The term document matrix A can become quite large (e.g., for 1000 terms and 1000 documents it has 1,000,000 elements) but it tends to be very sparse (i.e., most elements are zero). A row i in A (i.e., a row vector) has elements from 1 to m and each ai,j corresponds to the frequency of the appearance of term i in each of documents 1 to n.

row vecworA column j in A (i.e., a column vector) has elements from 1 to m and each ai,j corresponds to the frequency of the appearance of terms 1 to m in document j.

column vector

11. Singular Value Decomposition
A SVD can be used to effectively reduce the dimensionality of the data. That is, a larger set of data can be effectively represented by a smaller set of data. The SVD has been used in recommendation engines to make better recommendations (e.g., as a result of the NetFlix competition). In LSI, the singular values represent semantic concepts in the data (e.g, documents). Synonyms are handled well as both are in the context of similar concepts. On the other hand, multiple meanings of terms are not handled well.

12. SVD
The SVD is related to PCA (Principal Components Analysis).
from scipy.linalg import svd

The SVD transforms matrix A as follows. If m less than or equal to n then the following holds.

LSI equation If n less than or equal to m then the following holds.

LSI equation The U matrix contains the left singular values which are the eigenvectors of AAT.

The V matrix contains the right singular values which are the eigenvectors of ATA.

Assume that m is less than or equal to n. Otherwise, switch m and n. The Sigma matrix is a diagonal matrix containing the singular values sigmai,i which are the square roots of the eigenvalues of both AAT and ATA.

LSI equationThe matrix Sigma,hat is often written or processed as a vector as follows

LSI equationBy convention, and by invariant permutations of the matrices, the singular values sigmai,i meet the following conditions.

LSI equationBy selecting an appropriate (positive) value k less than m, a semantically reduced system can be obtained that can be used to model the entire system.

LSI equationTo reduce, a value k is selected such that k m. The reduced form is as follows, where A is approximated by A,hat where

LSI equation

13. Prototype methodology
The following methodology was used as a starting point to prototype headline topic extraction. 1. RSS (Really Simple Syndication) feeds have been collected in Thunderbird for some time now. For example, the NBC news headlines has about 9,000 entries. For example purposes, the NBC security feed was used, with only 241 entries. 2. The RSS feed headlines for each feed were output to a text file. 3. The selected text file headlines were organized into the Python format needed for LSI (Latent Semantic Analysis). 4. The Python libraries SciPy (Scientific Python), NumPy (Numerical Python), and Gensim (Topic modeling for humans) were used to do a SVD (Singular Value Decomposition) of the headlines (i.e., documents) and terms. 5. In order to plot the results in a 2D scatter plot, only two dimensions of correlation were used (i.e., dimensions 2 and 3 as dimension 1 is related to the document size). With higher dimensions, a clustering technique needs to be applied to group similar items. 6. A selected part of the resulting Excel 2D scatter plot was obtained. 7. Stop words consisting of common words were used. Additional words can be added as necessary.

14. Prototype infrastructure
Here is a selected area of a plot of the 2 dimension correlation of headlines and terms for the 241 headlines in the NBC News security RSS feed.

15. Headline scatter plot
LSI

16. Changes and improvements
Here are some changes/improvements that might be needed.

17. Dimensions
More dimensions need to be used. In order to do a 2-D scatter plot, only 2 dimensions were used. The results are, as might be expected, not as promising as might be hoped. In practice, with thousands of documents (e.g., headlines) hundreds of dimensions are used.

18. Clustering
Once more dimensions are used, and a visual representation such as a scatter chart is not possible, some means of clustering (e.g., using k-means) is needed in order to group similar terms. This would need to be done even if 2 dimensions were used as the clustering is still needed to be automated. Clustering appears simple but is actually quite involved. One of the primary concerns of the field of machine learning is in clustering techniques.

19. Memory limits
Memory limits were exceeded with 7500 headlines. One way to reduce memory usage is to increase the number of stop words. It appears that the 32-bit Python implementation has a 4 GB memory limit. The 64-bit Python would remove this restriction but there were initial problems installing the NLTK using the 64-bit Python. Note: The above demo was done using SciPy and NumPy without using GenSim. GenSim is designed to overcome many of these limitations, but is more involved in making it work for customized headlines.

20. Time efficiency
As problem sizes increase, the use of parallel processing becomes more important. There is a Python library (e.g., Bias) which can utilize multiple cores on a CPU but this was not investigated at this time. There are also ways to distribute the load over multiple machines (e.g., using MapReduce). This was not investigated at this time.

21. Investigations
In addition, cursory investigations of the following were done. Some may prove useful in future endeavors.

22. Possible improvements
The following might be useful if adapted to the LSI method of analysis.

23. End of page

by RS  admin@robinsnyder.com : 1024 x 640