Send
Close Add comments:
(status displays here)
Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.
Latent Semantic Analysis
1. Document comparison
2. Parts
Documents
Words - Vocabulary - Terms
Words/Terms for a given Document
3. Customers
If
documents are customers, and
words are products offered, and
words for a given document are products bought by a given customer
then one can start to answer the question, "
customers like you bought products like this".
Google (and others) use a version of LSI in their search engine.
NetFlix (and others) use a version of LSI in their recommendation engine.
LSI is used for document comparison in eDiscovery, PC (Predictive Coding), TAR (Technology Assisted Review), etc.
4. Document matrix
LSI starts with a term document matrix.
There are n documents in the set of documents D where document d
j in D goes from 1 to n.
A document d
j is some piece of text that is to be compared to other documents. A document can be an actual document, an email message, a headline, etc.
5. Terms
All of the terms (e.g., words, collocations, etc.) for all documents form the set of m terms in the set of terms T where term t
i in T goes from 1 to m.
A set (i.e., list) of stop, or omitted, words is maintained. These words are usually common words that do not add meaning. These words do not appear in the term set T.
6. Frequency
The frequency (i.e., count) of terms in each document is used in this analysis although other parameters can be used.
7. TF-ID matrix
8. TD-IDF
A typical definition for Tf-idf is that the value of element a
i,j is
where g
i is the global weighting factor for term i in document j.
9. Global weighting factor
The global weighting factor for term i in document j is defined as
where tf
i,j is the term count of term i in document j and n is the number of documents, and
Let the term document matrix A be the m by n matrix where element a
i,j in A is a measure (e.g., count, Tf-idf, etc.) of the frequency of term i in document j.
10. Term document matrix
The term document matrix A can become quite large (e.g., for 1000 terms and 1000 documents it has 1,000,000 elements) but it tends to be very sparse (i.e., most elements are zero).
A row i in A (i.e., a row vector) has elements from 1 to m and each a
i,j corresponds to the frequency of the appearance of term i in each of documents 1 to n.
A column j in A (i.e., a column vector) has elements from 1 to m and each a
i,j corresponds to the frequency of the appearance of terms 1 to m in document j.
11. Singular Value Decomposition
A
SVD can be used to effectively reduce the dimensionality of the data. That is, a larger set of data can be effectively represented by a smaller set of data. The SVD has been used in recommendation engines to make better recommendations (e.g., as a result of the NetFlix competition).
In LSI, the singular values represent semantic concepts in the data (e.g, documents). Synonyms are handled well as both are in the context of similar concepts. On the other hand, multiple meanings of terms are not handled well.
12. SVD
The
SVD is related to
PCA (Principal Components Analysis).
from scipy.linalg import svd
The
SVD transforms matrix
A as follows.
If m less than or equal to n then the following holds.
If n less than or equal to m then the following holds.
The U matrix contains the left singular values which are the eigenvectors of AA
T.
The V matrix contains the right singular values which are the eigenvectors of A
TA.
Assume that m is less than or equal to n. Otherwise, switch m and n.
The
Sigma matrix is a diagonal matrix containing the singular values
sigmai,i which are the square roots of the eigenvalues of both AA
T and A
TA.
The matrix
Sigma,hat is often written or processed as a vector as follows
By convention, and by invariant permutations of the matrices, the singular values
sigmai,i meet the following conditions.
By selecting an appropriate (positive) value k less than m, a semantically reduced system can be obtained that can be used to model the entire system.
To reduce, a value k is selected such that k
≤ m. The reduced form is as follows, where A is approximated by
A,hat where
13. Prototype methodology
The following methodology was used as a starting point to prototype headline topic extraction.
1.
RSS (Really Simple Syndication) feeds have been collected in Thunderbird for some time now. For example, the NBC news headlines has about 9,000 entries. For example purposes, the NBC security feed was used, with only 241 entries.
2. The RSS feed headlines for each feed were output to a text file.
3. The selected text file headlines were organized into the Python format needed for LSI (Latent Semantic Analysis).
4. The Python libraries SciPy (Scientific Python), NumPy (Numerical Python), and Gensim (Topic modeling for humans) were used to do a SVD (Singular Value Decomposition) of the headlines (i.e., documents) and terms.
5. In order to plot the results in a 2D scatter plot, only two dimensions of correlation were used (i.e., dimensions 2 and 3 as dimension 1 is related to the document size). With higher dimensions, a clustering technique needs to be applied to group similar items.
6. A selected part of the resulting Excel 2D scatter plot was obtained.
7. Stop words consisting of common words were used. Additional words can be added as necessary.
14. Prototype infrastructure
Here is a selected area of a plot of the 2 dimension correlation of headlines and terms for the 241 headlines in the NBC News security RSS feed.
15. Headline scatter plot
16. Changes and improvements
Here are some changes/improvements that might be needed.
17. Dimensions
More dimensions need to be used. In order to do a 2-D scatter plot, only 2 dimensions were used. The results are, as might be expected, not as promising as might be hoped. In practice, with thousands of documents (e.g., headlines) hundreds of dimensions are used.
18. Clustering
Once more dimensions are used, and a visual representation such as a scatter chart is not possible, some means of clustering (e.g., using k-means) is needed in order to group similar terms. This would need to be done even if 2 dimensions were used as the clustering is still needed to be automated.
Clustering appears simple but is actually quite involved. One of the primary concerns of the field of machine learning is in clustering techniques.
19. Memory limits
Memory limits were exceeded with 7500 headlines. One way to reduce memory usage is to increase the number of stop words. It appears that the 32-bit Python implementation has a 4 GB memory limit. The 64-bit Python would remove this restriction but there were initial problems installing the NLTK using the 64-bit Python.
Note: The above demo was done using SciPy and NumPy without using GenSim. GenSim is designed to overcome many of these limitations, but is more involved in making it work for customized headlines.
20. Time efficiency
As problem sizes increase, the use of parallel processing becomes more important. There is a Python library (e.g., Bias) which can utilize multiple cores on a CPU but this was not investigated at this time. There are also ways to distribute the load over multiple machines (e.g., using MapReduce). This was not investigated at this time.
21. Investigations
In addition, cursory investigations of the following were done. Some may prove useful in future endeavors.
Python-based NLTK (Natural Language Tool Kit)
Octave, which provides much of the functionality of MatLab but without the huge expense of MatLab.
22. Possible improvements
The following might be useful if adapted to the LSI method of analysis.
Time dependency of headlines.
The full body text of each headline.
Multiple words that occur together (i.e., collocations) rather than single words.
23. End of page