Topic modeling: overview

Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.

This is a brief overview of topic modeling. Topic modeling approaches include the following.

The LSI (Latent Semantic Indexing) method is a VBM (Vector Based Model) that uses SVD (Singular Value Decomposition) transformation.
The LDA (Latent Dirichlet Allocation) method is a hierarchical (and generative) Bayesian inferencing method.

Both use a BOW (Bag Of Words) approach.

Sets and bags:

A set has items that appear only once in a set.
A bag has items that can appear more than once.

Since topic modeling almost always uses a BOW approach, each "word" and a count of the number of times that a word occurs in a document is used. Better results are usually obtained by modifying the count using tf-idf (term frequency - inverse document frequency) to not inflate the importance of repeated words in a document or collection of documents.

TD (Term Frequency)
IDF (Inverse Document Frequency)

There are variations but usually a logarithm function is used to not overly weight words that appear more and more times in a document. However, this is done in the preprocessing step so that this modification can be easily be omitted or changed.

The general approach used to pre-process a document for topic modeling includes the following preprocessing steps for the text.

stop

omit

3. Use stemming to consider some words with the same stem as the same word. For example, "car" and "cars".

word

5. Transform the resulting list of words into a bag of words (words with count) for that document.

6. Modify/weight the counts if desired (e.g., using tf-idf, etc.). This pre-processing is the messiest part of the process and cleaning up this pre-processing, often with long lists available from different sources, etc., helps a lot in getting valid results.

A lot of NLP (Natural Language Processing) techniques are used in this pre-processing.

The order of the words, unless used as bi-grams, tri-grams, etc., is not considered important to the analysis - probably because no one who has tried it has found it useful - probably because there is too much noise to discern any useful signal. And, to date, topic modeling has worked will without that complication.

Note that LSI, LDA, etc., are designed to be general and to smooth out differences in the document so that it may find a document relevant/similar even if the keyword of interest is not found in that document.

Since the introduction of LSI, LDA, etc., derivative works have appeared (usually in academia) that cover almost any conceivable alteration to the original model. This includes temporal orderings (e.g. date and time), word orders (e.g., bi-grams, tri-grams, etc.), hierarchical levels, etc. In topic modeling, the entire corpus of documents is pre-processed in the above manner and then LSI, LDA, etc., is used.

Document similarity allows these documents to be grouped/clustered into similar documents (using a variety of methods).

Methods to compare a new "document" to existing documents include the following.

A search query is converted into a document in the same manner as described above. A dictionary of words with importance indicated by repetition is just a search query with those important words repeated and then processed in the above manner.

An advertiser could supply search terms, a dictionary of words with frequency count, are examples of what they are interested in terms of paragraphs, documents, etc. In each case, that information supplied is converted into a list of words with frequency count (i.e., bag of words ) that represents a "document" and then document similarity is used to determine and do something with similar documents. Topic modeling helps identify "similar" documents without knowing anything about the actual documents, one must specify which group or groups of documents that are of interest. In predictive coding, humans manually identify those groups by annotating a small subset of documents. Other ways include, as mentioned above, search queries, a dictionary with frequencies, example text, etc. Topic modeling is a very general idea that has found applications in, for example, DNA processing/research, image processing, etc.

For example, if customers are considered "documents" and the number of each item they have bought are considered "words" with associated "counts", then, without knowing the details of any customer or product, topic modeling can help answer questions like "customers like you bought these products" (i.e., document similarity and then most associated products) and "here are products similar to the one you are considering" (word/topic similarity), etc. This is the basis of recommendation engines and was a key part of the winning solution to the NetFlix competition a few years ago. (Though the winning method was complex)

Note: The "cold start" problem happens, for example, when a new product is introduced that has no history and a way is then provided to jump start this product, which is why, in the NetFlix competition solution, topic modeling is only part of the overall solution. One is always free to integrate specific queries into the process but, for the problem being solved, this may help or hinder the process and make the results better or worse, depending on the application, the model, the implementation, etc. It is still true that to solve any of problem, one must carefully identify and define the problem that one is solving and then, if off-the-shelf methods do not work well, one needs to create or adapt some model to solve that particular problem.