Topic modeling: Conference proceedings

Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.

The NLP (Natural Language Processing) model LDA (Latent Dirichlet Allocation) is a generative Bayesian statistical model. Here as a graphical model depiction of this model. A plate represents an array, a plate in a plate a 2D array, etc. LDA plate notation

The w are the words, grayed out as they are observable. All other parameters are latent variables.

Infer topics from documents containing words
Infer objects from images containing pixels
Infer similarity from DNA containing nucleotides
Recommend products from customers and purchases

Bayesian inference has many subfields.

Analytic methods:

Conjugate priors
Variational approximation

Numeric methods:

Metropolis algorithm
Gibbs sampling

Documents that form a corpus
Words that form a Vocabulary

Goal: Determine topics in the documents, each with a distributions of word probabilities.

documents and words
customer purchases and products offered
images and pixels
DNA and nucleotides
proteins and amino acids
... and so on ...

Corpus: PDF's from 2002 to 2014
Converted to text (Python script)
Parsed into pages (Python script)
Lines removed (page numbers, email, headers, table of contents, index, etc.)
Simplicity: Each page was considered a document.

Adobe PostScript was designed as a page description language for exact production of documents. PostScript "paints" graphics and text on a page. Text is just graphics.

Steve Jobs made PostScript popular with the introduction of the Apple LaserWriter in 1985.

The LaserWriter is a laser printer with built-in PostScript interpreter sold by Apple Computer, Inc. from 1985 to 1988. It was one of the first laser printers available to the mass market. In combination with WYSIWYG publishing software like PageMaker, that operated on top of the graphical user interface of Macintosh computers, the LaserWriter was a key component at the beginning of the desktop publishing revolution. Wikipedia

In 1985 the LaserWriter cost almost $7,000 (over $16,000 in equivalent 2019 dollars). It had 1.5MB of memory and printed at 300 dpi.

Adobe introduced PDF (Portable Document Format) documents using a simplified (code unrolled via abstract interpretation) PostScript with added meta-features for links, context, etc.

PDF became a popular document format.

In data science applications involving text, it may be needed to extract text from PDF documents. Assume the documents are not encrypted or the encryption keys are known.

How can one extract text from PDF documents?

In a Word document, the text is grouped together in meaningful groups.

In PDF, based on PostScript, text is just graphics painted on a page.

Thus, characters can be extracted in small groups but those groups need to be patched together to make text.

There are some programs that attempt to extract text from PDF.

For this purpose (and conversion to SVG), I have used GhostScript, an open source version of Adobe PostScript, to do reasonable extraction of text that then needs additional work to be useful.

Goal of conference proceedings topic modeling.

Run LDA topic modeling algorithm
for 30 topics to get a
topic distribution of words and probabilities.

For a simple run (100 iterations to convergence assumed): about 30 minutes (fast 16GB i7 laptop)

For a cluster of 4 machines, 4 variations can be tried at a time.

Future: Parallella cluster with multi-core algorithm on each and machine on a different variation.

2,505 documents/pages, 20,745 words

What issues do you see?

2,505 documents/pages, 20,704 words

What issues do you see?

2,505 documents/pages, 13,545 words

What issues do you see?

What are the two main branches of statistics?

frequentist statistics
Bayesian statistics

They both solve the same types of problems. In some cases, one way works better/easier than the other.

Analogy: wave-particle duality in physics

Bayes rule is based on conditional probability flipping.

Diagnosis:

Probability someone has cancer: 0.004 or 0.4%
Cancer test accuracy: 98.0%

You test positive for cancer. What is the probability that you have cancer? Bayes rule sensitivity analysis

Knowing material, passing test, "A" student
Knowing material, passing certification, competent
Knowing something, failing lie detector test, lied
Not taking drug, failing test, accused/guilty

If "X" (patient) has disease "Y" (measles), "X" (patient) will have "Z" (red spots).
"X" (patient) has "Z" (red spots).
What is the probability that "X" (patient) has "Y" (measles)?

We do not yet know if "X" (patient) has "Y" (measles).

If document "X" has topic "Y" in it, "X" should contain word "Z".
Document "X" contains word "Z".
What is the probability document "X" has topic "Y"?

We do not yet know if document "X" contains topic "Y".

Given a model, Bayesian/graphical networks/models allow us to go backwards and infer hidden/latent topics/distributions from the observable data. LDA model

Create a generative model for documents
Generate sample documents from the model
Run the algorithm on the generated documents
See how well the original model is obtained.

Set: no repetitions
Bag: repetitions

Each word is counted once in a bag no matter how many times it appears (e.g., in the document). Words are "exchangeable" which makes the theory and math work out in nice ways.

Most common method: LDA

LDA: Latent Dirichlet Allocation

Latent for hidden topic distributions
Dirichlet for an infinite measures probability distribution
Allocation for allocating words to topic distributions

Poor naming choice: LDA can mean "Linear Discriminant Analysis" or "Latent Dirichlet Allocation".

The nice "math" makes the computational equations work out nicely.

Custom Python code written by the author.
gensym - Python-based libraries
Mallet - Java-based
R statistical packages
Mahout - module of the entire machine learning system
C code from original developers
... and so on ...

There are many variations of the basic algorithm as this is a current research area.