Send
Close Add comments:
(status displays here)
Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.
Topic modeling: Conference proceedings
1. LDA model
The
NLP (Natural Language Processing) model
LDA (Latent Dirichlet Allocation) is a generative Bayesian statistical model. Here as a graphical model depiction of this model. A plate represents an array, a plate in a plate a 2D array, etc.
M denotes the number of documents
N is number of words in a given document
K are the "topics"
The w are the words, grayed out as they are observable. All other parameters are latent variables.
2. Topic modeling
Infer topics from documents containing words
Infer objects from images containing pixels
Infer similarity from DNA containing nucleotides
Recommend products from customers and purchases
3. Bayesian inference
Bayesian inference has many subfields.
Analytic methods:
Conjugate priors
Variational approximation
Numeric methods:
Metropolis algorithm
Gibbs sampling
Documents that form a corpus
Words that form a Vocabulary
Goal: Determine topics in the documents, each with a distributions of word probabilities.
4. Applications
documents and words
customer purchases and products offered
images and pixels
DNA and nucleotides
proteins and amino acids
... and so on ...
5. Documents and words
6. LDA model
7. ASCUE data
Corpus: PDF's from 2002 to 2014
Converted to text (Python script)
Parsed into pages (Python script)
Lines removed (page numbers, email, headers, table of contents, index, etc.)
Simplicity: Each page was considered a document.
8. PostScript
Adobe PostScript was designed as a page description language for exact production of documents. PostScript "paints" graphics and text on a page. Text is just graphics.
Steve Jobs made PostScript popular with the introduction of the Apple LaserWriter in 1985.
9. LaserWriter
The LaserWriter is a laser printer with built-in PostScript interpreter sold by Apple Computer, Inc. from 1985 to 1988. It was one of the first laser printers available to the mass market. In combination with WYSIWYG publishing software like PageMaker, that operated on top of the graphical user interface of Macintosh computers, the LaserWriter was a key component at the beginning of the desktop publishing revolution. Wikipedia
In 1985 the LaserWriter cost almost $7,000 (over $16,000 in equivalent 2019 dollars). It had 1.5MB of memory and printed at 300 dpi.
10. PDF documents
Adobe introduced
PDF (Portable Document Format) documents using a simplified (code unrolled via abstract interpretation) PostScript with added meta-features for links, context, etc.
PDF became a popular document format.
11. Text extraction
In data science applications involving text, it may be needed to extract text from PDF documents. Assume the documents are not encrypted or the encryption keys are known.
How can one extract text from PDF documents?
12. Text as graphics
In a Word document, the text is grouped together in meaningful groups.
In PDF, based on PostScript, text is just graphics painted on a page.
Thus, characters can be extracted in small groups but those groups need to be patched together to make text.
13. Text extraction
There are some programs that attempt to extract text from PDF.
For this purpose (and conversion to SVG), I have used GhostScript, an open source version of Adobe PostScript, to do reasonable extraction of text that then needs additional work to be useful.
14. Mechanics
Goal of conference proceedings topic modeling.
Run LDA topic modeling algorithm
for 30 topics to get a
topic distribution of words and probabilities.
15. Computation
For a simple run (100 iterations to convergence assumed): about 30 minutes (fast 16GB i7 laptop)
For a cluster of 4 machines, 4 variations can be tried at a time.
Future: Parallella cluster with multi-core algorithm on each and machine on a different variation.
16. ASCUE 1
2,505 documents/pages,
20,745 words
What issues do you see?
17. ASCUE 2
2,505 documents/pages,
20,704 words
What issues do you see?
18. ASCUE 3
2,505 documents/pages,
13,545 words
What issues do you see?
19. Statistics
What are the two main branches of statistics?
frequentist statistics
Bayesian statistics
They both solve the same types of problems. In some cases, one way works better/easier than the other.
Analogy: wave-particle duality in physics
20. Bayes rule
Bayes rule is based on conditional probability flipping.
21. Diagnosis
Diagnosis:
Probability someone has cancer: 0.004 or 0.4%
Cancer test accuracy: 98.0%
You test positive for cancer. What is the probability that you have cancer?
22. Other domains
Knowing material, passing test, "A" student
Knowing material, passing certification, competent
Knowing something, failing lie detector test, lied
Not taking drug, failing test, accused/guilty
23. Conditional probability
If "X" (patient) has disease "Y" (measles), "X" (patient) will have "Z" (red spots).
"X" (patient) has "Z" (red spots).
What is the probability that "X" (patient) has "Y" (measles)?
We do not yet know if "X" (patient) has "Y" (measles).
If document "X" has topic "Y" in it, "X" should contain word "Z".
Document "X" contains word "Z".
What is the probability document "X" has topic "Y"?
We do not yet know if document "X" contains topic "Y".
24. Generative technique
Given a model, Bayesian/graphical networks/models allow us to go backwards and infer hidden/latent topics/distributions from the observable data.
25. Math: Many extensions
26. Generative techniques
Create a generative model for documents
Generate sample documents from the model
Run the algorithm on the generated documents
See how well the original model is obtained.
27. Gibbs Sampling
28. Bag of words
Set: no repetitions
Bag: repetitions
Each word is counted once in a bag no matter how many times it appears (e.g., in the document).
Words are "exchangeable" which makes the theory and math work out in nice ways.
Most common method:
LDA
29. LDA
LDA: Latent Dirichlet Allocation
Latent for hidden topic distributions
Dirichlet for an infinite measures probability distribution
Allocation for allocating words to topic distributions
Poor naming choice: LDA can mean "
Linear Discriminant Analysis" or "
Latent Dirichlet Allocation".
30. Lots of probability
The nice "math" makes the computational equations work out nicely.
31. Software
Custom Python code written by the author.
gensym - Python-based libraries
Mallet - Java-based
R statistical packages
Mahout - module of the entire machine learning system
C code from original developers
... and so on ...
There are many variations of the basic algorithm as this is a current research area.
32. ASCUE 1
What issues do you see?
33. ASCUE 2
What issues do you see?
34. ASCUE 2
What issues do you see?
35. Other visualization techniques
36. Word clouds
37. Machine learning
Machine learning (artificial intelligence)
Big data
Statistics
38. Machine learning
39. Deep learning
feature extraction for machine learning
deep learning
40. Applications
Topic modeling
Sentiment analysis
Recommendation engine
41. End of page