Send Close Add comments: (status displays here)
Got it!  This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.  Note: This appears on each machine/browser from which this site is accessed.
Topic modeling: Conference proceedings
by RS  admin@robinsnyder.com : 1024 x 640


1. LDA model
The NLP (Natural Language Processing) model LDA (Latent Dirichlet Allocation) is a generative Bayesian statistical model. Here as a graphical model depiction of this model. A plate represents an array, a plate in a plate a 2D array, etc. LDA plate notation The w are the words, grayed out as they are observable. All other parameters are latent variables.

2. Topic modeling

3. Bayesian inference
Bayesian inference has many subfields. Goal: Determine topics in the documents, each with a distributions of word probabilities.

4. Applications

5. Documents and words

6. LDA model

7. ASCUE data

8. PostScript
Adobe PostScript was designed as a page description language for exact production of documents. PostScript "paints" graphics and text on a page. Text is just graphics.

Steve Jobs made PostScript popular with the introduction of the Apple LaserWriter in 1985.

9. LaserWriter
The LaserWriter is a laser printer with built-in PostScript interpreter sold by Apple Computer, Inc. from 1985 to 1988. It was one of the first laser printers available to the mass market. In combination with WYSIWYG publishing software like PageMaker, that operated on top of the graphical user interface of Macintosh computers, the LaserWriter was a key component at the beginning of the desktop publishing revolution. Wikipedia

In 1985 the LaserWriter cost almost $7,000 (over $16,000 in equivalent 2019 dollars). It had 1.5MB of memory and printed at 300 dpi.

10. PDF documents
Adobe introduced PDF (Portable Document Format) documents using a simplified (code unrolled via abstract interpretation) PostScript with added meta-features for links, context, etc.

PDF became a popular document format.

11. Text extraction
In data science applications involving text, it may be needed to extract text from PDF documents. Assume the documents are not encrypted or the encryption keys are known.

How can one extract text from PDF documents?

12. Text as graphics
In a Word document, the text is grouped together in meaningful groups.

In PDF, based on PostScript, text is just graphics painted on a page.

Thus, characters can be extracted in small groups but those groups need to be patched together to make text.

13. Text extraction
There are some programs that attempt to extract text from PDF.

For this purpose (and conversion to SVG), I have used GhostScript, an open source version of Adobe PostScript, to do reasonable extraction of text that then needs additional work to be useful.

14. Mechanics
Goal of conference proceedings topic modeling.

15. Computation
For a simple run (100 iterations to convergence assumed): about 30 minutes (fast 16GB i7 laptop)

For a cluster of 4 machines, 4 variations can be tried at a time.

Future: Parallella cluster with multi-core algorithm on each and machine on a different variation.

16. ASCUE 1
ASCUE Topic Modeling chart2,505 documents/pages, 20,745 words

What issues do you see?

17. ASCUE 2
ASCUE Topic Modeling chart2,505 documents/pages, 20,704 words

What issues do you see?

18. ASCUE 3
ASCUE Topic Modeling chart2,505 documents/pages, 13,545 words

What issues do you see?

19. Statistics
What are the two main branches of statistics? They both solve the same types of problems. In some cases, one way works better/easier than the other.

Analogy: wave-particle duality in physics

20. Bayes rule
Bayes rule is based on conditional probability flipping. Bayes rule

21. Diagnosis
Diagnosis: You test positive for cancer. What is the probability that you have cancer? Bayes rule sensitivity analysis

22. Other domains

23. Conditional probability
We do not yet know if "X" (patient) has "Y" (measles). We do not yet know if document "X" contains topic "Y".

24. Generative technique
Given a model, Bayesian/graphical networks/models allow us to go backwards and infer hidden/latent topics/distributions from the observable data. LDA model

25. Math: Many extensions

26. Generative techniques

27. Gibbs Sampling
Gibbs Sampling

28. Bag of words
Each word is counted once in a bag no matter how many times it appears (e.g., in the document). Words are "exchangeable" which makes the theory and math work out in nice ways.

Most common method: LDA

29. LDA
LDA modelLDA: Latent Dirichlet Allocation Poor naming choice: LDA can mean "Linear Discriminant Analysis" or "Latent Dirichlet Allocation".

30. Lots of probability
The nice "math" makes the computational equations work out nicely.

31. Software
There are many variations of the basic algorithm as this is a current research area.

32. ASCUE 1
ASCUE Topic Modeling chartWhat issues do you see?

33. ASCUE 2
ASCUE Topic Modeling chartWhat issues do you see?

34. ASCUE 2
ASCUE Topic Modeling chartWhat issues do you see?

35. Other visualization techniques

36. Word clouds

37. Machine learning

38. Machine learning

39. Deep learning

40. Applications

41. End of page

by RS  admin@robinsnyder.com : 1024 x 640