Send Close Add comments: (status displays here)
Got it!  This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.  Note: This appears on each machine/browser from which this site is accessed.
Python: NLTK and bi-chars
by RS  admin@robinsnyder.com : 1024 x 640


1. Python: NLTK and bi-chars
This page looks at frequency distributions of text.

2. Declaration of Independence
The hard-coded text used is part of the Declaration of Independence.


3. Join the text
The text can be obtained from the list by joining.


4. Word list
Regular expressions can be used to obtain words in the text.


5. Frequencies
The number of times each word occurs can be obtained from the nltk package as follows.

Note that this is not very hard to explicitly program, if needed.

The dictionary has a key of a word and value of the count.

6. Print the distribution
The distribution can then be printed.


7. Program
Here is the Python code [#6]


8. Output
Here is the output of the Python code.


9. Bi-grams
Often, bi-grams are used at the word level. Here, bi-grams are used at the character level.

Claude Shannon used this approach at a manual level (computers like we have today did not exist at the time).

10. Program
Here is the Python code [#7]


11. Output
Here is the output of the Python code.


12. Observations
Notice how the uni-char simulation generates random words.

Notice how the bi-char simulation generates more English-like looking words since it takes into consideration the probability of the next character from the current character.

13. Assumptions
Some assumptions were made to simplify the code.

14. End of page

by RS  admin@robinsnyder.com : 1024 x 640