Natural Language Parsing examples

NLP (Natural Language Processing) software systems including POS (Part of Speech) analysis, can be used to attempt to automatically analyze unstructured text.

To stimulate ideas, here is some output from some available NLP and POS systems.

The default settings were used. There are many ways to customize and tweak the system depending on the domain of application.

The leading NLP group is at Stanford University, at the following URL.

Their software is summarized and available from the following URL.

The NLTK (Natural Language Tool Kit) is available at the following URL.

"NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning."

Here ares some example sentences.

This article contains a discussion of the history of commercial and academic efforts to automate patent classifications. It also suggests new approaches (adding additional structured language to the text) that (it asserts) lead to statistically meaningful improvements.

Here are the parse tree from NLTK for the first sentence.

Here is the output from the Stanford Parser, an "implementations of probabilistic natural language parsers in Java: highly optimized PCFG and dependency parsers, a lexicalized PCFG parser, and a deep learning reranker".

(ROOT (S (NP (DT This) (NN article)) (VP (VBZ contains) (S (NP (NP (DT a) (NN discussion)) (PP (IN of) (NP (NP (DT the) (NN history)) (PP (IN of) (NP (UCP (JJ commercial) (CC and) (JJ academic)) (NNS efforts)))))) (VP (TO to) (VP (VB automate) (NP (NN patent) (NNS classifications)))))) (. .))) det(article-2, This-1) nsubj(contains-3, article-2) root(ROOT-0, contains-3) det(discussion-5, a-4) nsubj(automate-15, discussion-5) det(history-8, the-7) prep_of(discussion-5, history-8) amod(efforts-13, commercial-10) conj_and(commercial-10, academic-12) amod(efforts-13, academic-12) prep_of(history-8, efforts-13) aux(automate-15, to-14) xcomp(contains-3, automate-15) nn(classifications-17, patent-16) dobj(automate-15, classifications-17) (ROOT (S (NP (PRP It)) (ADVP (RB also)) (VP (VBZ suggests) (NP (NP (JJ new) (NNS approaches)) (PRN (-LRB- -LRB-) (VP (VBG adding) (NP (JJ additional) (JJ structured) (NN language)) (PP (TO to) (NP (DT the) (NN text)))) (-RRB- -RRB-)) (SBAR (WHNP (WDT that)) (S (PRN (-LRB- -LRB-) (S (NP (PRP it)) (VP (VBZ asserts))) (-RRB- -RRB-)) (VP (VBP lead) (PP (TO to) (NP (ADJP (RB statistically) (JJ meaningful)) (NNS improvements)))))))) (. .))) nsubj(suggests-3, It-1) advmod(suggests-3, also-2) root(ROOT-0, suggests-3) amod(approaches-5, new-4) dobj(suggests-3, approaches-5) nsubj(lead-20, approaches-5) dep(approaches-5, adding-7) amod(language-10, additional-8) amod(language-10, structured-9) dobj(adding-7, language-10) det(text-13, the-12) prep_to(adding-7, text-13) nsubj(asserts-18, it-17) parataxis(lead-20, asserts-18) rcmod(approaches-5, lead-20) advmod(meaningful-23, statistically-22) amod(improvements-24, meaningful-23) prep_to(lead-20, improvements-24)

Here is the output from the Stanford POS Tagger, a "maximum-entropy (CMM) part-of-speech (POS) tagger for English, Arabic, Chinese, French, and German, in Java ".

This_DT article_NN contains_VBZ a_DT discussion_NN of_IN the_DT history_NN of_IN commercial_JJ and_CC academic_JJ efforts_NNS to_TO automate_VB patent_NN classifications_NNS ._. It_PRP also_RB suggests_VBZ new_JJ approaches_NNS -LRB-_-LRB- adding_VBG additional_JJ structured_JJ language_NN to_TO the_DT text_NN -RRB-_-RRB- that_WDT -LRB-_-LRB- it_PRP asserts_VBZ -RRB-_-RRB- lead_NN to_TO statistically_RB meaningful_JJ improvements_NNS ._.

Using the NLTK, here is the parts of NLTK and Stanford speech tagger (via NLTK API) output using the default settings, along with the differences between the two taggers.

NLTK tagger: (default settings) 0. This : DT = Determiner 1. article : NN = Noun, singular or mass 2. contains : VBZ = Verb, 3rd person singular present 3. a : DT = Determiner 4. discussion : NN = Noun, singular or mass 5. of : IN = Preposition or subordinating conjunction 6. the : DT = Determiner 7. history : NN = Noun, singular or mass 8. of : IN = Preposition or subordinating conjunction 9. commercial : JJ = Adjective 10. and : CC = Coordinating conjunction 11. academic : JJ = Adjective 12. efforts : NNS = Noun, plural 13. to : TO = to 14. automate : VB = Verb, base form 15. patent : NN = Noun, singular or mass 16. classifications. : NNP = Proper noun, singular 17. It : NNP = Proper noun, singular 18. also : RB = Adverb 19. suggests : VBZ = Verb, 3rd person singular present 20. new : JJ = Adjective 21. approaches : NNS = Noun, plural 22. ( : VBP = Verb, non-3rd person singular present 23. adding : VBG = Verb, gerund or present participle 24. additional : JJ = Adjective 25. structured : JJ = Adjective 26. language : NN = Noun, singular or mass 27. to : TO = to 28. the : DT = Determiner 29. text : NN = Noun, singular or mass 30. ) : : = Colon or ellipsis 31. that : IN = Preposition or subordinating conjunction 32. ( : CD = Cardinal number 33. it : PRP = Personal pronoun 34. asserts : VBZ = Verb, 3rd person singular present 35. ) : : = Colon or ellipsis 36. lead : NN = Noun, singular or mass 37. to : TO = to 38. statistically : RB = Adverb 39. meaningful : JJ = Adjective 40. improvements : NNS = Noun, plural 41. . : . = Termator Stanford tagger: (default settings) 0. This : DT = Determiner 1. article : NN = Noun, singular or mass 2. contains : VBZ = Verb, 3rd person singular present 3. a : DT = Determiner 4. discussion : NN = Noun, singular or mass 5. of : IN = Preposition or subordinating conjunction 6. the : DT = Determiner 7. history : NN = Noun, singular or mass 8. of : IN = Preposition or subordinating conjunction 9. commercial : JJ = Adjective 10. and : CC = Coordinating conjunction 11. academic : JJ = Adjective 12. efforts : NNS = Noun, plural 13. to : TO = to 14. automate : VB = Verb, base form 15. patent : JJ = Adjective 16. classifications. : NN = Noun, singular or mass 17. It : PRP = Personal pronoun 18. also : RB = Adverb 19. suggests : VBZ = Verb, 3rd person singular present 20. new : JJ = Adjective 21. approaches : NNS = Noun, plural 22. ( : VBP = Verb, non-3rd person singular present 23. adding : VBG = Verb, gerund or present participle 24. additional : JJ = Adjective 25. structured : JJ = Adjective 26. language : NN = Noun, singular or mass 27. to : TO = to 28. the : DT = Determiner 29. text : NN = Noun, singular or mass 30. ) : NN = Noun, singular or mass 31. that : WDT = Wh-determiner 32. ( : VBZ = Verb, 3rd person singular present 33. it : PRP = Personal pronoun 34. asserts : VBZ = Verb, 3rd person singular present 35. ) : JJ = Adjective 36. lead : NN = Noun, singular or mass 37. to : TO = to 38. statistically : RB = Adverb 39. meaningful : JJ = Adjective 40. improvements : NNS = Noun, plural 41. . : . = Termator Differences: 15. patent : NN = Noun, singular or mass 15. patent : JJ = Adjective 16. classifications. : NNP = Proper noun, singular 16. classifications. : NN = Noun, singular or mass 17. It : NNP = Proper noun, singular 17. It : PRP = Personal pronoun 30. ) : : = Colon or ellipsis 30. ) : NN = Noun, singular or mass 31. that : IN = Preposition or subordinating conjunction 31. that : WDT = Wh-determiner 32. ( : CD = Cardinal number 32. ( : VBZ = Verb, 3rd person singular present 35. ) : : = Colon or ellipsis 35. ) : JJ = Adjective