Send
Close Add comments:
(status displays here)
Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.
Internet searches: Estimating probabilities
1. Internet searches: Estimating probabilities
This page looks at one way to use advanced Internet searches to estimate certain probabilities.
The notes date from the Presidential election of 2000 when Bill Clinton was President and Al Gore was the Democratic nominee and George Bush (son) was the Republican nominee.
2. Search companies
At that time, Alta Vista was a popular search engine and Google was an up-and-coming search engine.
One of the problems with Internet searches is that there may be too many hits for a given search string. The other problem is that it may be hard to determine how valid the information is that is found with the search.
Many search engines offer advanced search options that support logical operations, resulting in fewer hits that must be looked at.
Many offer shortcuts for advanced logical operations.
3. AltaVista
Each search engine has its own syntax for advanced queries.
The results that follow use the Alta Vista search engine, at
http://www.altavista.com, as Google did not seem to handle the logical or operation correctly (or, there was something that we could not get right).
4. Clinton-Gore
The results that follow were done in March, 2000, an election year, in which Bill Clinton was President, and Vice President Al Gore was on his way to winning the nomination to run for President.
5. Logical operations
Logical Not is "!".
Logical And is "&".
Logical Or is "|".
The plus sign "+" includes pages with the next operand.
The minus sign "-" excludes pages with the next operand.
The operator Near is "~", as in "Clinton ~ Gore" would find the text "Clinton" that is within 10 words of the word "Gore".
6. Example search
We will search for Clinton and/or Gore. Remember, though, that we will also find hits that do not refer to President Clinton or Vice President Gore.
This would become more problematic when Bush was elected president as there are many hits for the plant variety of "bush".
7. Advanced query options
Clinton & Gore would find pages with both Clinton and Gore.
Gore & ! Clinton would find pages with Gore, but not Clinton.
Clinton & ! Gore would find pages with Clinton, but not Gore.
Clinton | Gore would find pages with Clinton or Gore or both.
Clinton would find pages with Clinton.
+Clinton would find pages with Clinton.
+Clinton+Gore would find pages with both Clinton and Gore.
+Clinton-Gore would find pages with Clinton but not Gore.
8. Search hits on 2000-03-13
hits advanced search
--------- ----------------
1,370,179 Clinton & ! Gore
400,246 Gore & ! Clinton
118,899 Clinton & Gore
--------- ----------------
1,854,859 Clinton | Gore
118,899 Clinton & Gore
1,370,179 Clinton & ! Gore
--------- ----------------
1,482,243 Clinton
118,899 Gore & Clinton
400,246 Gore & ! Clinton
--------- ----------------
511,253 Gore
Notice that the numbers do not quite add up, but are very close.
9. Probability
Here is some probability for those interested.
P(Clinton | Gore ) =
118,889 / 511,253 ≈
0.232.
P(Gore | Clinton ) =
118,889 / 1,482,243 ≈
0.0802.
Thus, it is about
0.232/0.0802, or about
2.9 times more likely that
Clinton is found on a page with
Gore on it than
Gore is to be found on a page with
Clinton on it.
Suppose that Gore wins the election and becomes President. How do you expect that these search results would change?
10. Narrowing the search
hits advanced search
--------- ----------------
1,449,241 Clinton
581,603 Clinton ~ President
356,430 Clinton ~ Bill
11. End of page