Send
Close Add comments:
(status displays here)
Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.
Visualizing Simpson's Paradox
1. Visualizing Simpson's Paradox
This page looks at an overview of Simpson's Paradox in terms of the following.
Finding minimal instances of Simpson's Paradox (using Lua)
Finding the asymptotic trend of Simpson's Paradox (using cluster computing)
Visualizing Simpson's Paradox (using explanatory graphics techniques)
2. References
[182] Snyder, R (2016). Visualizing Simpson's Statistical Paradox. Annual Meeting of the Southeastern Chapter of the Institute for Operations Research and the Management Sciences.
[179] Snyder, R. (2015). Investigating approximate statistical properties of scaled problem instances of Simpson's Paradox using big data cluster computing techniques. 45th Annual Meeting of the Southeastern Region of the Decision Sciences Institute.
[178] Snyder, R. (2015). Using node.js to manage distributed computation of Python programs using inexpensive computing clusters. The Journal of Computing in Small Colleges. Vol. 31. No. 3.
[177] Snyder, R (2015). Developing a simple Lua program to find minimal instances of Simpson's Paradox. Annual Meeting of the Southeastern Chapter of the Institute for Operations Research and the Management Sciences.
3. Minimal instances
The Lua code to find minimal instances in a general manner (omitted) consists of many nested loops, checking every possibility in a breadth-first manner.
In simple form, if there are n balls to be distributed between red and blue balls and then distributed between 4 jars that each have at least one color of each ball in them, then there are 8 degrees of freedom which requires 8 simple nested loop bodies (i.e., and outer loop and 7 inner loops).
This makes the problem intractable in general but, with the increasing cost and power of small computers (e.g., credit card size computers) it becomes easier to increase the value of n - the total number of balls used and therefore get a better approximate idea of the asymptotic statistical behavior of Simpson's Paradox.
4. Minimal instance
One:
letters: 2/(2+1)=0.667*
digits: 3/(3+2)=0.600
Two:
letters: 2/(2+5)=0.286*
digits: 1/(1+3)=0.250
Both:
letters: 4/(4+6)=0.400
digits: 4/(4+5)=0.444*
The minimal instance requires
19 balls (
8 blue,
11 red)
5. Asymptotic behavior
To determine the asymptotic behavior of instances of Simpson's Paradox, cluster computing techniques were used.
6. Big data clusters - hardware
Cluster computing was used to run tests to find the asymptotic behavior of instances of Simpson's Paradox.
Here is a cluster of 6 Raspberry Pi credit-card size computers.
Here are six SSH windows, one to each of the Raspberry Pi computers.
The node.js window (on the right) controls to work from each Pi, when requested.
Note: Bigger and faster computers were used in the cluster for Simpson's Paradox computations.
7. Instances checked
X is the number of balls
Y is the arrangements checked for each number of balls, less than 5,000,000,000
8. Instances found
X is the number of balls
Y is the instances found , less than 30,000,000
9. Computation time
X is the number of balls
Y is the for each number of balls - up to 600.00 seconds or 10 minutes, but usually a lot less.
Note that some computers in the cluster were faster than others. There were more than just the six Raspberry Pi computers used.
10. Instances by percent
X is the number of balls
Y is percent instances found
Minimal instance has 19 balls.
11. Explanatory graphics
Explanatory graphics attempts to present graphics in a certain way to clearly "explain" what it is that one wants someone else to see.
Simpson's Paradox can be difficult to comprehend and understand.
12. Comparison
Data visualization: finding what exactly is important in a legal case.
Information visualization: presenting data in a way to make a specific point to the jury.
13. Legal analogy
In a legal analogy, custom scripts and code can be used to explore ideas using data visualization. Once one has found the important ideas, one then uses information visualization to effectively convey that idea to the jury.
14. Larry Page
Larry Page, co-founder of Google, came from a user interface point of view and has a saying that, in terms of the user interface, the user is always right. Thus, Google attempts to create the user interface such that the user can easily do what the user wants to do - all to Google's advantage, of course.
15. Page rank algorithm
The Google PageRank algorithm is a play on the name of Larry Page.
The PageRank algorithm can be expressed in linear algebra form and, in that sense, was a very good use of linear algebra to solve a practical problem.
16. MapReduce
The Google MapReduce algorithm provides a way to efficiently implement PageRank for huge collections of computers to provide fast response time.
Note: The Google MapReduce is loosely based on the functional concepts of map and reduce.
17. Grok the interface
Can you "
grok" that interface?
The term is from the 1961 science fiction novel
Stranger in a strange land by Robert Heinlein.
Google attempts to create the user interface such that the user can easily do what the user wants to do. The user can "
grok" the interface such that they intuitively understand it and how to use it.
How close does Google get to this goal?
The following is a step by step progression of graphics and decisions made to make the graphics and visualization more clear.
The graphics were developed and generated using Python and the
PIL (Python Imaging Library).
18. Empty pairs of tubes
The kinesthetic example used by the author uses actual clear tubes containing balls of red and blue color. For visualization, three pairs of tubes are needed, called here set "
A", "
B", and "
A+B". Each pair is divided into a "
left" and "
right" pair.
19. Put balls into each tube
The next step is to place the red and blue balls into the tubes, using a minimal example of Simpson's Paradox, as follows.
Note that the left and right tubes of pair "
A+B" have the same number and color of balls as the corresponding balls in the left and right tubes of pair "
A" and pair "
B" combined.
The next step is for the reader to decide, for each pair of tubes, the best probability of picking a blue (or red) ball out of the tubes - where the balls in each tube are put into a jar (or bag) and one must pick a ball out of the jar (or bag) without knowing which color of ball will be chosen. To do this, one must calculate percentages.
20. Show the percentages
It can help the reader to provide the percentages, since it is easier to verify percentages rather than calculate them. Here is the result.
The percentages are color coded to the color of the balls, using black for the "
100.0" percent.
21. Show the best choices
For each pair of tubes, it can help to show the best choices, left or right, for each pair.
This is done with a color-coded box around the percentage, "
left" or "
right", that provides the best chance of picking that color ball.
22. Identify each ball
To help show that the "
left" and "
right" balls in pair "
A+B" are a summation of the "
left" and "
right" balls in pair "
A" and the "
left" and "
right" balls in pair "
B", respectively, it can help to identify each ball uniquely, as follows.
In doing this, the naming of "
A", "
B", and "
A+B" is changed, for better or worse, respectively, to "
One", "
Two", and "
Both".
23. Distinguish balls
To help distinguish balls between the "
left" and "
right" tubes, the "
left" tubes can use letters while the right tubes can use digits, as follows.
In doing this, the naming of "
left" and "
right" is changed, for better or worse, to "
letters" and "
digits", respectively.
This, then, is the final result. A primary objective in the visualization is that the visualization should allow the reader to check individual parts of the visualization that lead to the overall result.
That is, abstracting similarities and differences from high level view to low level view.
24. Tell a story with the data
A popular idea is to tell a story with the data using visualization techniques as part of explanatory graphics.
A popular book, and the origin of the idea, is in
Storytelling with data: a data visualization guide for business professionals, by Cole Knaflic (who did some data science at Google).
25. Cole Knaflic
Cole Nussbaumer Knaflic tells stories with data. She is founder & CEO of storytelling with data (SWD) and author of Storytelling With Data: A Data Visualization Guide for Business Professionals and Storytelling with Data: Let's Practice! SWD's well-regarded workshops and presentations are highly sought after by data-minded individuals, companies, and philanthropic organizations all over the world. Learn more at storytellingwithdata.com.
26. End of page