Send
Close Add comments:
(status displays here)
Got it! This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.
Data science ideas: initial
1. Class project
A number of assignments will be for a individual or group project in data science.
You should find a problem for which you wish to apply data science techniques.
class - idea from another class
work - idea from work, co-op, etc.
home - idea from personal interest.
2. Scope
Whenever I have an idea for a identified project I start with the following.
all of the possible ways to do it (pie in the sky)
minimum way to get started
I usually end up starting at the minimum way to get started, then add requirements as needed.
YAGNI (You Ain't Gonna Need It).
3. Software systems development
Rule: 90% of the functionality can be obtained with 10% of the effort.
The remaining 10% of the functionality takes 90% of the effort.
The remaining 10% may be the most critical.
This concept is related to the Pareto principle (which is the 80-20 rule).
4. Phases
Proposal
Data
Visualization
Method with simple examples
Application prototype (code)
Written communication (docx)
Oral communication (pptx)
This page has ideas for data science projects, investigations, etc.
Any remarks in square brackets are from projects on which I have worked.
5. Geographic information systems
USGS satellite image data [solar panel and swimming pool detection]
Open Street Map [bicycle route mapping]
Google Maps API, etc. [bicycle route mapping, individual and for groups]
6. Bayesian classification
Any yes-no question where data is available.
Sentiment analysis [Footwear company - German comments, Google translation API]
SPAM analysis
7. Topic modeling
Documents (Customers), Vocabulary (Products), Words used in documents (Products bought by a customer).
News headlines [RSS feeds for banking, security, political, etc., analysis, change over time]
Text comparisons [LSI vector-based, LDA ]
Certification question updates [PDF to text, topic modeling, relating common parts of text with questions]
8. Intellectual property forensic analysis
Code comparisons [complex regular expression and LCS and MED comparisons, cluster computing]
9. Dimensionality reduction
10. Time series data
Time series analysis involves data that has periodic cycles.
11. Data collection, analysis, and display
Moving sensor data [Kinect skeleton data collection, analysis and display in real time]
Moving sensor data [Sony Oui controller connected to Arduino]
News headlines collection [Raspberry Pi over several years]
Crayon drawing color analysis [Nursing research on crayon drawings of children]
Real Estate listings [Amazon Cloud, 100,000+ listings, changes updated every 20 minutes, on vanilla web-hosting system]
12. Decision trees and random forests
Grouping data in tree structure from most important/prevalent to least important/prevalent.
13. Natural language processing
text to voice [pet food marketing]
text to voice [vocabulary learning system]
Google translation API [sentiment analysis]
14. Text processing
Email comparisons [Enron database, legal search pruning]
Patent comparisons [Google patent database]
Compact encoding and fast search of texts [searching ancient documents in many languages]
SQL query analysis and fix-ups [6,000 queries in 150 client databases]
15. Regression
Linear regression
Logistic regression (hill-climbing algorithms)
16. Clustering
Clustering is used to partition a set of data into groups.
parametric - number of groups is known at start
non-parametric - number of groups is not known at start
k-means clustering [graphics display and convex hull]
17. Gaussian mixtures
Gaussian mixture models are used to infer multiple (normal) distributions in aggregate data.
18. Statistical distributions
Inferring distributions for arbitrary data and generating test data [generate realistic test data]
Reverse engineering high level specifications from rules and patterns [reverse engineering charts from PDF]
19. Kernel density estimation
Kernel density estimation
20. Neural networks
Neural networks are intended to recognize patterns in a yes-no manner. [best buy for computer given competing data]
21. Manifold learning
22. Support Vector Machines
23. Deep learning and tensor flow
Deep learning attempts to reduce the amount of "
feature extraction" needed to analyze data.
TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. https://tensorflow.org
24. Data sources
Sensor data
Log data (web server, etc.)
Program code errors
Student scores, attendance, etc.
News stories from news RSS feeds
Government data
Census data
Crime statistics [crime statistics project]
25. End of page