A guide to the usefulness of data science covers such topics as algorithms, logistic regression, financial modeling, data visualization, and data engineering.
Machine generated contents note: Big Data and Data Science Hype Getting Past the Hype Why Now? Datafication The Current Landscape (with a Little History) Data Science Jobs A Data Science Profile Thought Experiment: Meta-Definition OK, So What Is a Data Scientist, Really? In Academia In Industry Statistical Thinking in the Age of Big Data Statistical Inference Populations and Samples Populations and Samples of Big Data Big Data Can Mean Big Assumptions Modeling Exploratory Data Analysis Philosophy of Exploratory Data Analysis Exercise: EDA The Data Science Process A Data Scientist's Role in This Process Thought Experiment: How Would You Simulate Chaos? Case Study: RealDirect How Does RealDirect Make Money? Exercise: RealDirect Data Strategy Machine Learning Algorithms Three Basic Algorithms Linear Regression k-Nearest Neighbors (k-NN) k-means Exercise: Basic Machine Learning Algorithms Thought Experiment Financial Modeling In-Sample, Out-of-Sample, and Causality Preparing Financial Data Log Returns Example: The S and P Index Working out a Volatility Measurement Exponential Downweighting The Financial Modeling Feedback Loop Why Regression? Adding Priors A Baby Model Exercise: GetGlue and Timestamped Event Data Exercise: Financial Data William Cukierski Background: Data Science Competitions Background: Crowdsourcing The Kaggle Model A Single Contestant Their Customers Thought Experiment: What Are the Ethical Implications of a Robo-Grader? Feature Selection Example: User Retention Filters Wrappers Embedded Methods: Decision Trees Entropy The Decision Tree Algorithm Handling Continuous Variables in Decision Trees Random Forests User Retention: Interpretability Versus Predictive Power David Huffaker: Google's Hybrid Approach to Social Research Moving from Descriptive to Predictive Social at Google Privacy Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control? A Real-World Recommendation Engine Nearest Neighbor Algorithm Review Some Problems with Nearest Neighbors Beyond Nearest Neighbor: Machine Learning Classification The Dimensionality Problem Singular Value Decomposition (SVD) Important Properties of SVD Principal Component Analysis (PCA) Alternating Least Squares Fix V and Update U Last Thoughts on These Algorithms Thought Experiment: Filter Bubbles Exercise: Build Your Own Recommendation System Sample Code in Python Data Visualization History Gabriel Tarde Mark's Thought Experiment What Is Data Science, Redux? Processing Franco Moretti A Sample of Data Visualization Projects Mark's Data Visualization Projects New York Times Lobby: Moveable Type Project Cascade: Lives on a Screen Cronkite Plaza eBay Transactions and Books Public Theater Shakespeare Machine Goals of These Exhibits Data Science and Risk About Square The Risk Challenge The Trouble with Performance Estimation Model Building Tips Data Visualization at Square Ian's Thought Experiment Data Visualization for the Rest of Us Data Visualization Exercise Social Network Analysis at Morning Analytics Case-Attribute Data versus Social Network Data Social Network Analysis Terminology from Social Networks Centrality Measures The Industry of Centrality Measures Thought Experiment Morningside Analytics How Visualizations Help Us Find Schools of Fish More Background on Social Network Analysis from a Statistical Point of View Representations of Networks and Eigenvalue Centrality A First Example of Random Graphs: The Erdos-Renyi Model A Second Example of Random Graphs: The Exponential Random Graph Model Data Journalism A Bit of History on Data Journalism Writing Technical Journalism: Advice from an Expert Correlation Doesn't Imply Causation Asking Causal Questions Confounders: A Dating Example OK Cupid's Attempt The Gold Standard: Randomized Clinical Trials A/B Tests Second Best: Observational Studies Simpson's Paradox The Rubin Causal Model Visualizing Causality Definition: The Causal Effect Three Pieces of Advice Madigan's Background Thought Experiment Modern Academic Statistics Medical Literature and Observational Studies Stratification Does Not Solve the Confounder Problem What Do People Do About Confounding Things in Practice? Is There a Better Way? Research Experiment (Observational Medical Outcomes Partnership) Closing Thought Experiment Claudia's Data Scientist Profile The Life of a Chief Data Scientist On Being a Female Data Scientist Data Mining Competitions How to Be a Good Modeler Data Leakage Market Predictions Amazon Case Study: Big Spenders A Jewelry Sampling Problem IBM Customer Targeting Breast Cancer Detection Pneumonia Prediction How to Avoid Leakage Evaluating Models Accuracy: Meh Probabilities Matter, Not 0s and 1s Choosing an Algorithm A Final Example Parting Thoughts About David Crawshaw Thought Experiment MapReduce Word Frequency Problem Enter MapReduce Other Examples of MapReduce What Can't MapReduce Do? Pregel About Josh Wills Thought Experiment On Being a Data Scientist Data Abundance Versus Data Scarcity Designing Models Economic Interlude: Hadoop A Brief Introduction to Hadoop Cloudera Back to Josh: Workflow So How to Get Started with Hadoop? Process Thinking Naive No Longer Helping Hands Your Mileage May Vary Bridging Tunnels Some of Our Work What Just Happened? What Is Data Science (Again)? What Are Next-Gen Data Scientists? Being Problem Solvers Cultivating Soft Skills Being Question Askers Being an Ethical Data Scientist Career Advice