Karl's ML Learning Log

Another take on regularization: preferring use of more features

2017-02-20T00:00:00Z

Continuing through cs231n lectures it's mostly a review of linear optmization. But I found the motivation for regularization here interesting:

$x = [1, 1, 1, 1]$
$w_1 = [1, 0, 0, 0]$
$w_2 = [0.25, 0.25, 0.25, 0.25]$
$w_1^Tx = w_2^Tx =1 $

Without regularization, these two weight vectors will contribute the same loss for this example, but we prefer $w_2$ as it draws on more features. An L2 regularization term will penalize $w_1$ more than $w_2$; regularization is a way to prefer an even spread of weights across the features.

The way I've thought about regularization previously is simply that large weights allow the function to "stretch further" to fit the training set, making the model less likely to generalize to unseen data. This might be more relevant when considering non-linear classifiers.

To reconcile both of these interpretations: all else being equal, if you are relying solely on a smaller set of features, you are more likely to be over-fitting the examples you are training on.

Practice vectorizing numpy operations, and why KNN stinks at image classification

2017-01-17T00:00:00Z

I'm working my way cs231n, Stanford's class on convolutional neural networks. The first assignment sets the stage by having you implement a k-nearest neighbor classifier by hand.

This is a nice warm-up as it gets you setup with the framework for image classification: train on a series of labeled images, and then predict the class of a new set of test images.

Reviewing vectorization

The first thing that was useful about this was simply implementing KNN by hand, first without any help from numpy's vectorizing methods, then help along one dimension, and finally figuring out how to do it without any loops at all. I spent some time reviewing universal functions and broadcasting, which come in handy when re-implementing the distance matrix calculation.

Why KNN stinks at image classifcation

It's useful to see that applying a nearest neighbor approach to image classification is not effective, and it makes sense: if you are doing a pixel-wise distance comparison between two images, it will be terribly sensitive to any translation of image features; imagine the exact same image having a distance of zero to itself, but then a large distance to a version of itself shifted over by a few pixels.

This motivates the need for the translation-invariant nature of features learned by convolutional layers.

I'm keeping my assignments in a private bitbucket repo out of respect for what might be a desire on stanford's part to keep the solutions private, but it might turn out this is overly cautious.

Hands-on lecture development for University of Michigan's undergrad ML course and a WIP EM notebook

2016-12-22T00:00:00Z

I've been offline in my logging of learning for a while and it's in part because I, in addition to my regular job (where I'm learning of practical stuff about running NN toolkits, especially MXNet), I ended up helping out with the University of Michigan's ML class. It was a great opportunity to learn some concepts, but also took up most of my free learning time, including the time to write up my learning!

In any case, I figured I'd link to the notebooks I helped develop, they are available on the courses github repo. I helped with hands-on lectures 12-21, covering:

Ensemble learning with bagging
Evaluating models with learning curves
Unsupervised learning and Principal Components Analysis (PCA)
Bayesian networks
Latent Variable models and Expectation Maximization
Hidden Markov Models, the forward-backward algorithm and the Baum-Welch algorithm (EM for HMMs)
Introduction to neural networks: the backpropogation algorithm and some of the basics on convolutional neural networks

I'm not sure how instructive they will be without having also read through the lecture notes and perhaps attended the lectures, but perhaps there's some good stuff in there for others.

For some of these I already was familiar with the material and was able pull something together pretty easily (e.g decision trees, bagging, learning curves, PCA) but towards the end it was quite challenging: each week I both needed to learn and develop an exercise for a topic, including bayesian networks, EM and HMMs.

One concept that was particularly vexing to me was expectation maximization. I've followed up on the original hands-on lecture to go a bit more deeply into it with an in progress notebook (that is currently unlisted). I will say more once I've wrapped it up, but I hope to make a nice standalone notebook that will help others get up to speed more quickly.

Learning features in ANNs and review of Perceptron

2016-10-11T00:00:00Z

I continued on watching week 2's lectures of the Geoffrey Hinton's neural network Coursera course where it covered some some of the different kinds of high level architectures, including feed forward and recurrent neural networks.

One interesting perspective I hadn't fully understood is that a key benefit of ANNs is the potential to learn the feature units. A standard pattern recognition system starts out with some feature representation of the data that is usually hand coded, and the model is then trained to weight the features to best recognize the data. With ANNs, you deliberately start out with a more raw source of data and leave it to the network to learn features which emerge in the earlier layer(s) of the network, and the subsequent layers then weight those features as per the standard pattern recognition paradigm.

This also reminds me of the idea of an "adaptive basis function" that I came across in perusing MLPP, meaning the model adapts the basis function that is applied before training a linear model. So you could contrast, say, kernel methods where a hand picked kernel is used as a basis function to get the overall model to fit non-linear decision boundaries with a multi-layer ANN that learns its basis function (the first layers of the network) adaptively based on the training data.

Perceptron

The lecture reviewed the perceptron algorithm, where the model is a linear combination of weights and the input vector plus a bias weight. You iterate over each training sample and update the weights if the output does not match as follows: if the output is zero when it should be one, add the sample to the weights. If the output is one when it should be zero, subtract the sample from the weights. This can be proven to always provide the correct answer if the data can be perfectly fit. The big caveat is that if the data cannot be perfectly fit (e.g is not linearly separable), then it will not converge.

The perceptron is also covered in second chapter of Python Machine Learning.

Getting started with Coursera Probabilistic Graphical Models Course

2016-10-05T00:00:00Z

Another day, another Coursera course!

I previously grumbled about this course having been taken back online, glad that it is back. It appears they have broken the original course into 3 pieces and slowed things down a bit, e.g "week 1" in the original now spans about a week and a half. Given that I'm going to attempt to keep up with this and the ANN course, that's fine by me :)

In the introductory lectures from week one covers how graphical models represent relationships between random variables. I'm mostly familiar with the high level concepts at this level, and it was nice to find the probability review of joint distributions, conditioning on one or more variables, integrating out variables with marginalization etc completely familiar.

Factors

One new concept to me is that of factors. A factor is a mapping of every possible assignment of the cross-product space of a set of random variables to a real number. A probability distribution is an example of a factor, as it maps every possible outcome in the event space to a real number. A conditional distribution is another example. A factor's scope is the set of variables who's cross-product it assigns values to.

Factors can be marginalized just as probability distributions can, and can also be multiplied together. When you multiply two factors, you essentially get a cross join of the two with the assigned value being the product of the output from the two original factors.

Factor reduction is kind of like a 'select where', only looking at rows where a variable has a particular value.

The video explains that factors are useful because the tools for manipulating factors are the tools we'll use to work with high dimensional probability distributions.

Getting started with Geoffery Hinton's Coursera Neural Networks class, a nice summary of unsupervised learning

2016-10-04T00:00:00Z

Most of the first week's lectures was pure review based on the usual summary ANNs and Machine Learning, but one thing I appreciated was the summary of unsupervised learning from this lecture. It defines several related goals:

To create an internal representation of the input that is useful for subsequent supervised or reinforcement learning
To provide a compact, low-dimensional representation of the input (e.g PCA is a linear method for this)
To provide an economical high-dimensional representation of the input in terms of learned features
To find sensible clusters in the input, which is an example of a very sparse code in which only one of the features is non-zero

I understood that dimensionality reduction and clustering were forms of unsupervised learning, but the connection that clustering is really just a one dimensional representation is interesting. I'm excited that half of the course will cover unsupervised learning, as I haven't really covered any material with ANNs for that purpose, and I'm beginning to understand that one of the coolest parts of ANNs is the learned features.

Regression models with scikit-learn

2016-09-28T00:00:00Z

I've been meaning to get back to chapter 10 of Python Machine Learning which covers regression models. Here's my notebook which differs from the author's in that I used a pipeline for preprocessing and explored the performance of a few more models just for kicks.

Attacking a regression problem shares many of the same techniques as a classification problem: preprocessing the data, performing cross validation etc. The main differences are in exploratory analysis, visualizing model performance and in the evaluation metric itself.

Exploring quantitative variables

The two tools used in exploring the housing dataset are the scatter matrix and the correlation matrix. One bone I have to pick with the chapter is the dataset contains exclusively quantitative variables; however just because the output variable is continuous doesn't mean every input variable will typically be so too. It would have been nice had the example included a mix. Anyways:

A scatter matrix is a nice way to quickly visualize the distribution of each quantitative variable individually as well as how each relate to each other pairwise.
A correlation matrix helps find positive or negative correlations between each pair of quantitative variables.
Combined, this helps you quickly scan relationships and look for variables that are both strongly correlated with the dependent (output) variable and whether the relationship is linear (and thus likely to work well with a linear model) or perhaps require a non-linear model (polynomial, random forest).

Here in the correlation matrix we can see that there are two variables that are strongly correlated with the price (MEDV): the number of rooms (RM) and % lower status population (LSTAT):

but in the scatter matrix we can see that RM has a more linear relationship:

so in fitting a 1d model, that's the best variable to choose.

Implementing a basic single variable model

Implementing a single variable linear regression model was a pretty simple adaptation of the perceptron coded up in chapter 2, in fact ,we just need to remove the part where we took the linear combination of the weights and input vector and fed it into an activation function that forced the value to -1 or 1; the linear combination is the output of the model.

Regression models explored

After walking through implementing a single variable regression model by hand, the chapter goes on to explore a few off the shelf scikit-learn models.

In addition to basic linear regression, the book also covers some regularized flavors (note: we've covered regularization before:

L2 penalized (adding in square some of weights to cost function) is called Ridge Regression
L1 penalized (adding in sum of absolute values of weights to cost function) is called LASSO, short for Least Absolute Shrinkage and Selection Operator
Elastic Net includes both regularization parameters, and is tuned by a L1 to L2 ratio parameter

RANSAC

In addition to regularized linear regression models, there's another nifty technique used to minimize the effect of outliers called RANSAC.

RANSAC aims to reduce the effect of outliers by iteratively randomly selecting a subset, assuming they are 'inliers' and then selecting all other points that are within a threshold of the fitted line:

Select a random number of samples to be inliers and fit the model.
Test all other data points against the fitted model and add those points that fall within a user-given tolerance to the inliers.
Re fit the model using all inliers.
Estimate the error of the fitted model versus the inliers.
Terminate the algorithm if the performance meets a certain user-defined threshold or if a fixed number of iterations has been reached; go back to step 1 otherwise.

One disadvantage is the metric for deciding whether points are within the threshold is dataset dependent, as the book describes:

Using the residual_metric parameter, we provided a callable lambda function that simply calculates the absolute vertical distances between the fitted line and the sample points. By setting the residual_threshold parameter to 5.0, we only allowed samples to be included in the inlier set if their vertical distance to the fit line is within 5 distance units, which works well on this particular dataset. By default, scikit-learn uses the MAD estimate to select the inlier threshold, where MAD stands for the Median Absolute Deviation of the target values y. However, the choice of an appropriate value for the inlier threshold is problem-specific, which is one disadvantage of RANSAC.

Evaluating performance

In evaluating how well a model performed, we need something more sophisticated that merely counting up the % of correct classifications. The two the book presents are mean squared error and so called $R^2$ which is the mean squared error rescaled to map between 0 and 1 by dividing MSE by the variance of the response variable.

$$R^2 = 1 - \frac{MSE}{Var(y)}$$

I like it because it makes it easier to compare apples to apples across datasets. (note that on the test set, $R^2$ can end up being less than 0).

Visualizing performance with Residual Plots

To evaluate the performance visually when you are working with more than one or two input variables, making visualizing the line or plane impossible, residual plots come in handy. As the book notes,

We can plot the residuals (the differences or vertical distances between the actual and predicted values) versus the predicted values to diagnose our regression model. Those residual plots are a commonly used graphical analysis for diagnosing regression models to detect nonlinearity and outliers, and to check if the errors are randomly distributed... If we see patterns in a residual plot, it means that our model is unable to capture some explanatory information, which is leaked into the residuals

Here's the residual plot for the housing dataset:

So what do you look for in these plots? Ideally it just looks like noise

for a good regression model, we would expect that the errors are randomly distributed and the residuals should be randomly scattered around the centerline. If we see patterns in a residual plot, it means that our model is unable to capture some explanatory information, which is leaked into the residuals as we can slightly see in our preceding residual plot. Furthermore, we can also use residual plots to detect outliers, which are represented by the points with a large deviation from the centerline.

Comparing model performance

Here's a summary of the $R^2$ performance of a bunch of models. Interesting to see random forest kicking ass as usual and that a high degree polynomial overfits the training set so badly. It's also interesting that on this dataset, plain old logistic regression outperforms the regularized alternatives as well as RANSAC.

Model	train $R^2$	test $R^2$
Random Forest	0.983379	0.827613
LR	0.764545	0.673383
Ridge	0.764475	0.672546
Decision Tree	0.851129	0.662887
Lasso	0.753127	0.653209
Quatratic	0.951138	0.652489
ElasticNet	0.751682	0.652375
RANSAC	0.722514	0.595687
Cubic	1.000000	-1030.784778

Reflections on a summer of learning

2016-09-05T00:00:00Z

My summer of full-time learning is winding down—I've lined up a job as a research engineer at a lab focused on autonomous vehicles. I'll be able to apply some of my learnings as well as my general software engineering skills while continuing to learn. Perhaps I'll write more on the process of finding a job later, for now I'll just say I feel extremely fortunate to have found something that meets the learning / applying criteria so quickly and to put a nice bow on the summer.

I'm finding that I want to shift my thinking to plan for the longer haul process of continuing to gain expertise in machine learning. But first, what did I learn these past few months?

Most broadly speaking, I'd say I've made a lot of progress in these areas:

I've become comfortable with the practical concerns in attacking supervised learning problems presented in the form of a (perhaps messy) labeled dataset, where one hopes to train a model that generalizes well to unlabeled data.
I've laid laid some of the theoretical foundations for continuing to study machine learning from a probabilistic perspective. Along the way I've gained an appreciation for thinking mathematically in general, and a desire to continue to strengthen my foundations in math.
I've continued a breadth first perusal of the huge field of machine learning and am better able to connect the dots with some of the foundations of statistics and probability theory under my belt

Applied Machine Learning

Working through most of Python Machine Learning turned out to be a perfect first step in getting my hands dirty applying ML. As an experienced programmer, this was mostly pure fun; a lot of the nitty gritty details in thinking about getting an environment setup to work with Jupyter notebooks and scikit-learn, and thinking about how to pipe data through various transformations and making it all work in Python came pretty naturally. The author is a gifted teacher and when he delves into the underlying theory, he does a nice job at making it approachable.

A quick brain dump of what I've covered here:

how supervised learning problems are framed
the process of working with a data set
- preprocessing and cleaning
- exploring relationships of variables
- building a pipeline to combine with different classifiers
- reducing dimensionality
- examining feature importance
- evaluating generality of model with cross-validation
- understanding whether more data would benefit your model with learning curves
- tuning parameters of model
models & techniques
- gradient descent
- logistic regression
- decision trees
- ensemble learning: bagging trees to make a random forest

One chapter I didn't cover yet that I'd like to soon is regression. I get the idea of regression and feel like I could plug in scikit learn models like linear and random forest regression, so it didn't feel as compelling to study compared with some of the probability theory I spent time on instead. But it's still worth going through the exercise, and there are some data sets I'd like to play with that would require regression (e.g predicting margin of victory of an nba game).

Probability Fundamentals

I've studied these specific topics as I work through Wasserman's All of Statistics:

axioms of probability
- a mathematical definition of probability: sample space, sigma algebra (subset of sample space that is measurable), mapping from every subset to a number
random variables
- mathematical definition
- transformation of random variables
basic statistics and sampling theory
probability distributions: densities and cumulative distribution functions
joint probability distributions, marginal distributions
conditional probability
expectation
probability inequalities

I've also begun to study the convergence of random variables, but have found that my rusty recollection of real analysis and the study of convergence of sequences of regular ol' numbers to be a bit lacking. This motivates wanting to study real analysis. But I've gotten far enough to appreciate The Central Limit Theorem and some of its implications, even if I couldn't prove it :)

On a related note, I wrote a review of this book on Amazon if you're curious about more thoughts on this book.

So I only barely covered the first half of that book and never really got into the inference part. Doh! But I think this provides a realistic sense of what one might hope to cover in a few months, particularly if also studying other stuff along the way. I spent plenty of time dabbling in some of the more advanced inference techniques, both bayesian and frequentist, but wouldn't say I've studied them rigorously. I think I'm merely ready to study these at this point:

I have a good understanding of how the concepts of random variables and joint distributions relate to datasets
I'm getting more comfortable reading and writing math notation
I'm beginning to understand how probability theory is useful not only in the definition of some predictive models but also in reasoning about the predictive quality of models

Math fundamentals

One area of study that I didn't anticipate emerging was what you might call mathematical reasoning. I'll excerpt from Introduction to Mathematical Thinking to provide some motivation:

But during the nineteenth century, as mathematicians tackled problems of ever greater complexity, they began to discover that their intuitions were sometimes inadequate to guide their work...This introspection led, in the middle of the nineteenth century, to the adoption of a new and different conception of the mathematics, where the primary focus was no longer on performing a calculation or computing an answer, but formulating and understanding abstract concepts and relationships. This was a shift in emphasis from doing to understanding. Mathematical objects were no longer thought of as given primarily by formulas, but rather as carriers of conceptual properties. Proving something was no longer a matter of transforming terms in accordance with rules, but a process of logical deduction from concepts.

This revolution—for that is what it amounted to—completely changed the way mathematicians thought of their subject. Yet, for the rest of the world, the shift may as well have not occurred. The first anyone other than professional mathematicians knew that something had changed was when the new emphasis found its way into the undergraduate curriculum. If you, as a college math student, find yourself reeling after your first encounter with this “new math,” you can lay the blame at the feet of the mathematicians Lejeune Dirichlet, Richard Dedekind, Bernhard Riemann, and all the others who ushered in the new approach.

and then

Over many years, we have grown accustomed to the fact that advancement in an industrial society requires a workforce that has mathematical skills. But if you look more closely, those skills fall into two categories. The first category comprises people who, given a mathematical problem (i.e., a problem already formulated in mathematical terms), can find its mathematical solution. The second category comprises people who can take a new problem, say in manufacturing, identify and describe key features of the problem mathematically, and use that mathematical description to analyze the problem in a precise fashion.

In the past, there was a huge demand for employees with type 1 skills, and a small need for type 2 talent. Our mathematics education process largely met both needs. It has always focused primarily on producing people of the first variety, but some of them inevitably turned out to be good at the second kind of activities as well. So all was well. But in today’s world, where companies must constantly innovate to stay in business, the demand is shifting toward type 2 mathematical thinkers—to people who can think outside the mathematical box, not inside it. Now, suddenly, all is not well.

There will always be a need for people with mastery of a range of mathematical techniques, who are able to work alone for long periods, deeply focused on a specific mathematical problem, and our education system should support their development. But in the twenty-first century, the greater demand will be for type 2 ability. Since we don’t have a name for such individuals (“mathematically able” or even “mathematician” popularly imply type 1 mastery), I propose to give them one: innovative mathematical thinkers.

This new breed of individuals (well, it’s not new, I just don’t think anyone has shone a spotlight on them before) will need to have, above all else, a good conceptual (in an operational sense) understanding of mathematics, its power, its scope, when and how it can be applied, and its limitations. They will also have to have a solid mastery of some basic mathematical skills, but that skills mastery does not have to be stellar. A far more important requirement is that they can work well in teams, often cross-disciplinary teams, they can see things in new ways, they can quickly learn and come up to speed on a new technique that seems to be required, and they are very good at adapting old methods to new situations.

I stumbled upon this by way of studying the axioms of probability. At first I found it really hard to understand things presented in this way. I wanted a "intuitive" description. But in trying over and over again to fully understand the material as well as the problem sets I started to see some of the beauty in feeling comfortable reasoning about concepts this way, and not always wanting to shy away from mathematical notation; shying away means waiting around for someone else to provide a much more verbose description of the formula. It’s kind of like not being able to read a map and instead waiting for someone to write out a series of turn by turn directions. Wouldn’t you like to be able to read math natively? It's clear that the pioneers of ML are really good at thinking mathematically, and often what they bring to the table is precisely framing problems involving uncertainty and inference mathematically from a probabilistic perspective.

Where this leaves me

I need to spend some more time planning my studies for the next few months as I start my job but so far this is what I'm imagining:

Deeper dive into machine learning from a probabilistic perspective, specifically the book Machine Learning: A Probabilistic Perspective seems like a perfect next step as the sometime dense mathematical exposition feels approachable now. Some of it will be review, but a lot of it will be a natural follow up to my probability studies; I'll likely make this my primary roadmap instead of the second half of Wasserman.
Bring my practical skills to larger scale datasets: I'd like to be able to do all the things I'm now comfortable with in scikit-learn and notebooks in a distributed fashion. Part of this will be devops stuff like bringing up a cluster of machines hosting docker containers, using Spark, some of the deep learning frameworks (MxNet, tensorflow) etc, and another will be gaining comfort authoring custom transformation logic (e.g a preprocessing technique) that could work at scale.
Continued study of some math fundamentals. I mentioned real analysis as one topic. I also might work my way through Thinking Mathematically
Specific study of deep learning. Surprisingly, I hardly touched deep learning during the past few months. Being the hottest most advanced technique out there it was quite tempting to dive in, but I felt it was smarter to start with the fundamentals. But now's the time to really get into it. I already understand how a multi-layer perceptron could be plugged into my usual supervised learning evaluation playbook, but I have a lot to learn about convolutional networks, recurrent networks, how deep learning can automatically learn features and other topics specific to computer vision. Thankfully all of this will be required by my new job.

I also am bursting with ideas and would like to keep playing with datasets, and chipping away at my csv -> notebook automatic data science starter kit side-project, but we'll see.

Disappointing improvements using one-hot / binary encoding, improving performance with help of Python profiler

2016-08-30T00:00:00Z

I put my new categorical encoder to use in a follow up attempt to the Red Hat business Kaggle competition.

The first thing I ran into was how slow it was to encode the variables with thousands of values. I did some profiling to see if there were any huge wins to be had:

$ python -m cProfile -o profile.bin $(which py.test) tests/test_preprocessing_transforms.py::test_profile_omniencode_redhat

and then

import pstats
p = pstats.Stats('profile.bin')
p.strip_dirs()
p.sort_stats('cumtime')
p.print_stats('preprocessing_transforms')

via

As I suspected, using list comprehensions instead of finding ways to do everything with numpy arrays was slow:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.006    0.006   11.809   11.809 test_preprocessing_transforms.py:15(test_profile_omniencode_redhat)
        1    0.000    0.000   10.213   10.213 preprocessing_transforms.py:48(transform)
        1    0.000    0.000   10.209   10.209 preprocessing_transforms.py:50(<listcomp>)
        4    0.090    0.022   10.209    2.552 preprocessing_transforms.py:54(_encode_column)
    40000    0.155    0.000    4.701    0.000 preprocessing_transforms.py:65(splat)
        1    0.000    0.000    1.050    1.050 test_preprocessing_transforms.py:4(<module>)
        1    0.000    0.000    0.817    0.817 preprocessing_transforms.py:1(<module>)
    24772    0.079    0.000    0.079    0.000 preprocessing_transforms.py:70(<listcomp>)
    40000    0.073    0.000    0.073    0.000 preprocessing_transforms.py:66(<listcomp>)
        1    0.000    0.000    0.037    0.037 preprocessing_transforms.py:44(fit)
        1    0.000    0.000    0.037    0.037 preprocessing_transforms.py:45(<dictcomp>)
        4    0.001    0.000    0.037    0.009 preprocessing_transforms.py:79(_column_info)
        1    0.000    0.000    0.004    0.004 preprocessing_transforms.py:21(transform)
        4    0.000    0.000    0.001    0.000 preprocessing_transforms.py:90(_partition_one_hot)
        4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:59(<dictcomp>)
     1746    0.000    0.000    0.000    0.000 preprocessing_transforms.py:102(<lambda>)
        4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:105(<listcomp>)
        4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:61(<listcomp>)
        4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:110(_num_onehot)
       41    0.000    0.000    0.000    0.000 preprocessing_transforms.py:127(capacity)
        4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:63(<listcomp>)
       41    0.000    0.000    0.000    0.000 preprocessing_transforms.py:122(num_bin_vals)

however, after updating a list comprehension to a more performant approach to building a numpy array of integers representing the bits for a binary encoded value, I only saw a 20% boost; not bad, but still over a 50x slowdown compared to ignoring or ordinally encoding the columns with lots of unique values:

Evaluating random forest ignore
 _Starting fitting full training set
 _Finished fitting full training set in 3.86 seconds
 _Starting evaluating on full test set
  Full test accuracy (0.05 of dataset): 0.880
 _Finished evaluating on full test set in 16.32 seconds
Evaluating random forest ordinal
 _Starting fitting full training set
 _Finished fitting full training set in 4.26 seconds
 _Starting evaluating on full test set
  Full test accuracy (0.05 of dataset): 0.885
 _Finished evaluating on full test set in 16.10 seconds
Evaluating random forest omni 20
 _Starting fitting full training set
 _Finished fitting full training set in 376.31 seconds
 _Starting evaluating on full test set
  Full test accuracy (0.05 of dataset): 0.885
 _Finished evaluating on full test set in 1050.23 seconds
Evaluating random forest omni 50
 _Starting fitting full training set
 _Finished fitting full training set in 417.19 seconds
 _Starting evaluating on full test set
  Full test accuracy (0.05 of dataset): 0.886
 _Finished evaluating on full test set in 1102.41 seconds

and worst of all, all the trouble I went through to write this new encoder performed no better than ordinal encoding (simply assuming the thousands of unique values could be mapped to a sequence of numbers), which goes contrary to results reported elsewhere. Side note: I also profiled that author's binary encoder and it was just as slow as mine.

So I'm glad to have my OmniEncoder as a tool ready to apply to other data sets, but it was disappointing to see it didn't do anything for me on this particular dataset.

Creating a confidence interval using Hoeffding's inequality

2016-08-29T00:00:00Z

Wrapping up the reading on inequalities (Wasserman's All of Statistics Chapter 4), the book shows how Hoeffding's inequality can be used to construct a confidence interval on the parameter $p$ for a binomial random variable.

Starting with Hoeffding's inequality (as applied to a series of samples of a Bernoulli, e.g a series of coin flips):

$$ P(|\bar{X}_n - p| > \epsilon) \leq 2e^{-2n\epsilon^2} $$

we previously showed how this can be used to place bounds on the true error rate of an estimator given the observed error rate of a classifier on a training set. Switching examples to a series of coin flips, the analogous question is: if we flip a coin with true weight $p$ 100 times, what is the probability that the observed number of heads is within 20% of p?

What if, instead, we'd like to place +/- bounds on the number of flips to expect given a particular weight? We ask, if I flip the coin 100 times, how many flips on either side of $p*100$ will trap the observed number of flips, say, 90% of the time? Another way of saying this is, what is the 90% confidence interval on coin flips?

Given a particular probability bound $\alpha$:

$$2e^{-2n\epsilon^2} = \alpha$$

we can solve for $\epsilon$. We'll call this $\epsilon_n$:

$$\epsilon_n = \sqrt{\frac{1}{2n}log(\frac{2}{\alpha})}$$

leaving us with

$$ P(|\bar{X}_n - p| > \epsilon_n) \leq \alpha $$

now if we consider the random interval $C = (\bar{X}_n - \alpha, \bar{X}_n + \epsilon_n)$ the probability that the number of heads falls outside of the interval is $P(p \notin C) = P(|\bar{X}_n - p| > \epsilon_n) \leq \alpha$. This means $P(p \in C) \geq 1 - \alpha$; the random interval $C$ traps the true parameter $p$ with probability $1-\alpha$. We call $C$ a $1-\alpha$ confidence interval.

To make this a bit more concrete, I created a gsheet that explores the confidence intervals around coin flip experiments. I set $p$ to 0.5 for the sake of example, but the interval is independent of $p$.

Recovering PGM course materials

I was annoyed to find that the materials for the coursera course on probabilistic graphical models have been removed pending a relaunch of the course on some new platform. Thankfully I found a torrent of the materials. To save you the trouble of navigating shady blogs to find the torrent file, I'll host it here too for a bit.

A better categorical encoder

2016-08-26T00:00:00Z

Following up on the learnings from the Kaggle Red Hat business challenge attempt, and inspired by the work of others, I implemented a scikit-learn pipeline transformer that gracefully handles categorical variables with tons of values. It works by having a maximum number of columns a given variable will expand out into (say, 20) and then one-hot encoding as many of the frequently occurring values as possible, leaving enough space to binary encode the rest.

I considered that perhaps this should be broken down into two transformers: one for one-hot, one for binary encoding, but I think there is enough commonality that placing them together makes sense. For instance, in both cases you need a frequency count of the unique values showing up in a column. And the concern about how many one-hot columns to create is related to the number of columns to leave available for binary encoding.

In the absence of me publishing my project open source yet, here is an example use case:

OmniEncoder(max_cols=4).fit_transform(
...             pd.DataFrame(
...                 columns=['color'],
...                 data=[
...                     ['green'],
...                     ['red'],
...                     ['green'],
...                     ['yellow'],
...                     ['orange'],
...                     ['red'],
...                     ['purple']
...                 ]
...             )
...         )
   color_green  color_red  color_10  color_01
0            1          0         0         0
1            0          1         0         0
2            1          0         0         0
3            0          0         1         1
4            0          0         0         1
5            0          1         0         0
6            0          0         1         0

Given we need to encode 6 colors in a maximum of 4 columns, we only have room for 2 one-hot encoded columns, which are designated to the two most frequently occurring colors. The rest are then encoded within two binary columns.

and a here is the source for the encoder:

class OmniEncoder(BaseTransformer):
    """
    Encodes a categorical variable using no more than k columns. As many values as possible
    are one-hot encoded, the remaining are fit within a binary encoded set of columns.
    If necessary some are dropped (e.g if (#unique_values) > 2^k).

    In deciding which values to one-hot encode, those that appear more frequently are
    preferred.
    """
    def __init__(self, max_cols=20):
        self.column_infos = {}
        self.max_cols = max_cols
        if max_cols < 3 or max_cols > 100:
            raise ValueError("max_cols {} not within range(3, 100)".format(max_cols))

    def fit(self, X, y=None, **fit_params):
        self.column_infos = {col: self._column_info(X[col], self.max_cols) for col in X.columns}
        return self

    def transform(self, X, **transform_params):
        return pd.concat(
            [self._encode_column(X[col], self.max_cols, *self.column_infos[col]) for col in X.columns],
            axis=1
        )

    @staticmethod
    def _encode_column(col, max_cols, one_hot_vals, binary_encoded_vals):
        num_one_hot = len(one_hot_vals)
        num_bits = max_cols - num_one_hot if len(binary_encoded_vals) > 0 else 0

        binary_val_to_index = {val: idx + 1 for idx, val in enumerate(binary_encoded_vals)}

        bit_cols = [format(2 ** i, 'b').zfill(num_bits) for i in reversed(range(num_bits))]

        col_names = ["{}_{}".format(col.name, val) for val in one_hot_vals] + ["{}_{}".format(col.name, bit_col) for bit_col in bit_cols]

        def splat(v):
            v_one_hot = [1 if v == ohv else 0 for ohv in one_hot_vals]

            v_bits = [0]*num_bits
            if v in binary_val_to_index:
                v_bits = [int(b) for b in format(binary_val_to_index[v], 'b').zfill(num_bits)]

            return pd.Series(v_one_hot + v_bits)

        df = col.apply(splat)
        df.columns = col_names

        return df

    @staticmethod
    def _column_info(col, max_cols):
        """

        :param col: pd.Series
        :return: {'val': 44, 'val2': 4, ...}
        """
        val_counts = dict(col.value_counts())
        num_one_hot = OmniEncoder._num_onehot(len(val_counts), max_cols)
        return OmniEncoder._partition_one_hot(val_counts, num_one_hot)

    @staticmethod
    def _partition_one_hot(val_counts, num_one_hot):
        """
        Paritions the values in val counts into a list of values that should be
        one-hot encoded and a list of values that should be binary encoded.

        The `num_one_hot` most popular values are chosen to be one-hot encoded.

        :param val_counts: {'val': 433}
        :param num_one_hot: the number of elements to be one-hot encoded
        :return: ['val1', 'val2'], ['val55', 'val59']
        """
        one_hot_vals = [k for (k, count) in heapq.nlargest(num_one_hot, val_counts.items(), key=lambda t: t[1])]
        one_hot_vals_lookup = set(one_hot_vals)

        bin_encoded_vals = [val for val in val_counts if val not in one_hot_vals_lookup]

        return sorted(one_hot_vals), sorted(bin_encoded_vals)


    @staticmethod
    def _num_onehot(n, k):
        """
        Determines the number of onehot columns we can have to encode n values
        in no more than k columns, assuming we will binary encode the rest.

        :param n: The number of unique values to encode
        :param k: The maximum number of columns we have
        :return: The number of one-hot columns to use
        """
        num_one_hot = min(n, k)

        def num_bin_vals(num):
            if num == 0:
                return 0
            return 2 ** num - 1

        def capacity(oh):
            """
            Capacity given we are using `oh` one hot columns.
            """
            return oh + num_bin_vals(k - oh)

        while capacity(num_one_hot) < n and num_one_hot > 0:
            num_one_hot -= 1

        return num_one_hot

A Wrinkle in Universal Preprocessing of Dataframes

2016-08-24T00:00:00Z

Challenges in automatic preprocessing

I took a crack at another Kaggle competition, Predicting Redhat Business Value in part to get another data point in evaluating my seemingly universal approach to building a preprocessing function:

one-hot encode categorical variables
impute missing values for quantitative variables

There are a few nuances to doing this well, including avoiding one-hot encoding binary variables, and doing some additional preprocessing before feeding the data into algorithms that prefer scaled data. But one of the biggest premises of my project is that this boring somewhat tedious exercise of building a preprocessing pipeline could be automated.

The problem I ran into that prevented me from quickly getting to a point where I'd preprocessed the data and could feed it into a couple of first classifiers (I usually start with logistic regression and random forest) was that some of the categorical variables had thousands of possible values, so applying one-hot encoding resulted in a dataframe with almost 10,000 columns! This exhausted the ram of my laptop, and even after sampling down to 1% of the data so that it would work, it was an unreasonable amount of dimensionality to be working with.

So in my first attempt, I simply dropped the problematic columns. From there, my usual approach got me up to the mid 80s in performance, nothing impressive, particularly given that there's a trick everyone discovered that means the top 100 submissions are above 99%.

Ideally there would be an approach that could handle categorical variables like this more gracefully. I'm not the first one to observe / play with this. So far I've tried a basic approach of re-including the categorical columns with many unique values by converting them to a single quantitative column. The updated pipeline looks like this:

preprocessor = Pipeline([
    ('features', DfFeatureUnion([
        ('quantitative', Pipeline([
            ('combine-q', DfFeatureUnion([
                ('highd', Pipeline([
                    ('select-highd', ColumnSelector(high_dim_cat_columns)),
                    ('encode-highd', EncodeCategorical())                        
                ])),
                ('select-quantitative', ColumnSelector(q_columns, c_type='float')),
            ])),
            ('impute-missing', DfTransformerAdapter(Imputer(strategy='median'))),
            ('scale', DfTransformerAdapter(StandardScaler()))
        ])),
        ('categorical', Pipeline([
            ('select-categorical', ColumnSelector(cat_columns)),
            ('apply-onehot', DfOneHot()),
            ('spread-binary', SpreadBinary())
        ])),
    ]))
])

where EncodeCategorical maps each unique value to a sequential number. This boosted my performance from 85% to 88%.

I think my next step in handling this case will be to have some rules of thumb for encoding based on the number of unique values a categorical variable has:

less than 20 values: one-hot encode
more than that: create log_2(num_unique_values) columns and assign the bits of the quantative encoding to each.

The latter option is explored in the aforementioned post) and it seems like that gives nearly the performance of one-hot encoding while obviously having fewer resulting dimensions. The post finds that

it seems that with decent consistency binary coding performs well, without a significant increase in dimensionality. Ordinal, as expected, performs consistently poorly.

Project update

My automatic data science project is plodding along, I have code to build a preprocessing pipeline from a list of quantitative, categorical and binary (e.g already one-hot encoded) variables, as well as to generate code that could be posted into a cell. So I should be getting close to generating a real notebook, as the rest of the steps are pretty straight forward once you have a preprocessing pipeline to combine with various classifiers; each notebook has a cell like this:

pipe_lr = Pipeline([
        ('wrangle', preprocessor),
        ('lr', LogisticRegression(C=100.0, random_state=0))
    ])

pipe_rf = Pipeline([
        ('wrangle', preprocessor),
        ('rf', RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0))
    ])

followed by cross validation.

Probability inequalities applied to bounding expected loss of classifiers

2016-08-23T00:00:00Z

Inequalities applied to bounding expected loss of classifiers

I returned to reading on Probability Inequalities that I started weeks ago with Markov's and Chebyshev's inequalities. Reminding myself,

Inequalities are useful for bounding quantities that might otherwise be hard to compute.

I was excited to find a concrete link between this reading and some of the decision theory I dived into and took some notes on last week.

We can model training data as a series of IID samples drawn from a joint distribution:

$$ (X_1, Y_1), ... (X_n, Y_n) \sim P_{X,Y} $$

and a classifier as a function $\hat{Y} = f(X)$. The expected loss of the classifier is:

$$ \sum_{y \neq \hat{y}} p(y|x) $$

and the empirical risk on our training set to be:

$$ \frac{1}{n} \sum_{y \neq \hat{y}} 1 $$

e.g the fraction of times our classifier is wrong on our training set. What can we say about the true error rate of our classifier?

We can think of each sample from our training set as a Bernoulli trial (a coin flip) with unknown mean $p$. The true mean would tell us our true error rate, and the probability that any given sample is misclassified. What we do know is the observed or sample mean $\bar{X}_n$ noted above.

We can use probability inequalities to help us here! First let's use Chebyshev's inequality:

$$ P(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2} $$

we can apply this to reason about how likely our sample mean $\bar{X}_n$ is to be within some bound $\epsilon$ of $p$:

$$ P(|\bar{X}_n - p| > \epsilon) \leq \frac{V(\bar{X}_n)}{\epsilon^2} = \frac{p(1-p)}{n\epsilon^2} \leq \frac{1}{4n\epsilon^2} $$

since $p(1-p) \leq \frac{1}{4}$ for all $p$. So for instance, say we have 100 samples, and wish to know how likely the observed error rate is within 20% of the true error rate, we plug in $\epsilon =0.2$ and $n=100$ to get a bound of $0.0625$.

This is kind of neat. We can state with certainty the bounds of the true error rate of our classifier.

Hoeffding's Inequality

Hoeffding's inequality as applied to a Bernoulli random variable provides a tighter bound than Chebyshev's:

$$ P(|\bar{X}_n - p| > \epsilon) \leq 2e^{-2n\epsilon^2} $$

(note this is proven in an appendix in all of statistics as well on the wikipedia page)

plugging in the same parameters, we get

$$ P(|\bar{X}_n - p| > 0.2) \leq 2e^{-2(100)(0.2)^2} = 0.00067 $$

so with a 100 samples, we can be pretty damn certain that the observed error rate is within 20% of the true error rate.

Machine Learning from a Decision Theoretic perspective with the help of MathMonk

2016-08-19T00:00:00Z

I spent a couple of mornings this week diving into more of MathMonk's videos on Machine Learning. I've mentioned his probability theory playlist before but in reviewing my curriculum resources I was reminded he also has an extensive playlist on machine learning. After perusing a bit I was pleased to find that my understanding of probability theory allowed me to mostly follow a lot of his big picture framing of machine learning from a probabilistic perspective.

The following runs through a lot of notes I took while watching the following videos carefully:

generative vs discriminative models
decision theory parts one, two, three, four
The big picture parts one, two, three

My big takeaways are that machine learning can be framed as reasoning about a joint distribution over inputs $X$, outputs $Y$, the data you have, and also framing the parameters of any model as a random variable $\Theta$.

$p(y|x, D) = \int p(y|x, D, \theta) p(\theta|x, D) d\theta$

Since integrating over all parameters is hairy, one approach is called "point estimation" where you find the optimal set of parameters and plug those in. This is the approach taken by many of the supervised learning algorithms I've studied so far where you use something like gradient descent to estimate which values for the parameters yield the least cost.

It's interesting to think about how, instead, actually integrating over all parameters of the model to get an expression that could give you a better model for the conditional distribution. This is where a technique like monte carlo methods could come into play: you could approximate the integral by randomly sampling over the parameters and checking the value.

While I was able to follow the videos now, boy it sure still is hairy. It's humbling to think that it will likely take a continued dedication to study for months / years to come before I could hope to really master all of this stuff, but at least I'm finding that some of the advanced generative model stuff is feeling more approachable.

Generative / discriminative

Notes from this video

Discriminative: p(y|x)

some are probabilistic: logistic regression
some aren't: tree based, svms, etc

Generative (joint): p(x,y)

f(x|y)p(y)
- first choose a y, according to its marginal distribution. then we choose a point. "the generative process"
- could keep doing this, you could generate a data set
p(y|x)f(x)
assuming you have a density, is more powerful
requires estimating a density, which is hard

In order to grok generative model, stats comes in handy

Decision Theory

Notes from decision theory parts one, two, three, four.

goal: minimize expected loss

loss function: $f(y, \hat{y})$

can penalize false negatives, false positives differently depending on goal of classifier (e.g might highly penalize false negative in medical testing, or false positive in spam detection)

Examples:

"0-1 loss" $L(y, \hat{y}) = I(y \neq \hat{y}) = \begin{cases} 0 & y = \hat{y} \\ 1 & otherwise \end{cases}$.
"square loss" $L(y, \hat{y}) = (y-\hat{y})^2$
Can penalize differently based on confusion matrix, or combinations / ratios therein (e.g precision & recall, f1 score)

For Supervised Learning:

We are given a labeled data set $(x1, y1), ... (x_n, y_n)$. For some new $x$ we wish to predict the true $y$, our prediction is $\hat{y}$.

First attempts

Given some new $x$, minimize $L(y, \hat{y})$... but we don't know the true $y$.

Choose f where $f(x) = y$ to minimize $L(y, f(x))$... but for what $x$'s? And we still don't know $y$

Making this concrete with Probability

$X$ and $Y$ are random variables, having a joint distribution $(X,Y) \sim p$, minimize the average loss, e.g minimize the conditional expectation of the loss given an observation.

$E(L(Y, \hat{y}) | X = x) = \sum_{y} L(y, \hat{y}) p(y|x)$

This is now a well posed problem, though we are now faced with the challenge of finding or estimating $p(y|x)$.

Plugging in the "0-1" loss this becomes

$E(L(Y, \hat{y}) | X = x) = \sum_{y!=\hat{y}} p(y|x) = 1 - p(\hat{y}|x)$

Note: we will abbreviate $p(y=\hat{y}|x)$ as $p(\hat{y} | x)$.

If we want to minimize this conditional expected loss:

$\hat{y} = \text{argmin}_y E(L(Y, \hat{y}) | X = x) = \text{argmin}_y 1 - p(\hat{y}|x) = \text{argmax}_y p(y|x) $

That is, we choose $\hat{y}$ to be the most likely $y$ for a given $x$, the most likely class. The key quantity that we need to solve this problem is the conditional distribution $p(y|x)$.

How do we pose the problem for a predictive function $\hat{Y} = f(X)$?

we want to minimize

$E L(Y, \hat{Y}) = E L(Y, f(X)) \\ = \sum_{x, y} L(y, f(x)) p(x,y) \\ = \sum_{x, y} L(y, f(x)) p(y|x) p(x) \\ = \sum_x \Big (\sum_y L(y, f(x)) p(y|x) \Big) p(x)$

let's let $\sum_y L(y, f(x)) p(y|x) = g(x, f(x))$

$ \sum_x \Big (\sum_y L(y, f(x)) p(y|x) \Big) p(x) \\ = \sum_x g(x, f(x)) p(x) \\ = E^X(g(X, f(X)))$

(where $E^X$ is the expected value w.r.t the marginal distribution of $X$)

let's suppose for some $x', t$ that $g(x', f(x')) > g(x', t)$

and define

$f_0 = \begin{cases} f(x) & x \neq x' \\ t & x = x' \end{cases}$

so $\forall x g(x, f(x) > g(x, f_0(x))$

and since expectation is order preserving, we can now state that

$E^X g(X, f(X) \geq E^X g(X, f_0(x))$

Let's choose $f$ to minimize $g(X, f(X))$

$f^*(x) = \text{argmin}_t g(x, t)$

(the value of $t$ that minimizes $g(x,t)$

$E^X g(X, f(X) \geq E^X g(X, f_0(x)) \geq E^X g(X, f^*(X))$

So we've found to minimize $E L(Y, f(X))$, we don't need to depend on the marginal distribution $p(x)$ and that $p(y|x)$ is again the key quantity.

Note: showing this result used some functional analysis. Another way to put it is that we applied the law of iterated expectation which states that $E(E(Y|X)) = E(Y)$

$E(Y, \hat(Y)) = E^X(E(Y, \hat{Y}|X))$ and we are minimizing $E(Y, \hat{Y}|X)$

Square Loss

$L(y, \hat{y}) = (y-\hat{y})^2$ with $(X,Y) \sim p$

note: in the case of regression, $Y$ is real valued and continuous, so $Y$ has a density function not a mass function.

$E L(Y, \hat{y} | X = x) = \int L(y, \hat{y}) p(y|x) dy \\ = \int (y - \hat{y})^2 p(y|x)dy$

let's suppose that $p(y|x)$ is smooth enough to differentiate and set to zero

$0 = \frac{d}{d\hat{y}} E L(Y, \hat{y} | X=x) \\ = \frac{d}{d\hat{y}} \int (\hat{y} - y)^2 p(y|x) dy \\ = \int 2(\hat{y} - y) p(y|x) dy \\ = 2\hat{y} \int p(y|x) dy - 2\int y p(y|x) dy \\ = 2\hat{y} 1 - 2 E(Y|X=x) \\

solving

$ 2\hat{y} 1 - 2 E(Y|X=x) = 0 \\ \hat{y} = E(Y|X=x)$

note: we can take another derivative and see that the we have a minimum, not a maximum

$E(Y|X=x) = \text{argmin}_y E(L(Y, \hat{y} | X = x)$

this is a nice clean result. Given a particular $x$, we choose our predicted $\hat{y}$ to be the expected value of $y$ given $x$.

To generalize this to a predictive function for any $x$

$f(x) = E(Y|X=x)$

The big picture

From this video.

We are trying to minimize $E L(Y, f(X))$

how do many of the concepts and techniques of ML fall out of trying to solve this problem?

First, $p(y|x)$ is the key quantity we need to solve this problem in principal.

We have data $D = ((x_1, y_1), ..., (x_n, y_n))$

Discriminative

Estimate $p(y|x)$ directly using our data $D$

Examples: kNN, Trees, SVMs.

Generative

This data really comes from a generative process, you might miss out on important context if you don't make use of the marginal distributions for $X$ and/or $Y$.

We need to estimate the joint distribution $p(x,y)$ using D.

This is harder. But richer; we can always recover the conditional using $p(y|x) = \frac{p(x,y)}{p(x)}.$

The generative approach says,

$p(x,y) = p(x|y)p(y)$

that is, we can choose a y and then generate a sample x.

Parameters / Latent Vars

Our model has parameters and/or latent variables $\Theta$. We will model this as a random variable.

$p(x,y | \Theta)$

We can integrate out $\Theta$ to recover our conditional distribution

$p(y|x, D) = \int p(y|x, D, \theta) p(\theta|x, D) d\theta$

we can often get a nice analytic expression for $p(y|x, D, \theta)$
we usually can't get a closed form expression for $p(\theta|x, D)$
the integral is usually nasty that can't be done analytically

Computing this problem exactly is often intractable.

Approaches

Next video: so what are the different approaches to solving:

$p(y|x, D) = \int p(y|x, D, \theta) p(\theta|x, D) d\theta$

Exact inference

Assume a nice enough model that facilitates exact inference for parts or all of the pieces of this puzzle

multivariate gaussian: can do everything analytically
conjugate priors
graphical models

Point estimates of $\Theta$

Maximum likelihood estimate (MLE) of $\Theta$
Maximum maximum a posteriori estimate (MAP). $\Theta_{MAP} = \text{argmax}_{\theta} p(\theta | x, D)$. Plug in $p(y|x, D, \Theta_{map})$ for $p(y|x,D)$.
Optimization, expectation maximization (EM) to approximate
empirical bayes takes a point estimate for part of $\Theta$

Deterministic Approximation

Using some method to deterministically approximate this integral

Laplace Approximation
Variational methods
Expectation Propagation

Stochastic Approximation

MCMC (Gibbs sampling, Metropolis Hastings): approximate the integral
Importance Sampling: approximate expected values (particle filtering)

Generalizing outside of supervised learning

Density estimation

The problem we've defined is relevant to unsupervised techniques like density estimation as well

$D = (X_1, ..., X_n)$ and $X$ are iid.

Goal: estimate the distribution these r.vs share.

Possible approaches:

histogram
kernel density estimation

but from a probabilistic perspective:

Params $\Theta$

Suppose that $D$ is generated by choosing a $\theta$, then draw $X_1, ..., X_n$ using $\theta$.

We need to compute:

$p(x | \theta)$ is nice
we call $p(\theta | D)$ our "posterier" distribution can be nasty
the integral can be nasty

So we run into similar problems as with the supervised learning case: attempting to work with nasty parts without closed forms and with hard to compute integrals.

Multi-stage sampling, Moment Generating Functions

2016-08-15T00:00:00Z

Hierarchical models

Example 3.28 in Wasserman (p. 56) runs through an example that I found interesting: choose a county at random, and then randomly sample n people. Note how many of those people have a disease. We can model this as conditional probability using two random variables:

X = number of people who have disease within the county
Q = the % of people in a county with a disease

Since Q varies from county to county it is a random variable. Let's assume $Q \sim \text{Uniform}(0, 1)$, we now have:

$$ X|Q = q \sim \text{Binomial}(n, q) $$

E.g the number of people with disease from a sample in a given county is conditioned on the county randomly chosen and the proportion of people there who have the disease. Having posed the problem this way, we can find the expectation and variance using familiar tricks, e.g $E(X) = E(E(X|Q)) = E(nQ) = nE(Q) = n / 2$.

I like that this example and the concept of hirerchical models ties conditional probability to the idea of "multi-stage sampling" that I studied a ways back in Stanford's basic stats course.

Moment generating functions

Wasserman briefly covers Moment Generating functions, which are useful as an alternative approach to finding moments of random variables, and therefore the expected value (the 1st moment) and variance (combination of 1st and second moment).

The moment generating function, or Laplace transform of a random variable $X$ is defined as:

$$ \psi_X(t) = E(e^{tX}) = \int e^{tx}dF(x)$$

and it turns out that evaluating derivatives of $\psi$ at 0 yields successive moments of X:

$$ \psi^{(k)}(0) = E(X^k)$$

Aside from the mechanics of using these functions to find moments of a couple of random variables in examples, I was left wanting for more background.

In googling around I came across this book which had just a bit more background on the topic that I found helpful. Side note: this (free) book strikes me as very similar in spirit to Wasserman's in that it aims to strike a balance between concision and rigor, though it only covers about the 1st half of the material All of Statistics does. I've added a link to the curriculum page for future reference. I also found this coverage of generating functions which includes moment generating functions to he helpful in putting MGFs in broader context.

Anyways, to understand moment generating functions, it's worth taking a step back and think about generating functions and specifically how you can use a Taylor series to approximate any function as a sum of its derivatives taken at a single point.

With this in mind, taking the first derivative of $\psi$ evaluated at zero can be evaluated using a taylor expansion of $e^X$:

$\psi'(0) = \frac{d}{dt} E(1 + tX + \frac{t^2X^2}{2!} + ...)\vert_{t=0} \\ = E(\frac{d}{dt}(1 + tX + \frac{t^2x^2}{2!} + ...))\vert_{t=0} \\ = E(X)$

similarly:

$\psi''(0) = \frac{d^2}{dt^2} E(1 + tX + \frac{t^2X^2}{2!} + ...)\vert_{t=0} \\ = E(\frac{d^2}{dt^2}(1 + tX + \frac{t^2x^2}{2!} + ...))\vert_{t=0} \\ = E(X^2)$

etc.

So I'm not sure why diving into these details was helpful, maybe to just sit with the details long enough to see that, yes, this is a clever formulation that allows us an alternate way to find moments. Note that instead of needing to integrate, we instead need to take a successive derivatives, and in some contexts this could be easier to work with.

It also turns out to be true that if two random variables have the same MGF, they have the same CDFs:

Let $X1,X2$ be two RVs with mgfs $m1,m2$. If $m1(t)=m2(t)$ for all $t\in(−\epsilon,+\epsilon)$, for some $\epsilon>0$ then the two RVs have identical cdfs (and therefore identical pdfs or pmfs).

So for now I think I'll tuck this fact away and remember there exists this alternative way of working with random variables.

Conditional Expectation

2016-08-11T00:00:00Z

Conditional Expectation

I returned to chapter 3 this morning since I still hadn't covered conditional expectation. At first glance, the definition of conditional expectation is pretty straight forward:

$$E(X|Y=y) = \int x f_{X|Y}(x|y) dx$$

what's a bit subtle is that this is not a value, but a function of $y$. As the book states,

Whereas $E(X)$ is a number, $E(X|Y=y)$ is a function of $y$. Before we observe $Y$ we don't know the value of $E(X|Y=y)$ so it is a random variable which we denote $E(X|Y)$. In other words, $E(X|Y)$ is the random variable whose value is $E(X|Y=y)$ when $Y=y$.

I wrote up a series of examples related to conditional distributions and conditional expectation as way of both reviewing conditional expectation and solidifying the reading. It also uses The Rule of Iterated Expectations which says $E[E(Y|X)] = E(Y)$.

With this review of conditional probability in mind, I went back and looked at this problem again which considers flipping a coin $N$ times where $N \sim Poisson(\lambda)$. I previously found this problem mind blowing but it seems more straightforward thinking in terms of conditional expectation, even though the fact that the resulting random variable is also a Poisson remains pretty nifty.

Variance of linear combos of R.Vs, Markov's and Chebyshev's inequalities

2016-08-09T00:00:00Z

A hw problem on variance

Continuing to chug my way through homework problems, I worked through this problem which asks

Let

$$ f_{X,Y}(x,y) = \begin{cases} c(x+y) & 0 \leq x \leq 1, 0 \leq y \leq 2 \\ 0 & \text{otherwise} \end{cases} $$

Where $c$ is a constant. Find $V(2X - 3Y + 8)$.

This was a pretty straightforward problem, albeit a tad tedious as in taking all of these steps

Finding the constant c by noting that integrating the joint distribution must equal 1
Finding the marginal distributions for X and Y
Finding E(X), E(Y), V(X), V(Y) and Cov(X,Y)
Breaking down V(2X - 3Y +8) into components of the above to make the final calculation

It still helped to play with the formulas and iron in the concepts of expectation and variance. It also was a good refresher on finding marginal distributions by integrating out variables within a joint distribution.

Probability inequalities

I got started reading about probability inequalities. The first two theorems, Markov's inequality and Chebyshev's inequality are pretty easy to understand. As the book puts it

Inequalities are useful for bounding quantities that might otherwise be hard to compute. They will be used in the theory of convergence in the next chapter.

Markov's inequality

Markov's inequality helps bound non-negative random variables. Let's say $X$ is a non-negative random variable, and we know its mean, $E(X)$. We wish to know $P(X > t)$, but let's say $X$'s distribution function is hard to compute for some reason. We can find a bound on $P(X > t)$ by starting with the definition of expectation (note we'll use shorthand $f(x)$ for $f_X(x)$):

$E(X) = \int_0^{\infty} x f(x)dx$

and breaking this into two parts:

$\int_0^{\infty} x f(x)dx = \int_0^{t} x f(x) dx + \int_t^{\infty} x f(x)dx$

which must be greater than

$\int_t^{\infty} x f(x)dx$

from here it follows that

$\int_t^{\infty} x f(x)dx \geq \int_t^{\infty} t f(x)dx \geq t \int_t^{\infty} f(x)dx = t P(X > t)$

So we have

$E(X) \geq t P(X > t)$

which can be rearranged as

$P(X > t) \leq \frac{E(X)}{t}$

and there you have Markov's inequality.

Chebyshev's inequality

We can use this result to reason about $P(|X - \mu| \geq t)$ by plugging this into Markov's inequality, squaring both sides, and noting the definition of variance:

$P(|X-\mu| \geq t) = P(|X-\mu|^2 \geq t^2) \leq \frac{E(X-\mu)^2}{t^2} = \frac{\sigma^2}{t^2}$.

It's pretty neat that such simple derivations lead to results that are so useful. As Wikipedia notes

Chebyshev's inequality guarantees that, for a wide class of probability distributions, "nearly all" values are close to the mean—the precise statement being that no more than $\frac{1}{k^2}$ of the distribution's values can be more than k standard deviations away from the mean... The inequality has great utility because it can be applied to any probability distribution in which the mean and variance are defined.

And to paraphrase wikipedia further, it also provides a more generalized, albeit looser, rule of thumb to the 68–95–99.7 rule which applies to Normal distributions, as under Chebyshev's inequality a minimum of just 75% of values must lie within two standard deviations of the mean and 89% within three standard deviations.

Side project

I continue to work on the automatic notebook generator. Will report more when I have a decent chunk of progress to coherently describe.

A couple of expectation problems and progress on building preprocessing pipelines

2016-08-04T00:00:00Z

Morning probability work

I worked out the expected value and variance of a geometric random variable, which wasn't all that fun or illuminating as it came down to using the formulas for expectation and variance:

$E(X) = \sum_{x} x f_X(x)$
$V(X) = E(X^2) - E(X)^2 = \sum_{x} x^2 f_X(x) - E(X)^2$

and futzing around with algebraic manipulation and a some tricks for evaluating infinite series. This didn't make me eager to do the same with a Poisson RV so I'm going to leave that as a TODO for now, might just skip it for good.

I will say it's pretty neat that the expectation and variance of a Poisson, having $F_X(x) = e^{-\lambda} \frac{\lambda^x}{x!}$ both turn out to be $\lambda$. This prompted me to review the Poisson distribution by watching this video and a couple of others from that playlist.

Preprocessing pandas dataframes

I'm continuing to progress on my automatic preprocessing pipeline builder. One challenge has been that all of the built in pipeline transformers that scikit-learn provides are dealing with numpy arrays, so while they work when you pass in a pandas dataframe, the output is always a 2d numpy array.

I prefer to keep things as dataframes so that the column names remain available after preprocessing, in case there is any other intermediate exploration.

I wrote this helper class which helps adapt most of the basic transformers:

class DfTransformerAdapter(BaseTransformer):
    """Adapts a scikit-learn Transformer to return a pandas DataFrame"""
    def __init__(self, transformer):
        self.transformer = transformer

    def fit(self, X, y=None, **fit_params):
        self.transformer.fit(X, y=y, **fit_params)
        return self

    def transform(self, X, **transform_params):
        raw_result = self.transformer.transform(X, **transform_params)
        return pd.DataFrame(raw_result, columns=X.columns)

but some of the more complex transformers that update or combine the columns in some way, including FeatureUnion and one hot encoding are requiring more work.

Santosh Venkatesh's Theory of Probability Book, lining up more problems TODO and progress on preprocessing functionality of automatic data science tool

2016-08-02T00:00:00Z

Another Probability Theory book

I purchased the book The Theory of Probability and recommend it as another resource (and have added it to the curriculum page accordingly). Had I found this before All of Statistics I might have chosen it as my main book, but I'm far enough along down this path, with enough of the curriculum tied to it, that I will instead stay the course and use this new book as a reference.

What's nice about the book is that it provides a bit more more motivation for having the theory of probability in the first place, and weaves together a bit more of a narrative throughout the first half as it tours many topics. The second half provides more abstract foundations that support he first half, and the preface has this crazy diagram showing how a reader can navigate the book in several different ways.

The tradeoff, of course, is that the book is over twice as long and covers fewer topics. But if I were to work through this book, I would likely need to spend less time looking for additional resources elsewhere, as I've found myself needing to do as I work through All of Statistics.

I was disappointed to find that solutions to the exercises are only provided to instructors, as it makes the book a bit less valuable for working through as a self-study, but it turns out if you view the source of the book's web page on the publisher's website you might find a naked url to the solutions manual within a JSON object encoded on the page :)

Transcribing more problems

I spent some more time transcribing problems from CMU's stats classes onto the homework section and have plenty of work cut out for me on the topics of expectation, inequalities and convergence of random variables. It's time consuming, but a nice way to line things up and prime my head for concepts ahead in the reading.

Automatic data science

I continued working on producing an idealized titanic notebook that would be generated by my tool, still getting the preprocessing steps in place.

It's looking like this so far:

preprocessor = Pipeline([
    ('features', FeatureUnion([
        ('quantitative', Pipeline([
            ('extract', ColumnSelector(
                ['Age', 'SibSp', 'Fare', 'Pclass', 'Parch'],
                c_type='float'
            )),
            ('impute-missing', Imputer(strategy='median')),
            ('scale', StandardScaler())
        ])),
        ('categorical', Pipeline([
            ('encode-categorical', EncodeCategorical()),
            ('combine-binary', FeatureUnion([
                ('onehot', Pipeline([
                    ('select-onehot', ColumnSelector(
                        ['Embarked']
                    )),
                    ('apply-onehot', OneHotEncoder())
                ])),
                ('select-binary', ColumnSelector(
                    ['Sex']
                ))
            ]))
        ]))
    ]))
])

along with prose describing the different preprocessing steps necessary.

Study buddy help on a mind blowing probability problem (Binomial where N is Poisson), forging ahead on covariance material

2016-07-28T00:00:00Z

Tricky problem tackled with help from study buddy

I spent a couple of mornings attacking this problem which asks:

Suppose a coin has head probability $p$.

(a) If we toss the coin n times, where n is a fixed constant, and let $X, Y$ denote the number of heads and tails, respectively. Show that $X$ and $Y$ are dependent.

(b) If we toss the coin $N$ times, where $N \sim \text{Poisson}(\lambda)$ and let $X, Y$ denote the number of heads and tails, respectively. Show that $X$ and $Y$ are independent. What are their distributions?

I was able to figure out the first part on my own, but part b was a bit of a head scratcher.

My first instinct was to think of it in terms of transforming a random variable, but really, the right approach is the view the problem as conditional distribution of $X$ and $Y$ given $N$. Random variable transformation is applying a function to a random variable, not plugging one random variable into another.

My goal ultimately ended up being able to understand the solution set. There were a couple of tricks in using the constraint that $X+Y$ must be $N$. The solution uses conditional probability, the laws of total probability and some algebraic manipulation to make the sums of factorials end up being exponentials that look like a Poisson.

One cool result is that the number of heads you end up getting when you consider fipping a coin $N$ times where $N \sim \text{Poisson}$ is itself $\sim \text{Poisson}$ albeit of a different parameter. It's also kind of mind bending that the number of heads and tails are independent. If you consider a fixed number of coin flips, the number of heads and tails are related, but when you consider the joint distribution over all possible $n$s across a $Poisson$, they are unrelated.

Anyways, I wasn't really able to fully grok part (b) until I chatted with my study buddy Veejay who I mentioned earlier. We've ended up chatting for a couple hours most weeks about probability, ML libraries and programming paradigms, it's been really great. First of all, he's awesome at math, having a PHD in physics and all, and I'm lucky he enjoys teaching to the point where he's happy to serve as my graduate student instructor of sorts. I'm honestly not sure I'd be able to get through this probability homework without having someone to consult with. I hope that any insights I have on the hands on python ML stuff and geeking out about programming at least repays him partially.

Variance and co-variance

Having at least attempted each problem related to chapter 2 material, I returned my attention to chapter 3 where I last left off. I took a minute to write up an exercise computing the expectation and variance of a Binomial random variable and read through the definitions of co-variance. This theorem is what I wish to remember:

$$ Cov(X, Y) = E(XY) - E(X)E(Y) $$

which is more intuitive to me and practical than the definition $Cov(X,Y) = E((X - \mu_X)(Y - \mu_Y))$.

Note that when two variables are independent, $E(XY) = E(X)E(Y)$ so the covariance is zero, which makes sense. The book mentions that the converse can't always be assumed; there can be cases where the covariance is equal to 0 but the variables are still dependent.

Expressing a Binomial as the sum of two Poisson random variables, and working on automatic data science project

2016-07-25T00:00:00Z

Morning Probability HW

After getting organized in enumerating more of the homework problems from the two Carnegie Mellon stats courses, I spent some time working through this problem which poses:

Assume $X \sim \text{Poisson}(\lambda)$ and $Y \sim \text{Poisson}(\mu)$, and that $X, Y$ are independent. Let $n$ be a positive integer. Show that $(X \vert X+Y = n) \sim \text{Binomial}(n, p)$ with $p = \frac{\lambda}{\lambda + \mu}$.

it was a nice way to review conditional probability, joint distribution functions and play with Poisson random variables a bit, and also to hammer home the point that random variables can be manipulated, transformed, and expressed in terms of other random variables. I look forward to being fully caught up in the homework so that I can begin completing the exercises in the material I've studied in the book on expectation and then move onto inequalities and the convergence of random variables.

Afternoon automatic data science project work

This afternoon I resumed work on the automatic data science project. I'm getting started with a (seemingly) simple problem: given a labeled dataset for a classification problem along with information about which columns are the inputs, outputs, and what kind of variables they are, produce a notebook that loads the dataset, preprocesses it and applies cross validation with a logistic regression model. The notebook should explain its reasoning for doing each kind of preprocessing operation (e.g one-hot encoding of categorical variables, standardization of quantitative variables) as it goes.

I figure this is just enough to define the bones of the project and will serve as a good first milestone. The very first thing I'm doing is hand writing what I'd like the output to look like for kaggle's titanic dataset based on my learnings from previous attempts, and from there I'll start working on how to automate that process so that it could work for kaggle's forest cover type dataset.

After that, I can start making it a bit more rich (exploratory analysis, trying out more models) and work on inferring variable types based on the dataset so that ideally the tool could work completely hands-free with a new dataset. I had started working on this sub-problem first a few weeks ago and have made some progress, but have decided it is a bit too low-level and would rather get the bones in place first.

Back in the saddle with probability hw, initial steps on automatic data science project

2016-07-15T00:00:00Z

July is a busy month of vacationing on both Lake MI and Lake Huron, with a trip to NYC this week in between (lucky me!). So in case you're wondering if I've gone dark here, expect to see things picking up again the week after next.

I did manage to find a little bit of time this week to get back to studying and I've also been chipping away at a side project, which I'll describe a bit more below.

I have also decided to keep studying all the way through August, which gives me a little bit more breathing room to stay the course. I was getting a bit flustered when I thought August was going to be the month I would turn towards looking for jobs. The challenge is in prioritizing stats study, python ML stuff and wanting to complete a side project.

Problem set organization

I spent some time earlier this week identifying problem set TODOs and organizing them on the problem set page so I can keep picking up problems to work through. My goal is to keep working through all of the homework problems from two of CMU's stat courses that use the All of Stats page (per my curriculum) and I realized I hadn't really identified and written up all of the problems relevant to chapters 2 and 3 yet.

Another RV random variable transformation problem

I worked through this problem today:

Let $X_1, ..., X_n \sim \text{Exp}(\beta)$ be IID. Let $Y = \text{max}\{X_1, ..., X_n\}.$ Find the PDF of $Y$.

It's back from chapter 2, as I found a few problems I hadn't completed yet from the CMU hw assignments. This was a good problem to review as it requires reasoning about distributions soundly. The first trick is to reason about IID (independent and identically distributed). Being independent means that we can multiply together the probability of each variable's affect on the combined distribution $Y$.

The next thing that got me was actually figuring out what the CDF of the exponential distribution is. You can look it up online, but I wanted to see if I could do the basic integration of

$F_X(x) = \int_{-\infty}^{x} e^{-\lambda x} dx$

it's a very simple application of U-substitution but I needed to review that concept before getting it.

Automatic data science side project

As I wrote before, one of the projects that has intrigued me is building a tool that can product a python notebook from a dataset; imagine being able to upload the titanic dataset and getting back something like [this notebook](http://nbviewer.jupyter.org/github/krosaen/ml-study/blob/master/kaggle/titanic2/titanic2.ipynb. In many ways, this is as much an engineering challenge as a real application of ML, but I think it will be a great way to solidify my understanding of how to apply supervised learning effectively, and I think it's something that other aspiring data scientists would find useful, and has a lot of possibilities for the future should I get the basics working—for instance, it could automatically figure out that it is working with a time series, or geographic data, and do the appropriate exploratory analysis.

The first challenge I've been working on is inferring what kind of variable each column in a dataset is: quantitative, categorical or ordinal. I'm using the actual datasets from the test sets I've worked on so far as test cases and think that some heuristics will suffice to do a decent job. But if required, it might be interesting to build a model to do this inference, pulling in many datasets with known variable types to train it.

Samples as Random Variables, Sample Mean and Variance and Automatic Data Science

2016-06-27T00:00:00Z

Sample Mean and Variance

All of statistics chapter 3 briefly covers the sample mean and variance defined as follows:

If $X_1, ..., X_n$ are random variables then we define the sample mean to be

$$\overline{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i$$

and the sample variance to be

$$S^2_n = \frac{1}{n-1} \sum_{i=1}^{n}(X_i - \overline{X}_n)^2$$

and it shouldn't be surprising that if $X_1, ..., X_n$ are IID (independent and identically distributed) with $\mu = E(X_i)$ and $\sigma^2 = V(X_i)$ then

$$E(\overline{X}_n) = \mu, V(\overline{X}_n) = \frac{\sigma^2}{n} \text{ and } E(S^2_n) = \sigma^2$$

that is, the average value of the sample mean and sample variance are equal to those of the distributions they are drawn from.

Samples are random variables

A question I've grappled with is how to map the concept of random variables over to day to day work with data sets. Are random variables the columns or the rows? I think the answer is: both.

Let's consider a dataset with the age and height of a 1000 people surveyed from Ann Arbor, MI. You've got 2 columns and 100 rows. I imagined we have 2 random variables, one for age and one for height, and, with each, an associated probability distribution over the entire population. But as the book continues to mention, "a sequence or random variables $X_1, ... X_n$" and IID samples, it's clear it is talking about the rows, so what's the deal?

Cross referencing back with Stanford's Free online Probability and Statistics Course section on sampling distributions helped here:

In our study of Probability and Random Variables, we discussed the long-run behavior of a variable, considering the population of all possible values taken by that variable. For example, we talked about the distribution of blood types among all U.S. adults and the distribution of the random variable X, representing a male's height. In this module, we focus directly on the relationship between the values of a variable for a sample and its values for the entire population from which the sample was taken. This ... is the bridge between probability and our ultimate goal of the course, statistical inference.

and

Statistics vary from sample to sample due to sampling variability, and therefore can be regarded as random variables whose distribution we call sampling distribution.

(Side note: it's annoying how often free MOOCs keep their content behind a login wall, would have loved to link to the section directly.)

So if we have a dataset that covered the entire population, each row would just be a data point, not a random variable, but in practice, we are usually working with a sample from a population, and thus need to consider the variation of the samples themselves, e.g, what is the expected mean and variance of a sample from a distribution, which itself also has a mean and variance? Also note: in practice we often don't have access to some master dataset of the entire population.

Back to the example I posed with 1000 rows of age and height: the random variables at play are:

$A$: the age of the population of Ann Arbor
$H$: the height of the population of Ann Arbor
$A_1, ... A_{1000}$ a sequnce of IID random variables representing the age column of my dataset, each distributed according to $A$
$H_1, ... H_{1000}$: a sequence of IID random variables representing the height column of my dataset, each distributed according to $H$

To me, this is exciting, as probability theory and stats/ml are finally starting to come together. Soon I'll get to the inference part where, given a sample, how do we infer parameters about the random variable of the entire population?

Parameter vs Statistic

As defined in the Stanford class:

A parameter is a number that describes the population; a statistic is a number that is computed from the sample.

So if we assume that the height of a population follows a Normal distribution, the population's mean height $\mu$ is a parameter, and the sample mean $\overline{X}_n$ is a statistic.

Unbiased sample variance: why divide by n-1

The formula for sample variance divides by n-1 instead of n, why is this? Cross referencing with Khan Academy helped here.

It first helps to know that the formula using n instead of n-1:

$$ \frac{1}{n} \sum_{i=1}^{n}(X_i - \overline{X}_n)^2 $$

is known as the biased sample variance. Intuitively, dividing by $n$ is biased towards underestimating the population variance because it could be that population mean lies outside of the sample altogether. Dividing by n-1 accounts for this. Or as this SO answer puts it,

Because the observed values fall, on average, closer to the sample mean than to the population mean, the standard deviation which is calculated using deviations from the sample mean underestimates the desired standard deviation of the population.

However, this isn't just some fudge factor where someone said, "well, we should just divide by n-1 instead to make it a little larger", it's also so that when we calculate the expected value of the sample variance, it ends up equaling the variance of the population. That is, this theorem:

$$E(S^2_n) = \sigma^2$$

can be proven when we define the sample variance dividing by n-1.

Automatic data science

I mentioned in passing last week an idea I've been tossing around that would help one get started with a dataset by producing an iPython notebook automatically based on the dataset, inferring the variable types, what clean up and scaling are necessary, doing appropriate exploratory analysis and the first steps towards trying out a couple of models.

This was born from observing (as other have), after working through Python Machine Learning and attempting a couple of Kaggle competitions, that many of the steps are pretty routine.

It turns out there are already a few other tools of note related to this, under the umbrella term automatic data science that I spent some time looking at today:

auto-sklearn: a single model that fits the fit/predict API of sklearn and handles a bunch of stuff for you under the hood
TPOT: "will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data". I found the tutorial with the titanic dataset underwhelming as all of the data munging still needed to be done by hand.
Data Robot: commercial tool that helps find / tune the right model, deploys it to cloud as an API

As for my potential side project: I still think I could add some value in producing something that is explanatory. The goal would not be to replace what a data scientist does, but to get you started. You go from data set to notebook and run with it, further tweaking, trying out feature engineering and testing as you see fit.

But side project viability aside, I think the field of automatic data science is an exciting one. Data science is so data driven and functional by its nature that pipelines of routine operations fall out quite naturally, and automating much of this promises to provide even more leverage to those who can think creatively at the top level about how / where to apply ML in the first place. It also is a bit daunting as it likely means that it will be increasingly unlikely one can make a great living simply figuring out how to come up with a model and deploy it for a routine classification task.

Getting started with artificial neural nets and image classification, side project ideas

2016-06-23T00:00:00Z

Can't wait to catch up with the Deep Learning cool kids

Today I couldn't hold out any longer: I jumped ahead and got started on chapter 12 of Python Machine Learning on Artificial Neural Networks. The chapters I've skipped wholesale to get here include:

8: Applying Machine Learning to Sentiment Analysis
9: Embedding a Machine Learning Model into a Web Application
10: Predicting Continuous Target Variables with Regression Analysis
11: Working with Unlabeled Data – Clustering Analysis

I definitely plan on coming back to all but the web app chapter, probably in the order of 10, 11, 8.

Side projects

Part of the reason I couldn't hold out any longer in jumping into neural nets is that deep learning / tensorflow and the like are all the rage and I'm just excited to get into it. The other reason is that I'm trying to figure out which side-project idea I'd like to embark on and one of them: building a "shazam" for bird calls, may benefit from neural networks for the audio classification work. (The other leading candidate: a data science starter tool that takes a csv and spits out an iPython notebook that preprocesses, explores and classifies the data according to the kind of input / output variables at play: lmk if you have any thoughts on either of the ideas!)

Deep learning defined

In the beginning of chapter 12, Sebastian notes

During the previous decade, many more major breakthroughs resulted in what we now call deep learning algorithms, which can be used to create feature detectors from unlabeled data to pre-train deep neural networks—neural networks that are composed of many layers.

and

... the error gradients that we will calculate later via backpropagation would become increasingly small as more layers are added to a network. This vanishing gradient problem makes the model learning more challenging. Therefore, special algorithms have been developed to pretrain such deep neural network structures, which is called deep learning.

I knew that having many layers is the big thing with deep learning, but hadn't read before that pretraining was a key part of it. However, in googling around about this, I came across this SO answer that claims pretraining is the wave of the past and using different activation functions (rectified linear units or ReLUs) is really the key:

the reason you don't see people pretraining (I mean this in an unsupervised pretraining sense) conv nets is because there have been various innovations in purely supervised training that have rendered unsupervised pretraining unnecessary... So as of now, a lot of the top performing conv nets seem to be of a purely supervised nature. That's not to say that unsupervised pretraining or using unsupervised techniques may not be important in the future. But some incredibly deep conv nets have been trained, have matched or surpassed human level performance on very rich datasets, just using supervised training.

¯\_(ツ)_/¯

Whatever the bleeding edge may be, working my way through backprop and image classification tasks will be worthwhile.

Classifying digits with tree based methods

Chapter12 works with the MINST hand written digits dataset (the same one used in Andrew NG's coursera class).

My assumption was that this is used as an example for neural networks because the other models we studied so far can't handle this task. However, in trying out a couple of tree and forest models, I found they perform quite well—the random forest with 100 classifiers performs even better than the reported accuracy of the neural net from the book!

decision tree depth 10 training fit: 0.912
decision tree depth 10 test accuracy: 0.872
decision tree depth 100 training fit: 1.000
decision tree depth 100 test accuracy: 0.886
random forest 10 estimators training fit: 0.999
random forest 10 estimators test accuracy: 0.946
random forest 100 estimators training fit: 1.000
random forest 100 estimators test accuracy: 0.969

Eventually it would be nice to take on a task that really was uniquely suited for neural nets.

Understanding Expectation, Moments and Variance with help from the transformation of random variables

2016-06-22T00:00:00Z

Transformation of Random Variables and Expectation

I resumed probability study in earnest starting from the beginning of chapter 3 of All of Statistics which covers expectation of random variables.

One cool thing that struck me was how the transformation of random variables comes into play. I found this topic quite challenging when studying chapter 2, but feel pretty good about it now, I even managed to reason about a transformation of multiple random variables in this homework problem!

The way transformation comes up again is with the rule of The Rule of the Lazy Statistician:

$E(r(x)) = \sum_x r(x) f_X(x)$

and for the continuous case:

$E(r(x)) = \int_{-\infty}^{\infty} r(x) f_X(x) dx$

This means that we can side step the challenge of finding the PDF of the transformed variable and plug the function in directly; one of the examples shows how you can compute $E(Y)$ where $X \sim \text{Uniform}(0,1)$ and $Y = r(x) = e^X$ by both using the Lazy Statistician rule and by going through the work to derive $f_y(y)$:

$E(Y) = E(r(x)) = \int r(x) f_X(x) dx = \int_0^1 e^x dx = e - 1$

vs deriving $f_Y(y)$ which turns out to be $\frac{1}{y}$ for $1 < y < e$ and computing

$E(Y) = \int_1^e y \frac{1}{y} = e - 1$

Framing a problem in terms of R.V transformation

It also means that the skills of reasoning about transformed random variables comes in handy in framing / solving problems like, what is the expected value of this random variable? Example 3.8 was interesting to me:

Take a stick of unit length and break it at random. Let $Y$ be the length of the longer piece. What is the mean of $Y$?

We can solve this once we frame it as a transformation of $X \sim \text{Uniform}(0,1)$ via $r(x) = max(X, 1-X)$ and note that

$r(x) = \begin{cases} 1 - x & 0 < x < 0.5 \\ x & 0.5 \leq x \leq 1 \end{cases} $

making $E(Y) = E(r(X)) = \int r(x) f_X(x)dx = \int r(x) \times 1 dx = \int_0^{0.5} (1-x)dx + \int_{0.5}^1 xdx = \frac{3}{4}$.

Moments and Variance

The kth moment of a random variable is defined as $E(X^k)$. Applying The Rule of the Lazy Statistician, this means the kth moment is also equal to $\int x^k f_X(x)$.

The kth central moment is $E(x - \mu)^k = \int (x - \mu)^k f_X(x) dx$

Variance is the 2nd central moment. $V(x) = \sigma^2 = E(X - \mu)^2 = \int (x - \mu)^2 f_X(x) dx$

$\newcommand{\abs}[1]{\lvert#1\rvert}$Variance is the most common measure of spread of a random variable. Why not just use the 1st central moment instead? Because $E(X - \mu) = E(X) - \mu = \mu - \mu = 0$. That said, we can use $E\abs{X - \mu}$, it's just not as common.

Homework

I managed to solve one direction in the proof for this homework problem, showing that for a discrete random variable, the probability of a single item is equal to 1, the variance must be 0. This is because the sum of all probabilities must add up to 1, leaving the probability of all other values 0, implying you have a point mass distribution, which has variance 0. It also makes some intuitive sense: if you have only one value of your random variable that has probability = 1 and everything else has probability 0, there is no measure of spread to speak of; every sample will have the same value.

scikit-learn Pipeline gotchas, k-fold cross-validation, hyperparameter tuning and improving my score on Kaggle's Forest Cover Type Competition

2016-06-20T00:00:00Z

I spent the past few days exploring the topics from chapter 6 of Python Machine Learning, "Learning Best Practices for Model Evaluation and Hyperparameter Tuning". Instead of working through the exact same examples as the author I applied the learnings to another take at Kaggle's Forest Cover Type Dataset in a follow up notebook to my first attempt.

K-fold cross-validation

One of the first things you learn about in applying ML is the importance of cross-validation: evaluating the performance of your model on a portion of your dataset separate from what you used to train your model. The easiest way is to holdout a test set and compare performance using that:

train your model on 70% of your labeled data
evaluate the trained model on the remaining 30%

K-fold cross-validation improves on this by letting you do this multiple times so you can see whether the test performance varies based on which samples you used to train / test.

By running through the train/test comparison several times you can get a better estimate of model performance, and sanity check that the model is not performing wildly differently after being trained on different segments of your labeled data, which in itself could indicate instability in your model or too small a sample set.

The need for Pipelines

As soon as you start doing k-fold cross validation it becomes very handy to make use of scikit-learn helper methods to do this grunt work for you. While hand rolling a 70/30 split isn't so bad, iterating over k different folds, tallying up the performances etc feels more like busy work.

But the question is, how can you package up something that represents all of the steps for use over and over again? This is where scikit-learn pipelines come in very handy.

The key point to me is that you can't just preprocess your entire labeled dataset once and then slice and dice it for cross-validation thereafter; preprocessing and dimensionality reduction make use of the training set too, so it is important that they only have access to whatever training set you are using during the k-fold process, otherwise you are granting some of your data processing steps access to "unseen" data and aren't really testing how it will fair in the wild with new unseen data.

So if I need to pass along something that will fit / transform / predict, I need to construct a pipeline, which is a composite that implements the same interface as all of the learning models:

fit: given samples, tune some internal parameters
transform: return some transformation of samples, possibly making use of anything you tuned while fitting

In addition to any number of fit/transform steps, pipeline's implement a predict method that use the last item in the pipeline, which it expects to be a model.

So in order to participate in this I needed refactor my hand rolled preprocessing function into one that constructs a pipeline.

This post was very handy in finding the FeatureUnion pipeline that can help you break down your preprocessing and recombine it again. So you can combine two pipelines:

one that extracts categorical variables and applies one hot encoding, and then scales the outputs to -1 / 1
one that extracts quantitative variables, imputes missing values with each variable's mean and then applies mean scaling into one and have a little pipeline that is ready to preprocess your data. This can then be further combined in a larger pipeline with other steps like PCA followed by your classifier.

Pipeline gotchas

Pipeline's are very cool but can still be a little clunky to work with. I found that when parallelization happens behind the scenes, any hand-rolled Pipeline step needs to extend BaseEstimator to inherit some magic parameter serialization logic and that you need to be careful to have anything that needs to persist between fit and transform to be declared a parameter in the _init_ method of the class in order to work. This isn't really documented anywhere outside of blog posts and stack overflow questions.

Hyperparameter tuning

Having each model I wanted to test all rolled up into tidy pipelines, I was ready to make use of some other cool stuff in scikit-learn, including hyperparemter tuning with grid search, which will evaluate the performance of a model with varying parameter values using k-fold cross validation along the way. Note: "hyperparameter" is just a fancy name for the parameters used to tune the models themselves. For instance, the maximum depth you allow a decision tree to grow to is a hyperparameter.

Here is a graph of how model performance varied across the tested parameters for Logistic regression:

Note that C is the inverse regularization parameter, so increasing it reduces the dampening of parameter values, allowing the weights of the trained model vary as much as they like to fit the data. In this case, LR was not going to overfit and it was best to essentially remove all regularization.

Kernel SVM:

and Random Forest:

for once, these graphs were produced via original work, not copy/paste from Python ML examples :) What strikes me is that: yes these parameters matter, and also that in these cases, the performance across parameter space is convex, making me wonder whether it could be worth applying something smarter than exhaustive search? As this post notes most people don't bother as it makes it harder to parallelize the search across parameters, and instead recommends randomized search.

Conceptually, hyperparameter tuning is an optimization task, just like model training... Smarter tuning methods are available. Unlike the “dumb” alternatives of grid search and random search, smart hyperparameter tuning are much less parallelizable. Instead of generating all the candidate points upfront and evaluating the batch in parallel, smart tuning techniques pick a few hyperparameter settings, evaluate their quality, then decide where to sample next. This is an inherently iterative and sequential process. It is not very parallelizable. The goal is to make fewer evaluations overall and save on the overall computation time. If wall clock time is your goal, and you can afford multiple machines, then I suggest sticking to random search.

Parallelization gotchas

In evaluating many possible parameters across k folds, you can easily ramp up to 60+ train/fit runs on a given model, which can be really slow, especially on SVM models. Scikit-learn let's you run all of this across multiple cores to speed things up via the njobs param, and this worked well until I ran it during the hyperparameter tuning of random forest classifiers; running multiple 100 tree models in parallel crashed the scikit-learn cell and I had to just kind of guess that this was what was happening and finding that for that case, setting njobs=1 worked, I just needed to let it run for about 15 minutes to complete and find the optimal parameters.

Improving performance on Kaggle

After all of this futzing around, how did the improved parameters fair? The adjustments were enough to get me a 1-3% boost.

While this is kind of underwhelming, it was nice to see something concrete and this process can always help you if you happen to have guessed a really bad parameter value: the graphs above show that there were much worse values I could have chosen in the LR and SVM models.

Model	Untuned Test Accuracy	Tuned Test Accuracy	Untuned Kaggle Score	Kaggle Score
Logistic Regression	0.658	0.658	0.56
Kernel SVM	0.815	0.825	0.72143	0.73570
Random Forest	0.82	0.85	0.71758	0.74463

First attempt at Kaggle's Forest Cover Type competition, learning how slow SVMs can be

2016-06-14T00:00:00Z

I made my first few attempts at the Forest Cover Type Prediction Kaggle Competition, one of the recommended starter projects.

My broader goal for this week is to use this as training ground for some of the additional techniques I've been reading about in the Python ML book, including more sophisticated validation methods, hyperparameter tuning, and ensemble learning methods like bagging, but I haven't gotten that far.

A few things make this competition more interesting than the Titanic: Machine Learning from Disaster:

Much larger datasets: 15k training samples / 565k to predict vs merely hundreds
55 features vs ~15
multi-class prediction instead of binary
Non-linear models perform significantly better (with Titanic logistic regression performed nearly as well as anything)

On the other hand, there doesn't seem to be much need or opportunity for feature engineering, e.g there is no fuzzier features that could be further broken down as was the case in the Titanic competition where, for instance, I extracted a 'cabin deck' feature out of the raw cabin feature, and others have done stuff with surnames.

Today was a nice review of the basic initial attack to a competition like this:

build a function to preprocess the data handling any missing data, encoding categorical features appropriately, and scaling quantitative features
evaluate a few models examining training / test accuracy
submit

The initial results:

alg	70/30 training fit	70/30 test accuracy	training time	prediction time	full training fit	full test accuracy (kaggle submission)
Logistic Regression	0.68	0.67	3.07	0.01	0.60	0.55999
Decision Tree Depth 6	0.70	0.68	0.06	0.00	0.69	0.57956
Random Forest Depth 10	0.99	0.82	0.23	0.04	0.99	0.71758
Kernel SVM	0.91	0.82	3.44	6.2	0.90	0.72143
Kernel SVM on PCA reduced data	0.90	0.82	2.27	3.26

Note that runtime became a significant factor this time. I became acquainted with the fact that SVM prediction times are slow compared to logistic regression, decision trees and random forests. It's one thing for the training to take a while, but when the prediction is also slow that makes SVMs quite a bit less appealing, especially when they don't perform much better than random forest models. What makes SVMs slow for prediction is that the prediction time is proportional to the number of support vectors of the model, which in turn, are proportional to the number of training samples (assuming the decision boundaries are tricky).

One of the benefits of reducing the dimensions of the dataset with PCA is to improve the performance of training / prediction, so after seeing that with 25 out of the 55 features, 95% variance was retained, I tried running all of the methods on the reduced dataset. Kernel SVM performed just as well and ran almost 50% faster.

A last hope for kernel SVM performing the best by a larger margin is to explore tuning its parameters a bit, I read that SVMs tend to need this.

Oh, another thing I was curious about is whether applying tree based methods (trees or random forests) to scaled data makes any difference. One property of tree based methods is that you don't have to scale the parameters to be centered around zero as is required by many other classifiers, but, having already scaled the data, I used it with each algorithm out of laziness. But I followed up and re-trained a random forest on the unscaled dataset and the performance was identical.

In any case, with a best submission score of ~72% which isn't particularly impressive on the leaderboard I'm hoping either tuning or applying ensemble learning techniques can get me up into the 80s.

Dimensionality reduction with Principal Components Analysis

2016-06-10T00:00:00Z

I'm onto the topics related to data compression and dimensionality reduction in Chapter 5 of Python Machine Learning.

PCA

Principal components analysis works by finding the eigenvectors of the covariance matrix and projecting the data onto a reduced subset of the eigenvectors. I found this post helpful for getting some intuition for how it works and to review the concept of eigenvectors.

One thing that has struck me about PCA is how eigenvectors/values themselves can provide insight during initial exploration of a dataset, and this reminded me of a segment from the Talking Machines podcast during q&a about first steps with a dataset starting at 6:55. Ryan describes how, after looking at things like histograms of each dimensions and a scatter matrix, applying PCA and looking at the eigenvalues of the covariance matrix in decreasing order can give you an idea of how much structure there is in the dataset; if it falls off quickly after a few eigenvalues, you will have an easier time. This kind of graph should prove useful in exploring a dataset even if you don't choose to apply the transformation:

Remembering that PCA is also covered in Andrew NG's ML course, I also went back and re-watched the related videos and found his explanations are stronger than the book's, and includes additional tips for choosing the number of dimensions, and for how to reconstruct and approximation of original features space.

Unsupervised

One point the book makes that I found interesting is that PCS is an unsupervised compression mechanism; it makes no use of the class labels in deciding how important the extracted features are. This can be contrasted with a supervised feature importance mechanism such as the random forest feature importance technique from chapter 4:

Whereas a random forest uses the class membership information to compute the node impurities, variance measures the spread of values along a feature axis

Later in chapter 5 we cover another supervised technique with linear discriminant analysis.

Problem formulation

With Principal Components Analysis we're projecting n-dimensional data into a lower k-dimensional space.

In doing so we need to minimize this expression:

$ \newcommand{\abs}[1]{\lvert#1\rvert} \newcommand{\norm}[1]{\lVert#1\rVert} $

$\frac{1}{m} \sum_{i=1}^{m} \norm{x^{(i)} - x_{approx}^{(i)} }^2$

which is squared distance between each point and the location it gets projected.

NG points out that PCA not linear regression despite cosmetic similarity in 2-d -> 1-d case.

Every feature has equal importance, no output or special variable as with linear regression. This becomes more apparent when looking at the 3d case:

I also found some of the animations from this stats.stack exchange answer very interesting. First, the visualizing the possible vectors that the data could be projected to, noting that PCA finds the line that minimizes the squared projection distance from the points to this line:

and even cooler,

you can imagine that the black line is a solid rod and each red line is a spring. The energy of the spring is proportional to its squared length (this is known in physics as the Hooke's law), so the rod will orient itself such as to minimize the sum of these squared distances. I made a simulation of how it will look like, in the presence of some viscous friction

Procedure

Apply mean normalization and optionally feature scaling to feature matrix $X$ (resulting values should be centered at 0)
Compute covariance matrix $\Sigma = \frac{1}{m}\sum_{i=1}^{n} (x^{(i)})(x^{(i)})^T$
- will be $n \times n$ matrix
Compute eigenvectors
- Can use singular value decomposition function to compute eigenvectors since covariance matrix is symmetric positive semidefinite. The columns of the $U$ matrix returned are the eigenvectors.
Choose first K columns (corresponding to eigenvectors having largest eigenvalues) of $U$, call this matrix $U_{reduce}$
To reduce feature matrix $X$ from $n$ to $k$ dimensions, compute $Z = U_{reduce}^T X$

Choosing K

The total variation in the original feature matrix can be viewed as:

$\frac{1}{m} \sum_{i=1}^{m} \norm{x^{(i)} }^2$

If we look at the ratio of projection error to the total variation of the original matrix, we get a measure of how much information we've lost:

$\dfrac {\frac{1}{m} \sum_{i=1}^{m} \norm{x^{(i)} - x_{approx}^{(i)} }^2} {\frac{1}{m} \sum_{i=1}^{m} \norm{x^{(i)} }^2} \leq 0.01 $

"99% of variance retained"

Can be as tolerant of up to 5 or 10% loss depending on how aggressive you want to be in reduction.

So we can keep removing vectors with the lowest eigenvalues until this expression is violated, or perhaps use binary search.

Another tip: the svd function returns more than just the $U$ matrix, you get:

U, S, V = np.linalg.svd(X)

and

$\dfrac {\frac{1}{m} \sum_{i=1}^{m} \norm{x^{(i)} - x_{approx}^{(i)} }^2} {\frac{1}{m} \sum_{i=1}^{m} \norm{x^{(i)} }^2} $

can be computed by

$1 - \dfrac {\sum_{i=1}^k s_{ii}} {\sum_{i=1}^n s_{ii}}$

So you don't actually have to re-run PCA for each value of $k$, you can figure out the optimal $K$ using the $S$ matrix returned from the svd procedure, and then use the first $k$ vectors of $U$ accordingly.

Reconstructing the original feature space

Recall that we get our compressed dataset $Z$ via our reduction matrix $U_{reduce}$ that has the $K$ eigenvectors with the largest eigenvalues

$Z = U_{reduce}^T X$

To reconstruct an approximated $X$ we can reverse the computation:

$X_{approx} = U_{reduce} Z$

My chapter 5 notebook

You can see my chapter 5 notebook here which is pretty much the same as the author's but I add in a couple of things to cross reference the minor differences in Andrew NG's course, such as verifying that applying SVD also provides the eigenvectors/values and tying together the terms "cumulative explained variance" and "preservation of variance".

A notebook on inverse transform sampling

2016-06-09T00:00:00Z

Inverse Transform Sampling

In continuing to study the transformation of random variables, I had a bit of an aha moment in understanding the significance of a result I proved in this homework problem

Let $F$ be a CDF, and $U$ a random variable uniformly distributed on $[0, 1]$. Then $F^{-1}(U)$ is a random variable with CDF $F$.

The implication is that we can use the inverse CDF function of any random variable and combine it with a (uniform) random number generator to make a random number generator that adheres to its distribution. This is known as "inverse transform sampling". I wrote up a notebook attempting to tie together the math with some code to make the concept stick.

Working with multiple random variables: conditionals, marginals, transformations and IID samples

2016-06-08T00:00:00Z

I spent much of today working through all parts of chapter 2 that I haven't carefully covered yet. I've realized that the way to do it is to set down with the book, paper and pencil and attempt every example provided. It helps solidify the concepts, and is a quick mini problem set to attempt, making the the problem sets easier to tackle later as well.

All of Statistics is really a fantastic book, you just really need to take it one page, one example at a time. There's no fluff, no time spent re-explaining. But if you carefully work through the examples, much of it sticks.

Independent random variables

Independence with random variables is a similar concept as studied earlier with random events. The definition is less interesting to me than the theorem concerning their joint distribution function:

$X$ and $Y$ are independent if and only if $f_{X,Y}(x, y) = f_X(x)f_Y(y)$ for all values $x$ and $y$.

The book has a couple of examples using joint distributions of two discrete random variables where you can exhaustively check every x,y pair and verify that $f(x,y) = f(x)f(y)$.

The other examples show how, if you know or assume two random variables are independent, you can derive the joint distribution function $f_{X,Y}(x, y)$ by multiplying their density functions together.

Finally, there's a useful theorem that if you can break a joint density function of two random variables that have a range of a (possibly infinite) rectangle into the product of two functions (any function, not necessarily the density functions), then the variables are independent. So, for instance, if we have

$f_{X,Y}(x,y) = e^{x+y}$

over a rectangular region (say for $0 \leq x \leq 2, 0 \leq y \leq \infty$), we know $X$ and $Y$ are independent because we can express $f_{X,Y}(x,y)$ as the product: $e^xe^y$. And again, note that this does not mean than $f_x(x) = e^x$.

Conditional distributions

Conditional probability generalizes to random variables, distributions and relates back to marginal distributions:

$f_{X \mid Y} (x \mid y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}$

so given a joint density function, we can compute the marginal distribution and use them together to compute conditional probability.

For discrete distributions we can plug in the values directly. For the continuous case, we need to integrate over the area where X is defined (as when computing probabilities using probability density functions).

Random Vectors and IID Samples

Beyond joint distributions, we can generalize to random vectors. A random vector $X = (X_1, ..., X_n)$ and has PDF $f(x_1, ..., x_n)$. We can define marginal and conditional distributions in a similar way in the bivariate distributions (e.g a random vector of length 2 generalizes to length $n$).

The book briefly describes IID Samples:

If $X_1, ..., X_n$ are independent and each has the same marginal distribution with CDF $F$ we say $X_1, ..., X_n$ are IID (independent and identically distributed) and write $X_1, ..., X_n \sim F$. If $F$ has density $f$ we may also write $X_1, ..., X_n \sim f$. We call $X_1, ..., X_n$ a random sample of size n from $F$.

This seems like it will be a very important concept when I finally get to the inference part of the book and I'm finally feeling like the probability theory is getting closer to connecting to the ML topics.

Transformation of several random variables

The techniques covered before to transform a random variable using some function still apply, but the example provided is confusing. It seems that the most challenging part is reasoning about how the bounds of the new random variable should be crafted.

Example 2.48, we are asked to find the density of $Y = X_1 + X_2$ where the joint density of $(X_1, X_2)$ is Uniform(0, 1):

$f_x(x_1, x_2) = \begin{cases} 1 & 0 < x_1 < 1, 0 < x_2 < 1 \\ 0 & otherwise \end{cases} $

The solution is found by first finding the CDF:

$F_Y(y) = \begin{cases} 0 & y < 0 \\ \frac{y^2}{2} & 0 \leq y < 1 \\ 1 - \frac{(2 - y)^2}{2} & 1 \leq y < 2 \\ 1 & y \geq 2 \end{cases} $

and then differentiating to get the PDF:

$f_y(y) = \begin{cases} y & 0 \leq y \leq 1 \\ 2 - y & 1 \leq y \leq 2 \\ 0 & otherwise \end{cases} $

To find the CDF the author describes:

Suppose that $0 < y \leq 1$. Then $A_y$ (the region where the transformed function is defined) is the trinagle with vertices (0,0), (y,0) and (0,y). The area of this triangle is $y^2 / 2$. If $1 < y < 2$ then $A_y$ is everything in the unit square except the triangle with vertices (1, y-1), (1,1), (y - 1, 1). This set has area $1 - (2-y)^2 / 2$.

I will come back to this example with a fresh brain later and see if I can get it to click—I'll need to understand this if I hope to tackle this problem in my homework TODO list.

Transforming random variables, joint and marginal distributions, and The Rule of the Lazy Statistician

2016-06-06T00:00:00Z

HW on transformation of random variables and joint density functions

I've been catching up on probability homework from chapter 2.

In this problem I had to prove this theorem:

Let $F$ be a cdf, and $U$ a random variable uniformly distributed on $[0, 1]$. Then $F^{-1}(U)$ is a random variable with cdf $F$.

After looking carefully at the definitions of inverse (or quantile) CDFs and reviewing the techniques for how to transform a random variable by applying a function to it, I finally (mostly) got it right before checking with the solution. I wasn't precise in thinking where the resulting random variable was defined, but got the main point. Was a good review of working carefully with the definitions of random variables and cumulative distribution functions.

This problem:

Let $X$ have a CDF $F$. Find the CDF of $X^+ = \text{max}\{0,X\}$.

was more straight forward and again reviewed the technique of transforming a random variable.

These two problems both concerned joint density functions. In both cases, thinking clearly about how to integrate over 2d regions was the key.

Marginal Distributions

I read about marginal distributions: you can integrate out a random variable from the joint distribution to get back to a distribution without that variable.

For joint density function $f_{X,Y}$ the marginal density $f_X(x) = \int f(x,y)dy$

In the discrete case, you sum over all values of the variable you are marginalizing.

$f_X(x) = P(X=x) = \sum_y P(X = x, Y = y) = \sum_y f(x,y)$

Studying Expectation

I began reading ahead into chapter 3 and watching math monk probability videos on expectation. The formulas are pretty straight forward for both the discrete and continuous cases. Some tidbits that were new to me:

A random variable may not have a well defined expectation
A random variable may have a well defined infinite expectation

Well defined

In order to determine whether a distribution is well defined, break it up into $ > 0$ and $< 0$ cases (negative and positive parts) and so long as one of them is finite, then the entire summation / integral is "well defined".

An example of a continuous random variable $X$ with an undefined expectation is The Cauchy distribution.

$ f(x) = \frac {1}{\pi(1 + x^2)}$

$ E(X) = \int_{-\infty}^{\infty} \frac {x}{\pi(1 + x^2)} $

It can be shown that both $\int_{-\infty}^{0} \frac {x}{\pi(1 + x^2)} $ and $\int_{0}^{\infty} \frac {x}{\pi(1 + x^2)} $ are infinite, making the sum of the two undefined. This is a bit hand wavy (see Wikipedia for a more rigorous description), but gives the intuition behind why this integral is undefined: you can't add negative infinity and positive infinity together and have a well defined value.

Expectation rule aka The Rule of the Lazy Statistician

Another interesting property concerns how to compute the expectation of a function of a random variable.

What if we know $E(X)$ for some density function $f_x(x)$ and for some function $g(x)$ wish to know $E(g(x))$, but don't know $g_x(x)$?

The rule of the Lazy Statistician says we can plug $g(x)$ in for $x$ as follows:

$E(g(x)) = \sum_x g(x) f_X(x)$

and for the continuous case:

$E(g(x)) = \int_{-\infty}^{\infty} g(x) f_X(x) dx$

This seems pretty handy; we know the expected value of uniform distributions, normal distributions and a host of others, so if we would like to find the expected value of a random variable $Y$ with a PDF $f_Y(x)$ that can be re-written as a function of a random variable's PDF that we already know the expected value for, we can go that route without having to compute the integral $\int_{-\infty}^{\infty} x g_X(x) dx$

Feature selection via L1 regularization penalty, greedily removing least impactful features and random forests

2016-06-02T00:00:00Z

Wrapped up Chapter 4 of Python Machine Learning today which covered selecting meaningful features. Note that feature selection is distinct from feature extraction as the author describes.

Using feature selection, we select a subset of the original features. In feature extraction, we derive information from the feature set to construct a new feature subspace.

Reducing the number of features can help combat overfitting, and also improve the performance of a model. Plus, we may get all the way down to 2 or 3 dimensions, where we can visualize the dataset graphically.

L1 vs L2 penalty for feature selection

As previously reviewed, overfitting can be detected when the predictive performance of a trained model is worse on the training dataset than on an unseen test dataset. One technique to combat this mentioned earlier is 'regularization' where an additional parameter is added to the cost function used during training that is the sum of the squares of the weights of the model. This means, all else being equal, we would like our weights to be smaller. Smaller weights have a lower chance of stretching the model to overfit the dataset it was trained on.

Using the sum of the squared weights is called L2 regularization. If we simply sum the absolute values, we're using L1 regularization. For reasons the author does a pretty good job of explaining, using L1 regularization, in addition to shrinking the parameters, also tends to send some parameters down to zero faster than others, effectively eliminating some parameters from the model, or selecting a subset of features.

So: long story short, if you'd like to reduce the number of features at play in your model, and not just shrink them, using an L1 penalty for regularization (available in LogisticRegression) can do the job.

Here are two graphs of how the weights are affected by regularization parameters in L2 and L1 regularization:

(The book had shown just the L1 case, but I thought it'd be interesting to see both for comparison). Note that with l1 penalty, some of the features drop out more quickly than others, e.g 'magnesium' is down to zero as C gets to 10^4.

Sequential feature selection

Another technique to remove features is to try removing each one and see which has the least impact on the performance against the training / test set. This can be repeated until you are down to the number of features you'd like, or the performance suffers too much.

The book implements this approach but it's pretty straightforward so I'll leave it at that.

Random forests' insights into feature importance

Remember that decision trees work by dividing and conquering the training data by splitting on feature thresholds at each level. It may determine, for instance, that the way to maximize the difference in entropy between two levels in a tree is to have one leaf handle every row where sex is 'male' and the other where sex is 'female'.

So you can imagine that as part of this process, a decision tree is in effect learning what the most decisive features are, and as such, scikit-learn's random forest classifier makes this available via the .feature_importances_ attribute of the model.

The book warns

... the random forest technique comes with an important gotcha that is worth mentioning... if two or more features are highly correlated, one feature may be ranked very highly while the information of the other feature(s) may not be fully captured. On the other hand, we don't need to be concerned about this problem if we are merely interested in the predictive performance of a model rather than the interpretation of feature importances.

Second attempt at Kaggle's Titanic data set, accuracy up to 78%, notes on preprocessing and Pandas

2016-06-01T00:00:00Z

Getting up to 78% on the Titanic dataset

My goal today was to get at or above 77% in Kaggle's Titanic: Learning from Disaster starter competition after getting 73% on my first attempt. and I managed to do so, though not in the way I expected.

I spent most of the time exploring the dataset to decide whether two of the features I dropped from the dataset could be of use and concluded that they could:

It appears that those who depart from Southampton croak disproportionately and that those who stayed in cabin sectors b, c, d and e survived disproportionately. Note that I did a little bit of feature engineering by segmenting the raw cabin into a sector by taking the first letter.

Indeed, adding these features boosted logistic regression by a couple points (from 73% to 75%), but it wasn't until I simply tried submitting a solution predicted by a decision tree model that I reached my goal—and it turns out doing so with my original simplified dataset worked just as well!

So my intuition that futzing with more models was less important that looking more deeply at the features was not quite right. I think the answer is that both are important, and it always helps to get some intuition by exploring the data.

Preprocessing data

The two things I spent the most time on during these attempts was preprocessing the data and figuring out how to get matplotlib to do what I wanted.

For preprocessing features, here is the approach:

Binary Categorical variables should be mapped to 0, 1 if they haven't already
Ordinal categorical variables should be mapped to real numbers, e.g 1, 2, 3, ... up to the N possible values the variable takes
Categorical variables with more than one value need to be one-hot encoded, that is, flattened out into one binary variables per possible value. So variable 'color' with values 'blue', 'green', 'red' gets mapped to 'is_green', 'is_blue', 'is_red'. Any missing values can simply have 0s across the board
Missing values for quantitative variables can be filled in with the median or mean. So if we don't know how old someone is, we assume he/she is the average age.
All quantitative features are scaled to center at mean 0 with standard deviation 1 so that the values take the form of a normal distribution
All categorical features (now mapped to binary) are pushed to -1 or 1

An important point emphasized by the Python ML book:

Again, it is also important to highlight that we fit the StandardScaler only once on the training data and use those parameters to transform the test set or any new data point.

The preprocessing higher order function I wrote takes this into account by capturing a StandardScaler fit to the training set that is used by the preprocessing function it returns, which can then be reused to preprocess other datasets.

See the notebooks for all the details.

Pandas

Pandas is quite helpful for most data processing needs mentioned above. I'm finding it's worth taking time to understand the two key data structures provided: Series and DataFrames. The book Python for Data Analysis by the author of Pandas and is worth picking up too.

Pandas suffers from a bit of a kitchen sink OO API for my taste compared to the elegance of, say, Clojure's Seq library or Ruby's Enumerable, but I'm overall grateful they exist and have come to prefer working with them over using numpy 2d arrays directly.

Kaggle

I'm loving Kaggle. Having real datasets to work with and a sense via the leaderboards as to whether you are doing a competitive job is great, and the forums have plenty of tips if you are looking for hints. One key decision they made product wise that I appreciate is to allow you to submit and receive a score for competitions after they have closed so every competition can continue to serve as a training ground, so perhaps more important that recognizing ML skills, they are building up a database of challenges / tutorials.

The Python Machine Learning book is the perfect way to get ramped up to feeling comfortable attempting competitions. If you are in a rush, reading chapters 1, 3 and 4 should be plenty to get you going.

First attempt at Kaggle's starter competition classifying Titanic passenger survivorship

2016-05-31T00:00:00Z

Over the holiday weekend I wrapped up my first attempt at kaggle's competition that asks you to predict survivorship of titanic passengers. It's a perfect starter project as it has a mix of categorical and quantitative variables and a single binary output. There are a couple of hurdles in preprocessing the data (missing values, mapping categorical variables, etc) so making it most of the way through chapter 4 of Python Machine Learning was timely.

My solution was relatively quick and dirty; I simply dropped a couple of the variables that were mildly inconvenient to work with and didn't bother with any exploratory analysis, as I wanted to get to a working solution ASAP. So I tried out logistic regression, linear and non-linear SVM, decision tree and random forest models on my cleaned data and they all performed nearly the same. I split the provided training dataset into a 70/30 training/test split so that I could do some basic checks on overfitting; the tree based models indeed performed better on the training set than the test, but no worse on the test than logistic regression. I then made two solutions by using the trained logistic regression and random forest models and preprocessing function on the provided test dataset.

Random forest:

Logistic regression:

I now rank 3931st out of 4287 submissions to that competition—I think I can consider my ML sabbatical mission accomplished!

It has been fun though to get my feet wet. From reading around (but not peeking!) it seems that it should be possible to get closer to 80%, and many suspect that those who score much higher are cheating and looking at the full dataset.

I've gotten started on a follow up attempt where I hope to get into the high 70s.

K-nearest neighbors, getting started with ch04 (data preprocessing) and Kaggle's Titanic data set

2016-05-26T00:00:00Z

K-nearest neighbors algorithm

Wrapped up chapter 3 of Python Machine learning by applying the K-nearest neighbor algorithm.

KNN is clever, it just indexes the training data so that predicting a new value is finding the k closest data points and using them to vote on the class membership. It can learn on the fly as new known data points can be added to the training set at any time. The disadvantage is the space needed to keep the model on hand, and that the live lookup cost is proportional to the size of the training set (either linear or logarithmic depending on how the storage / indexing is optimized). It also apparently performs poorly in high dimensional space where being "close" becomes less meaningful and where sparseness can also mean there isn't really anything close to a new data point we wish to classify. However, we'll learn in the coming chapters about dimensionality reduction.

Data preprocessing

Got started on chapter 4 which is all about data preprocessing. There are some standard techniques for dealing with missing data, including filtering out, and filling in values with something sensible like the mean (e.g for any missing 'height' value just fill in the average height). My WIP.

My first Kaggle

Per recommendations I chose this starter kaggle competition that asks you to build a model to predict survivorship as my first Kaggle competition. I feel like I've learned just enough scikit-learn to tackle this so I don't want to delay any further, even if the data preprocessing know-how is arriving just in time. I just got far enough to load in the data and sketch out a preprocessing strategy. My WIP.

Discrete and continuous random variable review, and down the math rabbit hole

2016-05-25T00:00:00Z

Discrete and continuous random variable review

When do continuous random variables have a density?

In Math Monk's video on types of random variables an interesting factoid is brought up: a continuous random variable, that is, from a measure theory perspective, a random variable that is represented by a continuous function, may not have a continuous probability density. So when people say "continuous random variables" they usually mean "continuous random variable that has a density". The classic counter-example of a continuous function without a density is the cantor function which is some strange continuous stair casey kind of function that is continuous but doesn't have a well defined derivative.

Backing up a bit, random variables have a cumulative distribution function $F_x$ who's derivative is a probability density function $f_x$ (that is, $F_x = \int_{-\infty}^{x} f(u)du$), so it sort of makes sense that if you can imagine a continuous function that doesn't have a well defined derivative, then you could not come up with a density and thus the random variable would not "have a density".

As for why this matters, ¯\_(ツ)_/¯, but I suppose it's worth remembering that there exist continuous random variables that do not have a density.

Density vs mass

In reviewing discrete and continuous random variables (with mass) again, I was reminded of the difference between a probability mass function, which applies to discrete random variables, and a probability density function, which applies to continuous random variables.

With discrete random variables, the probability mass function is $f_X(x) = P(X=x)$. There exists a probability that $X$ has each discrete value.

With continuous random variables, the probability density $f_X(x)$ such that $P(a < X < B) = \int_a^b f_X(x)dx$. Note that you can't really think about this of being the probability that $X$ is equal to $x$, as $P(X=x)$ is always zero in the continuous case.

Math Monk's videos use different notation for the mass and density functions, and the all of stats book just uses $f_X$ in both cases.

Both discrete and continuous random variables have cumulative distribution functions of $F_X(x) = P(X \leq x)$.

Important random variables

Both the All of Statistics book and math monk's playlist do a quick review of popular discrete and continuous random variables, e.g Binomial, Bernoulli, Gaussian / normal, Exponential, Beta.

I found a notebook exploring some of these distributions that is a nice complement. It's also nice to see that scipy has a ton of random variables ready to use.

Down the math rabbit hole

When reviewing probability theory I occasionally find myself down a rabbit hole looking for a ground truth of sorts, and this afternoon was one of those times. I landed on this introduction to higher mathematics playlist and have added it to the "math & proof fundamentals" section of the ml curriculum page. The first two videos are a nice "what the hell is math" and "what does it mean to know or prove something" review.

First, this reminds me that reflecting back to much of the math curriculum of high school and college, I'm disappointed in the habit most of the courses had of avoiding proofs and instead jumping to the "here's the formula to plug stuff into" and "practice taking word problems and figuring out how to apply some combination of formulas to get the right answer" pairing. These were supposedly advanced courses, but I guess in being part of an engineering curriculum, they necessarily exchanged rigor for breadth. But I wonder if I had instead taken math major focused courses where proofs were the bread and butter I wouldn't be in a much stronger position right now as I work towards being fluent in the language of probability theory and bayesian statistics.

All is not lost however, the years of engineering focused math still honed problem solving skills, and ultimately, even the art of rigorous proofs boil town to problem solving. One nice thing about the study of algorithms that has been a continuous part of programming is that looking at working code that implements an algorithm is a sort of formal definition in itself. And I've been able to use probability theory as a nice test case of really mastering the fundamental axioms of a field and seeing how the theorems build upon each other all the way up. The end goal is to have a deeper understanding of some of the more advanced bayesian ML techniques without only being able to take an off the shelve package and load data into it (though I want to do that too).

The trick will be to not get too caught up in studying every layer of mathematics beneath probability theory, e.g real analysis and measure theory (and topologies and ...). Dipping beneath the surface every now and again is fine, and I enjoy thinking about the fundamentals of math in terms of methods of proofs and how to think mathematically, but if I'm not careful, this may turn into a decade long sabbatical :)

Random Variables again, regularization to combat high variance and a tour of some classifiers Scikit-learn (SVMs, Decision Trees)

2016-05-24T00:00:00Z

Morning probability warmup

Watched a couple more math monk videos and got caught up for a second on the definition of a random variable as it pertains to measure theory (at this spot in the video). It corresponds to what the All of Stats book notes,

Recall that a probability measure is defined on a $\sigma$-algebra $\mathcal{A}$ of a sample space $\Omega$. A random variable $X$ is a measurable map $X: \Omega \rightarrow \Bbb R$. Measurable means that, for every $x, \{\omega: X(\omega) \leq x\} \in \mathcal{A}$.

Googling around unearthed this concise yet rigorous overview of probability that helped.

Random Variables are useful because they map outcomes of a sample space to real numbers, so that probabilities on this sample space can be defined in terms of the Borel sigma algebra. Specifically, a random variable $X(\omega)$ is a mapping that assigns a real number to each outcome $\omega$ of the sample space $\Omega$. Then we can talk about the probability of events of the form $\{\omega \mid X(\omega) \leq x\}$, where $x$ is a given real number, by using a probability measure defined over the interval $(-\infty, x]$, an interval that falls within the Borel sigma algebra.This probability value, viewed as a function of $x$, is called the cumulative distribution function for $X$ and is usually denoted as $F_x(x)$.

So essentially a random variable must be measurable, as evidence by the existence of a cumulative distribution function. Ahh random variables, such a joy to continuously try to Grok. Also note that in practice, most naturally occurring random variables are measurable, so these technical details about a measurable function aren't really that important.

I've added this to the ml curriculum page.

Regularization to combat overfitting

One topic chapter 3 of Python Machine Learning covers is regularization, a technique used to combat overfitting data.

When a model overfits the data it has been trained on, it has "high variance". It's called "variance" because you can imagine that if you trained the model on different samples of a larger dataset, the weights would vary quite a bit as it overfit whatever it was trained on. Overfitting is bad because it means a model is unlikely to generalize and accurately fit / predict unseen data. (Note: the opposite problem where a model fails to fit the data at all is "high bias").

Regularization can help combat overfitting by penalizing large weights, and this is accomplished by adding this parameter to the cost function:

$$\lambda \sum_{i=1}^n w_i^2$$

Chapter 3 of Python Machine Learning covers this briefly, though I think the choice of example with logistic regression wasn't the best, since it doesn't actually correspond to a bias / variance tradeoff.

The book showed how the parameters are allowed to get larger when the regularization parameter $C$ increases (which is the inverse of $\lambda$, so the regularization effect is decreasing), but as you can see in the second graph I created, both the training and test accuracy get better, so this is't really a great use case.

Later, when tuning $\gamma$ with SVMs we show how increasing the parameter too much actually does lead to overfitting, which corresponds to excellent training fit but poorer test fit.

SVMs

Speaking of SVMs, the book briefly covers support vector machines in chapter 3. I get the big idea that instead of optimizing across all points when attempting to separate the classes of data, you are focusing on the points near the boundary, maximizing the margins, but the book doesn't go into enough theory to really derive the implementation. That's fine for now, it's just cool to see how easy it is to pass in the same data sets into different algorithms using scikit-learn. It was also cool to see the kernel trick in action, where by transforming a dataset a previously linearly inseparable data set becomes separable with the same algorithm. Here's the same XOR dataset separated by SVMs with a linear and then a Gaussian kernel:

Decision Trees

Decision trees are cool. The learning algorithm automatically constructs a binary decision tree where the class membership is determined at the leaves. Decision trees can fit nonlinearly separable data sets and have the added bonus that the models are themselves interpretable by humans.

The trouble with decision trees are that they can overfit the data. This can be mitigated by limiting their depth.

One of the most popular and powerful "ensemble learning" techniques is to combine multiple decision trees that together vote on class membership (random forests). This tends to make the model more robust against overfitting and improve accuracy to boot. The trees tend to cancel each other's overfitting problems so that less tuning is required when training the model. Each tree in the forest looks at a random subset of features and a random sample of the data.

Building a tree: maximizing information gain

The book does a nice job at concisely describing how the learning process works: by maximizing the information gain between levels in the tree. Information gain is quantified by a difference in "impurity". A node where every element belongs to the same class is perfectly pure, and a node where the data is evenly distributed across all classes is perfectly impure, so you can imagine that if you progressively sort the data into nodes where they are grouped together by class, you have built a tree that can help you classify.

The three measures of impurity the book covers are Gini index, entropy and classification error. I won't bother rehashing it here beyond a quick example of how the Gini index is defined:

$$\sum_{i=1}^{c} p(i \mid t)(-p(i \mid t)) = 1 - \sum_{i=1}^{c} p(i \mid t)^2$$

where $p(i \mid t)$ is the proportion of samples that belong to a class $c$ for a node $t$. So the impurity is maximized when there are more classes present in the samples and minimized if they all belong to the same class.

Playing with scikit

The book runs through training the same dataset using scikit-learn's decision tree and random forest learner. Very cool to see how easy this is.

As a bonus, I also explored the overfitting balance with tree depth, where I couldn't really get a single decision tree to overfit the iris dataset that's used across chapter 3, and applied the random forest classifier to the xor dataset:

A conference-like day: high level understanding of MCMC and connecting with a study buddy

2016-05-20T00:00:00Z

Today I didn't really get anything done, but had fun exploring some high level ideas and got to geek out about ML for a couple of hours. It was a lot like attending a conference!

High level reading / listening about Bayesian methods

Talking Machines

First, I carefully re-listened to the first 5 minutes or so of episode 10 of The Talking Machines podcast for another excellent overview of a concept from Ryan Adams, this time on Markov Chain Monte Carlo a.k.a MCMC. What was most interesting to me was where MCMC fits within Bayesian Inference:

In Bayesian models, we write down some joint distribution between stuff we can see in the world (the data we can observe) and some properties of the data that we want to understand. We can use conditional probability to reason about the things we can’t see given the things we can see. Ideally we could marginalize away maybe things we can’t see but that we don’t care about. This comes up e.g with LDA maybe we don’t care about what particular topic is assigned to one word in a document—we’re more interested in what’s going on in the corpus in general, integrating out the specific assignments.
Marginalization is something that we do a lot of, but unfortunately it involves summing out some big state space, or integrating over a big state space. Integration is something that is something that is hard to do.
Aside: we can divide a lot of the computations we do in ML and stats into two categories:
- finding the best configuration that we frame in terms of optimization
- integrating out different possible configurations looking at many different hypothesis, and that’s what marginalization or integration is about
There are many approaches to marginalization.
- Sometimes, in a few models, it turns out that you can do marginalization exactly using something like dynamic programming. You can propagate information on a graph and perform a hard sum or a hard integral
- More often, you have complicated data and you have to do something fancier. This field is known as approximate inference. Quite a few different ways to think about this. Two dominating approaches:
  - Variational inference: try to come up with an approximating distribution
  - Markov Chain Monte Carlo

Ok so that frames MCMC within Bayesian inference as one of the main techniques for marginalization.

So what is MCMC:

What we’re trying to do is ask questions about the data, and we can often write these questions as expectations under distributions. The idea of Monte Carlo is that we can use samples from a distribution, compute their averages and those give us estimates of these expectations we want to compute. Many questions we ask about data can be framed in these terms.
The problem is that, in addition to integration being hard, drawing samples from distributions is hard.
The idea of Markov Chain Monte Carlo is that we can define a random walk where the steps in the random walk are simple and have a small set of rules, such that we can write an algorithm to simulate the random walk. As we run this random walk forward, it will converge to a sample from this very complicated distribution.
MCMC dates back to last century, a big advance came with the Metropolis Hastings algorithm which was developed surrounding the Manhattan Project
MCMC is the gold standard for performing statistical inference in some of the hardest problems in stats, ML and physics. It can be challenging and slow to perform, still an active area of research.

This is close to a transcript from the episode, worth a quick listen if you'd like to hear it from Ryan himself.

The Master Algorithm

I remembered that The Master Algorithm also talked about MCMC towards the end of its chapter on Bayesian methods, so I went back and re-read that next. It has a nice build up from bayes theorem to naive bayes classification to markov chains to bayesian networks to markov networks.

PyData talk

Next I remember that I had this video from the PyData conference on my YouTube watch later list:

I watched it and it got be excited about some of the bayesian statistical libraries like PyMC3.

Overall I think Bayesian / graphical models are exciting because they have the promise of providing insight in the model itself along with the predictive power you'd get with something like a neural network:

The talk also listed what seem like great resources, and I've added them to the curriculum page: Doing Bayesian Data Analysis and examples ported to Python and PyMC3 and iPython notebook.

I think following up working through a book like that could be a nice follow up to the Python Machine Learning book. But the trick is to not have so many days like today that I don't even get finished with that :)

An ML Study Buddy

A fun outcome of the HN Coverage of my sigmoid notebook is that a fellow aspiring ML expert, VJ, reached out to me, and we'd scheduled a skype call for today. We ended up comparing notes and experiences in our efforts to craft a good curriculum and geeking out about all things ML and programming for a while. One frustration we share is in finding material that does a good job of both providing rigor and giving enough background to help gain intuition behind the concepts. E.g pages of formal definitions or "for dummy" watered down analogies, but the key seems to be to connect the two together.

We plan to keep in touch and work through some of the Python ML book together and attack a Kaggle data set as well. VJ seems to have a stonger stats background and has offered to help me get over any probability theory humps which is great.

Implementing logistic regression by swapping in a new cost function to previous single-layer neural network implementation

2016-05-19T00:00:00Z

Getting back to logistic regression, I again stared at the derivation of the cost function that the Python Machine Learning book provides. It starts with a likelihood function function, a topic I haven't gotten to yet on my All of Statistics book. After reading up a bit on wikipedia, reading ahead in All of Stats, and reviewing the logistic regression section of Advanced Data Analysis from an Elementary Point of View, I decided I get the rough idea and will come back to it later when the stats book gets there.

But suffice it is to say that we can derive a cost function, $J$ that is related to maximizing the log likelihood of the parameters $w$ of our model.

$$J(\mathbf{w}) = \sum_{i=1}^{m} - y^{(i)} log \bigg( \phi\big(z^{(i)}\big) \bigg) - \big(1 - y^{(i)}\big) log\bigg(1-\phi\big(z^{(i)}\big)\bigg).$$

Here $z^{(i)}$ is the linear combination of our weights $w$ and $x^{(i)}$ (the values for each feature in a given sample), $\phi$ is our trusty sigmoid function and $y^{(i)}$ is the correct output. So the cost function penalizes $\phi(z^{(i)})$ being different from $y^{(i)}$:

as the book also explains,

We can see that the cost approaches 0 (plain blue line) if we correctly predict that a sample belongs to class 1. Similarly, we can see on the y axis that the cost also approaches 0 if we correctly predict y = 0 (dashed line). However, if the prediction is wrong, the cost goes towards in infinity. The moral is that we penalize wrong predictions with an increasingly larger cost.

With a cost function to apply, the book also notes

If we were to implement logistic regression ourselves, we could simply substitute the cost function J in our Adaline implementation from Chapter 2

The author has published a bonus notebook doing just that, and my goal for today was to implement it as well as part of my chapter 3 work. Mission accomplished!

Probability Density Function hw problem and more math monk vids on conditional probability and indpendance

2016-05-18T00:00:00Z

I ended up spending the entire day on probability theory, starting with problem 4 from chapter 2 that did some stuff with probability distribution functions and cumulative distribution functions. I've fully wrapped up chapter 1 material now, having completed all of the problems from both CMU courses from chapter 1 or related material and have transcribed all of the chapter 2 problems in the homework section so it's easy to kick off each day with some probability homework. I've found morning is much better for that sort of work. The only challenge is in not ending up spending the whole day on it, starving the ML work, like happened today :)

Today's homework problem has two parts, the first pretty straight forward, integrating the PDF to get the CDF, but the second part was really tricky. I had to look at the solution to know where to start, but it was helpful to carefully work through it.

The problem is:

Let $Y = 1/X$. Find the probability density function $f_y(y)$ for $Y$.

I didn't really know where to begin, it seems strange to express one random variable in terms of another. But after peeking, I could see it was a matter of using the definition of what a CDF is and plugging in $1/X$ directly like so:

$F_Y(y) = P(Y \leq y)$ = $P(\frac{1}{X} \leq y)$ = $P(\frac{1}{y} < X)$ = $1 - P(X \leq \frac{1}{y})$
= $1 - F_X(\frac{1}{y})$.

From there it was a matter of plugging and chugging, and using a trick where you can take the reciprocal of both sides of an inequality and flip the inequality sign to manipulate to a clean version of $F_Y(y)$ and then finally taking the derivative to get $f_Y(y)$.

As far as looking at solutions: I really don't have much choice when I'm totally stuck given I'm doing this solo, but I figure taking just enough of a peek to get me over the hump when necessary is similar to going to office hours and getting hints and/or working with other students. And before I do that I re-read relevant sections of the book and google around too, sometimes I can get unstuck but today I really didn't know where to begin without the peek.

Math Monk on independence

Now that I'm into chapter 2 material in problem sets, I figured I should catch up on the mathematical monk probability primer playlist while I was at it, and watched a few more today, including a final review of measure theory, conditional probability and independence.

Independence is a funny concept and not really as intuitive as you might think. For instance, disjoint events are not independent because they violate the definition of independence $P(A \cap B) = P(A)P(B)$; each event has non-zero probability but the probability of their intersection is zero. Here's a screenshot of math monk's drawings of independent events:

The top box is two independent events, the bottom 3 independent events.

Sometimes events are just assumed to be independent, like in the case of tossing coins; the first toss has no way of influencing the next, so we assume each toss is an independent event.

Here's another example using conditional independence where $A$ and $B$ are conditionally independent given $C$, but not independent in general (within the entire sample space $\Omega$). Conditional probability creates a smaller sample space within which you can test for independence.

The definition of conditional independence:

$P(A \cap B \mid C) = P(A \mid C)P(B \mid C)$

A little example problem was proposed in the video that I've worked out here.

HN coverage of my sigmoid function notebook, working on a logistic regression implementation

2016-05-17T00:00:00Z

I submitted my notebook about the sigmoid function to hacker news this morning and was pleasantly surprised to see it picked up and make the front page. The feedback in the comments was mixed, but after getting past some nerd snark, there were a few nuggets I could learn from including this comment about how applying the log odds function to Bayes' theorem can be useful, will need to dig in a bit to fully grok. I was also pleased by this comment which pointed out that the intuitive justification for logistic regression presented is similar to that provided in the excellent Advanced Data Analysis from an Elementary Point of View (and more direct link to chapter on logistic regression).

On a related not, I am back into the Python ML book and am taking some time to implement logistic regression from scratch. The book presents some theory behind the cost function that is optimized,

To explain how we can derive the cost function for logistic regression, let's first derive the likelihood L that we want to maximize when we build a logistic regression model, assuming that the individual samples in our dataset are independent of one another.

And presents a hairy formula without further explanation. Thankfully that same chapter from the Advanced Data Analysis book mentioned above goes into a bit more detail.

The implementation should only require some modest updates to the cost function that I used in my adeline implementation from chapter 2. The author has also provided a bonus notebook on github implementing logistic regression that I will avoid peeking at until I get mine to work.

A notebook exploring why a sigmoid function is used in logistic regression and conditional probability HW

2016-05-16T00:00:00Z

Got off to a good start this morning on some probability hw, showing that for a fixed $B$ with $P(B) > 0$ , that $P(\cdot | B)$ is a probability (see solution).

I also spent more time exploring the Sigmoid function resulting in this IPython notebook that attempts to explain why it is used in logistic regression. While I was at it I setup a notebooks section on the site to house the 3 notebooks so far. The sigmoid exploration was really a diversion / deep dive as part of my work on chapter 3 of the python machine learning book, which briefly covers logistic regression in its tour of classification algorithms, but it seemed like a stand alone topic that could be of interest to others and for me to come back to later.

More probability HW, making sense of the odds ratio underlying logistic regression

2016-05-12T00:00:00Z

Homework

This morning I solved problem 7 from All of Statistics. It was tough for me, taking about an hour and a half, and I needed to peek at the solution to get over one minor hump, but I was proud to have nearly gotten it, including thinking to use induction to prove the 2nd part on my own.

The part that got me was this step:

$(A_{n+1} \cap (\bigcup_{i=1}^{n} A_i)^c) \cup \bigcup_{i=1}^{n} A_i$
= $[\bigcup_{i=1}^{n} A_i \cup A_{n+1}] \cap [\bigcup_{i=1}^{n} A_i \cup (\bigcup_{i=1}^{n} A_i)^c]$

because I didn't remember / know that union distributes over intersection in set theory! I won't forget it now that I spent 20 minutes stumped on it :)

It's also interesting that this problem uses a similar trick to problem 1 in crafting an intermediate variable that can be proven to be disjoint so that you can add up the probabilities.

Back to Python ML

Making sense of the odds ratio

In the Python ML book it goes through how the cost function is derived for logistic regression using the odds ratio, which is $\frac{p}{(1 - p)}$. I was left unsatisfied with this being presented as matter of fact so went googling around to get a bit more of a sense of why this ratio would be used instead of the direct probability.

this article was helpful:

... there is no way to express in one number how X affects Y in terms of probability. The effect of X on the probability of Y has different values depending on the value of X. So while we would love to use probabilities because they’re intuitive, you’re just not going to be able to describe that effect in a single number.

Collaborative filtering via Matrix Factorization, more probability homework, Skikit-learn version of perceptron

2016-05-11T00:00:00Z

Collaborative filtering via Matrix Factorization

This morning's Talking Machine's warmup was Ryan's intro in episode 8 where he describes a popular approach to collaborative filtering a.k.a recommender systems.

The approach is "probabilistic matrix factorization" and is similar to approaches used for principal component analysis, singular value decomposition and non-negative matrix factorization (I'm only vaguely familiar with these, but good to jot down for connecting the dots later; I know the python ML book covers principal component analysis).

Using Netflix movie recommendations as an example, say you have a million by 25,000 matrix where each row is a user's ratings for the 25,000 movies available on Netflix. This is obviously a very sparse matrix as most people don't watch 25,000 movies. The approach is to assume this matrix is approximately low rank and be viewed as the product of:

1,000,000 x 100 matrix
100 x 25,000 matrix

The 100 dimensions in each matrix are latent properties of movies that are discovered automatically and might correspond to, say, "has explosions", "is foreign film", "romantic comedy", "Seth Rogan stoner buddy action" etc, and the first matrix assigns these properties to users and the 2nd to movies. Each predicted rating for a movie can be viewed as the inner product of a user vector and a movie vector.

This approach apparently faired well in the Netflix prize competition and has other benefits such as being able to cluster similar movies together and in discovering topics that are more effective at categorizing movies than humans come up with by hand.

Ryan draws parallels to this approach with topic modeling and observes that it's kind of an interesting mix between supervised and unsupervised learning as it is both discovering structure in an unsupervised manner and then training a dataset against that structure in a supervised way by using actual user ratings. I had a similar thought when I first learned about topic modeling: it's really cool that you get both a useful interpretable model and a predictive function out of the exercise.

Homework section on site

I setup a homework section to index each problem I've worked through. I knocked out problem 5 and 6 from chapter 1 of all of statistics today, though I admittedly needed to peek at the solutions to get me oriented.

Getting started on chapter 3

I got started on chapter 3 of python machine learning, starting off by redoing what we did by hand in chapter 2 with off the shelve stuff, and then getting into the theory behind logistic regression. Here's the the WIP notebook..

High level understanding of bayesian non-parametric models, more stats, and back to Python Machine Learning with stochastic gradient descent.

2016-05-10T00:00:00Z

Talking machines warmup: non-parametric models, the chinese restaurant and indian buffet processes

Warmed up this morning by listening to another intro from The Talking Machines Podcast, this time Episode 7 where Ryan summarizes a couple of random processes that are useful in constructing bayesian non-parametric models.

The analogy for the Chinese restaurant process is a restaurant with an infinite number of tables. The first guest arrives and chooses a table. The next arrives and chooses either the existing table or a new one with some probability. New guests choose among existing tables in proportion to their probability, while still having a small chance of striking out on their own at a new table. The end result after N iterations is you have randomly partitioned the population, with the number of partitions not being predetermined.

The analogy for the Indian buffet process is a buffet with an infinite number of dishes. The first guest chooses a random number of dishes. The next chooses a random number of dishes, some probability of choosing among already chosen dishes. Subsequent guests choose among chosen dishes according to their popularity, and may try new random ones as well. The end result after N iterations is you have randomly assigned features to a population, again the number of features not being predetermined.

Both processes are clever ways of generating a finite projection of an infinite dimensional process, as you built it up with a finite data set as you go, and let’s you build up as much complexity in your model as you need.

The article linked to on hierarchical Chinese restaurant process has some nice blurbs on ML and (non)parametric models:

Another important dichotomy in machine learning distinguishes between parametric and nonparametric models. A parametric model involves a fixed representation that does not grow structurally as more data are observed. Examples include linear regression and clustering methods in which the number of clusters is fixed a priori. A nonparametric model, on the other hand, is based on representations that are allowed to grow structurally as more data are observed.1 Nonparametric approaches are often adopted when the goal is to impose as few assumptions as possible and to "let the data speak."

… In particular, modern classifiers such as decision trees, boosting and nearest neighbor methods are nonparametric, as are the class of supervised learning systems built on ìkernel methods,î including the support vector machine. (See Hastie et al. [2001] for a good review of these methods.) Theoretical developments in supervised learning have shown that as the number of data points grows, these methods can converge to the true labeling function underlying the data, even when the data lie in an uncountably infinite space and the labeling function is arbitrary [Devroye et al. 1996]. This would clearly not be possible for parametric classifiers

The intro includes a lot of great summaries of related topics before getting to the meat:

parametric vs non-parametric models
graphical models
promise of non-parametric models for unsupervised learning
topic modeling

Was nice to stumble upon this resource, will re-read later when I get into non-parametric models. Also: author of paper has a mass market book on algorithms that looks interesting.

Another homework problem

I worked through another problem from the all of stats book, more reasoning about sets, proving that $(\bigcup_{i \in I} A_i)^c = \bigcap_{i \in I} A_i^c$ for arbitrary index set $I$.

Back to Python Machine Learning

I'm aiming to do probability in the morning and ML in the afternoons for a bit. Today I finally wrapped up chapter 2 by implementing a stochastic gradient descent variant of the Adeline perceptron (e.g iterative weight update instead of batch). My ch02 notebook has also updated.

Grokking probability fundamentals paying off, stats problem set progress

2016-05-09T00:00:00Z

Seeing some benefits of grokking probability fundamentals

I've started re-listening to the introductions to each episode of The Talking Machines podcast as Ryan Adams does such a good job explaining something interesting each time. In Episode 6 He talks about Determinantal Point Processes. The paper linked to in the episode notes describes a DPP formally:

This is much easier to read now given my study of random variables and probability measures as it follows the template of describing how it maps subsets of elements in an outcome space to real numbers (in this case using determinants of a matrix). I can't say the rest of the paper is smooth reading for me; I'll need more background on graphical models, but this experience gives me reassurance that all this time spent on the fundamentals will pay off and that jumping directly to more advanced topics would not be a great idea. That said, I'm feeling closer to ready to perhaps jump right into a study of graphical models.

I also spent some time re-reading chapters of the All of Statistics book Saturday evening before going to sleep; fluency in mathematical notation and concepts of probability is starting to make this seem possible as the notation is beginning to quickly map to already understood chunks in my head.

Another take on random variables

I had a detour this morning looking up resources for probabalistic graphical models, and it seems like the best text is Probabilistic Graphical Models Principles Computation (download). In its 'foundations' chapter, it describes random variables:

Our discussion of probability distributions deals with events. Formally, we can consider any event from the set of measurable events. The description of events is in terms of sets of outcomes. In many cases, however, it would be more natural to consider attributes of the outcome. For example, if we consider a patient, we might consider attributes such as “age,” “gender,” and “smoking history” that are relevant for assigning probability over possible diseases and symptoms. We would like then consider events such as “age > 55, heavy smoking history, and suffers from repeated cough.”

To use a concrete example, consider again a distribution over a population of students in a course. Suppose that we want to reason about the intelligence of students, their final grades, and so forth. We can use an event such as GradeA to denote the subset of students that received the grade A and use it in our formulation. However, this discussion becomes rather cumbersome if we also want to consider students with grade B, students with grade C, and so on. Instead, we would like to consider a way of directly referring to a student’s grade in a clean, mathematical way.

The formal machinery for discussing attributes and their values in different outcomes are random variables. A random variable is a way of reporting an attribute of the outcome. For example, suppose we have a random variable Grade that reports the final grade of a student, then the statement P (Grade = A) is another notation for P (GradeA).

This solidifies the connection to every day data-sets: random variables are what assign the value to the columns of each row for each feature.

Problem set

I made some modest progress on the first homework assignment of the all of stats book in proving the "Continuity of Probabilities".

Measure theory as it relates to Cumulative distribution functions (CDFs), working on problem sets.

2016-05-05T00:00:00Z

Moar Measure

Following up on my study of measure theory, which in turn, is a study of probability theory because a probability is a measure, here are some properties of a measure $\mu$ (via the first few videos of the mathematical monk's playlist on probability):

Monotonicity: $A \subset B \implies \mu(A) \leq \mu(B)$
Subadditivity: $E_1, E_2, ... \in A \implies \mu(\bigcup_{i}E_i) \leq \sum_{i}\mu(E_i)$
Continuity from below: $E_1, E_2, ... \in A$ and $E_1 \subset E_2 \subset ... \implies \mu(\sum_{i=1}^{\infty}E_i) = \lim_{i\to\infty} \mu(E_i)$
Continuity from above: if $E_1, E_2, ... \in A$ and $E_1 \supset E_2 \supset ... $ and $\mu(E_1) < \infty$ then $\mu(\bigcap_{i=1}^{\infty} E_i) = \lim_{i\to\infty} \mu(E_i)$

The above are true of all measures, and thus all probability measures. Here are some more properties of probability measures.

Let $(\Omega, \mathcal{A}, P)$ be a probability measure space with $E, F, E_i \in \mathcal{A}$

$P(E \cup F) = P(E) + P(F)$ if $E \cap F = \emptyset$
$P(E \cup F) = P(E) + P(F) - P(E \cap F)$
$P(E) = 1 - P(E^C)$
$P(E \cap F^C) = P(E) - P(E \cap F)$

The generalization of $P(E \cup F) = P(E) + P(F) - P(E \cap F)$ is called the inclusion exclusion principal and can be visualized with 3 sets:

All of these are enumerated as properties of probability measures in the all of stats book, but I don't mind running through it again with my new found appreciation for measure theory. Most of this stuff can be visualized with venn diagrams.

Borel Probability measures and CDFs

Let's consider a Borel measure on $\Bbb R$, which is a measure with $\Omega$ is $\Bbb R$, $\mathcal{A}$ is $\mathcal{B}(\Bbb R)$, e.g it is a measure on $(\Bbb R, \mathcal{B}(\Bbb R))$.

$\mathcal{B}(\Bbb R))$ is the Borel $\sigma$-algebra which are all of the open sets of $\Bbb R$.

Going back to cumulative distribution functions (CDFs), it turns out that every CDF implies a unique Borel probability measure and that a Borel probability measure implies a unique CDF.

More concretely, A CDF is defined as a function $F: \Bbb R \to \Bbb R$ s.t.

$x \leq y \implies F(x) \leq F(y)$ $(x,y \in \Bbb R)$ ($F$ is non-decreasing)
$lim_{x \searrow a} F(x) = F(a)$ ($F$ is right-continuous)
$lim_{x \to \infty} F(x) = 1$
$lim_{x \to -\infty} F(x) = 0$

Theorem: $F(x) = P((-\infty, x])$ defines an equivalence between CDFs $F$ and Borel probability measures $P$.

This is kind of interesting as it frames the scope of what kind of probability measures can be uniquely described with a CDF: those that are measures over the Borel $\sigma$-algebra, e.g those considering the open sets of real numbers $\Bbb R$.

So this means we might in some cases be trying to reason about probability measures that are not on open sets of real numbers, and we'll need to use measure theory as a CDF won't cut it.

Mathematical monk's recommended resources

In this video the teacher recommends some books, so I thought I'd take note in case I need even moar stuff to read / reference / study:

Rudin's principles of mathematical analysis
Jacod Protter probability essentials
Real Analysis: Modern Techniques (advanced)

all are googleable to find excerpts / problem sets.

Useful tool for looking up LaTeX symbols

As I attempt to get more fluent in typing out TeX for Mathjax, being able to find a symbol quickly is important. This tool rocks: it lets you draw the symbol and finds closely related symbols along with the needed TeX.

Starting on problem sets

I'm finally diving into the All of Stats problem sets from the course website, here's my WIP for problem 1.

Probability through the lens of measure theory

2016-05-04T00:00:00Z

I noticed an aside in All of statistics that mentions the notion of a "measure", and also noticed that the Mathematical Monk's probability playlist kicks off with several videos about measure theory. The Wikipedia article on probability also begins by defining a probability as a "measure". So in my compulsion to leave no rock unturned in surveying the field of probability, I spent some time making some sense of measure theory in how it relates to probability theory.

Recall that a probability $P$ assigns a real number to each event $A$ within a sample space of outcomes $\Omega$. Each event $A$, in turn, is a set of outcomes $\omega_i$ within $A$.

When thinking of sample spaces and assigning probability to outcomes and events, it technically is only possible to do this to sets that are measurable, and measure theory rigorously defines what this means. The result is that the only events or subsets within a sample space $\Omega$ that are measurable are sets within a $\sigma$-algebra or a $\sigma$-field which satisfies 3 key properties $\mathcal{A} \subset 2^\Omega$

$\emptyset \in \mathcal{A}$
$A \in \mathcal{A}$ implies that $A^c \in \mathcal{A}$ (closed under complement)
if $A_1, A_2, ... \in \mathcal{A}$ then $\cup_{i=1}^\infty A_i \in \mathcal{A}$ (closed under countable unions)

A measure is a function from a $\sigma$-algebra $\mathcal{A}$ to postive real numbers, $\mu: \mathcal{A} \to [0, \infty]$ s.t:

$\mu(\emptyset) = 0$
$\mu(\cup_{i=1}^\infty E_i)$ = $\sum_{i=1}^{\infty} \mu(E_i)$ for any pairwise disjoint sets $E_1, E_2, ... \in \mathcal{A}$

From the standpoint of measure theory, a probability $P$ can be thought of as a measure on (the measurable $\sigma$-field of) an event space $\Omega$ (assigns a real number to all measurable subsets) where $P(\Omega) = 1$.

So why the hell does any of this matter? Are there concrete examples of subsets of sample spaces that are not measurable? The Wikipedia article on non-measureable sets says of the Banach–Tarski paradox, one of the more famous examples motivating the concept of a non-measurable set, "Obviously this construction has no meaning in the physical world." It also mentions that measurable sets are, "rich enough to include every conceivable definition of a set that arises in standard mathematics." So it seems like non-measurable sets are a mathematical construct invented to help resolve some paradoxes in math.

But after googling around, there are hints that eventually there will be interesting random variables that cannot be easily described by probability density functions and measure theory will be helpful then. There was another whiff that perhaps this will be useful when I get to thinking about the convergence of random variables. So perhaps getting caught up on thinking about non-measurable subsets within $\Omega$ misses the point.

If nothing else, digging deeper into the underpinnings of probability theory has helped ingrain the basics more deeply in my head.

update I think it really comes down to measure theory is just a slightly more general mathematical construct that a probability; a probability is a measure with some additional constraints (e.g P($\Omega$) = 1). Thus, studying measure theory is studying probability theory as knowing theorems of measure theory means knowing theorems of probability theory.

update 2 I found a helpful quote from this probability axioms overview further illuminating when considering the measurable collection of subsets of $\Omega$ is necessary:

there is no reason to talk about sigma algebras at all unless we consider sample spaces S that are uncountably infinite.

Resources:

Finally grokking random variables and going back to review curriculum of probability and stats.

2016-05-03T00:00:00Z

Grokking Random Variables

Continuing to get a handle on the purpose of random variables, Khan Academy's video helps as well explaining,

What's so useful about defining random variables? It will become more apparent later on, but the simple way of thinking about it is that as soon as you start to quantify outcomes, you can start to do a little more math on the outcomes and you can start to use a little mathematical notation on the outcome.

He goes on to explain that random variables serve as short hand for defining and reasoning about random processes (or experiments). So my grappling yesterday about this being the new way of describing things was on the right track. Thinking back, I think a better way of explaining random variables when first introducing them is to make very clear, "this is the new way of talking about experiments, before we just described the experiment in plain words, but now we are making explicit about the experiment we are dealing with by mapping every outcome to a number."

The all of statistics book motivates the need for random variables as well with,

Statistics and data mining are concerned with data. How do we link sample spaces and events to data? The link is provided by the concept of a random variable.

Finally, Norvig's notebook explains how the concept of a random variable makes explicit the division of the event space into components that is otherwise done so implicitly when describing experiments.

So far, we have talked of an outcome as being a single state of the world. But it can be useful to break that state of the world down into components. We call these components random variables. For example, when we consider an experiment in which we roll two dice and observe their sum, we could model the situation with two random variables, one for each die. (Our representation of outcomes has been doing that implicitly all along, when we concatenate two parts of a string, but the concept of a random variable makes it official.)

Ok, I think I get it now. Moving on.

Reviewing curriculum

After getting a little further into random variables and distributions, I realized I had skipped the section in All of Statistics on conditional probability and Baye's theorem, only having covered the surface level stuff in the Stanford course. Bayes theorem is extremely relevant to ML, so I want to go back and fiddle around with Baye's theorem a bit; it's pretty easy to prove and I'd like to get to a point where I can do so without looking at the book.

But I also had a bit of a crisis in making sure my curriculum is laid out and I have a clear path towards learning what will be most important. The Stanford course and Khan Academy's material on probability and statistics is easy to follow and working through it will get me somewhere, but they both are pretty surface level on distributions, density functions and distribution functions. Looking ahead in the All of Stats book I see it covers many things that are beyond the scope of the Stanford course and Khan Academy, including things like conditional distributions, multivariate distributions, marginal distributions, inequalities, convergence of random variables and more. These things are important and are what really start to bridge the gap between an intro to probability and statistical inference.

So basically I'm worried that I'll end up going really deep and thorough into basic probability and stats and then not have the resources I need to really grok the more advanced stuff; the All of Stats book is very dense and I don't have any accompanying videos or lectures to help. Looking more closely at the homework assignments that I was planning on doing, they are very heavy on proofs, which I'm not opposed to in principal but wonder if it will be the best use of my time. To help, I spent some time writing up a comprehensive index of concepts that I'd like to learn and did an audit of the materials I've found and which cover each concept / topic. To bolster the more advanced concepts I found what seems to be a really good set of videos on probability from 'mathematicalmonk' on youtube, who's a math professor at Duke

I continue to update the ml curriculum page and am piecing together a probability and stats overview page as well. While it feels uncomfortable to be stalling like this, I think pausing to make sure my course / curriculum is most effective and that I can set out a pace that will strike the right balance of breadth/depth given my time frame as I go is smart overall, there are no brownie points for going through the motions.

Mathjax in posts and resolving confusion about random variables vs probability functions

2016-05-02T00:00:00Z

Mathjax

Following up with Friday's fiddling to render notebooks and host them, I thought it could be useful to author learning log posts themselves as notebooks when appropriate. But what I realized for now is that I just want the nice looking math formatting, which, by convention, is provided by LaTeX and mathjax within IPython notebooks. So I've simply included the mathjax library on learning log posts for now and you can see some of the fancy formatting below.

Random Variables

I struggled a bit with the definition of a random variable in how it relates to the definition of probability. Terminology so far:

Experiments

$\Omega$: Sample space: the set of all possible outcomes in an experiment
$\omega$: a single outcome or point in $\Omega$
$A$: an event, which is a subset of $\Omega$, or a set of $\omega_i$s

Probability

A probability $P$ (aka probability distribution or probability measure) assigns a real number to every event $A$, and it is to describe the probability that a particular event occurs.

First of all, why does it assign a real number to every event instead of outcome? If it were the latter, it would achieve the former. I think it doesn't matter much and is just jargon.

Random Variable

A random variable is a mapping from each outcome to a real number.

It sort of seems like a random variable could function as a probability measure; it would map a real number for each outcome, and that real number would be the probability that the outcome occurs.

However, after grappling with this a bit, I've come to understand that the concept of a random variable is another layer of indirection when reasoning about experiments. It maps outcomes to real numbers, but those real numbers aren't themselves speaking to the probability that an outcome will occur; that's still left to a probability distribution.

Many resources, including the All of Statistics book and the Stanford Course first introduce experiments and probabilities before getting to the concept of a random variable, but if we were to retroactively introduce the concept of a random variable to the earlier material, it would serve to assign labels to each discrete outcome, not to describe the probability.

Let's look at the experiment, "roll a die once". This could be described as a random variable X that maps each die to the number on the die associated with the outcome of rolling it:

roll 1: 1
roll 2: 2
roll 3: 3
roll 4: 4
roll 5: 5
roll 6: 6

And then a probability P aka probability mass function $f_X(x) = P(X=x)$ of this discrete random variable could be described as:

$$ f_X(x) = \begin{cases} 1/6 & x = 1 \\ 1/6 & x = 2 \\ 1/6 & x = 3\\ 1/6 & x = 4 \\ 1/6 & x = 5 \\ 1/6 & x = 6 \end{cases} $$

Good ol' Wikipedia came through to clarify this point for me on the random variable page:

A random variable's possible values might represent the possible outcomes of a yet-to-be-performed experiment, or the possible outcomes of a past experiment whose already-existing value is uncertain (for example, due to imprecise measurements or quantum uncertainty)

All of this might seem very pedantic, but I wonder if anyone else has had this brief struggle when first faced with the concept of a random variable? In any case, once you are introduced to "random variables" it's kind of like the new way of thinking about probabilities thereafter and the old way of just thinking about outcomes without the concept of a random variable was a brief stepping stone where it could be considered implicit.

Beyond the basics of representing outcomes as we could consider them (e.g each roll of a pair of dice) a random variable can map to non-distinct values, really anything. For instance, for the experiment "roll two dice", a random variable could be "the sum of the dice". It could also be something completely arbitrary, like mapping 1, 2 and 3 to -45 and 4, 5 and 6 to 93939393.

Cumulative distribution functions

Another curveball in learning about random variables is the tendency to jump right into cumulative distribution functions, even before the more familiar probability mass function. A cumulative distribution function is the probability that a random variable has a value less than a particular value:

$F_X(x) = P(X \leq x)$

When graphing a CDF of a discrete random variable you get a step function that eventually reaches one when you've exhausted the range of possible values for x.

Why is thinking of probability in terms of its cumulative value useful, or at least why does it seem to be preferred to the probability mass function? My understanding so far is that CDFs play more nicely with continuous random variables. You can't really reason about the probability of a single value—it's infinitesimally small so the probability is always zero—you have to look at the probability over a range of values ala area under the curve of the probability mass function. Speaking of which, the relationship between the probability function $f_x$ and the cumulative distribution function $F_x$ is the integral:

$F_X(x) = \int_{-\infty}^x f_x(t)dt$

Norvig's notebook corroborates this intuition,

The principles (of continuous samples spaces) are the same (as with discrete sample spaces): probability is still the ratio of the favorable cases to all the cases, but now instead of counting cases, we have to (in general) compute integrals to compare the sizes of cases. Here we will cover a simple example, which we first solve approximately by simulation, and then exactly by calculation.