# KarlRosaen

A couple of expectation problems and progress on building preprocessing pipelines

## Morning probability work

I worked out the expected value and variance of a geometric random variable, which wasn't all that fun or illuminating as it came down to using the formulas for expectation and variance:

• $E(X) = \sum_{x} x f_X(x)$
• $V(X) = E(X^2) - E(X)^2 = \sum_{x} x^2 f_X(x) - E(X)^2$

and futzing around with algebraic manipulation and a some tricks for evaluating infinite series. This didn't make me eager to do the same with a Poisson RV so I'm going to leave that as a TODO for now, might just skip it for good.

I will say it's pretty neat that the expectation and variance of a Poisson, having $F_X(x) = e^{-\lambda} \frac{\lambda^x}{x!}$ both turn out to be $\lambda$. This prompted me to review the Poisson distribution by watching this video and a couple of others from that playlist.

## Preprocessing pandas dataframes

I'm continuing to progress on my automatic preprocessing pipeline builder. One challenge has been that all of the built in pipeline transformers that scikit-learn provides are dealing with numpy arrays, so while they work when you pass in a pandas dataframe, the output is always a 2d numpy array.

I prefer to keep things as dataframes so that the column names remain available after preprocessing, in case there is any other intermediate exploration.

I wrote this helper class which helps adapt most of the basic transformers:

class DfTransformerAdapter(BaseTransformer):
"""Adapts a scikit-learn Transformer to return a pandas DataFrame"""
def __init__(self, transformer):
self.transformer = transformer

def fit(self, X, y=None, **fit_params):
self.transformer.fit(X, y=y, **fit_params)
return self

def transform(self, X, **transform_params):
raw_result = self.transformer.transform(X, **transform_params)
return pd.DataFrame(raw_result, columns=X.columns)


but some of the more complex transformers that update or combine the columns in some way, including FeatureUnion and one hot encoding are requiring more work.