# KarlRosaen

Disappointing improvements using one-hot / binary encoding, improving performance with help of Python profiler

I put my new categorical encoder to use in a follow up attempt to the Red Hat business Kaggle competition.

The first thing I ran into was how slow it was to encode the variables with thousands of values. I did some profiling to see if there were any huge wins to be had:

$python -m cProfile -o profile.bin$(which py.test) tests/test_preprocessing_transforms.py::test_profile_omniencode_redhat


and then

import pstats
p = pstats.Stats('profile.bin')
p.strip_dirs()
p.sort_stats('cumtime')
p.print_stats('preprocessing_transforms')


via

As I suspected, using list comprehensions instead of finding ways to do everything with numpy arrays was slow:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.006    0.006   11.809   11.809 test_preprocessing_transforms.py:15(test_profile_omniencode_redhat)
1    0.000    0.000   10.213   10.213 preprocessing_transforms.py:48(transform)
1    0.000    0.000   10.209   10.209 preprocessing_transforms.py:50(<listcomp>)
4    0.090    0.022   10.209    2.552 preprocessing_transforms.py:54(_encode_column)
40000    0.155    0.000    4.701    0.000 preprocessing_transforms.py:65(splat)
1    0.000    0.000    1.050    1.050 test_preprocessing_transforms.py:4(<module>)
1    0.000    0.000    0.817    0.817 preprocessing_transforms.py:1(<module>)
24772    0.079    0.000    0.079    0.000 preprocessing_transforms.py:70(<listcomp>)
40000    0.073    0.000    0.073    0.000 preprocessing_transforms.py:66(<listcomp>)
1    0.000    0.000    0.037    0.037 preprocessing_transforms.py:44(fit)
1    0.000    0.000    0.037    0.037 preprocessing_transforms.py:45(<dictcomp>)
4    0.001    0.000    0.037    0.009 preprocessing_transforms.py:79(_column_info)
1    0.000    0.000    0.004    0.004 preprocessing_transforms.py:21(transform)
4    0.000    0.000    0.001    0.000 preprocessing_transforms.py:90(_partition_one_hot)
4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:59(<dictcomp>)
1746    0.000    0.000    0.000    0.000 preprocessing_transforms.py:102(<lambda>)
4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:105(<listcomp>)
4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:61(<listcomp>)
4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:110(_num_onehot)
41    0.000    0.000    0.000    0.000 preprocessing_transforms.py:127(capacity)
4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:63(<listcomp>)
41    0.000    0.000    0.000    0.000 preprocessing_transforms.py:122(num_bin_vals)


however, after updating a list comprehension to a more performant approach to building a numpy array of integers representing the bits for a binary encoded value, I only saw a 20% boost; not bad, but still over a 50x slowdown compared to ignoring or ordinally encoding the columns with lots of unique values:

Evaluating random forest ignore
_Starting fitting full training set
_Finished fitting full training set in 3.86 seconds
_Starting evaluating on full test set
Full test accuracy (0.05 of dataset): 0.880
_Finished evaluating on full test set in 16.32 seconds
Evaluating random forest ordinal
_Starting fitting full training set
_Finished fitting full training set in 4.26 seconds
_Starting evaluating on full test set
Full test accuracy (0.05 of dataset): 0.885
_Finished evaluating on full test set in 16.10 seconds
Evaluating random forest omni 20
_Starting fitting full training set
_Finished fitting full training set in 376.31 seconds
_Starting evaluating on full test set
Full test accuracy (0.05 of dataset): 0.885
_Finished evaluating on full test set in 1050.23 seconds
Evaluating random forest omni 50
_Starting fitting full training set
_Finished fitting full training set in 417.19 seconds
_Starting evaluating on full test set
Full test accuracy (0.05 of dataset): 0.886
_Finished evaluating on full test set in 1102.41 seconds


and worst of all, all the trouble I went through to write this new encoder performed no better than ordinal encoding (simply assuming the thousands of unique values could be mapped to a sequence of numbers), which goes contrary to results reported elsewhere. Side note: I also profiled that author's binary encoder and it was just as slow as mine.

So I'm glad to have my OmniEncoder as a tool ready to apply to other data sets, but it was disappointing to see it didn't do anything for me on this particular dataset.