Disappointing improvements using one-hot / binary encoding, improving performance with help of Python profiler

I put my new categorical encoder to use in a follow up attempt to the Red Hat business Kaggle competition.

The first thing I ran into was how slow it was to encode the variables with thousands of values. I did some profiling to see if there were any huge wins to be had:

$ python -m cProfile -o profile.bin $(which py.test) tests/test_preprocessing_transforms.py::test_profile_omniencode_redhat

and then

import pstats
p = pstats.Stats('profile.bin')
p.strip_dirs()
p.sort_stats('cumtime')
p.print_stats('preprocessing_transforms')

via

As I suspected, using list comprehensions instead of finding ways to do everything with numpy arrays was slow:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.006    0.006   11.809   11.809 test_preprocessing_transforms.py:15(test_profile_omniencode_redhat)
        1    0.000    0.000   10.213   10.213 preprocessing_transforms.py:48(transform)
        1    0.000    0.000   10.209   10.209 preprocessing_transforms.py:50(<listcomp>)
        4    0.090    0.022   10.209    2.552 preprocessing_transforms.py:54(_encode_column)
    40000    0.155    0.000    4.701    0.000 preprocessing_transforms.py:65(splat)
        1    0.000    0.000    1.050    1.050 test_preprocessing_transforms.py:4(<module>)
        1    0.000    0.000    0.817    0.817 preprocessing_transforms.py:1(<module>)
    24772    0.079    0.000    0.079    0.000 preprocessing_transforms.py:70(<listcomp>)
    40000    0.073    0.000    0.073    0.000 preprocessing_transforms.py:66(<listcomp>)
        1    0.000    0.000    0.037    0.037 preprocessing_transforms.py:44(fit)
        1    0.000    0.000    0.037    0.037 preprocessing_transforms.py:45(<dictcomp>)
        4    0.001    0.000    0.037    0.009 preprocessing_transforms.py:79(_column_info)
        1    0.000    0.000    0.004    0.004 preprocessing_transforms.py:21(transform)
        4    0.000    0.000    0.001    0.000 preprocessing_transforms.py:90(_partition_one_hot)
        4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:59(<dictcomp>)
     1746    0.000    0.000    0.000    0.000 preprocessing_transforms.py:102(<lambda>)
        4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:105(<listcomp>)
        4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:61(<listcomp>)
        4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:110(_num_onehot)
       41    0.000    0.000    0.000    0.000 preprocessing_transforms.py:127(capacity)
        4    0.000    0.000    0.000    0.000 preprocessing_transforms.py:63(<listcomp>)
       41    0.000    0.000    0.000    0.000 preprocessing_transforms.py:122(num_bin_vals)

however, after updating a list comprehension to a more performant approach to building a numpy array of integers representing the bits for a binary encoded value, I only saw a 20% boost; not bad, but still over a 50x slowdown compared to ignoring or ordinally encoding the columns with lots of unique values:

Evaluating random forest ignore
 _Starting fitting full training set
 _Finished fitting full training set in 3.86 seconds
 _Starting evaluating on full test set
  Full test accuracy (0.05 of dataset): 0.880
 _Finished evaluating on full test set in 16.32 seconds
Evaluating random forest ordinal
 _Starting fitting full training set
 _Finished fitting full training set in 4.26 seconds
 _Starting evaluating on full test set
  Full test accuracy (0.05 of dataset): 0.885
 _Finished evaluating on full test set in 16.10 seconds
Evaluating random forest omni 20
 _Starting fitting full training set
 _Finished fitting full training set in 376.31 seconds
 _Starting evaluating on full test set
  Full test accuracy (0.05 of dataset): 0.885
 _Finished evaluating on full test set in 1050.23 seconds
Evaluating random forest omni 50
 _Starting fitting full training set
 _Finished fitting full training set in 417.19 seconds
 _Starting evaluating on full test set
  Full test accuracy (0.05 of dataset): 0.886
 _Finished evaluating on full test set in 1102.41 seconds

and worst of all, all the trouble I went through to write this new encoder performed no better than ordinal encoding (simply assuming the thousands of unique values could be mapped to a sequence of numbers), which goes contrary to results reported elsewhere. Side note: I also profiled that author's binary encoder and it was just as slow as mine.

So I'm glad to have my OmniEncoder as a tool ready to apply to other data sets, but it was disappointing to see it didn't do anything for me on this particular dataset.