Disappointing improvements using one-hot / binary encoding, improving performance with help of Python profiler
I put my new categorical encoder to use in a follow up attempt to the Red Hat business Kaggle competition.
The first thing I ran into was how slow it was to encode the variables with thousands of values. I did some profiling to see if there were any huge wins to be had:
$ python -m cProfile -o profile.bin $(which py.test) tests/test_preprocessing_transforms.py::test_profile_omniencode_redhat
and then
import pstats
p = pstats.Stats('profile.bin')
p.strip_dirs()
p.sort_stats('cumtime')
p.print_stats('preprocessing_transforms')
As I suspected, using list comprehensions instead of finding ways to do everything with numpy arrays was slow:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.006 0.006 11.809 11.809 test_preprocessing_transforms.py:15(test_profile_omniencode_redhat)
1 0.000 0.000 10.213 10.213 preprocessing_transforms.py:48(transform)
1 0.000 0.000 10.209 10.209 preprocessing_transforms.py:50(<listcomp>)
4 0.090 0.022 10.209 2.552 preprocessing_transforms.py:54(_encode_column)
40000 0.155 0.000 4.701 0.000 preprocessing_transforms.py:65(splat)
1 0.000 0.000 1.050 1.050 test_preprocessing_transforms.py:4(<module>)
1 0.000 0.000 0.817 0.817 preprocessing_transforms.py:1(<module>)
24772 0.079 0.000 0.079 0.000 preprocessing_transforms.py:70(<listcomp>)
40000 0.073 0.000 0.073 0.000 preprocessing_transforms.py:66(<listcomp>)
1 0.000 0.000 0.037 0.037 preprocessing_transforms.py:44(fit)
1 0.000 0.000 0.037 0.037 preprocessing_transforms.py:45(<dictcomp>)
4 0.001 0.000 0.037 0.009 preprocessing_transforms.py:79(_column_info)
1 0.000 0.000 0.004 0.004 preprocessing_transforms.py:21(transform)
4 0.000 0.000 0.001 0.000 preprocessing_transforms.py:90(_partition_one_hot)
4 0.000 0.000 0.000 0.000 preprocessing_transforms.py:59(<dictcomp>)
1746 0.000 0.000 0.000 0.000 preprocessing_transforms.py:102(<lambda>)
4 0.000 0.000 0.000 0.000 preprocessing_transforms.py:105(<listcomp>)
4 0.000 0.000 0.000 0.000 preprocessing_transforms.py:61(<listcomp>)
4 0.000 0.000 0.000 0.000 preprocessing_transforms.py:110(_num_onehot)
41 0.000 0.000 0.000 0.000 preprocessing_transforms.py:127(capacity)
4 0.000 0.000 0.000 0.000 preprocessing_transforms.py:63(<listcomp>)
41 0.000 0.000 0.000 0.000 preprocessing_transforms.py:122(num_bin_vals)
however, after updating a list comprehension to a more performant approach to building a numpy array of integers representing the bits for a binary encoded value, I only saw a 20% boost; not bad, but still over a 50x slowdown compared to ignoring or ordinally encoding the columns with lots of unique values:
Evaluating random forest ignore
_Starting fitting full training set
_Finished fitting full training set in 3.86 seconds
_Starting evaluating on full test set
Full test accuracy (0.05 of dataset): 0.880
_Finished evaluating on full test set in 16.32 seconds
Evaluating random forest ordinal
_Starting fitting full training set
_Finished fitting full training set in 4.26 seconds
_Starting evaluating on full test set
Full test accuracy (0.05 of dataset): 0.885
_Finished evaluating on full test set in 16.10 seconds
Evaluating random forest omni 20
_Starting fitting full training set
_Finished fitting full training set in 376.31 seconds
_Starting evaluating on full test set
Full test accuracy (0.05 of dataset): 0.885
_Finished evaluating on full test set in 1050.23 seconds
Evaluating random forest omni 50
_Starting fitting full training set
_Finished fitting full training set in 417.19 seconds
_Starting evaluating on full test set
Full test accuracy (0.05 of dataset): 0.886
_Finished evaluating on full test set in 1102.41 seconds
and worst of all, all the trouble I went through to write this new encoder performed no better than ordinal encoding (simply assuming the thousands of unique values could be mapped to a sequence of numbers), which goes contrary to results reported elsewhere. Side note: I also profiled that author's binary encoder and it was just as slow as mine.
So I'm glad to have my OmniEncoder
as a tool ready to apply to other data sets, but it was disappointing to see it didn't do anything for me on this particular dataset.