DictVectorizer

scikit-learnにある便利ツールの話．

今までカテゴリカルデータは自分でインデクシングして，OneHotEncodingや1-of-Kと呼ばれる行列を作っていた．しかし，scikit-learnには，この機能を提供する便利クラスが用意されてる．
それがDictVectorizer

json objectのlist, [{}, {}, ...]の形式のデータを行列に落と仕込む
OneHotEncodingや1-of-Kと呼ばれる行列を作るのに重宝する機能
あるカラムが非常に高次元(100万くらい)だと，memoryエラーでおちるらしい (memory sizeによるか)

例

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from sklearn.feature_extraction import DictVectorizer
measurements = [
    {'city': 'Dubai', 'temperature': 31.0, 'country': 'U.A.E.'},
    {'city': 'London', 'country': 'U.K.', 'temperature': 27.0},
    {'city': 'San Fransisco', 'country': 'U.S.', 'temperature': 24.0},
]
vec = DictVectorizer()
# 一回fit
print vec.fit(measurements)

# to sparse matrix and to dense
print vec.fit_transform(measurements).toarray()

# feature name
print vec.get_feature_names()

# indexing result
print vec.vocabulary_

# vocabularyにないデータは空扱い
print vec.transform({'city': 'Cambridge', 'country': 'U.K.', 'temperature': 19.0})

# vocabularyにあればちゃんと１がはいる
print vec.transform({'city': 'London', 'country': 'U.K.', 'temperature': 19.0})

自分でインデクシングしていたのが残念．

参考URLs

http://stackoverflow.com/questions/24251959/memoryerror-in-toarray-when-using-dictvectorizer-of-scikit-learn
scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html
http://scikit-learn.org/stable/modules/feature_extraction.html