Spark MLlibの概要 (Spark 1.2)

Spark1.0になるにともないデータのスパース表現に対応したよう．
インプリは，Breeze (scalaの場合)に依存している．

Classification and regression

linear models

${ \displaystyle f(\mathbf{w}) := \lambda R(\mathbf{w}) + \frac{1}{n}\sum_{i=1}^n L(\mathbf{w}; \mathbf{x}_i, y_i) }$

でL1, L2 正則化項の両方に対応している．
基本はSGDで解いているよう．

SVMs

誤差関数がヒンジロス

logistic regression

誤差関数がロジスティックロス

linear regression

誤差関数が二乗誤差

naive Bayes

ナイーブな過程をおいた手法

${ \displaystyle p(y | \mathbf{x}) \propto p(\mathbf{x} | y) \times p(y) = p(x_1 | y) \times \ldots \times p(x_d | y) \times p(y) }$

decision trees

いわすもがな

ensembles of trees (Random Forests and Gradient-Boosted Trees)

しらん

Collaborative filtering

alternating least squares (ALS)

User-factorとItem-factorを交互に最適化するやつ．
implicit feedbackにも対応している．

http://www.google.co.jp/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCQQFjAA&url=http%3A%2F%2Flabs.yahoo.com%2Ffiles%2FHuKorenVolinsky-ICDM08.pdf&ei=ciydVJeqCcHsmAWT54CwCw&usg=AFQjCNGVe5pqBw1JV9uG2ANhmN8CzgesHQ&sig2=-cVfoW5EEXY5bzfZCcSwbA

Clustering

k-means

いわずもがな．
クラスタ中心の初期値の初期値化手法であるk-mean++とその分散実装バージョンの
kmeans||が実装されている．

Dimensionality reduction

singular value decomposition (SVD)

特異値分解

principal component analysis (PCA)

主成分分析

Feature extraction and transformation

TF-IDF

言語処理でよくあるTerm FrequencyとInverse Document Frequencyを作れる．

Word2Vec

ワードのベクトル表現を作れる．
結構新しい手法．

StandardScaler

Feature毎にzero-mean, unit-varianceにしてくれる．

Normalizer

サンプルデータの $L^p$ -normが1になるようにしてくれる．
デフォルトは $L^2$ -nomrが1, Unit-Sphereにしてくれる．

Optimization (developer)

stochastic gradient descent

基本的には，１サンプル来たらモデル更新する最適化手法．
データセットによって，step-sizeの調整が必要．
$\gamma = \frac{s}{ \sqrt{t}}$ のようにイテレーション毎に減衰するような工夫はしている
L1-Regの場合は，Proximal Updateにも対応している． ${ \displaystyle||\mathbf{w}_t - \mathbf{w}_{t-1}||}$ というRegularizerを目的関数で使う．
RDDベースなので，実際は，Stocastic Agerage Gradientになる．

limited-memory BFGS (L-BFGS)

Newton-Methodのヘッシアンの計算をサボる最適化手法 (BFGS)のメモリ最適化版．

KZKY memo

自分用メモ．