python: pairwise-distance
numpyでfor-loopしながらpairwise distanceを計算すると非常に遅い.
pairwise distanceはgram行列のカーネルにrbfを使った時に絶対に出てくるので,高速計算は必須要件.
比較
- 普通のfor-loop
- 行列計算ベースで一気に行う方法
- numbaで比較する
環境
- Ubuntu14.04
- Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz
設定1
- #samples=1000
- #dimensions=10, 100
コード1
#!/usr/bin/env python import numpy as np import time from numba import jit # dataset n = 1000 d = 100 # or 10 X = np.random.rand(n, d) gamma = 0.01 # numpy st = time.time() for x in X: for y in X: np.exp(-gamma * np.sum((x - y) ** 2)) pass pass et = time.time() print "elapsed time (numpy): %f [s]" % (et - st) # numpy efficiently computed def distmat(X): H = np.tile(np.diag(np.dot(X, X.T)), (X.shape[0], 1)) G = np.dot(X, X.T) distmat_ = H - 2 * G + H.T return distmat_ st = time.time() Y = np.exp(-gamma * distmat(X)) et = time.time() print "elapsed time (numpy efficiently computed): %f [s]" % (et - st) # for-loop with jit @jit("f8[:, :](f8[:, :])") def distmat_jit(X): """ """ Y = np.zeros((n, n)) for i in range(n): x = X[i, :] for j in range(n): s = 0 y = X[j, :] if i > j: Y[i, j] = Y[j, i] continue pass for k in range(d): s += (x[k] - y[k]) ** 2 pass Y[i, j] = np.exp(-gamma * s) pass return Y st = time.time() distmat_jit(X) et = time.time() print "elapsed time (numba): %f [s]" % (et - st)
結果1
- #dimensions=10
elapsed time (numpy): 15.017164 [s] elapsed time (numpy efficiently computed): 0.144054 [s] elapsed time (numba): 0.144754 [s]
- #dimensions=100
elapsed time (numpy): 18.905993 [s] elapsed time (numpy efficiently computed): 0.334139 [s] elapsed time (numba): 0.269915 [s]
次元を増やすとnumba (llvm jit)のほうが早いが,次元が小さならnumpyで十分.
しかし,numpyのやり方はメモリを食うのでnumbaがいいと思われる.
設定2
- #samples=10000
- #dimensions=618 (isolet dataの次元数)
次元,サンプルともに増やす.
コード2
#!/usr/bin/env python import numpy as np import time from numba import jit # dataset n = 10000 d = 618 X = np.random.rand(n, d) gamma = 0.01 # numpy efficiently computed def distmat(X): H = np.tile(np.diag(np.dot(X, X.T)), (X.shape[0], 1)) G = np.dot(X, X.T) distmat_ = H - 2 * G + H.T return distmat_ st = time.time() Y = np.exp(-gamma * distmat(X)) et = time.time() print "elapsed time (numpy efficiently computed): %f [s]" % (et - st) # for-loop with jit @jit("f8[:, :](f8[:, :])") def distmat_jit(X): """ """ Y = np.zeros((n, n)) for i in range(n): x = X[i, :] for j in range(n): s = 0 y = X[j, :] if i > j: Y[i, j] = Y[j, i] continue pass for k in range(d): s += (x[k] - y[k]) ** 2 pass Y[i, j] = np.exp(-gamma * s) pass return Y st = time.time() distmat_jit(X) et = time.time() print "elapsed time (numba): %f [s]" % (et - st)
結果2
elapsed time (numpy efficiently computed): 15.835750 [s] elapsed time (numba): 37.839368 [s]
numpyは計算終わらないので除去,
O(n^2d)だから#dimensionsを増やしても#samplesの方が速度に影響を与える
numpy efficiently computedのほうがいいという解釈か.
大体(#samples, #dimensions) > (10000, 1000)は,もうGPUを使ったほうがよいと思われる.
余談
自分自身同士の計算を含まないような距離計算のみなら
scipy.spatial.distance.pdist
で十分.