KZKY memo

自分用メモ.

python: pairwise-distance

numpyでfor-loopしながらpairwise distanceを計算すると非常に遅い.
pairwise distanceはgram行列のカーネルにrbfを使った時に絶対に出てくるので,高速計算は必須要件.

比較

環境

  • Ubuntu14.04
  • Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz

設定1

  • #samples=1000
  • #dimensions=10, 100

コード1

#!/usr/bin/env python

import numpy as np
import time

from numba import jit

# dataset
n = 1000
d = 100  # or 10
X = np.random.rand(n, d)
gamma = 0.01


# numpy
st = time.time()
for x in X:
    for y in X:
        np.exp(-gamma * np.sum((x - y) ** 2))
        pass
    pass
et = time.time()
print "elapsed time (numpy): %f [s]" % (et - st)
        
# numpy efficiently computed
def distmat(X):
    H = np.tile(np.diag(np.dot(X, X.T)), (X.shape[0], 1))
    G = np.dot(X, X.T)
    distmat_ = H - 2 * G + H.T
    return distmat_
st = time.time()
Y = np.exp(-gamma * distmat(X))
et = time.time()
print "elapsed time (numpy efficiently computed): %f [s]" % (et - st)


# for-loop with jit
@jit("f8[:, :](f8[:, :])")
def distmat_jit(X):
    """
    """
    Y = np.zeros((n, n))
    for i in range(n):
        x = X[i, :]
        for j in range(n):
            s = 0
            y = X[j, :]

            if i > j:
                Y[i, j] = Y[j, i]
                continue
                pass
            
            for k in range(d):
                s += (x[k] - y[k]) ** 2
                pass
            
            Y[i, j] = np.exp(-gamma * s)
            pass
    return Y
    
st = time.time()
distmat_jit(X)
et = time.time()
print "elapsed time (numba): %f [s]" % (et - st)

結果1

  • #dimensions=10
elapsed time (numpy): 15.017164 [s]
elapsed time (numpy efficiently computed): 0.144054 [s]
elapsed time (numba): 0.144754 [s]
  • #dimensions=100
elapsed time (numpy): 18.905993 [s]
elapsed time (numpy efficiently computed): 0.334139 [s]
elapsed time (numba): 0.269915 [s]

次元を増やすとnumba (llvm jit)のほうが早いが,次元が小さならnumpyで十分.
しかし,numpyのやり方はメモリを食うのでnumbaがいいと思われる.

環境2

サーバで計算.

設定2

  • #samples=10000
  • #dimensions=618 (isolet dataの次元数)

次元,サンプルともに増やす.

コード2

#!/usr/bin/env python

import numpy as np
import time

from numba import jit

# dataset
n = 10000
d = 618 
X = np.random.rand(n, d)
gamma = 0.01


# numpy efficiently computed
def distmat(X):
    H = np.tile(np.diag(np.dot(X, X.T)), (X.shape[0], 1))
    G = np.dot(X, X.T)
    distmat_ = H - 2 * G + H.T
    return distmat_
st = time.time()
Y = np.exp(-gamma * distmat(X))
et = time.time()
print "elapsed time (numpy efficiently computed): %f [s]" % (et - st)


# for-loop with jit
@jit("f8[:, :](f8[:, :])")
def distmat_jit(X):
    """
    """
    Y = np.zeros((n, n))
    for i in range(n):
        x = X[i, :]
        for j in range(n):
            s = 0
            y = X[j, :]

            if i > j:
                Y[i, j] = Y[j, i]
                continue
                pass
            
            for k in range(d):
                s += (x[k] - y[k]) ** 2
                pass
            
            Y[i, j] = np.exp(-gamma * s)
            pass
    return Y
    
st = time.time()
distmat_jit(X)
et = time.time()
print "elapsed time (numba): %f [s]" % (et - st)

結果2

elapsed time (numpy efficiently computed): 15.835750 [s]
elapsed time (numba): 37.839368 [s]

numpyは計算終わらないので除去,
O(n^2d)だから#dimensionsを増やしても#samplesの方が速度に影響を与える
numpy efficiently computedのほうがいいという解釈か.

大体(#samples, #dimensions) > (10000, 1000)は,もうGPUを使ったほうがよいと思われる.

余談

自分自身同士の計算を含まないような距離計算のみなら
scipy.spatial.distance.pdist
で十分.