TensorFlow: Convolutional Neural Networks

Overview

CIFAR10を使ってCNNおよびmulti-gpuでCNNをするサンプル．
CIFAR10データセットは，32x32pixelのカラー画像で，クラスは10クラスある．大体まとめると．

#classes	10
#samples/class	6000
#train samples	50000
#test samples	10000

Goals

DNNのアーキ，訓練，評価に関する標準のハイライト
大きくて良いモデルのテンプレを提供

Highlights of the Tutorial

conv/relu/max pooling/lrn
input, loss, activation, gradientsの訓練中のvisualization
学習パラメータのmoving averageのとり方，これを予測に使う方法
learning rateのスケジューリング
queueを使ったファイルの読みこみ (これ)
multi-GPUでどう訓練するか
multi-GPUでのパラメータシェアとアップデート

Model Architecture

かの有名なAlexNetをちょっと変えたアーキ．

# learable parameters = 1,068,298
# multiply-add = 19.5M

Code Organization

これ

git clone https://github.com/tensorflow/tensorflow.git

したほうがいい．

CIFAR-10 Model

cifar10.pyには，だいたい765 opsがある．

コードの再利用性を高めるために次のように関数を分けるのがいい．

Model inputs: inputs() and distorted_inputs()
Model prediction: inference()
Model training: loss() and train()

他のサンプルもこんな感じになっている．

Model Inputs

tf.FixedLengthRecordReaderでimageを読み込んで，example queueに入れてる．

前処理は

cropping to 24x24 pixel
whitening per image
random flipping
random brightning
random contrasting

Model Prediction

NNのアーキは
conv + relu + pool + lrn
\+ conv + relue+ lrn + pool
\+ affine + relu
\+ affine + relu
\+ affine + linear_softmax

exerciseでcuda-convnetに合わせるように，softmaxにしろと書いてあるが無視．
予測するときは，expしてnormalizeしようがしまいが，最大値を取ったときのindexは同じなので無視する．ただし，cross entropyの計算では必要なので，"def loss"の中でやっている．

Model Training

分類問題なのでcross-entropyを使っている．目的関数には，誤差項にweight decay (L2 norm)を加えている．

Launching and Training the Model

実行の前に注意

ここによるとtag=0.6.0のだと動かないので，注意．tag=0.5.0に戻すことと書いてあるが，私の環境ではそれでも同じエラー（initしてないのにrunするなみたいなエラー）がでていた.
なぜかcpuのみの環境では動いたのと，cifar10_multi_gpu_train.pyは動いて

"tf.device('/cpu:0')"が，"tf.Graph().as_default()"についていたので，

in cifar10_train.py

  with tf.Graph().as_default(), tf.device('/cpu:0'):

をつけてみると動いた．batch/sec的に計算はGPU計算されている用．これ書いている時点で，master branchで試したけど取り敢えずは動いた（他のtagでも動くと思う）．理由はよく分かっていないが，こうしないとinitする前にopされているよう．

実行

python cifar10.py

Evaluating a Model

実行

python cifar10_eval.py

予測するときは，training時にとったtrain paramsの移動平均を使っている．一種のアンサンブルだと思う．こうすることで@1の精度が3%くらい上がるとのこと

Howtoをやっていると，コードを見ても，特によくわからない部分はない．

Training a Model Using Multiple GPU Cards

synchronous parallelで計算する．GPUにモデルレプリカを置く一番簡単な分散学習．model paramは，GPUで計算したgradientsをcpu deviceで集めて平均とって，アップデートする．

Placing Variables and Operations on Devices

モデルのレプリカをそれぞれGPUにおく．gradientの計算はそれらで行う．ここではこれをtowerと読ぶことにする．

tower毎にoperationに，tf.name_scopeで，一意の名前をつける
operationは，tf.device()で，gpuで計算する

すべての変数はcpuに保存されていて，計算するときに

tf.get_variable_scope().reuse_variables()

で，shareさせて，gpuがパラメータにアクセスできるようにする．

cifar10_train.pyとの大きな違いは，この部分で

    # Create an optimizer that performs gradient descent.
    opt = tf.train.GradientDescentOptimizer(lr)

    # Calculate the gradients for each model tower.
    tower_grads = []
    for i in xrange(FLAGS.num_gpus):
      with tf.device('/gpu:%d' % i):
        with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
          # Calculate the loss for one tower of the CIFAR model. This function
          # constructs the entire CIFAR model but shares the variables across
          # all towers.
          loss = tower_loss(scope)

          # Reuse variables for the next tower.
          tf.get_variable_scope().reuse_variables()

          # Retain the summaries from the final tower.
          summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)

          # Calculate the gradients for the batch of data on this CIFAR tower.
          grads = opt.compute_gradients(loss)

          # Keep track of the gradients across all towers.
          tower_grads.append(grads)

    # We must calculate the mean of each gradient. Note that this is the
    # synchronization point across all towers.
    grads = average_gradients(tower_grads)

optimizerを作っておく
loop for each tower
- lossを計算
- variable再利用
- gradsを計算
average_gradientsを計算

している．

その後

    # Apply the gradients to adjust the shared variables.
    apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

でgradsをtower loopの前に作ったoptimizerに適用．

その後はほとんど一緒．

なので，tower loopが入って，tower lossの計算に，gpu deviceを使うのとtowerを区別するのにname_scopeを入れているのがsingle gpu版との主な違いな感じ．どこでsyncしているかというと，~~tower_lossの最後にcontrol_dependenciesを挟んでいるので，tower毎のlossの計算が全部終わるまでは，その後のaverageのopまで行かないと思う~~．単に，tf.device('/gpu:%d' % i)を抜けて，tf.device('/cpu:0')に入るからな気がする．tower_lossの中のcifar10.distorted_inputsで，filename_queueを作っているので，queueは，gpu device毎に存在すると思う.

Launching and Training the Model on Multiple GPU cards

python cifar10_multi_gpu_train.py --num_gpus=2

こんな感じで実行する．

参考

https://www.tensorflow.org/versions/0.6.0/tutorials/deep_cnn/index.html#convolutional-neural-networks

KZKY memo

自分用メモ．