Ubuntu忘備録: 5月 2018

アジェンダ

nijiflowをベースに、機械学習で艦これとアズールレーンの画像の分類器を作り、学習済みグラフ等を公開した。
また、これのTwitterBotを作成し公開している。
データセットはPixivのタグ付け済み画像情報を元に作成した。
艦これ・アズールレーンそれぞれ6500画像を使用した。
学習は、GPU(google colaboratory)環境にて約48h行い、約10000epocを実行した。
(テストデータによる)認識率は8割を超える。

結果

INFO:tensorflow:Restoring parameters from drive/fleetclassify/checkpoints/fleetclassify_v1_1.0_224_8/model.ckpt-110969

eval/Precision[0.855130792]
eval/Recall[0.86558044]
eval/Accuracy[0.862]
INFO:tensorflow:Finished evaluation at 2018-05-07-12:22:21

TIPS:

- colaboratoryを使いすると規約違反でGPUが使用停止されるのでやめよう
- colaboratoryで大きい学習データをdriveに置いて使うと90分で止まることがある
- MobileNet/nijiflowの学習の再開はコマンドを一行省略でできる

TODO:

- nijinetの98%(?)から見ると精度がまだ低いように思う
- 学習済みグラフの学習状況をplotするスクリプトが欲しい

艦これとアズールレーンの画像を分類する分類器を作りました。
スクリプトと学習済みグラフを公開しています。
( https://github.com/MichinariNukazawa/daisy_fleetclassify )
また、分類器を気軽に試せるように、twitter bot(@DFleetclassify)を公開しています。
(が、簡易なおためし用Botなので専用サーバも立てておらず、不安定かつ不定休です。)

fleetclassifyは nijiflow を元にしています。nijiflow自体は、MobileNetを用いた、転移学習による2D/3Dイラスト判別器です。詳細はリンク先参照。
また、nijiflowの詳細は、SIG2Dにより頒布され、後日公開されるSIG2D Letter1に記載されています。

公開前なのもあって、自分がやった作業についてざっくり書き残しておきます。

fleetclassifyの精度は8割程度で、nijiflowと比べると低いです。(nijiflowはMobileNetを使っていながら98%を実現している。)

# 学習前までの手順

SIG2D Letter1が公開前なので、その前処理までの手順を記載する。
主には、TensorflowのGPU実行環境の作成と、学習データ作成を行う。

# 環境構築

```
sudo apt install python3-pip -y
pip3 install tensorflow-gpu
```

# nijiflowの取得

git clone --depth=1 -b niji https://github.com/fallthrough/models

# 学習データ作成

## 学習用画像の収集

pixivより学習データに使用する画像を収集した。
https://github.com/MichinariNukazawa/pixivpy_wrapper

pixivpy_wrapperリポジトリ内の
`kancolle.sh`により`${HOME}/pixiv_data/image__艦これ`が、
`azure.sh`により`${HOME}/pixiv_data/image__アズールレーン`
が作成される。
ディレクトリには、画像ファイルとそのメタデータ`data.json`が作成される。
雑事として、`data.json`は厳密なjsonではないので、読み込みの際に小細工が必要。

## データの前処理

nijiflowは、
`models/research/slim/create_niji_dataset.py`
でTensorFlowの学習データバイナリを生成している。
これは、nijiflow独自(?)の単純なフォーマットの画像一覧テキストファイルを読みこむ。
pixivpy_wrapperのダウンロードデータを、このnijiflowデータファイルに変換する。
これにより、nijiflowの前処理にそのまま乗っかることができる。

変換を行う`util/nijiflow_source_from_path.py`を書いた。
pixivpy_wrapperのダウンロード・ディレクトリから、`data.json`を読んで`nijiflow.list`を作成する。

```
python3 nijiflow_source_from_path.py \
0 ${HOME}/pixiv_data/image__艦これアズールレーン,アズレン
python3 nijiflow_source_from_path.py \
1 ${HOME}/pixiv_data/image__アズールレーン艦これ,艦隊これくしょん
```
それぞれの指定ディレクトリ内に`nijiflow.list`が書き出される。

大筋は以下の通り。
- イラスト以外(うごイラ、漫画)の除外
- アズールレーンx艦これの画像の除外
(艦これアズレンのキャラクターが一枚の絵に入っているような、クロスオーバー的な二次創作イラストを学習から除くため)
-jpg以外の画像の除外
(pixiv画像にはpngが含まれている。nijiflowがjpgのみ使用していたので、単に除外した)
-(TODO) image.modeがRGBでない画像の除外
("L"が２０ファイルほど混じっている模様)

ファイルフォーマットは以下の通り。
一行ごとに１つ、ソースファイルの相対パスと分類ID。半角空白(?)で区切るtsvファイルフォーマット。
```tsv:nijiflow.list
68375709_p0.jpg 0
68375681_p0.jpg 0
68375623_p0.jpg 0
68375609_p0.jpg 0
68375495_p0.jpg 0
68375432_p0.jpg 0
```

## 学習データ化

nijiflowのスクリプトを使ってtensorflowの学習データセットを作成する。
データセットは複数のファイルを持ったディレクトリである。
```
python3 models/research/slim/create_niji_dataset.py \
    --output_dir=drive/fleetclassify/fleetclassify_dataset \
    ${HOME}/pixiv_data/image__艦これ/nijiflow.list \
    ${HOME}/pixiv_data/image__アズールレーン/nijiflow.list
```

# 学習の実行

TensorFlow-gpuが、CUDA9.0を要求する。Ubuntu18.04で作業していたのだが、デフォルトのCUDA9.1では駄目とのこと。
(それでCUDAの入れ替えインストール中にUbuntu環境を壊してしまったので)ここから先はgoogle collabにて行った。

ドライバだけ入れないことで回避できるとのこと。
https://medium.com/@taylordenouden/installing-tensorflow-gpu-on-ubuntu-18-04-89a142325138

## チェックポイント(学習済み元モデルデータ)を取得

nijinetの手順に従い、チェックポイントをダウンロードして展開する。
```
mkdir -p drive/fleetclassify/checkpoints/pretrained
pushd drive/fleetclassify/checkpoints/pretrained
wget http://download.tensorflow.org/models/mobilenet_v1_1.0_224_2017_06_14.tar.gz
tar xvzf mobilenet_v1_1.0_224_2017_06_14.tar.gz
popd
```

以上。この後は学習を行う。

# 以下、作業中に出会ったエラーとその解決

# 9:

複数枚の投稿を取り除く。漫画はともかく、まとめは他作品のキャラも入っているので。ファイルの事前圧縮は、少なくともデータセットのサイズには影響なかった。

####

```
WARNING:tensorflow:From /home/nuka/.local/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
WARNING:tensorflow:From train_image_classifier.py:400: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
Traceback (most recent call last):
File "train_image_classifier.py", line 576, in <module>
    tf.app.run()
File "/home/nuka/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
File "train_image_classifier.py", line 406, in main
    FLAGS.dataset_name, FLAGS.dataset_split_name, FLAGS.dataset_dir)
File "/home/nuka/flow/models/research/slim/datasets/dataset_factory.py", line 59, in get_dataset
    reader)
File "/home/nuka/flow/models/research/slim/datasets/niji.py", line 85, in get_split
    with open(os.path.join(dataset_dir, 'metadata.json')) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/tmp/datasets/niji/metadata.json'
```
単なるデータセット引数の指定ミスだった。
`--dataset_dir=${HOME}/flow/nijiflow_data/fleetclassify \`

####

```
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'gradients/MobilenetV1/Logits/Dropout_1b/dropout/div_grad/BroadcastGradientArgs': Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[Node: gradients/MobilenetV1/Logits/Dropout_1b/dropout/div_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/device:GPU:0"](gradients/MobilenetV1/Logits/Dropout_1b/dropout/div_grad/Shape, gradients/MobilenetV1/Logits/Dropout_1b/dropout/div_grad/Shape_1)]]

```
あらかじめGPU学習が設定されていたエラー。GPU版をインストールすることで解決。
`pip3 install --upgrade tensorflow-gpu`

####

```
File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.
```
ローカルのUbuntu18.04環境が原因だった模様。
cuda9.1では駄目で、cuda9.0を入れなければならない、とのことです。
https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1704&target_type=debnetwork
`sudo apt-get install cuda-9-0 `

####

```
INFO:tensorflow:global step 380: loss = 0.6584 (1.684 sec/step)
INFO:tensorflow:global step 390: loss = 0.6717 (1.288 sec/step)
INFO:tensorflow:global step 400: loss = 0.6621 (2.100 sec/step)
INFO:tensorflow:global step 410: loss = 0.6558 (1.993 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, corrupted record at 78840124
    [[Node: parallel_read/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_1, parallel_read/filenames)]]
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "models/research/slim/train_image_classifier.py", line 576, in <module>
    tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 572, in main
    sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 784, in train
    ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop
    ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
    raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
    enqueue_callable()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1249, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 78840124
    [[Node: parallel_read/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_1, parallel_read/filenames)]]
```

```

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, corrupted record at 99555679
    [[Node: parallel_read/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_1, parallel_read/filenames)]]
INFO:tensorflow:Caught OutOfRangeError. Stopping Training. FIFOQueue '_3_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
    [[Node: fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefetch_queue/fifo_queue)]]

Caused by op 'fifo_queue_Dequeue', defined at:
File "models/research/slim/train_image_classifier.py", line 576, in <module>
    tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 474, in main
    clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
File "/content/models/research/slim/deployment/model_deploy.py", line 193, in create_clones
    outputs = model_fn(*args, **kwargs)
File "models/research/slim/train_image_classifier.py", line 456, in clone_fn
    images, labels = batch_queue.dequeue()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 440, in dequeue
    self._queue_ref, self._dtypes, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3730, in queue_dequeue_v2
    timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): FIFOQueue '_3_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
    [[Node: fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefetch_queue/fifo_queue)]]

INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "models/research/slim/train_image_classifier.py", line 576, in <module>
    tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 572, in main
    sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 784, in train
    ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop
    ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
    raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
    enqueue_callable()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1249, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 99555679
    [[Node: parallel_read/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_1, parallel_read/filenames)]]

```
テストデータからEXIFの壊れたファイルを除いた。それが原因だったと思われる。
(テストデータ生成部はEXIFが壊れているのをチェック等していない。学習中に使っているのかは不明だが。)

####

```
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, drive/fleetflow/fleetclassify_dataset/niji_train_00000-of-00100.tfrecord; No such file or directory
    [[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]
```
GoogleDriveのファイルが壊れていたのが原因だった。上げ直して解決。

####

```
INFO:tensorflow:Restoring parameters from drive/fleetflow/checkpoints/pretrained/mobilenet_v1_1.0_224.ckpt
2018-04-30 02:46:43.464872: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Out of range: Read less bytes than requested
Traceback (most recent call last):
File "models/research/slim/train_image_classifier.py", line 576, in <module>
    tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 572, in main
    sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train
    master, start_standard_services=False, config=session_config) as sess:
File "/usr/lib/python3.6/contextlib.py", line 83, in __enter__
    raise RuntimeError("generator didn't yield") from None
RuntimeError: generator didn't yield
```
不明。とりあえず再実行中で解決？。

#### google colaboratoryが切れた場合

UbuntuのFireFoxで、ダイアログが出なかったので気づかなかった。
```
INFO:tensorflow:global step 130: loss = 0.6706 (7.589 sec/step)
INFO:tensorflow:global step 140: loss = 0.5755 (2.013 sec/step)
INFO:tensorflow:Saving checkpoint to path drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1/model.ckpt
INFO:tensorflow:global_step/sec: 0.233577
INFO:tensorflow:Recording summary at step 141.
2018-04-30 04:03:20.286157: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:134 : Unknown: drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1/model.ckpt-140.data-00000-of-00001.tempstate1704126194757243596; Input/output error
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnknownError'>, drive/fleetflow/fleetclassify_dataset/niji_train_00017-of-00100.tfrecord; Input/output error
    [[Node: parallel_read/ReaderReadV2_3 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_3, parallel_read/filenames)]]
INFO:tensorflow:global_step/sec: 0.000113881
INFO:tensorflow:Caught OutOfRangeError. Stopping Training. FIFOQueue '_2_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
    [[Node: fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefetch_queue/fifo_queue)]]

Caused by op 'fifo_queue_Dequeue', defined at:
File "models/research/slim/train_image_classifier.py", line 576, in <module>
    tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 474, in main
    clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
File "/content/models/research/slim/deployment/model_deploy.py", line 193, in create_clones
    outputs = model_fn(*args, **kwargs)
File "models/research/slim/train_image_classifier.py", line 456, in clone_fn
    images, labels = batch_queue.dequeue()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 440, in dequeue
    self._queue_ref, self._dtypes, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3730, in queue_dequeue_v2
    timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): FIFOQueue '_2_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
    [[Node: fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefetch_queue/fifo_queue)]]

INFO:tensorflow:Finished training! Saving model to disk.
2018-04-30 06:24:16.213965: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:109 : Unknown: drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1; Input/output error
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1; Input/output error
    [[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, MobilenetV1/Conv2d_0/BatchNorm/beta, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/gamma, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/moving_mean, MobilenetV1/Conv2d_0/BatchNorm/moving_variance, MobilenetV1/Conv2d_0/weights, MobilenetV1/Conv2d_0/weights/RMSProp, MobilenetV1/Conv2d_0/weights/RMSProp_1, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp_1,
〜
MobilenetV1/Conv2d_9_pointwise/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_9_pointwise/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_mean, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_variance, MobilenetV1/Conv2d_9_pointwise/weights, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/biases, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/weights, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp_1, global_step)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 990, in managed_session
    yield sess
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 780, in train
    sv.saver.save(sess, sv.save_path, global_step=sv.global_step)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1676, in save
    {self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1140, in _run
    feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1; Input/output error
    [[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, MobilenetV1/Conv2d_0/BatchNorm/beta, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/gamma, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/moving_mean, MobilenetV1/Conv2d_0/BatchNorm/moving_variance, MobilenetV1/Conv2d_0/weights, MobilenetV1/Conv2d_0/weights/RMSProp, MobilenetV1/Conv2d_0/weights/RMSProp_1, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp_1, MobilenetV1/Conv2d_10_depthwise/BatchNorm/gamma,
〜
MobilenetV1/Conv2d_9_pointwise/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_mean, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_variance, MobilenetV1/Conv2d_9_pointwise/weights, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/biases, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/weights, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp_1, global_step)]]

Caused by op 'save/SaveV2', defined at:
File "models/research/slim/train_image_classifier.py", line 576, in <module>
    tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 563, in main
    saver=tf.train.Saver(max_to_keep=1000000),
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1311, in __init__
    self.build()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1320, in build
    self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1357, in _build
    build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 806, in _build_internal
    save_tensor = self._AddSaveOps(filename_tensor, saveables)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 326, in _AddSaveOps
    save = self.save_op(filename_tensor, saveables)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 242, in save_op
    tensors)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1680, in save_v2
    shape_and_slices=shape_and_slices, tensors=tensors, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

UnknownError (see above for traceback): drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1; Input/output error
    [[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, MobilenetV1/Conv2d_0/BatchNorm/beta, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/gamma, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/moving_mean, MobilenetV1/Conv2d_0/BatchNorm/moving_variance, MobilenetV1/Conv2d_0/weights, MobilenetV1/Conv2d_0/weights/RMSProp, MobilenetV1/Conv2d_0/weights/RMSProp_1, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp_1, MobilenetV1/Conv2d_10_depthwise/BatchNorm/gamma, MobilenetV1/Conv2d_10_depthwise/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_10_depthwise/BatchNorm/gamma/RMSProp_1,
〜
MobilenetV1/Conv2d_9_pointwise/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_9_pointwise/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_mean, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_variance, MobilenetV1/Conv2d_9_pointwise/weights, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/biases, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/weights, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp_1, global_step)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "models/research/slim/train_image_classifier.py", line 576, in <module>
    tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 572, in main
    sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 784, in train
    ignore_live_threads=ignore_live_threads)
File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop
    ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise

    raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
    enqueue_callable()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1249, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
    status, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: drive/fleetflow/fleetclassify_dataset/niji_train_00017-of-00100.tfrecord; Input/output error
    [[Node: parallel_read/ReaderReadV2_3 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_3, parallel_read/filenames)]]

```

####

```
python3 models/research/slim/train_image_classifier.py --train_dir=drive/fleetclassify/checkpoints/fleetclassify_v1_1.0_224_$(date +'%Y%m%d_%H%M') --dataset_dir=drive/fleetclassify/fleetclassify_dataset --dataset_name=niji --dataset_split_name=train --model_name=mobilenet_v1 --preprocessing_name=mobilenet_v1 --save_interval_secs=600 --save_summaries_secs=600 --checkpoint_path=drive/fleetclassify/checkpoints/pretrained/mobilenet_v1_1.0_224.ckpt --checkpoint_exclude_scopes=MobilenetV1/Logits
python3: Relink `/lib/x86_64-linux-gnu/libudev.so.1' with `/lib/x86_64-linux-gnu/librt.so.1' for IFUNC symbol `clock_gettime'
Segmentation fault (コアダンプ)

```
cuDNNがインストールされていなかったのが原因だった。
https://medium.com/@taylordenouden/installing-tensorflow-gpu-on-ubuntu-18-04-89a142325138

####

```
Traceback (most recent call last):
File "models/research/slim/export_inference_graph.py", line 59, in <module>
import tensorflow as tf
ImportError: No module named tensorflow
```

python3をpythonと打っていた(3つけ忘れ)。

以上です。

Ubuntu忘備録

イラストを艦これとアズールレーンに分類する

アジェンダ

結果

TIPS:

TODO:

# 学習前までの手順

# 環境構築

# nijiflowの取得

# 学習データ作成

## 学習用画像の収集

## データの前処理

## 学習データ化

# 学習の実行

## チェックポイント(学習済み元モデルデータ)を取得

# 以下、作業中に出会ったエラーとその解決

# 9:

####

####

####

####

####

#### google colaboratoryが切れた場合

####

####

ComfyUIのLoRA管理ノード利用と妥協(ComfyUI-Lora-Manager)

RSS