アジェンダ
nijiflowをベースに、機械学習で艦これとアズールレーンの画像の分類器を作り、学習済みグラフ等を公開した。また、これのTwitterBotを作成し公開している。
データセットはPixivのタグ付け済み画像情報を元に作成した。
艦これ・アズールレーンそれぞれ6500画像を使用した。
学習は、GPU(google colaboratory)環境にて約48h行い、約10000epocを実行した。
(テストデータによる)認識率は8割を超える。
結果
INFO:tensorflow:Restoring parameters from drive/fleetclassify/checkpoints/fleetclassify_v1_1.0_224_8/model.ckpt-110969eval/Precision[0.855130792]
eval/Recall[0.86558044]
eval/Accuracy[0.862]
INFO:tensorflow:Finished evaluation at 2018-05-07-12:22:21
TIPS:
- colaboratoryを使いすると規約違反でGPUが使用停止されるのでやめよう- colaboratoryで大きい学習データをdriveに置いて使うと90分で止まることがある
- MobileNet/nijiflowの学習の再開はコマンドを一行省略でできる
TODO:
- nijinetの98%(?)から見ると精度がまだ低いように思う- 学習済みグラフの学習状況をplotするスクリプトが欲しい
艦これとアズールレーンの画像を分類する分類器を作りました。
スクリプトと学習済みグラフを公開しています。
( https://github.com/MichinariNukazawa/daisy_fleetclassify )
また、分類器を気軽に試せるように、twitter bot(@DFleetclassify)を公開しています。
(が、簡易なおためし用Botなので専用サーバも立てておらず、不安定かつ不定休です。)
fleetclassifyは nijiflow を元にしています。nijiflow自体は、MobileNetを用いた、転移学習による2D/3Dイラスト判別器です。詳細はリンク先参照。
また、nijiflowの詳細は、SIG2Dにより頒布され、後日公開されるSIG2D Letter1に記載されています。
公開前なのもあって、自分がやった作業についてざっくり書き残しておきます。
fleetclassifyの精度は8割程度で、nijiflowと比べると低いです。(nijiflowはMobileNetを使っていながら98%を実現している。)
# 学習前までの手順
SIG2D Letter1が公開前なので、その前処理までの手順を記載する。主には、TensorflowのGPU実行環境の作成と、 学習データ作成を行う。
# 環境構築
```sudo apt install python3-pip -y
pip3 install tensorflow-gpu
```
# nijiflowの取得
git clone --depth=1 -b niji https://github.com/fallthrough/models# 学習データ作成
## 学習用画像の収集
pixivより学習データに使用する画像を収集した。https://github.com/MichinariNukazawa/pixivpy_wrapper
pixivpy_wrapperリポジトリ内の
`kancolle.sh`により`${HOME}/pixiv_data/image__艦これ`が、
`azure.sh`により`${HOME}/pixiv_data/image__アズールレーン`
が作成される。
ディレクトリには、画像ファイルとそのメタデータ`data.json`が作成される。
雑事として、`data.json`は厳密なjsonではないので、読み込みの際に小細工が必要。
## データの前処理
nijiflowは、`models/research/slim/create_niji_dataset.py`
でTensorFlowの学習データバイナリを生成している。
これは、nijiflow独自(?)の単純なフォーマットの画像一覧テキストファイルを読みこむ。
pixivpy_wrapperのダウンロードデータを、このnijiflowデータファイルに変換する。
これにより、nijiflowの前処理にそのまま乗っかることができる。
変換を行う`util/nijiflow_source_from_path.py`を書いた。
pixivpy_wrapperのダウンロード・ディレクトリから、`data.json`を読んで`nijiflow.list`を作成する。
```
python3 nijiflow_source_from_path.py \
0 ${HOME}/pixiv_data/image__艦これ アズールレーン,アズレン
python3 nijiflow_source_from_path.py \
1 ${HOME}/pixiv_data/image__アズールレーン 艦これ,艦隊これくしょん
```
それぞれの指定ディレクトリ内に`nijiflow.list`が書き出される。
大筋は以下の通り。
- イラスト以外(うごイラ、漫画)の除外
- アズールレーンx艦これの画像の除外
(艦これアズレンのキャラクターが一枚の絵に入っているような、クロスオーバー的な二次創作イラストを学習から除くため)
-jpg以外の画像の除外
(pixiv画像にはpngが含まれている。nijiflowがjpgのみ使用していたので、単に除外した)
-(TODO) image.modeがRGBでない画像の除外
("L"が20ファイルほど混じっている模様)
ファイルフォーマットは以下の通り。
一行ごとに1つ、ソースファイルの相対パスと分類ID。半角空白(?)で区切るtsvファイルフォーマット。
```tsv:nijiflow.list
68375709_p0.jpg 0
68375681_p0.jpg 0
68375623_p0.jpg 0
68375609_p0.jpg 0
68375495_p0.jpg 0
68375432_p0.jpg 0
```
## 学習データ化
nijiflowのスクリプトを使ってtensorflowの学習データセットを作成する。データセットは複数のファイルを持ったディレクトリである。
```
python3 models/research/slim/create_niji_dataset.py \
--output_dir=drive/fleetclassify/fleetclassify_dataset \
${HOME}/pixiv_data/image__艦これ/nijiflow.list \
${HOME}/pixiv_data/image__アズールレーン/nijiflow.list
```
# 学習の実行
TensorFlow-gpuが、CUDA9.0を要求する。Ubuntu18.04で作業していたのだが、デフォルトのCUDA9.1では駄目とのこと。(それでCUDAの入れ替えインストール中にUbuntu環境を壊してしまったので)ここから先はgoogle collabにて行った。
ドライバだけ入れないことで回避できるとのこと。
https://medium.com/@taylordenouden/installing-tensorflow-gpu-on-ubuntu-18-04-89a142325138
## チェックポイント(学習済み元モデルデータ)を取得
nijinetの手順に従い、チェックポイントをダウンロードして展開する。```
mkdir -p drive/fleetclassify/checkpoints/pretrained
pushd drive/fleetclassify/checkpoints/pretrained
wget http://download.tensorflow.org/models/mobilenet_v1_1.0_224_2017_06_14.tar.gz
tar xvzf mobilenet_v1_1.0_224_2017_06_14.tar.gz
popd
```
以上。この後は学習を行う。
# 以下、作業中に出会ったエラーとその解決
# 9:
複数枚の投稿を取り除く。漫画はともかく、まとめは他作品のキャラも入っているので。ファイルの事前圧縮は、少なくともデータセットのサイズには影響なかった。####
```WARNING:tensorflow:From /home/nuka/.local/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
WARNING:tensorflow:From train_image_classifier.py:400: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
Traceback (most recent call last):
File "train_image_classifier.py", line 576, in <module>
tf.app.run()
File "/home/nuka/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "train_image_classifier.py", line 406, in main
FLAGS.dataset_name, FLAGS.dataset_split_name, FLAGS.dataset_dir)
File "/home/nuka/flow/models/research/slim/datasets/dataset_factory.py", line 59, in get_dataset
reader)
File "/home/nuka/flow/models/research/slim/datasets/niji.py", line 85, in get_split
with open(os.path.join(dataset_dir, 'metadata.json')) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/tmp/datasets/niji/metadata.json'
```
単なるデータセット引数の指定ミスだった。
`--dataset_dir=${HOME}/flow/nijiflow_data/fleetclassify \`
####
```InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'gradients/MobilenetV1/Logits/Dropout_1b/dropout/div_grad/BroadcastGradientArgs': Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[Node: gradients/MobilenetV1/Logits/Dropout_1b/dropout/div_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/device:GPU:0"](gradients/MobilenetV1/Logits/Dropout_1b/dropout/div_grad/Shape, gradients/MobilenetV1/Logits/Dropout_1b/dropout/div_grad/Shape_1)]]
```
あらかじめGPU学習が設定されていたエラー。GPU版をインストールすることで解決。
`pip3 install --upgrade tensorflow-gpu`
####
```File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
```
ローカルのUbuntu18.04環境が原因だった模様。
cuda9.1では駄目で、cuda9.0を入れなければならない、とのことです。
https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1704&target_type=debnetwork
`sudo apt-get install cuda-9-0 `
####
```
INFO:tensorflow:global step 380: loss = 0.6584 (1.684 sec/step)
INFO:tensorflow:global step 390: loss = 0.6717 (1.288 sec/step)
INFO:tensorflow:global step 400: loss = 0.6621 (2.100 sec/step)
INFO:tensorflow:global step 410: loss = 0.6558 (1.993 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, corrupted record at 78840124
[[Node: parallel_read/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_1, parallel_read/filenames)]]
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "models/research/slim/train_image_classifier.py", line 576, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 572, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 784, in train
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
enqueue_callable()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1249, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 78840124
[[Node: parallel_read/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_1, parallel_read/filenames)]]
```
```
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.DataLossError'>, corrupted record at 99555679
[[Node: parallel_read/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_1, parallel_read/filenames)]]
INFO:tensorflow:Caught OutOfRangeError. Stopping Training. FIFOQueue '_3_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefetch_queue/fifo_queue)]]
Caused by op 'fifo_queue_Dequeue', defined at:
File "models/research/slim/train_image_classifier.py", line 576, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 474, in main
clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
File "/content/models/research/slim/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "models/research/slim/train_image_classifier.py", line 456, in clone_fn
images, labels = batch_queue.dequeue()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 440, in dequeue
self._queue_ref, self._dtypes, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3730, in queue_dequeue_v2
timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
OutOfRangeError (see above for traceback): FIFOQueue '_3_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefetch_queue/fifo_queue)]]
INFO:tensorflow:Finished training! Saving model to disk.
Traceback (most recent call last):
File "models/research/slim/train_image_classifier.py", line 576, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 572, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 784, in train
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
enqueue_callable()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1249, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 99555679
[[Node: parallel_read/ReaderReadV2_1 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_1, parallel_read/filenames)]]
```
テストデータからEXIFの壊れたファイルを除いた。それが原因だったと思われる。
(テストデータ生成部はEXIFが壊れているのをチェック等していない。学習中に使っているのかは不明だが。)
####
```
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, drive/fleetflow/fleetclassify_dataset/niji_train_00000-of-00100.tfrecord; No such file or directory
[[Node: parallel_read/ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2, parallel_read/filenames)]]
```
GoogleDriveのファイルが壊れていたのが原因だった。上げ直して解決。
####
```
INFO:tensorflow:Restoring parameters from drive/fleetflow/checkpoints/pretrained/mobilenet_v1_1.0_224.ckpt
2018-04-30 02:46:43.464872: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Out of range: Read less bytes than requested
Traceback (most recent call last):
File "models/research/slim/train_image_classifier.py", line 576, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 572, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 747, in train
master, start_standard_services=False, config=session_config) as sess:
File "/usr/lib/python3.6/contextlib.py", line 83, in __enter__
raise RuntimeError("generator didn't yield") from None
RuntimeError: generator didn't yield
```
不明。とりあえず再実行中で解決?。
#### google colaboratoryが切れた場合
UbuntuのFireFoxで、ダイアログが出なかったので気づかなかった。```
INFO:tensorflow:global step 130: loss = 0.6706 (7.589 sec/step)
INFO:tensorflow:global step 140: loss = 0.5755 (2.013 sec/step)
INFO:tensorflow:Saving checkpoint to path drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1/model.ckpt
INFO:tensorflow:global_step/sec: 0.233577
INFO:tensorflow:Recording summary at step 141.
2018-04-30 04:03:20.286157: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:134 : Unknown: drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1/model.ckpt-140.data-00000-of-00001.tempstate1704126194757243596; Input/output error
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.UnknownError'>, drive/fleetflow/fleetclassify_dataset/niji_train_00017-of-00100.tfrecord; Input/output error
[[Node: parallel_read/ReaderReadV2_3 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_3, parallel_read/filenames)]]
INFO:tensorflow:global_step/sec: 0.000113881
INFO:tensorflow:Caught OutOfRangeError. Stopping Training. FIFOQueue '_2_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefetch_queue/fifo_queue)]]
Caused by op 'fifo_queue_Dequeue', defined at:
File "models/research/slim/train_image_classifier.py", line 576, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 474, in main
clones = model_deploy.create_clones(deploy_config, clone_fn, [batch_queue])
File "/content/models/research/slim/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "models/research/slim/train_image_classifier.py", line 456, in clone_fn
images, labels = batch_queue.dequeue()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 440, in dequeue
self._queue_ref, self._dtypes, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3730, in queue_dequeue_v2
timeout_ms=timeout_ms, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
OutOfRangeError (see above for traceback): FIFOQueue '_2_prefetch_queue/fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
[[Node: fifo_queue_Dequeue = QueueDequeueV2[component_types=[DT_FLOAT, DT_FLOAT], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefetch_queue/fifo_queue)]]
INFO:tensorflow:Finished training! Saving model to disk.
2018-04-30 06:24:16.213965: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:109 : Unknown: drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1; Input/output error
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1; Input/output error
[[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, MobilenetV1/Conv2d_0/BatchNorm/beta, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/gamma, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/moving_mean, MobilenetV1/Conv2d_0/BatchNorm/moving_variance, MobilenetV1/Conv2d_0/weights, MobilenetV1/Conv2d_0/weights/RMSProp, MobilenetV1/Conv2d_0/weights/RMSProp_1, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp_1,
〜
MobilenetV1/Conv2d_9_pointwise/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_9_pointwise/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_mean, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_variance, MobilenetV1/Conv2d_9_pointwise/weights, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/biases, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/weights, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp_1, global_step)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 990, in managed_session
yield sess
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 780, in train
sv.saver.save(sess, sv.save_path, global_step=sv.global_step)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1676, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1; Input/output error
[[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, MobilenetV1/Conv2d_0/BatchNorm/beta, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/gamma, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/moving_mean, MobilenetV1/Conv2d_0/BatchNorm/moving_variance, MobilenetV1/Conv2d_0/weights, MobilenetV1/Conv2d_0/weights/RMSProp, MobilenetV1/Conv2d_0/weights/RMSProp_1, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp_1, MobilenetV1/Conv2d_10_depthwise/BatchNorm/gamma,
〜
MobilenetV1/Conv2d_9_pointwise/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_mean, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_variance, MobilenetV1/Conv2d_9_pointwise/weights, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/biases, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/weights, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp_1, global_step)]]
Caused by op 'save/SaveV2', defined at:
File "models/research/slim/train_image_classifier.py", line 576, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 563, in main
saver=tf.train.Saver(max_to_keep=1000000),
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1311, in __init__
self.build()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1320, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1357, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 806, in _build_internal
save_tensor = self._AddSaveOps(filename_tensor, saveables)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 326, in _AddSaveOps
save = self.save_op(filename_tensor, saveables)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 242, in save_op
tensors)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1680, in save_v2
shape_and_slices=shape_and_slices, tensors=tensors, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
UnknownError (see above for traceback): drive/fleetflow/checkpoints/fleetclassify_v1_1.0_224_1; Input/output error
[[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, MobilenetV1/Conv2d_0/BatchNorm/beta, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/beta/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/gamma, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_0/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_0/BatchNorm/moving_mean, MobilenetV1/Conv2d_0/BatchNorm/moving_variance, MobilenetV1/Conv2d_0/weights, MobilenetV1/Conv2d_0/weights/RMSProp, MobilenetV1/Conv2d_0/weights/RMSProp_1, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp, MobilenetV1/Conv2d_10_depthwise/BatchNorm/beta/RMSProp_1, MobilenetV1/Conv2d_10_depthwise/BatchNorm/gamma, MobilenetV1/Conv2d_10_depthwise/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_10_depthwise/BatchNorm/gamma/RMSProp_1,
〜
MobilenetV1/Conv2d_9_pointwise/BatchNorm/gamma/RMSProp, MobilenetV1/Conv2d_9_pointwise/BatchNorm/gamma/RMSProp_1, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_mean, MobilenetV1/Conv2d_9_pointwise/BatchNorm/moving_variance, MobilenetV1/Conv2d_9_pointwise/weights, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp, MobilenetV1/Conv2d_9_pointwise/weights/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/biases, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/biases/RMSProp_1, MobilenetV1/Logits/Conv2d_1c_1x1/weights, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp, MobilenetV1/Logits/Conv2d_1c_1x1/weights/RMSProp_1, global_step)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "models/research/slim/train_image_classifier.py", line 576, in <module>
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "models/research/slim/train_image_classifier.py", line 572, in main
sync_optimizer=optimizer if FLAGS.sync_replicas else None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 784, in train
ignore_live_threads=ignore_live_threads)
File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run
enqueue_callable()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1249, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: drive/fleetflow/fleetclassify_dataset/niji_train_00017-of-00100.tfrecord; Input/output error
[[Node: parallel_read/ReaderReadV2_3 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](parallel_read/TFRecordReaderV2_3, parallel_read/filenames)]]
```
####
```python3 models/research/slim/train_image_classifier.py --train_dir=drive/fleetclassify/checkpoints/fleetclassify_v1_1.0_224_$(date +'%Y%m%d_%H%M') --dataset_dir=drive/fleetclassify/fleetclassify_dataset --dataset_name=niji --dataset_split_name=train --model_name=mobilenet_v1 --preprocessing_name=mobilenet_v1 --save_interval_secs=600 --save_summaries_secs=600 --checkpoint_path=drive/fleetclassify/checkpoints/pretrained/mobilenet_v1_1.0_224.ckpt --checkpoint_exclude_scopes=MobilenetV1/Logits
python3: Relink `/lib/x86_64-linux-gnu/libudev.so.1' with `/lib/x86_64-linux-gnu/librt.so.1' for IFUNC symbol `clock_gettime'
Segmentation fault (コアダンプ)
```
cuDNNがインストールされていなかったのが原因だった。
https://medium.com/@taylordenouden/installing-tensorflow-gpu-on-ubuntu-18-04-89a142325138
####
```
Traceback (most recent call last):
File "models/research/slim/export_inference_graph.py", line 59, in <module>
import tensorflow as tf
ImportError: No module named tensorflow
```
python3をpythonと打っていた(3つけ忘れ)。
以上です。
0 件のコメント:
コメントを投稿