制作Docker镜像:Docker+TensorRT+Pytorch+SSH

制作Docker镜像,包含Pytorch+TensorRT开发环境,同时开通SSH连接

拉取Nvidia Pytorch容器

安装

参考:https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
$ docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:20.12-py3

=============
== PyTorch ==
=============

NVIDIA Release 20.12 (build 17950526)
PyTorch Version 1.8.0a0+1606899

Container image Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
ERROR: Detected NVIDIA GeForce 940MX GPU, which is not supported by this container
ERROR: No supported GPU(s) detected to run this container

ERROR: This container was built for NVIDIA Driver Release 455.32 or later, but
version 440.26 was detected and compatibility mode is UNAVAILABLE.

[[CUDA Driver UNAVAILABLE (cuInit(0) returned 803)]]

NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.

NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
nvidia-docker run --ipc=host ...

root@d0301077522d:/workspace#

ERROR

在容器启动时会打印关键日志,其中上面的日志包含了ERROR提示

1
2
3
4
5
6
7
ERROR: Detected NVIDIA GeForce 940MX GPU, which is not supported by this container
ERROR: No supported GPU(s) detected to run this container

ERROR: This container was built for NVIDIA Driver Release 455.32 or later, but
version 440.26 was detected and compatibility mode is UNAVAILABLE.

[[CUDA Driver UNAVAILABLE (cuInit(0) returned 803)]]

简单的说就是Nvidia容器内部的驱动版本需要和外部驱动版本要匹配,同时Nvidia容器对于显卡版本也有要求。

验证

验证可使用的GPU

1
nvidia-smi

验证TensorRT版本以及使用

1
2
3
4
# python3
>>> import tensorrt
>>> print(tensorrt.__version__)
>>> assert tensorrt.Builder(tensorrt.Logger())

验证Pytorch版本

1
2
root@d0301077522d:/workspace# python -c "import torch; print(torch.__version__)"
1.8.0a0+1606899

2. Remote-SSH配置

参考:pycharm如何连接远程服务器的docker容器进行运行和调试代码(一)

重新启动容器,完整命令如下(注意:指定SSH连接端口)

1
docker run -it --rm -v /data/zj:/workdir --workdir=/workdir/ --gpus all --shm-size 16g -p 31652:22 nvcr.io/nvidia/pytorch:20.12-py3 /bin/bash

设置root用户密码:

1
passwd

配置SSH

在容器内部安装SSH:

1
2
3
apt update
apt install openssh-server
apt install openssh-client

修改SSH配置文件,在文件最后面添加如下内容

1
2
3
vim /etc/ssh/sshd_config

PermitRootLogin yes #允许root用户使用ssh登录

最后重启SSH服务

1
/etc/init.d/ssh restart

验证SSH

验证容器内部的ssh是否配置成功,

1
ssh root@127.0.0.1 -p 31652

在本地验证远程docker容器内部是否可以连接

1
ssh root@<remote IP> -p 31652

远程服务器容器内部的ssh配置完成之后,本地pycharm和vscode都可以进行连接

3. 制作镜像,保存自定义配置

参考:

  1. Docker入门(8)-- Docker 将容器打包成镜像以及导入导出
  2. 保存容器为镜像
1
2
3
4
# 语法
docker commit [OPTIONS] CONTAINER [REPOSITORY[:TAG]]
# 比如
docker commit --author zjykzj --message "在nvcr.io/nvidia/pytorch:20.12-py3容器内部增加Remote-SSH配置" 4580f75c32ec zjykzj/nvidia/pytorch:20.12-py3

之后就可以使用新制作的镜像启动容器了

1
docker run -it --rm -v /data/zj:/workdir --workdir=/workdir/ --gpus all --shm-size 16g -p 31652:22 zjykzj/nvidia/pytorch:20.12-py3 /bin/bash

4. 推送镜像到Docker HUB

参考:Docker Hub使用

1
docker push zjykzj/nvidia/pytorch:20.12-py3