Docker, NVIDIA Docker, DIGITS 安装备忘录

前提条件的环境

    • OS: Ubuntu 16.04

 

    CUDA Driver, Toolkit インストール済み

步骤

安装Docker

创建以下脚本并以sudo身份运行。

$ cat ./nvidia-docker-setup_1604.sh
#! /bin/sh
# Install Docker
apt-get update
apt-get install apt-transport-https ca-certificates
apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo "deb https://apt.dockerproject.org/repo ubuntu-xenial main" >> /etc/apt/sources.list.d/docker.list
apt-get update
apt-get install linux-image-extra-$(uname -r) linux-image-extra-virtual
apt-get install docker-engine
docker run hello-world
# Install nvidia-docker and nvidia-docker-plugin
wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0-rc.3/nvidia-docker_1.0.0.rc.3-1_amd64.deb
dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
# Test nvidia-smi
nvidia-docker run --rm nvidia/cuda nvidia-smi

$ chmod +x ./nvidia-docker-setup_1604.sh
$ sudo ./nvidia-docker-setup_1604.sh

请参考链接(http://qiita.com/ksasaki/items/bd85786171424901b27d)。

启动容器

nvtest@WT72:~$ sudo nvidia-docker run --name digits -d -p 8080:34448 nvidia/digits
Using default tag: latest
latest: Pulling from nvidia/digits

862a3e9af0ae: Already exists 
6498e51874bf: Already exists 
159ebdd1959b: Already exists 
0fdbedd3771a: Already exists 
7a1f7116d1e3: Already exists 
1a2b8e5c1cb0: Pull complete 
f79c18aad824: Pull complete 
d750f0e72581: Pull complete 
d399aa23f362: Pull complete 
f7534fde9b83: Pull complete 
ab6e25a40827: Pull complete 
ef0932bdd7af: Pull complete 
6616cddeb677: Pull complete 
37db32ac8c63: Pull complete 
Digest: sha256:be653fe4642928b584f44d03b206f6ddb433508c82b425d95ce6d277daa3462e
Status: Downloaded newer image for nvidia/digits:latest
7f0cb8c00720ce141b7614831f194707926dc6814250f17a722fc3ecb414961d
nvtest@WT72:~$ 

无法从容器外访问的问题。

确认现象

nvtest@WT72:~$ sudo nvidia-docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS                     NAMES
7f0cb8c00720        nvidia/digits       "./digits-server"   5 minutes ago       Up 5 minutes        0.0.0.0:8080->34448/tcp   digits
052d94824a82        nvidia/cuda         "/bin/bash"         37 minutes ago      Up 37 minutes                                 awesome_brown
nvtest@WT72:~$ sudo docker exec -it digits /bin/bash
root@7f0cb8c00720:/usr/share/digits#                                                                                                            
root@7f0cb8c00720:/usr/share/digits# 
root@7f0cb8c00720:/usr/share/digits# ping google.com
^C
root@7f0cb8c00720:/usr/share/digits# exit
exit
nvtest@WT72:~$ 

无论等待多久,ping都得不到回应。

解决方法(适用于Ubuntu 15及以上版本)

首先,查找系统当前正在使用的DNS服务器的IP地址。

如果Ubuntu的版本大于或等于15的情况

使用中文将以下内容进行释义,只需要一种选择: nmcli设备显示<接口名称> | grep IP4.DNS

nmcli设备显示<接口名称>,并在结果中搜索IP4.DNS。

nvtest@WT72:~$ nmcli device show enp4s0 | grep IP4.DNS
IP4.DNS[1]:                             xx.xx.xx.aa
IP4.DNS[2]:                             xx.xx.xx.bb
IP4.DNS[3]:                             xx.xx.xx.xx
IP4.DNS[4]:                             xx.xx.xx.xx

編輯Docker設定檔案。

$ vi /etc/default/docker
$ cat /etc/default/docker
# Docker Upstart and SysVinit configuration file

#
# THIS FILE DOES NOT APPLY TO SYSTEMD
#
#   Please see the documentation for "systemd drop-ins":
#   https://docs.docker.com/engine/articles/systemd/
#

# Customize location of Docker binary (especially for development testing).
#DOCKERD="/usr/local/bin/dockerd"

# Use DOCKER_OPTS to modify the daemon startup options.
DOCKER_OPTS="--dns xx.xx.xx.aa --dns xx.xx.xx.bb"

# If you need Docker to use an HTTP proxy, it can also be specified here.
#export http_proxy="http://127.0.0.1:3128/"

# This is also a handy place to tweak where Docker's temporary files go.
#export TMPDIR="/mnt/bigdrive/docker-tmp"
$

从Ubuntu 15开始,由于采用了systemd进行配置,因此需要以下内容。参考网址(http://blog.benhall.me.uk/2015/07/setting-dockers-docker_opts-on-ubuntu-15-04/)。

$ cat /lib/systemd/system/docker.service 
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network.target docker.socket
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd://
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process

[Install]
WantedBy=multi-user.target
$ sudo vi /lib/systemd/system/docker.service 
$ cat /lib/systemd/system/docker.service 
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network.target docker.socket
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
EnvironmentFile=/etc/default/docker
ExecStart=/usr/bin/dockerd -H fd:// $DOCKER_OPTS
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process

[Install]
WantedBy=multi-user.target
nvtest@WT72:~$ 
nvtest@WT72:~$ sudo systemctl daemon-reload
nvtest@WT72:~$ sudo service docker restart  

对策(适用于Ubuntu 14以前)

如果Ubuntu的版本是14或更低

用中文原生语言将以下内容改写:仅需要一种选项:
通过「<接口名称>」查询IP4的nmcli设备列表 | grep IP4

$ nmcli dev list iface eth0 | grep IP4
IP4.ADDRESS[1]:                         ip = xx.xx.xx.xx/23, gw = xx.xx.xx.xx
IP4.DNS[1]:                             xx.xx.xx.aa
IP4.DNS[2]:                             xx.xx.xx.bb
IP4.DNS[3]:                             xx.xx.xx.xx
IP4.DNS[4]:                             xx.xx.xx.xx
IP4.DOMAIN[1]:                          xxxxxx.com
IP4.WINS[1]:                            xx.xx.xx.xx
IP4.WINS[2]:                            xx.xx.xx.xx
IP4.WINS[3]:                            xx.xx.xx.xx

编辑Docker配置文件。

$ vi /etc/default/docker
$ cat /etc/default/docker
# Docker Upstart and SysVinit configuration file

#
# THIS FILE DOES NOT APPLY TO SYSTEMD
#
#   Please see the documentation for "systemd drop-ins":
#   https://docs.docker.com/engine/articles/systemd/
#

# Customize location of Docker binary (especially for development testing).
#DOCKERD="/usr/local/bin/dockerd"

# Use DOCKER_OPTS to modify the daemon startup options.
DOCKER_OPTS="--dns xx.xx.xx.aa --dns xx.xx.xx.bb"

# If you need Docker to use an HTTP proxy, it can also be specified here.
#export http_proxy="http://127.0.0.1:3128/"

# This is also a handy place to tweak where Docker's temporary files go.
#export TMPDIR="/mnt/bigdrive/docker-tmp"
$
$ sudo service docker restart  

下载 mnist 数据集

nvtest@WT72:~$ sudo nvidia-docker restart digits
nvtest@WT72:~$ sudo nvidia-docker exec -it digits /bin/bash
root@7f0cb8c00720:/usr/share/digits#                                                                                                            
root@7f0cb8c00720:/usr/share/digits# 
root@0fa423b2a7ee:/usr/share/digits# ./tools/download_data/main.py mnist /opt/mnist
Downloading url=http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ...
Downloading url=http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz ...
Downloading url=http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz ...
Downloading url=http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ...
Uncompressing file=train-images-idx3-ubyte.gz ...
Uncompressing file=train-labels-idx1-ubyte.gz ...
Uncompressing file=t10k-images-idx3-ubyte.gz ...
Uncompressing file=t10k-labels-idx1-ubyte.gz ...
Reading labels from /opt/mnist/train-labels.bin ...
Reading images from /opt/mnist/train-images.bin ...
Reading labels from /opt/mnist/test-labels.bin ...
Reading images from /opt/mnist/test-images.bin ...
Dataset directory is created successfully at '/opt/mnist'
Done after 130.443932056 seconds.
root@0fa423b2a7ee:/usr/share/digits# ll /opt/mnist/
total 65024
drwxr-xr-x  4 root root     4096 Sep 17 17:18 ./
drwxr-xr-x  3 root root     4096 Sep 17 12:49 ../
-rw-r--r--  1 root root  1648877 Sep 17 17:18 t10k-images-idx3-ubyte.gz
-rw-r--r--  1 root root     4542 Sep 17 17:18 t10k-labels-idx1-ubyte.gz
drwxr-xr-x 12 root root     4096 Sep 17 17:18 test/
-rw-r--r--  1 root root  7840016 Sep 17 17:18 test-images.bin
-rw-r--r--  1 root root    10008 Sep 17 17:18 test-labels.bin
drwxr-xr-x 12 root root     4096 Sep 17 17:18 train/
-rw-r--r--  1 root root  9912422 Sep 17 17:18 train-images-idx3-ubyte.gz
-rw-r--r--  1 root root 47040016 Sep 17 17:18 train-images.bin
-rw-r--r--  1 root root    28881 Sep 17 17:18 train-labels-idx1-ubyte.gz
-rw-r--r--  1 root root    60008 Sep 17 17:18 train-labels.bin
root@0fa423b2a7ee:/usr/share/digits# 
root@0fa423b2a7ee:/usr/share/digits# ll /opt/mnist/train
total 3524
drwxr-xr-x 12 root root    4096 Sep 17 17:18 ./
drwxr-xr-x  4 root root    4096 Sep 17 17:18 ../
drwxr-xr-x  2 root root  163840 Sep 17 17:18 0/
drwxr-xr-x  2 root root  208896 Sep 17 17:18 1/
drwxr-xr-x  2 root root  167936 Sep 17 17:18 2/
drwxr-xr-x  2 root root  172032 Sep 17 17:18 3/
drwxr-xr-x  2 root root  167936 Sep 17 17:18 4/
drwxr-xr-x  2 root root  147456 Sep 17 17:18 5/
drwxr-xr-x  2 root root  147456 Sep 17 17:18 6/
drwxr-xr-x  2 root root  192512 Sep 17 17:18 7/
drwxr-xr-x  2 root root  159744 Sep 17 17:18 8/
drwxr-xr-x  2 root root  163840 Sep 17 17:18 9/
-rw-r--r--  1 root root      20 Sep 17 17:18 labels.txt
-rw-r--r--  1 root root 1860000 Sep 17 17:18 train.txt
root@0fa423b2a7ee:/usr/share/digits# 
root@0fa423b2a7ee:/usr/share/digits# ll /opt/mnist/test 
total 700
drwxr-xr-x 12 root root   4096 Sep 17 17:18 ./
drwxr-xr-x  4 root root   4096 Sep 17 17:18 ../
drwxr-xr-x  2 root root  36864 Sep 17 17:18 0/
drwxr-xr-x  2 root root  36864 Sep 17 17:18 1/
drwxr-xr-x  2 root root  36864 Sep 17 17:18 2/
drwxr-xr-x  2 root root  36864 Sep 17 17:18 3/
drwxr-xr-x  2 root root  36864 Sep 17 17:18 4/
drwxr-xr-x  2 root root  32768 Sep 17 17:18 5/
drwxr-xr-x  2 root root  36864 Sep 17 17:18 6/
drwxr-xr-x  2 root root  36864 Sep 17 17:18 7/
drwxr-xr-x  2 root root  36864 Sep 17 17:18 8/
drwxr-xr-x  2 root root  32768 Sep 17 17:18 9/
-rw-r--r--  1 root root     20 Sep 17 17:18 labels.txt
-rw-r--r--  1 root root 300000 Sep 17 17:18 test.txt
root@0fa423b2a7ee:/usr/share/digits# 

通过Web浏览器访问DIGITS并将下载的mnist数据指定为训练数据。

Screenshot from 2016-09-18 01:24:04.png
广告
将在 10 秒后关闭
bannerAds