TensorFlow-GPU in Docker on Fedora 22

Let's overview what major technologies are involved in my setup (your mileage may vary):

  • NVIDIA GTX 970
  • Driver Version: 364.19
  • CUDA Version 7.5
  • GCC 5.3
  • TensorFlow GPU Docker image

Drivers

Assuming you have regular drivers for your GPU already installed from NVIDIA site, proceed by downloading CUDA Toolkit 7.5, specifically select Linux, x86_64, Fedora, 21, rpm (remote) which should give you something like:

http://developer.download.nvidia.com/compute/cuda/repos/fedora21/x86_64/cuda-repo-fedora21-7.5-18.x86_64.rpm  

After installing the CUDA repository run the following command to install the necessary packages:

sudo dnf install -y cuda cuda-drivers gpu-deployment-tool  

cuDNN requires registration, so if you don't want to register feel free to skip this step. The cuDNN library is distributed as a Tar file, so after downloading it run this command to extract it with the rest of the CUDA tools:

sudo tar xvf ~/Downloads/cudnn-7.5-linux-x64-v5.0-rc.tgz -C /usr/local/  

In order to check that your installation is all right, let's compile a CUDA binary and execute it.

Compiling deviceQuery

The binary we're going to compile will give us some information about the GPU device we're using. Even though CUDA officially supports GCC up to 4.9 we can simply bypass the check with unholy sed because everything seems to work just fine:

$ sudo sed -i 's/#if __GNUC__ > 4 || (__GNUC__ == 4 \&\& __GNUC_MINOR__ > 9)/#if __GNUC__ > 5 || (__GNUC__ == 5 \&\& __GNUC_MINOR__ > 9)/g' /usr/local/cuda-7.5/targets/x86_64-linux/include/host_config.h
$ mkdir ~/cuda
$ cp -r /usr/local/cuda-7.5/samples/* ~/cuda
$ cd ~/cuda/1_Utilities/deviceQuery
$ make
$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 970"  
  CUDA Driver Version / Runtime Version          8.0 / 7.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 4095 MBytes (4294246400 bytes)
  (13) Multiprocessors, (128) CUDA Cores/MP:     1664 CUDA Cores
  GPU Max Clock rate:                            1253 MHz (1.25 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 1835008 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 970  
Result = PASS  

Note if you don't want to copy all the samples to your home folder you can go YOLO and cd into the deviceQuery directory and run sudo make.

Running the Container

The script provided by TensorFlow here seems to be rather Ubuntu/Debial centric, on Fedora it doesn't work and I had to tweak it in the following way:

#!/usr/bin/env bash

set -e

export CUDA_HOME=${CUDA_HOME:-/usr/local/cuda}

if [ ! -d ${CUDA_HOME}/lib64 ]; then  
  echo "Failed to locate CUDA libs at ${CUDA_HOME}/lib64."
  exit 1
fi

# CUDA Libraries
export CUDA_SO=$(\ls /usr/local/cuda-7.5/lib64/*.so* | \  
                    xargs -I{} echo '-v {}:{}')
# libcuda from /lib64
export CUDA_SO="$CUDA_SO $(\ls /usr/lib64/libcuda.so* | \  
                    xargs -I{} echo '-v {}:{}')"
# libnv* from /lib64
export CUDA_SO="$CUDA_SO $(\ls /usr/lib64/libnv*.so* | \  
                    xargs -I{} echo '-v {}:{}')"
# nvidia devices
export DEVICES=$(\ls /dev/nv* | \  
                    xargs -I{} echo '--device {}:{}')

if [[ "${DEVICES}" = "" ]]; then  
  echo "Failed to locate NVidia device(s). Did you want the non-GPU container?"
  exit 1
fi

docker run -it -e LD_LIBRARY_PATH='/usr/local/cuda-7.5/lib64/:/usr/lib64/' $CUDA_SO $DEVICES "$@"  

Save the script as tf-gpu-docker.sh and execute:

$ ./tf-gpu-docker.sh gcr.io/tensorflow/tensorflow:latest-gpu python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)  
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.  
>>> import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally  
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally  
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally  
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally  
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally  
>>> session = tf.Session()
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero  
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:  
name: GeForce GTX 970  
major: 5 minor: 2 memoryClockRate (GHz) 1.253  
pciBusID 0000:03:00.0  
Total memory: 4.00GiB  
Free memory: 3.49GiB  
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0  
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y  
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:03:00.0)  

Voila! We have a Python shell from the running container, we've also verified that our CUDA device has been passed through successfully and are ready to start teaching machines how to think learn some machine learning.