====== Differences ====== This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
gpu_resources [2017/02/16 14:57] csteel |
gpu_resources [2017/06/08 17:48] (current) adoyle [Preventing Job Clobbering] |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== GPU Resources ====== | ====== GPU Resources ====== | ||
+ | This is a collaborative resource, please improve it. Login using your MCIN user name and ID and add your discoveries. | ||
+ | |||
+ | ===== Items of Interest / for Discussion? ===== | ||
+ | |||
+ | |||
+ | |||
+ | ==== Resources ==== | ||
+ | |||
+ | * [ OpenACC - Tutorial - Steps to More Science ]( https://developer.nvidia.com/openacc/3-steps-to-more-science ) | ||
+ | |||
+ | "Here are three simple steps to start accelerating your code with GPUs. We will be using PGI OpenACC compiler for C, C++, FORTRAN, along with tools from the PGI Community Edition." | ||
+ | |||
+ | * [ Performance Portability from GPUs to CPUs with OpenACC ](https://devblogs.nvidia.com/parallelforall/performance-portability-gpus-cpus-openacc/) | ||
+ | |||
+ | * [ Data Center Management Tools ]( http://www.nvidia.com/object/data-center-managment-tools.html ) | ||
+ | |||
+ | * The GPU Deployment Kit | ||
+ | * Ganglia | ||
+ | * Slurm | ||
+ | * NVIDIA Docker | ||
+ | * Others??? | ||
+ | |||
+ | "...performance on multicore CPUs for HPC apps using MPI + OpenACC is equivalent to MPI + OpenMP code. Compiling and running the same code on a Tesla K80 GPU can provide large speedups." | ||
+ | |||
+ | |||
+ | ===== Preventing Job Clobbering ===== | ||
+ | |||
+ | There are currently 3 GPU's in ace-gpu-1. To select one of the three (0, 1, 2), set the CUDA_VISIBLE_DEVICES environment variable. This can be accomplished by adding the following line to your ~/.bash_profile file on ace-gpu-1, where X is either 0, 1 or 2: | ||
+ | |||
+ | <code> | ||
+ | export CUDA_VISIBLE_DEVICES=X | ||
+ | </code> | ||
+ | |||
+ | This will only take effect when you log in, so log out and back in and try the following to ensure that it worked: | ||
+ | |||
+ | <code> | ||
+ | echo $CUDA_VISIBLE_DEVICES | ||
+ | </code> | ||
+ | |||
+ | If it outputs the ID that you selected then you're ready to use the GPU. | ||
+ | |||
+ | ==== Sharing a single GPU ==== | ||
+ | To configure TensorFlow to not pre-allocate all GPU memory you can use the following Python code: | ||
+ | |||
+ | <code> | ||
+ | # configures TensorFlow to not try to grab all the GPU memory | ||
+ | config = tf.ConfigProto(allow_soft_placement=True) | ||
+ | config.gpu_options.allow_growth = True | ||
+ | session = tf.Session(config=config) | ||
+ | K.set_session(session) | ||
+ | </code> | ||
+ | |||
+ | This has been found to work only to a certain extent, and when there are several jobs that use a significant amount of the GPU resources, jobs can still be ruined even when using the above code | ||
===== GPU Info ===== | ===== GPU Info ===== | ||
+ | |||
+ | For CPU and GPU usage: | ||
+ | |||
+ | <code> | ||
+ | glances | ||
+ | </code> | ||
+ | |||
+ | Other info | ||
<code> | <code> | ||
Line 24: | Line 85: | ||
</code> | </code> | ||
+ | Nvidia Visual Profiler (https://developer.nvidia.com/nvidia-visual-profiler) would be useful for GPU monitoring if we had X visualization, but we do not: | ||
+ | <code> | ||
+ | /usr/local/cuda/bin/nvvp | ||
+ | </code> | ||
+ | |||
+ | |||
+ | ===== GPU Accounting ===== | ||
+ | |||
+ | SysAdmins: to enable Accounting mode | ||
+ | <code> | ||
+ | sudo nvidia-smi -i 0 -am ENABLED | ||
+ | </code> | ||
+ | |||
+ | Users: to check if Accounting mode enabled or disabled | ||
+ | <code> | ||
+ | nvidia-smi -i 0 -q -d ACCOUNTING | ||
+ | </code> | ||
+ | |||
+ | Output example: | ||
+ | |||
+ | <code> | ||
+ | ==============NVSMI LOG============== | ||
+ | |||
+ | Timestamp : Thu Apr 27 09:09:50 2017 | ||
+ | Driver Version : 375.39 | ||
+ | |||
+ | Attached GPUs : 1 | ||
+ | GPU 0000:01:00.0 | ||
+ | Accounting Mode : Enabled | ||
+ | Accounting Mode Buffer Size : 1920 | ||
+ | Accounted Processes | ||
+ | Process ID : 15819 | ||
+ | GPU Utilization : 100 % | ||
+ | Memory Utilization : 6 % | ||
+ | Max memory usage : 187 MiB | ||
+ | Time : 3769 ms | ||
+ | Is Running : 0 | ||
+ | ... | ||
+ | </code> | ||
+ | Users: to check GPU stats per process: | ||
+ | <code> | ||
+ | nvidia-smi -i 0 --query-accounted-apps=gpu_name,pid,gpu_util,max_memory_usage,time --format=csv | ||
+ | </code> | ||
+ | |||
+ | Output example: | ||
+ | |||
+ | <code> | ||
+ | gpu_name, pid, gpu_utilization [%], max_memory_usage [MiB], time [ms] | ||
+ | TITAN X (Pascal), 15819, 100 %, 187 MiB, 3769 ms | ||
+ | TITAN X (Pascal), 15633, 87 %, 8465 MiB, 200626 ms | ||
+ | TITAN X (Pascal), 15944, 0 %, 153 MiB, 382 ms | ||
+ | TITAN X (Pascal), 16000, 0 %, 155 MiB, 299 ms | ||
+ | TITAN X (Pascal), 15862, 80 %, 8465 MiB, 215039 ms | ||
+ | TITAN X (Pascal), 15842, 41 %, 425 MiB, 721223 ms | ||
+ | TITAN X (Pascal), 16294, 74 %, 8465 MiB, 231517 ms | ||
+ | TITAN X (Pascal), 16436, 70 %, 10425 MiB, 229470 ms | ||
+ | TITAN X (Pascal), 16118, 40 %, 155 MiB, 1310156 ms | ||
+ | TITAN X (Pascal), 16908, 72 %, 8465 MiB, 511122 ms | ||
+ | TITAN X (Pascal), 17102, 73 %, 8465 MiB, 833806 ms | ||
+ | TITAN X (Pascal), 17900, 0 %, 153 MiB, 358 ms | ||
+ | TITAN X (Pascal), 18018, 0 %, 153 MiB, 235 ms | ||
+ | TITAN X (Pascal), 17632, 75 %, 8465 MiB, 823193 ms | ||
+ | TITAN X (Pascal), 18376, 74 %, 8529 MiB, 827336 ms | ||
+ | TITAN X (Pascal), 18637, 74 %, 8465 MiB, 547161 ms | ||
+ | TITAN X (Pascal), 16377, 54 %, 153 MiB, 0 ms | ||
+ | TITAN X (Pascal), 18752, 55 %, 8465 MiB, 0 ms | ||
+ | </code> | ||
+ | |||
+ | Users: Accounting help | ||
+ | <code> | ||
+ | nvidia-smi --help-query-accounted-apps | ||
+ | </code> | ||
+ | |||
+ | ==== nvidia-smi flags used ==== | ||
+ | |||
+ | <code> | ||
+ | -i, --id= Target a specific GPU. | ||
+ | -am --accounting-mode= Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED | ||
+ | -q, --query Display GPU or Unit info. | ||
+ | -d, --display= Display only selected information: MEMORY, | ||
+ | UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK, | ||
+ | COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS, | ||
+ | PAGE_RETIREMENT, ACCOUNTING. | ||
+ | Flags can be combined with comma e.g. ECC,POWER. | ||
+ | Sampling data with max/min/avg is also returned | ||
+ | for POWER, UTILIZATION and CLOCK display types. | ||
+ | Doesn't work with -u or -x flags. | ||
+ | </code> | ||
+ | |||
+ | * [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-mode]] | ||
+ | |||
+ | * [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon]] | ||
===== Deep Learning ===== | ===== Deep Learning ===== | ||
* [[https://docs.google.com/document/d/1pNXD06fHx-NnnFjz10hTZTaeVKXCfDsgl2CGt2iMWYc/edit|Deep Learning Notes]] | * [[https://docs.google.com/document/d/1pNXD06fHx-NnnFjz10hTZTaeVKXCfDsgl2CGt2iMWYc/edit|Deep Learning Notes]] | ||
+ | |||
+ | ===== Freesurfer ===== | ||
+ | |||
+ | * [[https://surfer.nmr.mgh.harvard.edu/fswiki/SystemRequirements]] | ||
+ | * [[https://surfer.nmr.mgh.harvard.edu/fswiki/DevelopersGuide]] | ||
+ | |||
+ | FreesSurfer 6.0 with CUDA (as well as openmp). Have had issues compiling FreeSurfer with it in the recent past, no longer actively supports GPU/CUDA as Freesurfer it's permanently stuck in the past on version 5.0.35... | ||
+ | |||
+ | https://surfer.nmr.mgh.harvard.edu/fswiki/SystemRequirements | ||
+ | https://surfer.nmr.mgh.harvard.edu/fswiki/DevelopersGuide | ||
===== Nvidia-Docker ===== | ===== Nvidia-Docker ===== |