====== Differences ====== This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
gpu_resources [2017/04/27 13:19] csteel |
gpu_resources [2017/06/08 17:48] (current) adoyle [Preventing Job Clobbering] |
||
---|---|---|---|
Line 2: | Line 2: | ||
This is a collaborative resource, please improve it. Login using your MCIN user name and ID and add your discoveries. | This is a collaborative resource, please improve it. Login using your MCIN user name and ID and add your discoveries. | ||
+ | |||
+ | ===== Items of Interest / for Discussion? ===== | ||
+ | |||
+ | |||
+ | |||
+ | ==== Resources ==== | ||
+ | |||
+ | * [ OpenACC - Tutorial - Steps to More Science ]( https://developer.nvidia.com/openacc/3-steps-to-more-science ) | ||
+ | |||
+ | "Here are three simple steps to start accelerating your code with GPUs. We will be using PGI OpenACC compiler for C, C++, FORTRAN, along with tools from the PGI Community Edition." | ||
+ | |||
+ | * [ Performance Portability from GPUs to CPUs with OpenACC ](https://devblogs.nvidia.com/parallelforall/performance-portability-gpus-cpus-openacc/) | ||
+ | |||
+ | * [ Data Center Management Tools ]( http://www.nvidia.com/object/data-center-managment-tools.html ) | ||
+ | |||
+ | * The GPU Deployment Kit | ||
+ | * Ganglia | ||
+ | * Slurm | ||
+ | * NVIDIA Docker | ||
+ | * Others??? | ||
+ | |||
+ | "...performance on multicore CPUs for HPC apps using MPI + OpenACC is equivalent to MPI + OpenMP code. Compiling and running the same code on a Tesla K80 GPU can provide large speedups." | ||
+ | |||
===== Preventing Job Clobbering ===== | ===== Preventing Job Clobbering ===== | ||
- | Today I was training a model and inadvertently kicked Konrad's job off the GPU. I discovered how to configure TensorFlow so that it doesn't do this: | + | There are currently 3 GPU's in ace-gpu-1. To select one of the three (0, 1, 2), set the CUDA_VISIBLE_DEVICES environment variable. This can be accomplished by adding the following line to your ~/.bash_profile file on ace-gpu-1, where X is either 0, 1 or 2: |
+ | |||
+ | <code> | ||
+ | export CUDA_VISIBLE_DEVICES=X | ||
+ | </code> | ||
+ | |||
+ | This will only take effect when you log in, so log out and back in and try the following to ensure that it worked: | ||
+ | |||
+ | <code> | ||
+ | echo $CUDA_VISIBLE_DEVICES | ||
+ | </code> | ||
+ | |||
+ | If it outputs the ID that you selected then you're ready to use the GPU. | ||
+ | |||
+ | ==== Sharing a single GPU ==== | ||
+ | To configure TensorFlow to not pre-allocate all GPU memory you can use the following Python code: | ||
<code> | <code> | ||
Line 15: | Line 53: | ||
</code> | </code> | ||
- | We should develop some kind of policy to run jobs on ace-gpu-1 so that we don't inadvertently ruin other peoples' processes. | + | This has been found to work only to a certain extent, and when there are several jobs that use a significant amount of the GPU resources, jobs can still be ruined even when using the above code |
===== GPU Info ===== | ===== GPU Info ===== | ||
Line 47: | Line 84: | ||
nsight | nsight | ||
</code> | </code> | ||
+ | |||
+ | Nvidia Visual Profiler (https://developer.nvidia.com/nvidia-visual-profiler) would be useful for GPU monitoring if we had X visualization, but we do not: | ||
+ | <code> | ||
+ | /usr/local/cuda/bin/nvvp | ||
+ | </code> | ||
+ | |||
===== GPU Accounting ===== | ===== GPU Accounting ===== | ||
Line 60: | Line 103: | ||
</code> | </code> | ||
+ | Output example: | ||
+ | |||
+ | <code> | ||
+ | ==============NVSMI LOG============== | ||
+ | |||
+ | Timestamp : Thu Apr 27 09:09:50 2017 | ||
+ | Driver Version : 375.39 | ||
+ | |||
+ | Attached GPUs : 1 | ||
+ | GPU 0000:01:00.0 | ||
+ | Accounting Mode : Enabled | ||
+ | Accounting Mode Buffer Size : 1920 | ||
+ | Accounted Processes | ||
+ | Process ID : 15819 | ||
+ | GPU Utilization : 100 % | ||
+ | Memory Utilization : 6 % | ||
+ | Max memory usage : 187 MiB | ||
+ | Time : 3769 ms | ||
+ | Is Running : 0 | ||
+ | ... | ||
+ | </code> | ||
Users: to check GPU stats per process: | Users: to check GPU stats per process: | ||
<code> | <code> | ||
Line 109: | Line 173: | ||
Doesn't work with -u or -x flags. | Doesn't work with -u or -x flags. | ||
</code> | </code> | ||
+ | |||
+ | * [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-mode]] | ||
+ | |||
+ | * [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon]] | ||
===== Deep Learning ===== | ===== Deep Learning ===== | ||