====== Differences ====== This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
gpu_resources [2017/05/05 15:25] csteel [nvidia-smi flags used] |
gpu_resources [2017/06/08 17:48] (current) adoyle [Preventing Job Clobbering] |
||
---|---|---|---|
Line 28: | Line 28: | ||
===== Preventing Job Clobbering ===== | ===== Preventing Job Clobbering ===== | ||
- | Today I was training a model and inadvertently kicked Konrad's job off the GPU. I discovered how to configure TensorFlow so that it doesn't do this: | + | There are currently 3 GPU's in ace-gpu-1. To select one of the three (0, 1, 2), set the CUDA_VISIBLE_DEVICES environment variable. This can be accomplished by adding the following line to your ~/.bash_profile file on ace-gpu-1, where X is either 0, 1 or 2: |
+ | |||
+ | <code> | ||
+ | export CUDA_VISIBLE_DEVICES=X | ||
+ | </code> | ||
+ | |||
+ | This will only take effect when you log in, so log out and back in and try the following to ensure that it worked: | ||
+ | |||
+ | <code> | ||
+ | echo $CUDA_VISIBLE_DEVICES | ||
+ | </code> | ||
+ | |||
+ | If it outputs the ID that you selected then you're ready to use the GPU. | ||
+ | |||
+ | ==== Sharing a single GPU ==== | ||
+ | To configure TensorFlow to not pre-allocate all GPU memory you can use the following Python code: | ||
<code> | <code> | ||
Line 38: | Line 53: | ||
</code> | </code> | ||
- | We should develop some kind of policy to run jobs on ace-gpu-1 so that we don't inadvertently ruin other peoples' processes. | + | This has been found to work only to a certain extent, and when there are several jobs that use a significant amount of the GPU resources, jobs can still be ruined even when using the above code |
===== GPU Info ===== | ===== GPU Info ===== | ||
Line 70: | Line 84: | ||
nsight | nsight | ||
</code> | </code> | ||
+ | |||
+ | Nvidia Visual Profiler (https://developer.nvidia.com/nvidia-visual-profiler) would be useful for GPU monitoring if we had X visualization, but we do not: | ||
+ | <code> | ||
+ | /usr/local/cuda/bin/nvvp | ||
+ | </code> | ||
+ | |||
===== GPU Accounting ===== | ===== GPU Accounting ===== | ||
Line 155: | Line 175: | ||
* [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-mode]] | * [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-mode]] | ||
+ | |||
* [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon]] | * [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon]] | ||
===== Deep Learning ===== | ===== Deep Learning ===== |