User Tools

Site Tools


gpu_resources

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
gpu_resources [2017/05/05 15:25]
csteel [nvidia-smi flags used]
gpu_resources [2017/06/08 17:48] (current)
adoyle [Preventing Job Clobbering]
Line 28: Line 28:
 ===== Preventing Job Clobbering ===== ===== Preventing Job Clobbering =====
  
-Today I was training a model and inadvertently kicked Konrad'​s ​job off the GPUI discovered how to configure TensorFlow ​so that it doesn't do this:+There are currently 3 GPU'​s ​in ace-gpu-1. To select one of the three (0, 1, 2), set the CUDA_​VISIBLE_​DEVICES environment variableThis can be accomplished by adding the following line to your ~/​.bash_profile file on ace-gpu-1, where X is either 0, 1 or 2: 
 + 
 +<​code>​ 
 +export CUDA_VISIBLE_DEVICES=X 
 +</​code>​ 
 + 
 +This will only take effect when you log in, so log out and back in and try the following to ensure ​that it worked: 
 + 
 +<​code>​ 
 +echo $CUDA_VISIBLE_DEVICES 
 +</​code>​ 
 + 
 +If it outputs the ID that you selected then you're ready to use the GPU. 
 + 
 +==== Sharing a single GPU ==== 
 +To configure TensorFlow to not pre-allocate all GPU memory you can use the following Python code:
  
 <​code>​ <​code>​
Line 38: Line 53:
 </​code>​ </​code>​
  
-We should develop some kind of policy ​to run jobs on ace-gpu-1 so that we don't inadvertently ruin other peoples'​ processes. +This has been found to work only to a certain extent, and when there are several ​jobs that use a significant amount of the GPU resources, jobs can still be ruined even when using the above code
 ===== GPU Info ===== ===== GPU Info =====
  
Line 70: Line 84:
 nsight nsight
 </​code>​ </​code>​
 +
 +Nvidia Visual Profiler (https://​developer.nvidia.com/​nvidia-visual-profiler) would be useful for GPU monitoring if we had X visualization,​ but we do not:
 +<​code>​
 +/​usr/​local/​cuda/​bin/​nvvp
 +</​code>​
 +
  
 ===== GPU Accounting ===== ===== GPU Accounting =====
Line 155: Line 175:
  
 * [[http://​docs.nvidia.com/​deploy/​driver-persistence/​index.html#​persistence-mode]] * [[http://​docs.nvidia.com/​deploy/​driver-persistence/​index.html#​persistence-mode]]
 +
 * [[http://​docs.nvidia.com/​deploy/​driver-persistence/​index.html#​persistence-daemon]] * [[http://​docs.nvidia.com/​deploy/​driver-persistence/​index.html#​persistence-daemon]]
 ===== Deep Learning ===== ===== Deep Learning =====
gpu_resources.1493997920.txt.gz · Last modified: 2017/05/05 15:25 by csteel