ACELab wiki

====== GPU Resources ====== This is a collaborative resource, please improve it. Login using your MCIN user name and ID and add your discoveries. ===== Items of Interest / for Discussion? ===== ==== Resources ==== * [ OpenACC - Tutorial - Steps to More Science ]( https://developer.nvidia.com/openacc/3-steps-to-more-science ) "Here are three simple steps to start accelerating your code with GPUs. We will be using PGI OpenACC compiler for C, C++, FORTRAN, along with tools from the PGI Community Edition." * [ Performance Portability from GPUs to CPUs with OpenACC ](https://devblogs.nvidia.com/parallelforall/performance-portability-gpus-cpus-openacc/) * [ Data Center Management Tools ]( http://www.nvidia.com/object/data-center-managment-tools.html ) * The GPU Deployment Kit * Ganglia * Slurm * NVIDIA Docker * Others??? "...performance on multicore CPUs for HPC apps using MPI + OpenACC is equivalent to MPI + OpenMP code. Compiling and running the same code on a Tesla K80 GPU can provide large speedups." ===== Preventing Job Clobbering ===== There are currently 3 GPU's in ace-gpu-1. To select one of the three (0, 1, 2), set the CUDA_VISIBLE_DEVICES environment variable. This can be accomplished by adding the following line to your ~/.bash_profile file on ace-gpu-1, where X is either 0, 1 or 2: <code> export CUDA_VISIBLE_DEVICES=X </code> This will only take effect when you log in, so log out and back in and try the following to ensure that it worked: <code> echo $CUDA_VISIBLE_DEVICES </code> If it outputs the ID that you selected then you're ready to use the GPU. ==== Sharing a single GPU ==== To configure TensorFlow to not pre-allocate all GPU memory you can use the following Python code: <code> # configures TensorFlow to not try to grab all the GPU memory config = tf.ConfigProto(allow_soft_placement=True) config.gpu_options.allow_growth = True session = tf.Session(config=config) K.set_session(session) </code> This has been found to work only to a certain extent, and when there are several jobs that use a significant amount of the GPU resources, jobs can still be ruined even when using the above code ===== GPU Info ===== For CPU and GPU usage: <code> glances </code> Other info <code> nvcc -V </code> <code> nvidia-smi </code> <code> lspci -vnn | grep VGA -A 12 </code> <code> dpkg -l | grep -i nvidia </code> <code> ssh -X ace-gpu-1 nsight </code> Nvidia Visual Profiler (https://developer.nvidia.com/nvidia-visual-profiler) would be useful for GPU monitoring if we had X visualization, but we do not: <code> /usr/local/cuda/bin/nvvp </code> ===== GPU Accounting ===== SysAdmins: to enable Accounting mode <code> sudo nvidia-smi -i 0 -am ENABLED </code> Users: to check if Accounting mode enabled or disabled <code> nvidia-smi -i 0 -q -d ACCOUNTING </code> Output example: <code> ==============NVSMI LOG============== Timestamp : Thu Apr 27 09:09:50 2017 Driver Version : 375.39 Attached GPUs : 1 GPU 0000:01:00.0 Accounting Mode : Enabled Accounting Mode Buffer Size : 1920 Accounted Processes Process ID : 15819 GPU Utilization : 100 % Memory Utilization : 6 % Max memory usage : 187 MiB Time : 3769 ms Is Running : 0 ... </code> Users: to check GPU stats per process: <code> nvidia-smi -i 0 --query-accounted-apps=gpu_name,pid,gpu_util,max_memory_usage,time --format=csv </code> Output example: <code> gpu_name, pid, gpu_utilization [%], max_memory_usage [MiB], time [ms] TITAN X (Pascal), 15819, 100 %, 187 MiB, 3769 ms TITAN X (Pascal), 15633, 87 %, 8465 MiB, 200626 ms TITAN X (Pascal), 15944, 0 %, 153 MiB, 382 ms TITAN X (Pascal), 16000, 0 %, 155 MiB, 299 ms TITAN X (Pascal), 15862, 80 %, 8465 MiB, 215039 ms TITAN X (Pascal), 15842, 41 %, 425 MiB, 721223 ms TITAN X (Pascal), 16294, 74 %, 8465 MiB, 231517 ms TITAN X (Pascal), 16436, 70 %, 10425 MiB, 229470 ms TITAN X (Pascal), 16118, 40 %, 155 MiB, 1310156 ms TITAN X (Pascal), 16908, 72 %, 8465 MiB, 511122 ms TITAN X (Pascal), 17102, 73 %, 8465 MiB, 833806 ms TITAN X (Pascal), 17900, 0 %, 153 MiB, 358 ms TITAN X (Pascal), 18018, 0 %, 153 MiB, 235 ms TITAN X (Pascal), 17632, 75 %, 8465 MiB, 823193 ms TITAN X (Pascal), 18376, 74 %, 8529 MiB, 827336 ms TITAN X (Pascal), 18637, 74 %, 8465 MiB, 547161 ms TITAN X (Pascal), 16377, 54 %, 153 MiB, 0 ms TITAN X (Pascal), 18752, 55 %, 8465 MiB, 0 ms </code> Users: Accounting help <code> nvidia-smi --help-query-accounted-apps </code> ==== nvidia-smi flags used ==== <code> -i, --id= Target a specific GPU. -am --accounting-mode= Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED -q, --query Display GPU or Unit info. -d, --display= Display only selected information: MEMORY, UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK, COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS, PAGE_RETIREMENT, ACCOUNTING. Flags can be combined with comma e.g. ECC,POWER. Sampling data with max/min/avg is also returned for POWER, UTILIZATION and CLOCK display types. Doesn't work with -u or -x flags. </code> * [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-mode]] * [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon]] ===== Deep Learning ===== * [[https://docs.google.com/document/d/1pNXD06fHx-NnnFjz10hTZTaeVKXCfDsgl2CGt2iMWYc/edit|Deep Learning Notes]] ===== Freesurfer ===== * [[https://surfer.nmr.mgh.harvard.edu/fswiki/SystemRequirements]] * [[https://surfer.nmr.mgh.harvard.edu/fswiki/DevelopersGuide]] FreesSurfer 6.0 with CUDA (as well as openmp). Have had issues compiling FreeSurfer with it in the recent past, no longer actively supports GPU/CUDA as Freesurfer it's permanently stuck in the past on version 5.0.35... https://surfer.nmr.mgh.harvard.edu/fswiki/SystemRequirements https://surfer.nmr.mgh.harvard.edu/fswiki/DevelopersGuide ===== Nvidia-Docker ===== Request to install on ACE-GPU-1 so that we can use nvidia-docker.: * Docker: Docker >= 1.9 (official docker-engine only) * NVIDIA drivers: >= 340.29 with binary nvidia-modprobe Why * Nvidia-Docker is officially supported by NVIDIA * Allows the containerizing of GPU applications. * Containers built using this tool should be able to be run on both ACE-GPU-1 and Guillimin. * Official github site for the project: https://github.com/NVIDIA/nvidia-docker * Requirements page for installation:https://github.com/NVIDIA/nvidia-docker/wiki/Installation Status * [[https://huia.cbrain.mcgill.ca/glpi/index.php?redirect=ticket_386&noAUTO=1|GLPI Ticket 386]] * For discussion at next IT Team Meeting ===== OpenACC ===== * [[http://www.openacc.org/|openacc.org]] OpenACC directives are complementary to and interoperate with existing HPC programming models including OpenMP, MPI, and **CUDA**. The directives and programming model defined in the OpenACC API document allow programmers to create high-level host+accelerator programs without the need to explicitly initialize the accelerator, manage data or program transfers between the host and accelerator, or initiate accelerator startup and shutdown. The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator. OpenACC is designed for portability across operating systems, host CPUs, and a wide range of accelerators, including APUs, GPUs, and many-core coprocessors. Status * [[https://huia.cbrain.mcgill.ca/glpi/front/ticket.form.php?id=387|GLPI Ticket 387]] * For discussion at next IT Team Meeting ===== Sun Grid Engine - SGE ===== * [[http://serverfault.com/questions/322073/howto-set-up-sge-for-cuda-devices|Howto set up SGE for CUDA devices?]] * [[http://gridscheduler.sourceforge.net/howto/loadsensor.html|Setting Up A Load Sensor in Grid Engine]] * [[https://github.com/kyamagu/sge-gpuprolog|kyamagu/sge-gpuprolog - Scripts to manage NVIDIA GPU devices in SGE 6.2u5]] * [[https://github.com/mozhgan-kch/sge-gpuprolog|mozhgan-kch/sge-gpuprolog forked from kyamagu/sge-gpuprolog - FORK]] * [[http://marc.info/?l=npaci-rocks-discussion&m=132872224919575&w=2|Rocks-Discuss - Grid Engine GPU load sensor]] * [[https://wikis.nyu.edu/display/NYUHPC/Tutorial+-+Submitting+a+job+using+qsub|Tutorial - Submitting a job using qsub]] Status * ???

ACELab wiki

User Tools

Site Tools

Page Tools