C++ Coder

HCP高性能計算架構，實現，編譯器指令優化，算法優化， LLVM CLANG OpenCL CUDA OpenACC C++AMP OpenMP MPI

C++博客

管理

98 Posts :: 0 Stories :: 0 Comments :: 0 Trackbacks

公告

對學習編程者的忠告：眼過千遍不如手過一遍！書看千行不如手敲一行！手敲千行不如單步一行！單步源代碼千行不如單步對應匯編一行！

常用鏈接

留言簿(2)

隨筆分類

隨筆檔案

搜索

閱讀排行榜

評論排行榜

Understanding performance counters

http://devgurus.amd.com/thread/159558

Understanding performance counters

此問題被 假設已回答。

chersanya 2012-8-5 下午12:03

I have a kernel, and each workitem processes tens of elements (firstly perform some computation and then global memory read + write). The profiler gave me much help in optimizing it, however I want to go further Now the profiling data looks like this (almost the same for all kernel runs):

GlobalWorkSize 126720
WorkGroupSize 256
VGPRs 13
FCStacks 2
ALUInsts 6787.63
FetchInsts 52.47
WriteInsts 26.47
ALUBusy 98.31
ALUFetchRatio 129.37
ALUPacking 72.16
FetchSize 411503.38
CacheHit 0.09
FetchUnitBusy 89.44
FetchUnitStalled 93.86
WriteUnitStalled 0.00
FastPath 9.19
CompletePath 28.91
PathUtilization 30.52

LDS not used at all, kernel occupancy is 50% (VGPR-limited 16 waves).

I can't understand several points here:

What exactly means FCStacks value? I have only one loop (for), and no if statements, but its value is two.
How can be ALUBusy 98% with low ALUPacking (72%)? As I see from ALUPacking, not all VLIWs are filled at full, so ALUBusy shouldn't be so close to 100%
FetchUnitStalled > FetchUnitBusy while it's written that FetchUnitBusy includes stalled time - how?

And how to improve ALUPacking up to 100%?

Understanding performance counters

nou 2012-8-6 上午6:18 (回復 chersanya)

ALU is counted as busy even with pure scalar code. ALU packing at 72% is quite high. you can try put code of work item into static loop for(int i=0;i<2;i++). compler will it unroll and you get quick/dirty way to vectorize code.

Re: Understanding performance counters

chersanya 2012-8-6 上午7:57 (回復 nou)

I already have a static loop in the kernel, but unrolling it (either with #pragma or manually) leads to poorer performance - probably because of registers, but not sure.

And what exactly is ALUPacking value? I think it's average of (used VLIW instructions)/(available VLIW instructions - i.e 5 in the case of VLIW5), but it is just speculation.

Re: Understanding performance counters

nou 2012-8-6 下午2:43 (回復 chersanya)

from profiler manual ALUBusy - The percentage of GPUTime ALU instructions are processed.

and also to ALUpacking - The ALU vector packing efficiency (in percentage). This value indicates how well the Shader Compiler packs the scalar or vector ALU in your kernel to the 5-way VLIW instructions. Value range: 0% (bad) to 100% (optimal). Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.

Re: Understanding performance counters
chersanya 2012-8-6 下午3:22 (回復 nou)
Yes, I've read this
But "how well (the Shader Compiler packs...)" is not a very concrete description of a counter That's why I'm asking if my guess is right (ALUPacking = [used VLIW instructions]/[available VLIW instructions - i.e 5 in the case of VLIW5]).
舉報濫用

喜愛 (0)

Re: Understanding performance counters

binying 2012-8-7 上午8:34 (回復 nou)

FCStacks

The size of the flow control stack used by the kernel (valid only for AMD Radeon HD 6000 series GPU devices or older). This number may affect the number of wavefronts in-flight. To reduce the stack size, reduce the amount of flow control nesting in the kernel.

This is from the profiler manual. Note that it is valid only for HD6000 or older

Re: Understanding performance counters

binying 2012-8-7 上午8:58 (回復 binying)

ALUBusy measures the percentage of GPU time ALU instructions are processed. There are many reasons for a low ALUBusy number, for example, not enough active wavefront to hide instruction latency or heavy memory access.

Code divergence can be measured with VALUUtilization counter if you have SI hardware.

http://devgurus.amd.com/thread/158655

ALUPacking mesures the ALU vector packing efficiency (in percentage). This value indicates how well the Shader Compiler packs the scalar or vector ALU in your kernel to the 5-way VLIW instructions. Value range: 0% (bad) to 100% (optimal). Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.

So I think it makes sense that ALUBusy 98% and low ALUPacking (72%) occur at the same time.

Re: Understanding performance counters

binying 2012-8-7 上午9:00 (回復 binying)

FetchUnitBusy	The percentage of GPUTime the Fetch unit is active. The result includes the stall time (FetchUnitStalled). This is measured with all extra fetches and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
FetchUnitStalled	The percentage of GPUTime the Fetch unit is stalled. Try reducing the number of fetches or reducing the amount per fetch if possible. Value range: 0% (optimal) to 100% (bad).

Re: Understanding performance counters
binying 2012-8-7 上午9:07 (回復 binying)
how to improve ALUPacking?

Use int4/float4/etc for memory accesses and total element operations, as this is the type of workload a graphics card is optimized for memory access and alu load.

Try to avoid bank conflicts across the device...
舉報濫用

喜愛 (0)
- Re: Understanding performance counters
  chersanya 2012-8-7 上午10:53 (回復 binying)
  To be honest, your answers didn't give any new information: everything exists in the manual. But it's not 100% clear, for example let's look at FetchUnitBusy and FetchUnitStalled counters. According to the documentation, Stalled time can't be greater than Busy, but it is.
  舉報濫用
  
  喜愛 (0)
  - Re: Understanding performance counters
    binying 2012-8-7 下午10:54 (回復 chersanya)
    ALUBusy is the % of time ALU isactually executing.
    
    ALUPAcking is the percentage of code that has been successfully packed into VLIW. Well, the compiler takes scalar code and generates vector or VLIW code for the hardware. Sometimes, code cannot be vectorized, this results in lower performance, e.g., you have lower ALUPacking.
    
    To improve APUPacking, you could also reduce conditional statements etc /for loops etc so that compiler can vectorize easily. But I think it is very difficult to have 100% ALUPacking.
    
    As for FetchUnitBusy and FetchUnitStalled counters, I would speculate that their relationship is sth. like that of ALUBusy and ALUPacking.
    舉報濫用
    
    喜愛 (0)

posted on 2013-01-09 13:36 jackdong 閱讀(536) 評論(0) 編輯收藏引用所屬分類: OpenCL

只有注冊用戶登錄后才能發表評論。


相關文章: 淺談多節點CPU+GPU協同計算負載均衡性設計 VLIW on Cypress and vector addition Low ALUBusy and low FetchUnitBusy Understanding performance counters ALUBusy question 適用于ATI卡的GPU計算MD5的小程序源碼，基于AMD APP SDK開發 Test latency for clEnqueueNDRangeKernel 采用OpenCL標準實現FPGA設計

網站導航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

C++ Coder

公告

常用鏈接

留言簿(2)

隨筆分類

隨筆檔案

搜索

最新評論

閱讀排行榜

評論排行榜

Understanding performance counters

Understanding performance counters

Re: Understanding performance counters

Re: Understanding performance counters

Re: Understanding performance counters

Re: Understanding performance counters

Re: Understanding performance counters

Re: Understanding performance counters

Re: Understanding performance counters

Re: Understanding performance counters

Re: Understanding performance counters