青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

C++ Coder

HCP高性能計算架構,實現,編譯器指令優化,算法優化, LLVM CLANG OpenCL CUDA OpenACC C++AMP OpenMP MPI

C++博客 首頁 新隨筆 聯系 聚合 管理
  98 Posts :: 0 Stories :: 0 Comments :: 0 Trackbacks
http://devgurus.amd.com/thread/159558

Understanding performance counters

此問題被 假設已回答。

chersanyaNewbie
chersanya 2012-8-5 下午12:03

I have a kernel, and each workitem processes tens of elements (firstly perform some computation and then global memory read + write). The profiler gave me much help in optimizing it, however I want to go further  Now the profiling data looks like this (almost the same for all kernel runs):

 

  1. GlobalWorkSize     126720  
  2. WorkGroupSize     256  
  3. VGPRs     13  
  4. FCStacks     2  
  5. ALUInsts     6787.63  
  6. FetchInsts     52.47  
  7. WriteInsts     26.47  
  8. ALUBusy     98.31  
  9. ALUFetchRatio     129.37  
  10. ALUPacking     72.16  
  11. FetchSize     411503.38  
  12. CacheHit     0.09  
  13. FetchUnitBusy     89.44  
  14. FetchUnitStalled     93.86  
  15. WriteUnitStalled 0.00  
  16. FastPath     9.19  
  17. CompletePath     28.91  
  18. PathUtilization     30.52  

 

LDS not used at all, kernel occupancy is 50% (VGPR-limited 16 waves).

I can't understand several points here:

  • What exactly means FCStacks value? I have only one loop (for), and no if statements, but its value is two.
  • How can be ALUBusy 98% with low ALUPacking (72%)? As I see from ALUPacking, not all VLIWs are filled at full, so ALUBusy shouldn't be so close to 100%
  • FetchUnitStalled > FetchUnitBusy while it's written that FetchUnitBusy includes stalled time - how?

 

And how to improve ALUPacking up to 100%?

  • Understanding performance counters
    nouExpert
    nou 2012-8-6 上午6:18 (回復 chersanya)

    ALU is counted as busy even with pure scalar code. ALU packing at 72% is quite high. you can try put code of work item into static loop for(int i=0;i<2;i++). compler will it unroll and you get quick/dirty way to vectorize code.

    • Re: Understanding performance counters
      chersanyaNewbie
      chersanya 2012-8-6 上午7:57 (回復 nou)

      I already have a static loop in the kernel, but unrolling it (either with #pragma or manually) leads to poorer performance - probably because of registers, but not sure.

      And what exactly is ALUPacking value? I think it's average of (used VLIW instructions)/(available VLIW instructions - i.e 5 in the case of VLIW5), but it is just speculation.

      • Re: Understanding performance counters
        nouExpert
        nou 2012-8-6 下午2:43 (回復 chersanya)

        from profiler manual ALUBusy - The percentage of GPUTime ALU instructions are processed.

        and also to ALUpacking - The ALU vector packing efficiency (in percentage). This value indicates how well the Shader Compiler packs the scalar or vector ALU in your kernel to the 5-way VLIW instructions. Value range: 0% (bad) to 100% (optimal). Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.

        • Re: Understanding performance counters
          chersanyaNewbie
          chersanya 2012-8-6 下午3:22 (回復 nou)

          Yes, I've read this 

          But "how well (the Shader Compiler packs...)" is not a very concrete description of a counter  That's why I'm asking if my guess is right (ALUPacking = [used VLIW instructions]/[available VLIW instructions - i.e 5 in the case of VLIW5]).

        • Re: Understanding performance counters
          binyingNovice
          binying 2012-8-7 上午8:34 (回復 nou)
          FCStacksThe size of the flow control stack used by the kernel (valid only for AMD Radeon HD 6000 series GPU devices or older). This number may affect the number of wavefronts in-flight. To reduce the stack size, reduce the amount of flow control nesting in the kernel.

          This is from the profiler manual. Note that it is valid only for HD6000 or older

          • Re: Understanding performance counters
            binyingNovice
            binying 2012-8-7 上午8:58 (回復 binying)

            ALUBusy measures the percentage of GPU time ALU instructions are processed. There are many reasons for a low ALUBusy number, for example, not enough active wavefront to hide instruction latency or heavy memory access.

            Code divergence can be measured with VALUUtilization counter if you have SI hardware.

             

            http://devgurus.amd.com/thread/158655

             

            ALUPacking mesures the ALU vector packing efficiency (in percentage). This value indicates how well the Shader Compiler packs the scalar or vector ALU in your kernel to the 5-way VLIW instructions. Value range: 0% (bad) to 100% (optimal). Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.

             

            So I think it makes sense that  ALUBusy 98% and low ALUPacking (72%) occur at the same time.

            • Re: Understanding performance counters
              binyingNovice
              binying 2012-8-7 上午9:00 (回復 binying)
              FetchUnitBusyThe percentage of GPUTime the Fetch unit is active. The result includes the stall time (FetchUnitStalled). This is measured with all extra fetches and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
              FetchUnitStalledThe percentage of GPUTime the Fetch unit is stalled. Try reducing the number of fetches or reducing the amount per fetch if possible. Value range: 0% (optimal) to 100% (bad).
              • Re: Understanding performance counters
                binyingNovice
                binying 2012-8-7 上午9:07 (回復 binying)

                how to improve ALUPacking?

                 

                Use int4/float4/etc for memory accesses and total element operations, as this is the type of workload a graphics card is optimized for memory access and alu load.

                 

                Try to avoid bank conflicts across the device...

                • Re: Understanding performance counters
                  chersanyaNewbie
                  chersanya 2012-8-7 上午10:53 (回復 binying)

                  To be honest, your answers didn't give any new information: everything exists in the manual. But it's not 100% clear, for example let's look at FetchUnitBusy and FetchUnitStalled counters. According to the documentation, Stalled time can't be greater than Busy, but it is.

                  • Re: Understanding performance counters
                    binyingNovice
                    binying 2012-8-7 下午10:54 (回復 chersanya)

                    ALUBusy is the % of time ALU isactually executing.

                     

                    ALUPAcking is the percentage of code that has been successfully packed into VLIW. Well, the compiler takes scalar code and generates vector or VLIW code for the hardware. Sometimes, code cannot be vectorized, this results in lower performance, e.g., you have lower ALUPacking.

                     

                    To improve APUPacking, you could also reduce conditional statements etc /for loops etc so that compiler can vectorize easily. But I think it is very difficult to have 100% ALUPacking.

                     

                    As for FetchUnitBusy and FetchUnitStalled counters, I would speculate that their relationship is sth. like that of ALUBusy and ALUPacking.

posted on 2013-01-09 13:36 jackdong 閱讀(536) 評論(0)  編輯 收藏 引用 所屬分類: OpenCL
青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品
  • <ins id="pjuwb"></ins>
    <blockquote id="pjuwb"><pre id="pjuwb"></pre></blockquote>
    <noscript id="pjuwb"></noscript>
          <sup id="pjuwb"><pre id="pjuwb"></pre></sup>
            <dd id="pjuwb"></dd>
            <abbr id="pjuwb"></abbr>
            国产欧美一区二区三区在线看蜜臀 | 亚洲一区图片| 亚洲福利久久| 欧美韩日精品| 亚洲人屁股眼子交8| 99精品99| 欧美在线资源| 欧美成人免费在线| 欧美亚州韩日在线看免费版国语版| 欧美区在线观看| 国产精品成人播放| 一区精品在线播放| 国产精品一区一区三区| 国产精品国产三级国产aⅴ入口| 国产精品视频99| 国产小视频国产精品| 亚洲成人在线观看视频| 一区二区激情| 久久婷婷影院| 中文国产成人精品久久一| 亚洲欧美成人一区二区在线电影| 久久久久一区| 国产精品久久久久91| 亚洲国产美国国产综合一区二区| 一区二区三区高清| 美女诱惑一区| 亚洲午夜av在线| 久久午夜精品| 国产精品视频内| 亚洲乱码精品一二三四区日韩在线 | 亚洲韩国一区二区三区| 亚洲一线二线三线久久久| 老司机精品久久| 国产伦精品一区二区三区免费迷| 亚洲国产另类精品专区| 久久av红桃一区二区小说| 亚洲精品久久久久久久久久久久 | 欧美一区二区三区四区视频 | 欧美黑人国产人伦爽爽爽| 午夜精品免费视频| 欧美性开放视频| 亚洲美女在线看| 欧美激情一级片一区二区| 欧美中文字幕视频| 国产人成精品一区二区三| 亚洲一区二区三区色| 最新高清无码专区| 久久影视精品| 精品动漫3d一区二区三区| 久久精品首页| 亚欧成人在线| 韩国精品在线观看| 久久夜色精品一区| 欧美怡红院视频| 国产在线视频欧美一区二区三区| 性久久久久久久久久久久| 亚洲一区二区三区影院| 国产精品嫩草久久久久| 午夜在线不卡| 欧美一区二区三区在线观看视频| 国产日韩欧美综合| 久久九九热免费视频| 国产视频一区在线观看| 久久这里有精品15一区二区三区| 国产精品久久毛片a| 亚洲一区3d动漫同人无遮挡| 亚洲精品1区2区| 欧美国产综合| 亚洲视频狠狠| 亚洲欧美另类久久久精品2019| 欧美三日本三级三级在线播放| 一本色道久久88综合亚洲精品ⅰ| 99pao成人国产永久免费视频| 国产精品美女久久久久av超清 | 久久久777| 亚洲国产精品久久久久婷婷884| 欧美国产日韩一区二区| 欧美激情亚洲自拍| 午夜在线不卡| 乱码第一页成人| 日韩亚洲在线| 亚洲欧美国产精品va在线观看| 国产欧美午夜| 亚洲电影激情视频网站| 欧美色综合网| 久久蜜桃av一区精品变态类天堂| 久久综合中文| 亚洲一区影院| 久久久国产精品一区| 日韩手机在线导航| 欧美一区二区三区在线观看视频| 亚洲国产国产亚洲一二三| 99热在这里有精品免费| 国产无一区二区| 亚洲精品美女在线观看播放| 国产午夜精品福利| 亚洲麻豆国产自偷在线| 国产精品一级久久久| 欧美黄色精品| 国产婷婷色一区二区三区在线| 亚洲精品久久| 伊人久久久大香线蕉综合直播| 一本一本久久a久久精品综合麻豆| 国色天香一区二区| 在线亚洲美日韩| 91久久久久久国产精品| 欧美一区视频| 亚洲欧美成人一区二区在线电影| 久久影音先锋| 久久久噜噜噜久久中文字幕色伊伊| 欧美成人有码| 免费成年人欧美视频| 国产欧美精品在线| 亚洲视频免费| 在线视频精品一区| 欧美福利视频网站| 欧美大成色www永久网站婷| 国产欧美精品一区二区色综合| 日韩一本二本av| 99综合电影在线视频| 免费中文日韩| 欧美福利电影在线观看| 在线观看91久久久久久| 免播放器亚洲一区| 一区二区三区免费看| 欧美在线观看视频| 亚洲嫩草精品久久| 欧美国产日本在线| 欧美激情精品久久久久久蜜臀| 国产一区二区三区av电影| 亚洲一区二区视频在线| 亚洲欧美久久久久一区二区三区| 欧美精品aa| 亚洲欧洲日本mm| 亚洲精品一区二区网址| 欧美好吊妞视频| 亚洲精品一区在线观看| 99国产麻豆精品| 欧美日韩在线看| 99v久久综合狠狠综合久久| aa日韩免费精品视频一| 欧美日韩极品在线观看一区| 亚洲精品黄网在线观看| 亚洲一区二区av电影| 国产精品人人做人人爽 | 欧美成在线观看| 91久久精品国产| 亚洲午夜极品| 国产日韩在线视频| 久久久久国产免费免费| 欧美大胆a视频| 亚洲视频欧美在线| 国产精品亚洲аv天堂网| 欧美一区二区在线| 欧美风情在线观看| 亚洲网站在线看| 国产日韩欧美一区二区三区在线观看| 久久激情视频免费观看| 亚洲丰满在线| 亚洲欧美大片| 韩国一区电影| 欧美精品一区二区蜜臀亚洲| 亚洲视频 欧洲视频| 老司机精品导航| 一区二区三区四区五区精品视频 | 一本色道久久精品| 久久久久88色偷偷免费| 亚洲伦理自拍| 国产一区 二区 三区一级| 欧美成人一区二区在线| 欧美亚洲一区二区在线| 91久久香蕉国产日韩欧美9色| 亚洲男女毛片无遮挡| 欧美日韩三级电影在线| 91久久极品少妇xxxxⅹ软件| 亚洲免费视频在线观看| 精品av久久久久电影| 欧美激情一区二区三区蜜桃视频| 在线亚洲欧美| 欧美xx视频| 亚洲一区免费视频| 亚洲电影免费观看高清完整版在线观看| 欧美极品在线观看| 久久精品国产99精品国产亚洲性色 | 欧美国产一区二区在线观看| 亚洲先锋成人| 国产欧美精品日韩区二区麻豆天美 | 国产精品专区h在线观看| 久久综合九色综合久99| 一区二区三区四区五区在线| 欧美99久久| 午夜精品一区二区三区电影天堂 | 久久成人18免费网站| 99日韩精品| 亚洲国产99| 老**午夜毛片一区二区三区| 亚洲欧美国产一区二区三区| 亚洲精品乱码久久久久久黑人 | 亚洲女性裸体视频| 中文精品视频|