青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

C++ Coder

HCP高性能計算架構(gòu),實現(xiàn),編譯器指令優(yōu)化,算法優(yōu)化, LLVM CLANG OpenCL CUDA OpenACC C++AMP OpenMP MPI

C++博客 首頁 新隨筆 聯(lián)系 聚合 管理
  98 Posts :: 0 Stories :: 0 Comments :: 0 Trackbacks
http://devgurus.amd.com/thread/159558

Understanding performance counters

此問題被 假設(shè)已回答。

chersanyaNewbie
chersanya 2012-8-5 下午12:03

I have a kernel, and each workitem processes tens of elements (firstly perform some computation and then global memory read + write). The profiler gave me much help in optimizing it, however I want to go further  Now the profiling data looks like this (almost the same for all kernel runs):

 

  1. GlobalWorkSize     126720  
  2. WorkGroupSize     256  
  3. VGPRs     13  
  4. FCStacks     2  
  5. ALUInsts     6787.63  
  6. FetchInsts     52.47  
  7. WriteInsts     26.47  
  8. ALUBusy     98.31  
  9. ALUFetchRatio     129.37  
  10. ALUPacking     72.16  
  11. FetchSize     411503.38  
  12. CacheHit     0.09  
  13. FetchUnitBusy     89.44  
  14. FetchUnitStalled     93.86  
  15. WriteUnitStalled 0.00  
  16. FastPath     9.19  
  17. CompletePath     28.91  
  18. PathUtilization     30.52  

 

LDS not used at all, kernel occupancy is 50% (VGPR-limited 16 waves).

I can't understand several points here:

  • What exactly means FCStacks value? I have only one loop (for), and no if statements, but its value is two.
  • How can be ALUBusy 98% with low ALUPacking (72%)? As I see from ALUPacking, not all VLIWs are filled at full, so ALUBusy shouldn't be so close to 100%
  • FetchUnitStalled > FetchUnitBusy while it's written that FetchUnitBusy includes stalled time - how?

 

And how to improve ALUPacking up to 100%?

  • Understanding performance counters
    nouExpert
    nou 2012-8-6 上午6:18 (回復 chersanya)

    ALU is counted as busy even with pure scalar code. ALU packing at 72% is quite high. you can try put code of work item into static loop for(int i=0;i<2;i++). compler will it unroll and you get quick/dirty way to vectorize code.

    • Re: Understanding performance counters
      chersanyaNewbie
      chersanya 2012-8-6 上午7:57 (回復 nou)

      I already have a static loop in the kernel, but unrolling it (either with #pragma or manually) leads to poorer performance - probably because of registers, but not sure.

      And what exactly is ALUPacking value? I think it's average of (used VLIW instructions)/(available VLIW instructions - i.e 5 in the case of VLIW5), but it is just speculation.

      • Re: Understanding performance counters
        nouExpert
        nou 2012-8-6 下午2:43 (回復 chersanya)

        from profiler manual ALUBusy - The percentage of GPUTime ALU instructions are processed.

        and also to ALUpacking - The ALU vector packing efficiency (in percentage). This value indicates how well the Shader Compiler packs the scalar or vector ALU in your kernel to the 5-way VLIW instructions. Value range: 0% (bad) to 100% (optimal). Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.

        • Re: Understanding performance counters
          chersanyaNewbie
          chersanya 2012-8-6 下午3:22 (回復 nou)

          Yes, I've read this 

          But "how well (the Shader Compiler packs...)" is not a very concrete description of a counter  That's why I'm asking if my guess is right (ALUPacking = [used VLIW instructions]/[available VLIW instructions - i.e 5 in the case of VLIW5]).

        • Re: Understanding performance counters
          binyingNovice
          binying 2012-8-7 上午8:34 (回復 nou)
          FCStacksThe size of the flow control stack used by the kernel (valid only for AMD Radeon HD 6000 series GPU devices or older). This number may affect the number of wavefronts in-flight. To reduce the stack size, reduce the amount of flow control nesting in the kernel.

          This is from the profiler manual. Note that it is valid only for HD6000 or older

          • Re: Understanding performance counters
            binyingNovice
            binying 2012-8-7 上午8:58 (回復 binying)

            ALUBusy measures the percentage of GPU time ALU instructions are processed. There are many reasons for a low ALUBusy number, for example, not enough active wavefront to hide instruction latency or heavy memory access.

            Code divergence can be measured with VALUUtilization counter if you have SI hardware.

             

            http://devgurus.amd.com/thread/158655

             

            ALUPacking mesures the ALU vector packing efficiency (in percentage). This value indicates how well the Shader Compiler packs the scalar or vector ALU in your kernel to the 5-way VLIW instructions. Value range: 0% (bad) to 100% (optimal). Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.

             

            So I think it makes sense that  ALUBusy 98% and low ALUPacking (72%) occur at the same time.

            • Re: Understanding performance counters
              binyingNovice
              binying 2012-8-7 上午9:00 (回復 binying)
              FetchUnitBusyThe percentage of GPUTime the Fetch unit is active. The result includes the stall time (FetchUnitStalled). This is measured with all extra fetches and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
              FetchUnitStalledThe percentage of GPUTime the Fetch unit is stalled. Try reducing the number of fetches or reducing the amount per fetch if possible. Value range: 0% (optimal) to 100% (bad).
              • Re: Understanding performance counters
                binyingNovice
                binying 2012-8-7 上午9:07 (回復 binying)

                how to improve ALUPacking?

                 

                Use int4/float4/etc for memory accesses and total element operations, as this is the type of workload a graphics card is optimized for memory access and alu load.

                 

                Try to avoid bank conflicts across the device...

                • Re: Understanding performance counters
                  chersanyaNewbie
                  chersanya 2012-8-7 上午10:53 (回復 binying)

                  To be honest, your answers didn't give any new information: everything exists in the manual. But it's not 100% clear, for example let's look at FetchUnitBusy and FetchUnitStalled counters. According to the documentation, Stalled time can't be greater than Busy, but it is.

                  • Re: Understanding performance counters
                    binyingNovice
                    binying 2012-8-7 下午10:54 (回復 chersanya)

                    ALUBusy is the % of time ALU isactually executing.

                     

                    ALUPAcking is the percentage of code that has been successfully packed into VLIW. Well, the compiler takes scalar code and generates vector or VLIW code for the hardware. Sometimes, code cannot be vectorized, this results in lower performance, e.g., you have lower ALUPacking.

                     

                    To improve APUPacking, you could also reduce conditional statements etc /for loops etc so that compiler can vectorize easily. But I think it is very difficult to have 100% ALUPacking.

                     

                    As for FetchUnitBusy and FetchUnitStalled counters, I would speculate that their relationship is sth. like that of ALUBusy and ALUPacking.

posted on 2013-01-09 13:36 jackdong 閱讀(531) 評論(0)  編輯 收藏 引用 所屬分類: OpenCL
青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品
  • <ins id="pjuwb"></ins>
    <blockquote id="pjuwb"><pre id="pjuwb"></pre></blockquote>
    <noscript id="pjuwb"></noscript>
          <sup id="pjuwb"><pre id="pjuwb"></pre></sup>
            <dd id="pjuwb"></dd>
            <abbr id="pjuwb"></abbr>
            国产亚洲精品久久久久动| 欧美日韩国产bt| 欧美成人影音| 麻豆久久婷婷| 亚洲影院色无极综合| 久久女同精品一区二区| 国产老女人精品毛片久久| 99热免费精品| 欧美凹凸一区二区三区视频| 久久国产一区| 国产喷白浆一区二区三区| 亚洲欧美日韩国产| 一区二区高清在线| 欧美日韩久久不卡| 99视频精品在线| 亚洲日韩中文字幕在线播放| 久久精品麻豆| 一区二区三区在线观看国产| 久久日韩粉嫩一区二区三区| 欧美中文在线免费| 一区二区亚洲精品国产| 乱码第一页成人| 久久综合色婷婷| 亚洲精品123区| 最新日韩在线| 国产精品高潮呻吟久久av无限| 一区二区三区日韩| 在线亚洲伦理| 国产亚洲精久久久久久| 久久夜色精品国产欧美乱| 欧美专区在线| 亚洲二区免费| 夜色激情一区二区| 国产伦精品一区二区三区在线观看 | 亚洲国产日日夜夜| 欧美韩日一区| 亚洲欧美另类久久久精品2019| 亚洲欧美另类中文字幕| 国产主播精品| 亚洲国产精品ⅴa在线观看| 欧美精品日韩www.p站| 亚洲免费一在线| 久久精品人人爽| 在线亚洲精品| 欧美一区二区三区在线| 在线看国产一区| 亚洲毛片在线看| 国产婷婷成人久久av免费高清 | 亚洲精选成人| 亚洲桃花岛网站| 亚洲高清不卡在线观看| 亚洲剧情一区二区| 国内精品久久久久久久果冻传媒| 欧美激情一区二区三区| 国产精品久久久亚洲一区| 美女网站久久| 国产精品av免费在线观看| 久久久亚洲国产天美传媒修理工| 欧美丰满高潮xxxx喷水动漫| 中文亚洲字幕| 美女主播一区| 欧美日韩一区三区| 久久这里只有精品视频首页| 欧美日本不卡| 久久综合电影一区| 国产精品久久久久婷婷| 91久久极品少妇xxxxⅹ软件| 国产精品日韩欧美大师| 欧美激情小视频| 国产日韩亚洲欧美| 亚洲美女在线国产| 亚洲国产精品123| 亚洲欧美视频一区| 一区二区三区精品| 久久在线播放| 久久久精品性| 国产精品视频内| 日韩亚洲不卡在线| 亚洲卡通欧美制服中文| 久久久久国产一区二区三区| 欧美一区二区视频97| 欧美视频在线播放| 亚洲精品国产精品国产自| 亚洲电影免费| 久久久人成影片一区二区三区| 性久久久久久久久久久久| 欧美日韩综合| 国产精品www网站| 亚洲国产第一| 亚洲二区三区四区| 久久综合成人精品亚洲另类欧美| 久久久亚洲综合| 韩国成人福利片在线播放| 午夜国产一区| 欧美在线日韩在线| 国产精品亚洲人在线观看| 亚洲校园激情| 欧美中文字幕不卡| 国产日韩欧美一区二区三区四区| 亚洲一区3d动漫同人无遮挡| 亚洲伊人伊色伊影伊综合网| 欧美日韩国产不卡在线看| 亚洲狼人精品一区二区三区| 日韩特黄影片| 欧美日韩一区二区在线| 亚洲精品一区在线| 亚洲综合丁香| 国产日韩欧美二区| 久久久综合激的五月天| 欧美激情精品久久久久久变态| 亚洲国产黄色片| 欧美精品一区二区三区在线看午夜| 亚洲精品免费在线播放| 亚洲女爱视频在线| 国产一区二区高清| 久久影音先锋| 99精品福利视频| 欧美亚洲三级| 在线成人小视频| 榴莲视频成人在线观看| 亚洲开发第一视频在线播放| 欧美一区二视频| 亚洲高清视频在线观看| 欧美片第1页综合| 亚洲综合色网站| 欧美 日韩 国产一区二区在线视频| 亚洲国内自拍| 亚洲精品日韩久久| 欧美三级第一页| 久久不射中文字幕| 欧美高清视频www夜色资源网| 99在线精品免费视频九九视| 国产精品麻豆欧美日韩ww| 久久久999精品免费| 亚洲日韩成人| 久久久中精品2020中文| 亚洲精品九九| 国产女人aaa级久久久级| 欧美成人免费播放| 午夜欧美大尺度福利影院在线看| 欧美大片国产精品| 亚洲欧美日本国产有色| 亚洲第一中文字幕在线观看| 国产精品欧美风情| 欧美成人69av| 欧美在线免费视屏| 亚洲免费电影在线观看| 久久嫩草精品久久久久| 国产精品99久久久久久久久| 欧美一级夜夜爽| 日韩视频在线观看免费| 免费观看在线综合| 校园春色综合网| 一本久久a久久免费精品不卡| 韩国精品一区二区三区| 国产精品三上| 欧美日韩亚洲天堂| 欧美gay视频激情| 久久av免费一区| 亚洲欧美日本在线| 在线午夜精品| 日韩亚洲不卡在线| 亚洲日本免费| 亚洲福利在线看| 毛片基地黄久久久久久天堂| 欧美影院在线| 午夜在线视频一区二区区别| 一本大道久久精品懂色aⅴ| 91久久亚洲| 亚洲第一视频| 一区视频在线看| 国语精品中文字幕| 韩国一区二区三区美女美女秀| 国产精品嫩草影院av蜜臀| 欧美性大战久久久久| 欧美日韩国产一中文字不卡 | 永久免费毛片在线播放不卡| 国产午夜精品理论片a级大结局| 欧美全黄视频| 欧美日韩在线视频首页| 欧美日韩专区在线| 国产精品99免视看9| 国产精品草草| 国产精品免费一区二区三区观看| 国产精品久久激情| 国产日本亚洲高清| 国产乱码精品| 国内精品视频久久| 在线欧美影院| 亚洲国产精品成人精品| 亚洲人体一区| 亚洲美女视频| 亚洲一二三区精品| 午夜一区二区三区在线观看| a4yy欧美一区二区三区| 亚洲精品免费在线| 日韩一级网站| 99精品视频免费| 亚洲视频成人| 亚洲免费综合|