青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

C++ Coder

HCP高性能計算架構,實現,編譯器指令優化,算法優化, LLVM CLANG OpenCL CUDA OpenACC C++AMP OpenMP MPI

C++博客 首頁 新隨筆 聯系 聚合 管理
  98 Posts :: 0 Stories :: 0 Comments :: 0 Trackbacks
http://devgurus.amd.com/thread/158866

VLIW on Cypress and vector addition

此問題被 假設已回答。

cadorinoNewbie
cadorino 2012-7-2 上午10:31

Hi to everybody.
I'm thinking about VLIW utilization on a 5870 HD.

Suppose you have the following kernel:

 

__kernel void saxpy(const __global float * x, __global float * y, const float a)

{

          uint guid = get_global_id(0);

                    y[guid] = a * x[guid] + y[guid];

}

 

Each work item operates on a single vector element and no vectorization (float4).
Is the compiler still capable of packing instructions to exploit the 4 ALUs of each processing element?

Is there any tool to determine the way instructions are packed into VLIW?

 

Thank you very much!

  • 633 瀏覽次數
  • Re: VLIW on Cypress and vector addition
    MicahVillmowModerator
    MicahVillmow 2012-7-2 上午10:44 (回復 cadorino)

    The compiler only packs the VLIW with computation within a work-item, not across work-items. Multicore ware has some software that will allow you to pack across work-items, which can be found here:http://www.multicorewareinc.com/index.php?option=com_content&view=article&id=71&Itemid=106

    Micah Villmow
    Advanced Micro Devices Inc.
    --------------------------------
    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
    • Re: VLIW on Cypress and vector addition
      cadorinoNewbie
      cadorino 2012-7-2 上午11:18 (回復 MicahVillmow)

      Ok, so what if there are not enough alu instructions per work-item to be packed to exploit the 4 ALUs?

      • Re: VLIW on Cypress and vector addition
        MicahVillmowModerator
        MicahVillmow 2012-7-2 上午11:27 (回復 cadorino)

        Then your program is not utilizing the entire machine and cannot reach peak efficiency.

        Micah Villmow
        Advanced Micro Devices Inc.
        --------------------------------
        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
        • Re: VLIW on Cypress and vector addition
          cadorinoNewbie
          cadorino 2012-7-2 下午1:00 (回復 MicahVillmow)

          Ok, so it is possible for a PE to get a VLIW containing less than 4 operations, and in this case some ALUs are idle. Right?

        • Re: VLIW on Cypress and vector addition
          cadorinoNewbie
          cadorino 2012-7-2 下午1:00 (回復 MicahVillmow)

          Ok, so it is possible for a PE to get a VLIW containing less than 4 operations, and in this case some ALUs are idle. Right?

          • Re: VLIW on Cypress and vector addition
            realhetNovice
            realhet 2012-7-3 上午3:05 (回復 cadorino)

            Peak efficiency is only when all your 4 or 5 ALUs have work to do in every cycles.

            If you disassemble your code (use kernel analyzer), you can easily spot the idle ALUs.

             

            for example:

            10  x: SUB_INT     T0.x,  PV9.z,  KC0[2].x

                       w: SETGE_UINT  ____,  PV9.z,  KC0[2].x    ;y,z,t sleeps

            11  z: AND_INT     ____,  T0.y,  PV10.w              ;x,y,w,t sleeps

            12  y: CNDE_INT    T1.y,  PV11.z,  T0.z,  T0.x     ;x,z,w,t sleeps

            13  x: ADD_INT     ____,  KC0[2].x,  PV12.y        ;y,z,w,t sleeps

            this is so unoptimal that is does only 5 operations under 4 clocks, the possible maximum would be 4*5=20 operations (on vliw5)

             

            826  x: XOR_INT     T1.x,  R28.w,  T0.w

                         y: SETGT_UINT  ____,  T1.x,  T0.w

                         z: XOR_INT     T3.z,  KC0[13].z,  R20.y      VEC_021

                         w: SETGT_UINT  T2.w,  T2.w,  R15.y      VEC_201

                         t: SETGT_UINT  T0.w,  R9.x,  T1.y

            827  x: ADD_INT     ____,  T0.z,  T2.z

                         y: ADD_INT     T2.y,  T0.y,  T2.x      VEC_021

                         z: ADD_INT     T0.z,  T1.z,  T2.y      VEC_210

                         w: ADD_INT     ____,  PV826.y,  T3.y      VEC_021

                         t: SETGT_UINT  ____,  T3.w,  R5.x

            this one is maximum utilization. 10 operationc in 2 clocks.

             

            There are tricks to improve local paralellism in code (other than simply vectorizing everything) like breaking dependency chains:

            for example  a+b+c+d  -> (a+b)+(c+d)

            • Re: VLIW on Cypress and vector addition
              nouExpert
              nou 2012-7-3 上午4:06 (回復 realhet)

              or you can proceed 2-4 work items in one work item. like this. but it has disadvantage as it increase register usage.

              __kernel void saxpy(const __global float * x, __global float * y, const float a)

              {

                        uint guid = get_global_id(0);

                             for(int i=0;i<4;i++)

                                  y[guid*4+1] = a * x[guid*4+i] + y[guid*4+i];

              }

  • Re: VLIW on Cypress and vector addition
    kbraffordNovice
    kbrafford 2012-7-3 下午12:24 (回復 cadorino)

    If he rewrote the kernel to use float4 types, would that also cause MMX/SSE registers to be used with compiled for a CPU device?

     

    __kernel void saxpy(const __global float4 * x, __global float4 * y, const float a)

    {

              uint guid = get_global_id(0);

     

              y[guid] = (float4)a * x[guid] + y[guid];

    }

    • Re: VLIW on Cypress and vector addition
      nouExpert
      nou 2012-7-3 下午1:58 (回復 kbrafford)

      yes you need use explicit vector types to utilize SSE instructions on CPU with AMD OpenCL

      • Re: VLIW on Cypress and vector addition
        cadorinoNewbie
        cadorino 2012-7-3 下午7:48 (回復 nou)

        Thank you very much you all, your answers are really useful!
        I already wrote a float4 version of the kernel, but I posted the float version since I'm trying to get into details of low-level aspects of VLIW compilation and execution.

         

        For what regards vectorized types, I agree that on CPU this enables SSE execution. Bu what for GPUs? For example, is the sum of two float4 elements spread across the 4 ALUs or the sum of each component is executed sequentially on a single ALU?

        • Re: VLIW on Cypress and vector addition
          nouExpert
          nou 2012-7-3 下午11:34 (回復 cadorino)

          I am not entirely sure what you mean with single ALU. but one workitem is executed only on one 5D/4D unit. and you will get packed instruction if is there enough independent instructions.

        • Re: VLIW on Cypress and vector addition
          realhetNovice
          realhet 2012-7-4 上午1:26 (回復 cadorino)

          On hd6xxx there are 4 physical ALUs for each workitems. The compiler will schedule operations for each of the ALUs on every single clocks. Also the compiler must ensure that there are no data dependency across the ALUs (eg. ALU x cannot use the result from ALU y in a single clock).

          In our case [Very Large Instruction Word] means that one instruction contains at most 4 subinstructions for each of the four ALUs.

           

          If it's not complicated enough: on the HD4xxx,5xxx there is a fifth ALU which handles the complicated instructions. So 4 ALUs can do simple math like mul, add, and the 5th can handle special things like cos().

           

          SSE is SIMD. It means one instruction will do the same operation on 4 different datas. SIMD can interpreted as a special case of VLIW where all ALU's have to do the same operation on data packed into vectors (eg. float4)

           

          GCN architecture dropped VLIW. On that a simple sequential code which even contains long dependency chains will do fine. So there is no need to have 4 or 5 independent execution paths in your algo, but for maximum utilization you'll have to feed it with 4x more workitems.

           

          (you know, the example code here is rather theoretical: Its bottleneck is memory IO, all the ALUs are sleeping and waiting for the memory units. Also the hardest ALU calculation is not the a*b+c (1 mad instruction) but get_global_id(0) (modulo/rangecheck/add operations) and address calculations for the 3 indirectly addressed buffers.)

          • Re: VLIW on Cypress and vector addition
            cadorinoNewbie
            cadorino 2012-7-4 下午3:01 (回復 realhet)

            Great answer! Thank you 
            So, summarizing, the grouping of instructions may allow to exploit all the 4/5 ALUs and this exploitation depends on the program and on the compiler. 

          • Re: VLIW on Cypress and vector addition
            bridgmanNewbie
            bridgman 2012-7-4 下午7:31 (回復 realhet)

            >>On hd6xxx there are 4 physical ALUs for each workitems.

             

            It's actually only the hd69xx (and Trinity) which use VLIW4. The rest of the hd6xxx family uses VLIW5, similar to the earlier parts.

             

            The GCN parts (hd77xx and higher) use 4 scalar SIMDs in a CU rather than 1 VLIW4 SIMD.

            The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
  • Re: VLIW on Cypress and vector addition
    cadorinoNewbie
    cadorino 2012-7-5 上午6:46 (回復 cadorino)

    Great topic. I would like to ask you another thing without starting another topic.
    Given a disassembled kernel like:

     

    00 ALU: ADDR(32) CNT(11) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)

          0  t: MULLO_INT   ____,  R1.x,  KC0[1].x     

          1  x: ADD_INT     ____,  R0.x,  PS0     

          2  w: ADD_INT     ____,  PV1.x,  KC0[6].x     

          3  z: LSHL        ____,  PV2.w,  2     

          4  y: ADD_INT     ____,  KC1[0].x,  PV3.z     

             z: ADD_INT     ____,  KC1[1].x,  PV3.z     

          5  x: LSHR        R2.x,  PV4.z,  2     

             y: LSHR        R0.y,  PV4.z,  2     

             w: LSHR        R0.w,  PV4.y,  2     

    01 TEX: ADDR(48) CNT(2)

          6  VFETCH R0.x___, R0.w, fc153  MEGA(4)

             FETCH_TYPE(NO_INDEX_OFFSET)

          7  VFETCH R1.x___, R0.y, fc153  MEGA(4)

             FETCH_TYPE(NO_INDEX_OFFSET)

    02 ALU: ADDR(43) CNT(1) KCACHE0(CB1:0-15)

          8  x: MULADD_e    R0.x,  KC0[2].x,  R0.x,  R1.x     

    03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R2].x___, R0, ARRAY_SIZE(4)  MARK  VPM

    END_OF_PROGRAM

     

    can I assume that each instruction 0 - 8 is executed in a clock cycle?

    In the example I've 9 instructions. Can I say that they take 9 cycles to get executed? (obviously, after the fetch clause the wavefront will be switched off until data is available, but I'm referring exclusively to executing instructions, ignoring the time spent in waiting for memory accesses to complete)

    • Re: VLIW on Cypress and vector addition
      realhetNovice
      realhet 2012-7-5 上午7:51 (回復 cadorino)

      yes, the first ALU clause takes 6 core clock cycles and the second takes one.

      But these are relatively small clauses interleaved with memory clauses so there are lots of penalties at the transitions of the clauses.

      Ideally alu clauses can hold 128 slots but these are small ones.

       

      Lets say you have 2 wavefronts A,B and C assigned to a compute unit:

      ALU               memory unit

      A: 00 ALU       idle

      B: 00 ALU       A: 01 TEX

      C: 00 ALU       A: 01 TEX still (it's slow operation compared to small ALU stuff)

      A: 02 ALU       B: 01 TEX

      idle                 B: 01 TEX still

      B: 02 ALU       A: 03 MEM

      idle                 C: 01 TEX

      idle                 C: 01 TEX still

      C: 02 ALU       B: 03 MEM

      idle                 C: 03 MEM

       

      (oups maybe the memory output unit is separated from the texture unit but it's only an illustration of how the different parts of a compute unit can work in paralell)

      (and the compiler did a good job compiling that MULADD into one instruction)

posted on 2013-01-09 16:37 jackdong 閱讀(512) 評論(0)  編輯 收藏 引用 所屬分類: OpenCL
青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品
  • <ins id="pjuwb"></ins>
    <blockquote id="pjuwb"><pre id="pjuwb"></pre></blockquote>
    <noscript id="pjuwb"></noscript>
          <sup id="pjuwb"><pre id="pjuwb"></pre></sup>
            <dd id="pjuwb"></dd>
            <abbr id="pjuwb"></abbr>
            欧美激情一区| 亚洲欧洲日本mm| 国产精品久久看| 欧美搞黄网站| 欧美日本一区| 国产日韩精品电影| 国产精品日本欧美一区二区三区| 欧美精品一卡| 欧美大片免费观看| 欧美视频在线观看一区| 国产精品国产三级欧美二区| 久久精品国产亚洲a| 久久国产福利| 欧美福利专区| 国产精品久久久久久久久久久久 | 久久欧美中文字幕| 噜噜噜噜噜久久久久久91| 欧美激情欧美狂野欧美精品| 99国产精品私拍| 久久久亚洲欧洲日产国码αv| 欧美激情精品久久久六区热门 | 久久久五月天| 亚洲靠逼com| 噜噜噜躁狠狠躁狠狠精品视频 | 欧美一区二区三区四区高清| 欧美日本国产一区| 国模套图日韩精品一区二区| 亚洲女与黑人做爰| 亚洲免费观看| 欧美日韩国产成人在线免费| 亚洲日本中文| 亚洲人成毛片在线播放女女| 久久另类ts人妖一区二区| 国产欧美日韩精品在线| 午夜久久久久| 亚洲综合视频在线| 国产综合婷婷| 欧美成人中文字幕| 欧美激情在线狂野欧美精品| 亚洲精品中文字幕在线| 亚洲日本一区二区三区| 欧美日韩亚洲成人| 欧美日韩精品一区二区天天拍小说| 亚洲精品免费在线播放| 99热在这里有精品免费| 国产精品美女久久久久aⅴ国产馆| 亚洲一二三四区| 欧美一区二区私人影院日本| 依依成人综合视频| 亚洲人成人一区二区在线观看 | 精品999成人| 亚洲美女网站| 红桃视频成人| 亚洲午夜国产一区99re久久| 国产一区欧美| 亚洲福利视频三区| 国产精品久久影院| 欧美aⅴ99久久黑人专区| 欧美日韩亚洲综合一区| 开元免费观看欧美电视剧网站| 欧美日韩免费视频| 美女尤物久久精品| 国产亚洲成人一区| 一区二区三区国产| 在线观看欧美亚洲| 欧美一区二区免费观在线| 亚洲免费观看在线观看| 久久国产色av| 亚洲免费视频观看| 欧美日韩国产色视频| 蜜臀99久久精品久久久久久软件| 国产精品日韩欧美综合| 亚洲高清在线观看| 亚洲三级毛片| 欧美激情1区| 日韩一区二区久久| 一区二区高清在线观看| 亚洲精品国产精品国自产在线| 狠狠色伊人亚洲综合网站色| 亚洲欧美三级伦理| 久久一日本道色综合久久| 国产精品亚洲激情| 久久久999国产| 欧美国产三区| 国产精品99久久不卡二区| 欧美日韩1234| 久久精品一区二区三区不卡牛牛 | 校园春色综合网| 欧美99在线视频观看| 日韩一本二本av| 国产精品区二区三区日本| 久久岛国电影| 亚洲午夜视频| 美日韩精品免费观看视频| 亚洲精品免费一区二区三区| 欧美日本不卡高清| 久久久久久亚洲精品不卡4k岛国| 欧美黑人一区二区三区| 欧美专区第一页| 在线午夜精品| 最近中文字幕mv在线一区二区三区四区 | 国产精品久久久久久影院8一贰佰| 亚洲午夜电影| 99国产精品视频免费观看一公开| 亚洲欧美中日韩| 日韩一级免费| 亚洲经典视频在线观看| 国产精品视频免费| 国产精品高精视频免费| 欧美精品一区二区三区一线天视频| 亚洲一区二区三区乱码aⅴ蜜桃女| 亚洲激情成人| 亚洲人www| 在线视频成人| 伊人久久婷婷色综合98网| 国产精品久久久久一区二区| 欧美精品在线免费| 欧美精品在线观看| 欧美精品一区三区| 欧美网站在线观看| 国产精品拍天天在线| 国产欧美亚洲日本| 国产精品一香蕉国产线看观看| 国产精品国产馆在线真实露脸| 国产精品久久一区主播| 国产欧美精品xxxx另类| 好看的av在线不卡观看| 在线日韩欧美视频| 一区二区久久久久| 性欧美xxxx大乳国产app| 久久久久久久网站| 亚洲国产精品美女| 亚洲午夜三级在线| 美女主播视频一区| 欧美精品二区三区四区免费看视频| 欧美人交a欧美精品| 国产精品久久久久一区二区| 亚洲国产高清视频| 亚洲欧美日韩天堂| 亚洲福利电影| 久久精品国产精品亚洲| 欧美喷水视频| 激情av一区二区| 亚洲欧美激情一区| 亚洲国产精品第一区二区三区| 一本久久综合| 欧美精品久久99| 在线观看欧美| 久久免费黄色| 亚洲欧美日韩网| 欧美激情第一页xxx| 亚洲精品视频在线观看免费| 欧美一二三区精品| 欧美亚州一区二区三区 | 亚洲精品一二区| 久久久久久伊人| 国产综合久久久久久鬼色| 日韩视频一区二区在线观看 | 中文精品视频| 久久婷婷蜜乳一本欲蜜臀| 国产一区导航| 欧美在线免费播放| 亚洲亚洲精品三区日韩精品在线视频 | 国产精品一区=区| 亚洲一区二区三区精品视频| 亚洲国产精品免费| 欧美成人综合网站| 在线亚洲激情| 亚洲欧美国产精品专区久久| 国产亚洲精品激情久久| 老牛影视一区二区三区| 米奇777超碰欧美日韩亚洲| 亚洲激情在线观看视频免费| 亚洲人成久久| 国产在线精品一区二区夜色| 六月婷婷久久| 欧美四级在线观看| 久久久久五月天| 欧美美女福利视频| 久久久噜噜噜久久中文字免 | 快she精品国产999| 亚洲一区观看| 久久在线视频在线| 久久9热精品视频| 欧美女同在线视频| 久久精品最新地址| 国产精品亚洲аv天堂网| 免费观看欧美在线视频的网站| 欧美日韩在线视频首页| 久久夜色精品国产欧美乱| 国产精品亚洲а∨天堂免在线| 欧美激情a∨在线视频播放| 中国av一区| 99精品欧美一区| 久久亚洲综合网| 久久精品99国产精品| 国产精品videosex极品| 亚洲福利视频免费观看| 国产专区精品视频| 性做久久久久久久久|