青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

C++ Coder

HCP高性能計(jì)算架構(gòu),實(shí)現(xiàn),編譯器指令優(yōu)化,算法優(yōu)化, LLVM CLANG OpenCL CUDA OpenACC C++AMP OpenMP MPI

C++博客 首頁(yè) 新隨筆 聯(lián)系 聚合 管理
  98 Posts :: 0 Stories :: 0 Comments :: 0 Trackbacks
http://devgurus.amd.com/thread/158866

VLIW on Cypress and vector addition

此問(wèn)題被 假設(shè)已回答。

cadorinoNewbie
cadorino 2012-7-2 上午10:31

Hi to everybody.
I'm thinking about VLIW utilization on a 5870 HD.

Suppose you have the following kernel:

 

__kernel void saxpy(const __global float * x, __global float * y, const float a)

{

          uint guid = get_global_id(0);

                    y[guid] = a * x[guid] + y[guid];

}

 

Each work item operates on a single vector element and no vectorization (float4).
Is the compiler still capable of packing instructions to exploit the 4 ALUs of each processing element?

Is there any tool to determine the way instructions are packed into VLIW?

 

Thank you very much!

  • 633 瀏覽次數(shù)
  • Re: VLIW on Cypress and vector addition
    MicahVillmowModerator
    MicahVillmow 2012-7-2 上午10:44 (回復(fù) cadorino)

    The compiler only packs the VLIW with computation within a work-item, not across work-items. Multicore ware has some software that will allow you to pack across work-items, which can be found here:http://www.multicorewareinc.com/index.php?option=com_content&view=article&id=71&Itemid=106

    Micah Villmow
    Advanced Micro Devices Inc.
    --------------------------------
    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
    • Re: VLIW on Cypress and vector addition
      cadorinoNewbie
      cadorino 2012-7-2 上午11:18 (回復(fù) MicahVillmow)

      Ok, so what if there are not enough alu instructions per work-item to be packed to exploit the 4 ALUs?

      • Re: VLIW on Cypress and vector addition
        MicahVillmowModerator
        MicahVillmow 2012-7-2 上午11:27 (回復(fù) cadorino)

        Then your program is not utilizing the entire machine and cannot reach peak efficiency.

        Micah Villmow
        Advanced Micro Devices Inc.
        --------------------------------
        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
        • Re: VLIW on Cypress and vector addition
          cadorinoNewbie
          cadorino 2012-7-2 下午1:00 (回復(fù) MicahVillmow)

          Ok, so it is possible for a PE to get a VLIW containing less than 4 operations, and in this case some ALUs are idle. Right?

        • Re: VLIW on Cypress and vector addition
          cadorinoNewbie
          cadorino 2012-7-2 下午1:00 (回復(fù) MicahVillmow)

          Ok, so it is possible for a PE to get a VLIW containing less than 4 operations, and in this case some ALUs are idle. Right?

          • Re: VLIW on Cypress and vector addition
            realhetNovice
            realhet 2012-7-3 上午3:05 (回復(fù) cadorino)

            Peak efficiency is only when all your 4 or 5 ALUs have work to do in every cycles.

            If you disassemble your code (use kernel analyzer), you can easily spot the idle ALUs.

             

            for example:

            10  x: SUB_INT     T0.x,  PV9.z,  KC0[2].x

                       w: SETGE_UINT  ____,  PV9.z,  KC0[2].x    ;y,z,t sleeps

            11  z: AND_INT     ____,  T0.y,  PV10.w              ;x,y,w,t sleeps

            12  y: CNDE_INT    T1.y,  PV11.z,  T0.z,  T0.x     ;x,z,w,t sleeps

            13  x: ADD_INT     ____,  KC0[2].x,  PV12.y        ;y,z,w,t sleeps

            this is so unoptimal that is does only 5 operations under 4 clocks, the possible maximum would be 4*5=20 operations (on vliw5)

             

            826  x: XOR_INT     T1.x,  R28.w,  T0.w

                         y: SETGT_UINT  ____,  T1.x,  T0.w

                         z: XOR_INT     T3.z,  KC0[13].z,  R20.y      VEC_021

                         w: SETGT_UINT  T2.w,  T2.w,  R15.y      VEC_201

                         t: SETGT_UINT  T0.w,  R9.x,  T1.y

            827  x: ADD_INT     ____,  T0.z,  T2.z

                         y: ADD_INT     T2.y,  T0.y,  T2.x      VEC_021

                         z: ADD_INT     T0.z,  T1.z,  T2.y      VEC_210

                         w: ADD_INT     ____,  PV826.y,  T3.y      VEC_021

                         t: SETGT_UINT  ____,  T3.w,  R5.x

            this one is maximum utilization. 10 operationc in 2 clocks.

             

            There are tricks to improve local paralellism in code (other than simply vectorizing everything) like breaking dependency chains:

            for example  a+b+c+d  -> (a+b)+(c+d)

            • Re: VLIW on Cypress and vector addition
              nouExpert
              nou 2012-7-3 上午4:06 (回復(fù) realhet)

              or you can proceed 2-4 work items in one work item. like this. but it has disadvantage as it increase register usage.

              __kernel void saxpy(const __global float * x, __global float * y, const float a)

              {

                        uint guid = get_global_id(0);

                             for(int i=0;i<4;i++)

                                  y[guid*4+1] = a * x[guid*4+i] + y[guid*4+i];

              }

  • Re: VLIW on Cypress and vector addition
    kbraffordNovice
    kbrafford 2012-7-3 下午12:24 (回復(fù) cadorino)

    If he rewrote the kernel to use float4 types, would that also cause MMX/SSE registers to be used with compiled for a CPU device?

     

    __kernel void saxpy(const __global float4 * x, __global float4 * y, const float a)

    {

              uint guid = get_global_id(0);

     

              y[guid] = (float4)a * x[guid] + y[guid];

    }

    • Re: VLIW on Cypress and vector addition
      nouExpert
      nou 2012-7-3 下午1:58 (回復(fù) kbrafford)

      yes you need use explicit vector types to utilize SSE instructions on CPU with AMD OpenCL

      • Re: VLIW on Cypress and vector addition
        cadorinoNewbie
        cadorino 2012-7-3 下午7:48 (回復(fù) nou)

        Thank you very much you all, your answers are really useful!
        I already wrote a float4 version of the kernel, but I posted the float version since I'm trying to get into details of low-level aspects of VLIW compilation and execution.

         

        For what regards vectorized types, I agree that on CPU this enables SSE execution. Bu what for GPUs? For example, is the sum of two float4 elements spread across the 4 ALUs or the sum of each component is executed sequentially on a single ALU?

        • Re: VLIW on Cypress and vector addition
          nouExpert
          nou 2012-7-3 下午11:34 (回復(fù) cadorino)

          I am not entirely sure what you mean with single ALU. but one workitem is executed only on one 5D/4D unit. and you will get packed instruction if is there enough independent instructions.

        • Re: VLIW on Cypress and vector addition
          realhetNovice
          realhet 2012-7-4 上午1:26 (回復(fù) cadorino)

          On hd6xxx there are 4 physical ALUs for each workitems. The compiler will schedule operations for each of the ALUs on every single clocks. Also the compiler must ensure that there are no data dependency across the ALUs (eg. ALU x cannot use the result from ALU y in a single clock).

          In our case [Very Large Instruction Word] means that one instruction contains at most 4 subinstructions for each of the four ALUs.

           

          If it's not complicated enough: on the HD4xxx,5xxx there is a fifth ALU which handles the complicated instructions. So 4 ALUs can do simple math like mul, add, and the 5th can handle special things like cos().

           

          SSE is SIMD. It means one instruction will do the same operation on 4 different datas. SIMD can interpreted as a special case of VLIW where all ALU's have to do the same operation on data packed into vectors (eg. float4)

           

          GCN architecture dropped VLIW. On that a simple sequential code which even contains long dependency chains will do fine. So there is no need to have 4 or 5 independent execution paths in your algo, but for maximum utilization you'll have to feed it with 4x more workitems.

           

          (you know, the example code here is rather theoretical: Its bottleneck is memory IO, all the ALUs are sleeping and waiting for the memory units. Also the hardest ALU calculation is not the a*b+c (1 mad instruction) but get_global_id(0) (modulo/rangecheck/add operations) and address calculations for the 3 indirectly addressed buffers.)

          • Re: VLIW on Cypress and vector addition
            cadorinoNewbie
            cadorino 2012-7-4 下午3:01 (回復(fù) realhet)

            Great answer! Thank you 
            So, summarizing, the grouping of instructions may allow to exploit all the 4/5 ALUs and this exploitation depends on the program and on the compiler. 

          • Re: VLIW on Cypress and vector addition
            bridgmanNewbie
            bridgman 2012-7-4 下午7:31 (回復(fù) realhet)

            >>On hd6xxx there are 4 physical ALUs for each workitems.

             

            It's actually only the hd69xx (and Trinity) which use VLIW4. The rest of the hd6xxx family uses VLIW5, similar to the earlier parts.

             

            The GCN parts (hd77xx and higher) use 4 scalar SIMDs in a CU rather than 1 VLIW4 SIMD.

            The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
  • Re: VLIW on Cypress and vector addition
    cadorinoNewbie
    cadorino 2012-7-5 上午6:46 (回復(fù) cadorino)

    Great topic. I would like to ask you another thing without starting another topic.
    Given a disassembled kernel like:

     

    00 ALU: ADDR(32) CNT(11) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)

          0  t: MULLO_INT   ____,  R1.x,  KC0[1].x     

          1  x: ADD_INT     ____,  R0.x,  PS0     

          2  w: ADD_INT     ____,  PV1.x,  KC0[6].x     

          3  z: LSHL        ____,  PV2.w,  2     

          4  y: ADD_INT     ____,  KC1[0].x,  PV3.z     

             z: ADD_INT     ____,  KC1[1].x,  PV3.z     

          5  x: LSHR        R2.x,  PV4.z,  2     

             y: LSHR        R0.y,  PV4.z,  2     

             w: LSHR        R0.w,  PV4.y,  2     

    01 TEX: ADDR(48) CNT(2)

          6  VFETCH R0.x___, R0.w, fc153  MEGA(4)

             FETCH_TYPE(NO_INDEX_OFFSET)

          7  VFETCH R1.x___, R0.y, fc153  MEGA(4)

             FETCH_TYPE(NO_INDEX_OFFSET)

    02 ALU: ADDR(43) CNT(1) KCACHE0(CB1:0-15)

          8  x: MULADD_e    R0.x,  KC0[2].x,  R0.x,  R1.x     

    03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R2].x___, R0, ARRAY_SIZE(4)  MARK  VPM

    END_OF_PROGRAM

     

    can I assume that each instruction 0 - 8 is executed in a clock cycle?

    In the example I've 9 instructions. Can I say that they take 9 cycles to get executed? (obviously, after the fetch clause the wavefront will be switched off until data is available, but I'm referring exclusively to executing instructions, ignoring the time spent in waiting for memory accesses to complete)

    • Re: VLIW on Cypress and vector addition
      realhetNovice
      realhet 2012-7-5 上午7:51 (回復(fù) cadorino)

      yes, the first ALU clause takes 6 core clock cycles and the second takes one.

      But these are relatively small clauses interleaved with memory clauses so there are lots of penalties at the transitions of the clauses.

      Ideally alu clauses can hold 128 slots but these are small ones.

       

      Lets say you have 2 wavefronts A,B and C assigned to a compute unit:

      ALU               memory unit

      A: 00 ALU       idle

      B: 00 ALU       A: 01 TEX

      C: 00 ALU       A: 01 TEX still (it's slow operation compared to small ALU stuff)

      A: 02 ALU       B: 01 TEX

      idle                 B: 01 TEX still

      B: 02 ALU       A: 03 MEM

      idle                 C: 01 TEX

      idle                 C: 01 TEX still

      C: 02 ALU       B: 03 MEM

      idle                 C: 03 MEM

       

      (oups maybe the memory output unit is separated from the texture unit but it's only an illustration of how the different parts of a compute unit can work in paralell)

      (and the compiler did a good job compiling that MULADD into one instruction)

posted on 2013-01-09 16:37 jackdong 閱讀(514) 評(píng)論(0)  編輯 收藏 引用 所屬分類(lèi): OpenCL
青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品
  • <ins id="pjuwb"></ins>
    <blockquote id="pjuwb"><pre id="pjuwb"></pre></blockquote>
    <noscript id="pjuwb"></noscript>
          <sup id="pjuwb"><pre id="pjuwb"></pre></sup>
            <dd id="pjuwb"></dd>
            <abbr id="pjuwb"></abbr>
            中日韩午夜理伦电影免费| 亚洲午夜国产一区99re久久| 欧美制服丝袜| 亚洲免费在线播放| 亚洲午夜精品一区二区| 日韩一级免费| 日韩视频一区二区在线观看| 亚洲精品在线电影| 亚洲新中文字幕| 午夜电影亚洲| 久久一区激情| 欧美日韩mp4| 国产精品免费在线| 国内成人精品2018免费看| 狠狠久久亚洲欧美专区| 亚洲国产欧美日韩| 在线亚洲欧美| 久久精品国产99国产精品澳门| 欧美在线啊v| 欧美国产亚洲另类动漫| 在线视频日韩精品| 久久精品网址| 欧美日韩中文字幕在线| 国产偷国产偷亚洲高清97cao | 久久精品夜色噜噜亚洲aⅴ| 久久久噜噜噜久久人人看| 欧美精品免费观看二区| 国产酒店精品激情| 亚洲国内欧美| 久久精品成人一区二区三区| 欧美激情亚洲综合一区| 亚洲一区二区三区激情| 欧美a级理论片| 国产麻豆一精品一av一免费| 亚洲三级观看| 久久一二三国产| 这里只有精品视频在线| 免费永久网站黄欧美| 国产日本精品| 夜夜躁日日躁狠狠久久88av| 久久免费视频网站| 亚洲一区免费网站| 欧美激情精品久久久| 精品成人国产在线观看男人呻吟| 欧美亚洲免费电影| 性欧美办公室18xxxxhd| 亚洲高清在线观看一区| 99热这里只有精品8| 美国十次成人| 国内精品福利| 欧美在线视频观看| 一区二区三区精品久久久| 欧美电影免费观看| 影音先锋亚洲视频| 久久久久9999亚洲精品| 亚洲无线视频| 欧美视频一区| 亚洲视频在线观看网站| 亚洲人精品午夜| 欧美aaaaaaaa牛牛影院| 在线视频成人| 欧美成ee人免费视频| 久久久久.com| 一区二区三区在线看| 久久久久亚洲综合| 久久成人亚洲| 影音先锋日韩有码| 欧美成人资源| 欧美风情在线观看| 亚洲人永久免费| 亚洲欧洲免费视频| 欧美日韩成人| 亚洲一区图片| 午夜精品久久久久久久| 国产一区二区三区四区| 久久精品人人| 久久一区二区三区av| 日韩视频一区二区在线观看 | 午夜国产精品视频| 国产综合视频| 你懂的视频欧美| 欧美激情综合在线| 亚洲一区美女视频在线观看免费| 一本色道久久88综合日韩精品 | 国产综合色产| 欧美福利专区| 欧美日韩一区二| 先锋影音国产一区| 久久九九热re6这里有精品| 亚洲人成艺术| 在线亚洲激情| 亚洲成人影音| 一区二区三区日韩欧美| 国语自产精品视频在线看一大j8 | 国产精品一区二区欧美| 久色婷婷小香蕉久久| 欧美精品三级日韩久久| 亚洲欧美日韩网| 久久综合导航| 欧美成人激情视频| 欧美精品乱人伦久久久久久| 亚洲欧美日韩另类| 久久免费视频在线观看| 夜夜躁日日躁狠狠久久88av| 亚洲欧美视频在线观看视频| 亚洲国产精品精华液2区45| 一区二区三区av| 在线视频国产日韩| 亚洲在线免费观看| 亚洲精品中文字幕女同| 香蕉成人久久| 亚洲午夜电影| 欧美激情精品久久久久久免费印度| 亚洲欧美另类久久久精品2019| 久久久噜噜噜久久久| 欧美有码视频| 欧美午夜性色大片在线观看| 欧美电影美腿模特1979在线看 | 国产欧美日韩在线视频| 亚洲三级国产| 亚洲电影视频在线| 欧美一二三视频| 亚洲欧美经典视频| 欧美日韩精品一二三区| 欧美激情在线免费观看| 激情久久综合| 欧美在线高清视频| 欧美在线看片| 国产欧美精品一区| 亚洲制服丝袜在线| 亚洲欧美综合国产精品一区| 欧美日韩国产探花| 亚洲欧洲在线视频| 亚洲三级色网| 欧美大片一区二区| 亚洲国产经典视频| 亚洲欧洲精品一区二区精品久久久| 久久久久国产精品一区| 久久婷婷人人澡人人喊人人爽| 国产日韩精品一区二区浪潮av| 一区二区三区免费在线观看| 国产精品99久久久久久久女警| 欧美另类videos死尸| 亚洲精品乱码久久久久| 一区二区三区 在线观看视| 欧美日韩亚洲免费| 亚洲天天影视| 久久成人免费| 在线成人黄色| 欧美a级片网站| 99热精品在线观看| 欧美在线国产精品| 国产亚洲午夜| 老司机免费视频一区二区三区| 欧美国产日本在线| 99re热这里只有精品免费视频| 欧美精品国产一区二区| 在线视频欧美一区| 久久精品30| 亚洲国产另类久久精品| 欧美日韩国产黄| 亚洲一区二区三区中文字幕在线| 欧美怡红院视频一区二区三区| 国产一区二区电影在线观看| 久久久午夜电影| 亚洲精品乱码久久久久久| 亚洲一区二区三区四区五区午夜 | 国产精品成人aaaaa网站| 亚洲视频www| 久久尤物视频| 一区二区三区免费看| 国产欧美午夜| 欧美1区3d| 亚洲欧美日韩高清| 欧美激情1区2区3区| 亚洲一二三区在线观看| 狠狠综合久久| 欧美日韩免费观看一区=区三区| 亚洲欧美日韩国产中文在线| 欧美肥婆在线| 欧美一区二区三区婷婷月色 | 日韩写真在线| 国产午夜亚洲精品羞羞网站| 免费不卡在线观看| 亚洲欧美久久久久一区二区三区| 欧美www视频| 欧美一区二区三区的| 日韩亚洲不卡在线| 在线观看视频一区| 国产精品美女久久福利网站| 两个人的视频www国产精品| 亚洲自拍啪啪| 亚洲精品综合精品自拍| 免费观看日韩av| 久久成人亚洲| 篠田优中文在线播放第一区| avtt综合网| 亚洲精品国产品国语在线app| 国产一区二区精品久久| 国产精品视频午夜|