C++ Coder

HCP高性能計(jì)算架構(gòu)，實(shí)現(xiàn)，編譯器指令優(yōu)化，算法優(yōu)化， LLVM CLANG OpenCL CUDA OpenACC C++AMP OpenMP MPI

C++博客

管理

98 Posts :: 0 Stories :: 0 Comments :: 0 Trackbacks

公告

對(duì)學(xué)習(xí)編程者的忠告：眼過(guò)千遍不如手過(guò)一遍！書(shū)看千行不如手敲一行！手敲千行不如單步一行！單步源代碼千行不如單步對(duì)應(yīng)匯編一行！

常用鏈接

留言簿(2)

隨筆分類(lèi)

隨筆檔案

搜索

閱讀排行榜

評(píng)論排行榜

VLIW on Cypress and vector addition

http://devgurus.amd.com/thread/158866

VLIW on Cypress and vector addition

此問(wèn)題被 假設(shè)已回答。

cadorino 2012-7-2 上午10:31

Hi to everybody.
I'm thinking about VLIW utilization on a 5870 HD.

Suppose you have the following kernel:

__kernel void saxpy(const __global float * x, __global float * y, const float a)
{
uint guid = get_global_id(0);
y[guid] = a * x[guid] + y[guid];
}

Each work item operates on a single vector element and no vectorization (float4).
Is the compiler still capable of packing instructions to exploit the 4 ALUs of each processing element?

Is there any tool to determine the way instructions are packed into VLIW?

Thank you very much!

Re: VLIW on Cypress and vector addition
MicahVillmow 2012-7-2 上午10:44 (回復(fù) cadorino)
The compiler only packs the VLIW with computation within a work-item, not across work-items. Multicore ware has some software that will allow you to pack across work-items, which can be found here:http://www.multicorewareinc.com/index.php?option=com_content&view=article&id=71&Itemid=106
Micah Villmow
Advanced Micro Devices Inc.
--------------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
舉報(bào)濫用

喜愛(ài) (0)
- Re: VLIW on Cypress and vector addition
  cadorino 2012-7-2 上午11:18 (回復(fù) MicahVillmow)
  Ok, so what if there are not enough alu instructions per work-item to be packed to exploit the 4 ALUs?
  舉報(bào)濫用
  
  喜愛(ài) (0)
  - Re: VLIW on Cypress and vector addition
    MicahVillmow 2012-7-2 上午11:27 (回復(fù) cadorino)
    Then your program is not utilizing the entire machine and cannot reach peak efficiency.
    Micah Villmow
    Advanced Micro Devices Inc.
    --------------------------------
    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
    舉報(bào)濫用
    
    喜愛(ài) (0)
    - Re: VLIW on Cypress and vector addition
      cadorino 2012-7-2 下午1:00 (回復(fù) MicahVillmow)
      Ok, so it is possible for a PE to get a VLIW containing less than 4 operations, and in this case some ALUs are idle. Right?
      舉報(bào)濫用
      
      喜愛(ài) (0)
    - Re: VLIW on Cypress and vector addition
      cadorino 2012-7-2 下午1:00 (回復(fù) MicahVillmow)
      Ok, so it is possible for a PE to get a VLIW containing less than 4 operations, and in this case some ALUs are idle. Right?
      舉報(bào)濫用
      
      喜愛(ài) (0)
      - Re: VLIW on Cypress and vector addition
        realhet 2012-7-3 上午3:05 (回復(fù) cadorino)
        Peak efficiency is only when all your 4 or 5 ALUs have work to do in every cycles.
        If you disassemble your code (use kernel analyzer), you can easily spot the idle ALUs.
        
        for example:
        10 x: SUB_INT     T0.x, PV9.z, KC0[2].x
                   w: SETGE_UINT ____, PV9.z, KC0[2].x    ;y,z,t sleeps
        11 z: AND_INT     ____, T0.y, PV10.w              ;x,y,w,t sleeps
        12 y: CNDE_INT    T1.y, PV11.z, T0.z, T0.x     ;x,z,w,t sleeps
        13 x: ADD_INT     ____, KC0[2].x, PV12.y        ;y,z,w,t sleeps
        this is so unoptimal that is does only 5 operations under 4 clocks, the possible maximum would be 4*5=20 operations (on vliw5)
        
        826 x: XOR_INT     T1.x, R28.w, T0.w
                     y: SETGT_UINT ____, T1.x, T0.w
                     z: XOR_INT     T3.z, KC0[13].z, R20.y      VEC_021
                     w: SETGT_UINT T2.w, T2.w, R15.y      VEC_201
                     t: SETGT_UINT T0.w, R9.x, T1.y
        827 x: ADD_INT     ____, T0.z, T2.z
                     y: ADD_INT     T2.y, T0.y, T2.x      VEC_021
                     z: ADD_INT     T0.z, T1.z, T2.y      VEC_210
                     w: ADD_INT     ____, PV826.y, T3.y      VEC_021
                     t: SETGT_UINT ____, T3.w, R5.x
        this one is maximum utilization. 10 operationc in 2 clocks.
        
        There are tricks to improve local paralellism in code (other than simply vectorizing everything) like breaking dependency chains:
        for example a+b+c+d -> (a+b)+(c+d)
        舉報(bào)濫用
        
        喜愛(ài) (0)
        Re: VLIW on Cypress and vector addition
        nou 2012-7-3 上午4:06 (回復(fù) realhet)
        or you can proceed 2-4 work items in one work item. like this. but it has disadvantage as it increase register usage.
        __kernel void saxpy(const __global float * x, __global float * y, const float a)
        {
                  uint guid = get_global_id(0);
                       for(int i=0;i<4;i++)
                            y[guid*4+1] = a * x[guid*4+i] + y[guid*4+i];
        }
        舉報(bào)濫用
        
        喜愛(ài) (0)
Re: VLIW on Cypress and vector addition
kbrafford 2012-7-3 下午12:24 (回復(fù) cadorino)
If he rewrote the kernel to use float4 types, would that also cause MMX/SSE registers to be used with compiled for a CPU device?

__kernel void saxpy(const __global float4 * x, __global float4 * y, const float a)
{
uint guid = get_global_id(0);

y[guid] = (float4)a * x[guid] + y[guid];
}
舉報(bào)濫用

喜愛(ài) (0)
- Re: VLIW on Cypress and vector addition
  nou 2012-7-3 下午1:58 (回復(fù) kbrafford)
  yes you need use explicit vector types to utilize SSE instructions on CPU with AMD OpenCL
  舉報(bào)濫用
  
  喜愛(ài) (0)
  - Re: VLIW on Cypress and vector addition
    cadorino 2012-7-3 下午7:48 (回復(fù) nou)
    Thank you very much you all, your answers are really useful!
    I already wrote a float4 version of the kernel, but I posted the float version since I'm trying to get into details of low-level aspects of VLIW compilation and execution.
    
    For what regards vectorized types, I agree that on CPU this enables SSE execution. Bu what for GPUs? For example, is the sum of two float4 elements spread across the 4 ALUs or the sum of each component is executed sequentially on a single ALU?
    舉報(bào)濫用
    
    喜愛(ài) (0)
    - Re: VLIW on Cypress and vector addition
      nou 2012-7-3 下午11:34 (回復(fù) cadorino)
      I am not entirely sure what you mean with single ALU. but one workitem is executed only on one 5D/4D unit. and you will get packed instruction if is there enough independent instructions.
      舉報(bào)濫用
      
      喜愛(ài) (0)
    - Re: VLIW on Cypress and vector addition
      realhet 2012-7-4 上午1:26 (回復(fù) cadorino)
      On hd6xxx there are 4 physical ALUs for each workitems. The compiler will schedule operations for each of the ALUs on every single clocks. Also the compiler must ensure that there are no data dependency across the ALUs (eg. ALU x cannot use the result from ALU y in a single clock).
      In our case [Very Large Instruction Word] means that one instruction contains at most 4 subinstructions for each of the four ALUs.
      
      If it's not complicated enough: on the HD4xxx,5xxx there is a fifth ALU which handles the complicated instructions. So 4 ALUs can do simple math like mul, add, and the 5th can handle special things like cos().
      
      SSE is SIMD. It means one instruction will do the same operation on 4 different datas. SIMD can interpreted as a special case of VLIW where all ALU's have to do the same operation on data packed into vectors (eg. float4)
      
      GCN architecture dropped VLIW. On that a simple sequential code which even contains long dependency chains will do fine. So there is no need to have 4 or 5 independent execution paths in your algo, but for maximum utilization you'll have to feed it with 4x more workitems.
      
      (you know, the example code here is rather theoretical: Its bottleneck is memory IO, all the ALUs are sleeping and waiting for the memory units. Also the hardest ALU calculation is not the a*b+c (1 mad instruction) but get_global_id(0) (modulo/rangecheck/add operations) and address calculations for the 3 indirectly addressed buffers.)
      舉報(bào)濫用
      
      喜愛(ài) (0)
      - Re: VLIW on Cypress and vector addition
        cadorino 2012-7-4 下午3:01 (回復(fù) realhet)
        Great answer! Thank you
        So, summarizing, the grouping of instructions may allow to exploit all the 4/5 ALUs and this exploitation depends on the program and on the compiler.
        
        舉報(bào)濫用
        
        喜愛(ài) (0)
      - Re: VLIW on Cypress and vector addition
        bridgman 2012-7-4 下午7:31 (回復(fù) realhet)
        >>On hd6xxx there are 4 physical ALUs for each workitems.
        
        It's actually only the hd69xx (and Trinity) which use VLIW4. The rest of the hd6xxx family uses VLIW5, similar to the earlier parts.
        
        The GCN parts (hd77xx and higher) use 4 scalar SIMDs in a CU rather than 1 VLIW4 SIMD.
        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
        舉報(bào)濫用
        
        喜愛(ài) (0)
        Re: VLIW on Cypress and vector addition
        realhet 2012-7-4 下午11:13 (回復(fù) bridgman)
        My bad I didn't use specific series numbers. Somehow I thought if I learn ISA on the 7970 I will be able to reuse the same code on the whole 7xxx series later. But it was until your post Thx for the info.
        舉報(bào)濫用
        
        喜愛(ài) (0)
Re: VLIW on Cypress and vector addition
cadorino 2012-7-5 上午6:46 (回復(fù) cadorino)
Great topic. I would like to ask you another thing without starting another topic.
Given a disassembled kernel like:

00 ALU: ADDR(32) CNT(11) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)
      0 t: MULLO_INT   ____, R1.x, KC0[1].x
      1 x: ADD_INT     ____, R0.x, PS0
      2 w: ADD_INT     ____, PV1.x, KC0[6].x
      3 z: LSHL        ____, PV2.w, 2
      4 y: ADD_INT     ____, KC1[0].x, PV3.z
         z: ADD_INT     ____, KC1[1].x, PV3.z
      5 x: LSHR        R2.x, PV4.z, 2
         y: LSHR        R0.y, PV4.z, 2
         w: LSHR        R0.w, PV4.y, 2
01 TEX: ADDR(48) CNT(2)
      6 VFETCH R0.x___, R0.w, fc153 MEGA(4)
         FETCH_TYPE(NO_INDEX_OFFSET)
      7 VFETCH R1.x___, R0.y, fc153 MEGA(4)
         FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(43) CNT(1) KCACHE0(CB1:0-15)
      8 x: MULADD_e    R0.x, KC0[2].x, R0.x, R1.x
03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R2].x___, R0, ARRAY_SIZE(4) MARK VPM
END_OF_PROGRAM

can I assume that each instruction 0 - 8 is executed in a clock cycle?
In the example I've 9 instructions. Can I say that they take 9 cycles to get executed? (obviously, after the fetch clause the wavefront will be switched off until data is available, but I'm referring exclusively to executing instructions, ignoring the time spent in waiting for memory accesses to complete)
舉報(bào)濫用

喜愛(ài) (0)
- Re: VLIW on Cypress and vector addition
  realhet 2012-7-5 上午7:51 (回復(fù) cadorino)
  yes, the first ALU clause takes 6 core clock cycles and the second takes one.
  But these are relatively small clauses interleaved with memory clauses so there are lots of penalties at the transitions of the clauses.
  Ideally alu clauses can hold 128 slots but these are small ones.
  
  Lets say you have 2 wavefronts A,B and C assigned to a compute unit:
  ALU               memory unit
  A: 00 ALU       idle
  B: 00 ALU       A: 01 TEX
  C: 00 ALU       A: 01 TEX still (it's slow operation compared to small ALU stuff)
  A: 02 ALU       B: 01 TEX
  idle                 B: 01 TEX still
  B: 02 ALU       A: 03 MEM
  idle                 C: 01 TEX
  idle                 C: 01 TEX still
  C: 02 ALU       B: 03 MEM
  idle                 C: 03 MEM
  
  (oups maybe the memory output unit is separated from the texture unit but it's only an illustration of how the different parts of a compute unit can work in paralell)
  (and the compiler did a good job compiling that MULADD into one instruction)
  舉報(bào)濫用
  
  喜愛(ài) (0)
  - Re: VLIW on Cypress and vector addition
    cadorino 2012-7-5 上午8:19 (回復(fù) realhet)
    Fantastic. Are wavefronts switched only when a fetch occurs?
    舉報(bào)濫用
    
    喜愛(ài) (0)

轉(zhuǎn)至原文

posted on 2013-01-09 16:37 jackdong 閱讀(514) 評(píng)論(0) 編輯收藏引用所屬分類(lèi): OpenCL

只有注冊(cè)用戶(hù)登錄后才能發(fā)表評(píng)論。


相關(guān)文章: 淺談多節(jié)點(diǎn)CPU+GPU協(xié)同計(jì)算負(fù)載均衡性設(shè)計(jì) VLIW on Cypress and vector addition Low ALUBusy and low FetchUnitBusy Understanding performance counters ALUBusy question 適用于ATI卡的GPU計(jì)算MD5的小程序源碼，基于AMD APP SDK開(kāi)發(fā) Test latency for clEnqueueNDRangeKernel 采用OpenCL標(biāo)準(zhǔn)實(shí)現(xiàn)FPGA設(shè)計(jì)

網(wǎng)站導(dǎo)航: 博客園 IT新聞 BlogJava 博問(wèn) Chat2DB 管理

青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

C++ Coder

公告

常用鏈接

留言簿(2)

隨筆分類(lèi)

隨筆檔案

搜索

最新評(píng)論

閱讀排行榜

評(píng)論排行榜

VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition

Re: VLIW on Cypress and vector addition