/ risc-v

Vector extension for RISC-V and HWACHA project

As the RISC-V community grows, more accomplishments will be integrated into the specifications. One of the most interesting and notorious additions from the last update (2017) it's the vector extension also know as "V". This extension brings tips that the ISA will evolute in the vector future and it's directly related to Hwacha project. So, that's the main topic of this post, explain what's HWACHA project and why it's a big deal for the RISC-V ISA and how it works, now that it's really open source to developers on github.

"V" extension on the ISA

Before talking about HWACHA, let's talk about the additions on RISC-V ISA related to vector "V" extension. Beginning from the v2.1 specification to the latest (in the time that this post has been written) v2.2, though vector operations are still in version v0.2, a lot of major additions has been done in the ISA. According to the specification, projected by the foundation, the vector extension it's related to the principles of architectures developed by Seymour Cray in the 70's, where the processing it's based on local registers close to the CPU, instead of accessing external main memory. Also, the Cray-way architecture had introduced the concept of using pipeline paralelism to compute the math operations through the ALU, instead of using a lot of parallel hardware that are not fully occupied during the whole computation.
cray-1

The ISA also accomplishes a proposal with vector data registers (v0-v31), 8 vector
predicate registers (vp0-vp7) and a configurable vector length register, why and how they are used it'll be explained ahead. Also, to support different hardware implementations, the ISA allows to configure the maximum vector length, number of predicate registers, rounding mode and the fixed-point saturation status of the vector operations (vcmax, vcnpred, vxrm, vxsat). We must be clear that we're talking about ISA, so no hardware minor's should be mentioned in the document.

These registers are used by the as-called vector unit, one which should be configured before use it. It's mentioned that the maximum vector length of each vector data register (which corresponds to the vectors elements) should be configurable at least, in 4-element shape, what I guess it's promoting to re-use the same vector register to different data types (uint32, uint16, uint8....). The vector data registers also have a vctype (4-bit width) register that enables the programmer to dynamic change the size of the content data equally or lesser than the vcmax previous configured. As imagined, if the programmer overwrite vctype registers with values above vcmax an exception should occurs. Same must happen if the programmer writes an vcmax value bigger then supported by hardware implementation raise an illegal instruction exception. There are others details due to zeroing and exceptions so, if you want something detailed, go to the ISA and check it.

Screenshot-from-2019-03-04-16-18-48

Also, there is a vector length register, what should be programmed with the maximum length based on the vector operations that should be made by the vector processor. The main idea is to give the application the power to change it for each vector loop iteration inside a stripmine loop. I'll not extend more on the proposed extension because it's quite simple compared with the current work that'll be released for vector operations in the near future, but I think that this highlights can help the base for the Hwacha project that I'll talk later on.

The UCBar project

The Hwacha project, developed by UCBar (UC Berkeley Architecture Research) team, it's a vector-fetch architecture designed to be high performance at low power/energy for a wide range of applications while being a favorable compiler target. According to their documentation, the features of this implementation are the high degree of decoupling between vector data access and vector execution.

l1ogp3nscg521

So, the name of the project refers to an old rocket launcher machine created by Korea that can at somehow represents the powerful of a multiple vectors processing. The project was built above the Rocket Chip Generator project, where you can built your own SoC through different set of configurations already write in Chisel[1] language. Hwacha project uses the RoCC[2] interface to exchange data between main CPU and the Vector processor through the custom ISA opcodes available. The approach used by Hwacha team was based on traditional vector architecture, with a scalar coupled processor responsible to loadup the vector processor settings and start the vector CPU right after it.
simd_simt
It's explained in the Hwacha Vector-Fetch Architecture Manual, that both SIMD and SIMT (operation through multiple data/threads) has a set of drawbacks that compromises the efficiency of the whole system whereas the first looses in flexibility and the latter in scalar processing however traditional vector machines can have both scalar and dynamism as principles using the vector length register (VLR) and the hardware vector length (HVL). The concept behind this machines is made through the software requesting an application vector length and a hardware vector length once up it provides a vector length register, so this enables the vector coprocessor to be flexible to different HVL and also avoiding fringe code to process the last elements in the array. Another positive aspect is that in the opposite way as SIMT, Hwacha only stores one copy of address of the manipulated variables, so it performs address calculation once per loop iteration, using the hardware resources better.

Hwacha hardware

The hardware developed in Hwacha v3.8.1 (the only hw document available in that time I write) it is present in the image below, where we can distinguish between the rocket tile that has instruction/data cache and the hwacha coprocessor that has a single instruction cache. The control thread[3] of the Hwacha programming model it's executed by the rocket CPU exploiting vector-fetch decoupling whereas the worker thread[4] run in the Hwacha coprocressor. The instruction cache present in the Hwacha coprocessor it's used to pre-fetch instructions pointed by the control-thread to the vector CPU. Also, the vector and rocket CPU are both connected to the L2 cache memory sub-system through tile-link bus that keeps the coherency of the data between shared registers (Hwacha CPU) and data cache (Rocket Tile). If more memory it's needed, the base SoC proposed by this diagram shows that the memory system it's also connected to AMBA AXI4 bus that has DRAMs as slave devices to increase the bulk storage of the whole system.

Screenshot-from-2019-03-06-16-57-28

One of the major deals of access/execute decoupling in the Hwacha project is the idea of using the cache-coherent memory system to worry about keep the data coherent, make the control CPU compute addresses while the vector accelerator is processing vector data. As the image below shows, the hardware it's composed by a scalar unit (SXU+SMU), a set of configurable lanes (VXU+VMU), a vector runahead unit (VRU).

Screenshot-from-2019-03-06-16-55-48

When the control processor founds a vf instruction, the accelerator decodes and it starts fetching instructions (using PC as ref) in a 4KB instruction cache until it finds a vstop. Then the single issue, 4-stages pipeline (scalar unit) front-end begins splitting the job through the multiple lanes. There are a lot of details around the architecture mainly related to execution behavior and the memory unit but I'll not prolong myself even because this document design it's not updated with the latest improvement/changes in Hwacha v4, presented in the latest summit.

Testing the ISA

In UCBar page on github you can find a Hwacha project template in the same manner as the RISC-V template, used to demonstrate how you can create your own rocket-chip SoC. The template helps users to get started using the vector accelerator coupled with a RV64G and run benchmarks as well as assembly tests in the core. I had started taking a look from the last month in the version available in their github.com but there'are still some details that should be fixed, so in this moment it's not completely working as expected in my opinion. Besides this fact, it's also important that the Hwacha team project releases a updated documentation that could matches with the latest hardware presented at RISC-V summit of 2018. The configuration proposed by the team of the control processor is:

// hwacha.ISCA2016Config
// file > TopLevelConfigs.scala
class ISCA2016Config extends Config(
  new Process28nmConfig ++
  new WithNBanksPerMemChannel(4) ++
  new With32BtbEntires ++ new HwachaConfig)

class HwachaConfig extends Config(new DefaultHwachaConfig ++ new DefaultConfig)

// file > Configs.scala (rocket-chip/src/main/scala/system)
class DefaultConfig extends Config(new WithNBigCores(1) ++ new BaseConfig)

// file > Configs.scala (rocket-chip/src/main/scala/subsystem)
class WithNBigCores(n: Int) extends Config((site, here, up) => {
  case RocketTilesKey => {
    val big = RocketTileParams(
      core   = RocketCoreParams(mulDiv = Some(MulDivParams(
        mulUnroll = 8,
        mulEarlyOut = true,
        divEarlyOut = true))),
      dcache = Some(DCacheParams(
        rowBits = site(SystemBusKey).beatBits,
        nMSHRs = 0,
        blockBytes = site(CacheBlockBytes))),
      icache = Some(ICacheParams(
        rowBits = site(SystemBusKey).beatBits,
        blockBytes = site(CacheBlockBytes))))
    List.tabulate(n)(i => big.copy(hartId = i))
  }
})

And for Hwacha setup we have:

class DefaultHwachaConfig extends Config((site, here, up) => {
    case HwachaIcacheKey => ICacheParams(
      nSets = 64,
      nWays = 1,
      rowBits = 1 * 64,
      nTLBEntries = 8,
      fetchBytes = 8, // Fetch one 8 byte instruction
      latency = 1
    )
    // Same as core's icache: NITLBEntries, NRAS, ECCCode, WordBits, Replacer

    case HwachaCommitLog => true

    // hwacha constants
    case HwachaNAddressRegs => 32
    case HwachaNScalarRegs => 64
    case HwachaNVectorRegs => 256
    case HwachaNPredRegs => 16
    case HwachaRegBits => math.max(log2Up(site(HwachaNVectorRegs)), log2Up(site(HwachaNScalarRegs)))
    case HwachaPredRegBits => log2Up(site(HwachaNPredRegs))
    case HwachaRegLen => 64
    case HwachaMaxVLen =>
      site(HwachaNBanks) * site(HwachaNSRAMRFEntries) *
        site(HwachaBankWidth) / site(HwachaRegLen)

    case HwachaNDTLB => 8
    case HwachaNPTLB => 4
    case HwachaLocalScalarFPU => false

    // Multi-lane constants
    case HwachaNLanes => 1

    // lane constants
    case HwachaBankWidth => 128
    case HwachaNBanks => 4
    case HwachaNSRAMRFEntries => 256
    case HwachaNFFRFEntries => 16
    case HwachaNFFRFReadPorts => 3
    case HwachaNPredRFEntries => 256
    case HwachaNPredRFReadPorts => 3
    case HwachaNOperandLatches => 6
    case HwachaNPredLatches => 4
    case HwachaWriteSelects => 2
    case HwachaRFAddrBits => math.max(log2Up(site(HwachaNSRAMRFEntries)), log2Up(site(HwachaNFFRFEntries)))
    case HwachaPRFAddrBits => log2Up(site(HwachaNPredRFEntries))

    case HwachaStagesALU => 1
    case HwachaStagesPLU => 0
    case HwachaStagesIMul => 3
    case HwachaStagesDFMA => 4
    case HwachaStagesSFMA => 3
    case HwachaStagesHFMA => 3
    case HwachaStagesFConv => 2
    case HwachaStagesFCmp => 1

    case HwachaNSeqEntries => 8

    case HwachaNVVAQEntries => 4
    case HwachaNVPAQEntries => 24
    case HwachaNVSDQEntries => 4
    case HwachaNVLDQEntries => 4
    case HwachaNVLTEntries => 64

    case HwachaNSMUEntries => 16
    case HwachaBuildVRU => true

    case BuildRoCC => Seq(
      (p: Parameters) => {
         val hwacha = LazyModule.apply(new Hwacha()(p))
         hwacha
      }
    )
    // Set TL network to 128bits wide
    case SystemBusKey => up(SystemBusKey, site).copy(beatBytes = 16)

    case HwachaConfPrec => false
    case HwachaVRUMaxOutstandingPrefetches => 20
    case HwachaVRUEarlyIgnore => 1
    case HwachaVRUMaxRunaheadBytes => 16777216
    case HwachaCMDQLen => 32
    case HwachaVSETVLCompress => true
  }
)

So, for the sanity of the readers that're not used with Chisel+RocketChip, to understand such files, please read this link and this that maybe could help to understand a little bit better. Because we do not have something quite stable yet, I decided to explain what I understood about Hwacha ISA usage through their custom repository that supports the vectorized ISA as a custom version of spike ISA simulator. I'll not enter in the details of how you can install and use such tools as the toolchain, tests, etc... though you can take a look to this and get by yourself. Be aware that this tools are non-standard ones for RISC-V actual ISA because it contains modifications to support such custom structures.

Testing the ISA

Screenshot-from-2019-03-07-11-02-34

First of all, we should keep in mind that we have a set of registers that are used to make our vector computations through different ways. The shared registers are used to get data from the data cache in the control processor, whereas the address registers stores the base address of the variables used in the computation (as n, x, y, a in CSAXPY). Also the predicate registers are used to store future conditions of branch to take or not by the vector CPU. We also have the vector configuration register and vector length register that are responsible for configuring the vector accelerator according to the application requirements.

Check the following code below:

#include "util.h"
#include "vec_vvadd.h"

// Input/Reference Data
#include "dataset1.h"


int main( int argc, char* argv[] )
{
  float result[DATA_SIZE];
  
  vec_vvadd_asm(DATA_SIZE, result, input_data_X, input_data_Y);
 
  return 0;
}

and also the assembly correspondent function:

#include "vec-util.h"

.text
.align 2

.globl vec_vvadd_asm
.type  vec_vvadd_asm,@function

# assumes calling convention:
# a0 has int n
# a1 has float* result  <---
# a2 has float* x
# a3 has float* y
vec_vvadd_asm:
    li a4, VCFG(3, 0, 0, 1)
    vsetcfg a4
stripmine:
    vsetvl t0, a0 #a0 is requested vec len, actual is placed in t0
    vmca va0, a2
    vmca va1, a3
    vmca va2, a1
    la t5, vvadd_v
    vf 0(t5)
    slli t1, t0, 2
    add a1, a1, t1
    add a2, a2, t1
    add a3, a3, t1
    sub a0, a0, t0
    bnez a0, stripmine
    fence
    ret

# vector thread asm
.align 3
vvadd_v:
    vpset vp0
    vlw vv0, va0
    vlw vv1, va1
    vfadd.s vv2, vv0, vv1
    vsw vv2, va2
    vstop

This code was developed to use as benchmark for HWACHAv4 and it's available at this link. In resume this assembly code operates addition math through a float array of 32-bits using the vector coprocessor. So, the code can be split in two terms:

  1. What runs on the main CPU or control processor;
  2. What runs on the vector CPU or vector accelerator in hwacha case;

Control CPU

First, everything from the sub-routine called vec_vvadd_asm to the ret assembly command is the segment of code that runs on the control CPU and from the vvadd_v sub-routine through the vstop asm command is the code that runs on the vector CPU. With these concepts fixed, we can analyze what each assembly command does starting first from the control CPU sub-routine. For each loop iteration the control thread process a new information and send to the vector processor. This two lines below configures the vector processor to operate through 3x64 bits data registers and 1x predicate register. The macro VCFG has 4 arguments which relies on double (64-bits), word (32-bits) and half-word (16-bits) and the last argument number of predicate registers.

vec_vvadd_asm:
    li a4, VCFG(3, 0, 0, 1)
    vsetcfg a4

This configuration was defined to use three registers because in this example, we're using three address variables which are 64-bits referred as resultant vector, x vector and y vector.

stripmine:
    vsetvl t0, a0 #a0 is requested vec len, actual is placed in t0
    vmca va0, a2
    vmca va1, a3
    vmca va2, a1

In this section above it's passed to the accelerator the size of the vector length with vsetvl and the accelerator returns with the correspondent length set in t0 register. The lines with vmca assembly function, are responsible to load up the address of the used variables to the accelerator in the correspondent vector address register va0, va1, va2. Note: It's important to highlight that the RISC-V registers like a0, a1, a2, a3 used in this example case are the same the store the functions arguments according to the ISA, so when called the sub-routine vec_vvadd_asm, they are write as vecvvaddasm(DATASIZE, result, inputdataX, inputdataY).
Screenshot-from-2019-03-08-11-11-39
That's why the a0 register it's used to control the number of loop iterations and the others are used as address of the variables.

    la t5, vvadd_v
    vf 0(t5)

These two lines above just point to the accelerator where it should start executing the vector commands and then now it should start fetching ahead the correspondent instructions at the label vvadd_v.

    slli t1, t0, 2
    add a1, a1, t1
    add a2, a2, t1
    add a3, a3, t1
    sub a0, a0, t0
    bnez a0, stripmine
    fence
    ret

In this segment, the control thread computes the number of iterations that should left before jumping off the sub-routine. As the number of vector elements that will be processed by the vector instructions it's stored in t0 and it can change through each iteration (because in fringe elements this will not be the same) we should increment the pointers of the address variables. The slli instruction it's used to multiply by 4 the number of elements processed because we're using float (i,e 32 bits) so the next element should be aligned:

element_n_address = base_address + element_number * 4 bytes

Following the code, we have the sub instruction that decrements the numbers of elements to processed from the total size store in a0. Then to finalize we have the bnez which keeps controlling if the number of elements to process is it still more than zero, if we had over it uses fence to guarantee that stores and load will be processed in the right order and all data will be coherent. So this finishes all the aspects of the control thread that runs the control CPU.

Vector Coprocessor

In the case of the vector CPU the things runs a little bit differently, as this thread it's responsible for execute the operations, it should execute just vector instructions.

    vpset vp0
    vlw vv0, va0
    vlw vv1, va1

The code snippet above sets the predicated register vp0 to be used and load the data pointed by the vector address register va0 and va1 (x, y base addresses respectively) to the vector registers vv0 and vv1. After this, the code below will add vv0 and vv1 into vv2 using a single precision floating point unit (that why ends with .s instruction).

    vfadd.s vv2, vv0, vv1
    vsw vv2, va2
    vstop

With the data computed, it stores the data present in vv2 in the address presented in va2 (which points to address of result variable in the data memory). After all it finishes with vstop instruction to indicate end of the processing. So, that's all about this program, now lets see some simple tweak to understand how the things really work on the vector CPU.

Understanding how it's processed

See this code with a complete built-in based examples from benchmark, where I simple change some line and put a function to be called during the assembly execution of the control thread. I've changed the stripmine loop to call the function check_stats_ctrl_th that'll basic stores into an array the vector length processed during each iteration. Before calling the function the only important thing to notice it's the fact that we should take care of backing up the registers used by the logic because as we're inserting this custom assembly commands, the compiler maybe will not do this for us.

stripmine:
    vsetvl t0, a0 #a0 is requested vec len, actual is placed in t0
    # backup return address and stack pointer...
    mv      t5, ra
    mv	    t6, sp			
    # we just need to backup argument registers a3 because 
    # logic here and the logic in the check_stats_ctrl_th uses it
    mv      t4, a3
    call	check_stats_ctrl_th	
    # ...return them      	
    mv  	ra, t5			
    mv	    sp, t6
    mv      a3, t4
    vmca va0, a2
    vmca va1, a3
    vmca va2, a1
    la t5, vvadd_v
    vf 0(t5)
    slli t1, t0, 2
    add a1, a1, t1
    add a2, a2, t1
    add a3, a3, t1
    sub a0, a0, t0
    bnez a0, stripmine
    fence
    ret

In the C function, I copied the content of the temporary register t0 that stores the current vector length informed by the vector processor. Then stored it in an array to printout latter to check the results that were:
Screenshot-from-2019-03-11-14-54-46

According to the picture you can see that during 29 iterations the vector CPU processed 680 elements where in the last iteration it own setups with 280. This kind of flexibility was an excellent deal for the software side because it doesn't need to care about fringe elements and also gives the vector CPU to work independently with different lengths configurations. It's important highlight that I've used spike with HWACHA extension to test it because the current hwacha-template it's has been released but it's under development.

That's the end of this post for now, I hope you can get a good introduction of this subject and how it should work in the near future, with more RISC-V updates on the ISA. Also, check it another vector CPU that was presented in the RISC-V Summit 2018 developed by ETH team, the guy that's heading up the presentation it's Brazilian too =).

References

  1. Chisel is a package developed by UCBar to generate hardware from Scala language in a high level synthesis manner. ↩︎

  2. RoCC is a custom interface for Rocket Chip generator to extend the RISC-V ISA. Check more at this [link]
    (https://docs.google.com/document/d/1CH2ep4YcL_ojsa3BVHEW-uwcKh1FlFTjH_kg5v8bxVw/edit). ↩︎

  3. Control thread refers to the assembly programming model proposed by hwacha, where it configures the vector processor to operate as the application demands, usually this terms refers to the non-vec cpu where Hwacha it's coupled. ↩︎

  4. Worker thread refers to the assembly programming model proposed by hwacha, where it executes the raw job of vector processing through vector elements, also it's defined by Hwacha task to run this segment of code. ↩︎

Vector extension for RISC-V and HWACHA project
Share this

Subscribe to @aignacio's