Hi ok so a quick update, some fascinating discussion underway on comp.arch over the past month, in which various people including Mitch Alsup have been educating me about multi issue architectures and how to do precise exceptions, branch speculation, and much more, using scoreboards based on the PROPER understanding of a 58 year old supercomputer called the Cray CDC 6600.

Updates are at https://www.crowdsupply.com/libre-risc-v/m-class slightly behind, I have 7 outstanding to be reviewed and published.

The SoC is to be quad core dual issue for 64 bit and quad issue for 32 bit vectorised instructions, 8 issue for 16 and 8 bit vectors. Vectors are variable length and predicated, being dropped directly onto the multi issue queue, rather than being a fake SIMD knockoff.

3D instructions will be added as necessary, identified already is the conversion from FP ARGB to 32 bit ARGB pixels, these to be done as vectors, vectorisation taken care of by the multi-issue execution engine, just like for every other instruction.

Target performance is 5-6 GFLOPs, 100 M pixels/sec (1280x720 @ 25fps), 30 M triangles/s, if we exceed that, whoopee. Power budget 2.5 watts total system. VPU target, 720p decode, again only accelerate with custom ops where needed.

It is a fascinating and scary microarchitecture that subdivides the register file into 4 32 bit banks of 2R1W cells with individual byte-write-enable lines, passthrough write to read ports in same cycle, where the 4 banks are split odd-even and hi32 lo32. Hence why it is dual issue for 64 bit and quad for 32 bit.

A separate additional bus exists for operand forwarding, even though the writethru capability of the regfile also provides a type of operand forwarding.

This because FMAC requires 3R1W normally, and the extra forwarding bus, if the FMACs are chained (src of 1 is dest of previous) will allow 4 32 bit FMACs per clock per 800mhz core, for a total of 12GFLOPs, 2x the target performance. Without the extra bus performance is halved as the banks of only 2R1W regfile becomes the bottleneck.

I wont go into detail on how the 16 and 8 bit ops are handled, its too scary :) took weeks to work out.

Overall going well, could use some help, we picked nmigen because it results in readable flexible code, no nonsense with #ifdefs and we have too much that needs OO programming to mess about with #ifdefs in native verilog.

--
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68