SoC for GPU project

older
Re: [Libre-silicon-devel] SoC for...

David Lanzendörfer

25 Jun 2018 25 Jun '18

6:55 p.m.

Hi I've now started wiring a RISC-V CPU generated with Rocket-Chip into a top level: https://github.com/leviathanch/SauMauPing1 However, for some reason it doesn't start trying to fetch instructions. I'll investigate tomorrow, for now I expect it to rain cats and dogs tonight and tomorrow as much as my head hurts right now. I'll relax now and get back tomorrow with lots of coffee. Another important topic which came up when I tried to repurpose the memory controller from the North Point project: We will have to develop SRAM for our MCU or solder an external SRAM chip onto the board... like this one: https://github.com/freecores/zbt_sram_controller/blob/master/ZBTSRAM61NLP_NV... Didn't we have a talk some while ago with one of the folks from OpenRAM? That's exactly their department. We should tell them to be ready to get started dimensioning SRAM cells, as soon as we've got the first workable results from PearlRiver. Cheers David PS: Turned spell checker off now ^^

Attachments:

signature.asc (application/pgp-signature — 195 bytes)

Show replies by date

Luke Kenneth Casson Leighton

25 Jun 25 Jun

9:36 p.m.

On Mon, Jun 25, 2018 at 7:55 PM, David Lanzendörfer <david.lanzendoerfer@o2s.ch> wrote:

...

Hi I've now started wiring a RISC-V CPU generated with Rocket-Chip into a top level: https://github.com/leviathanch/SauMauPing1 However, for some reason it doesn't start trying to fetch instructions. I'll investigate tomorrow, for now I expect it to rain cats and dogs tonight and tomorrow as much as my head hurts right now. I'll relax now and get back tomorrow with lots of coffee.

Another important topic which came up when I tried to repurpose the memory controller from the North Point project: We will have to develop SRAM for our MCU or solder an external SRAM chip onto the board... like this one: https://github.com/freecores/zbt_sram_controller/blob/master/ZBTSRAM61NLP_NV...

a much faster option is xSPI (HyperRAM). it's less pins (12), runs at 150mhz for the non-DDR variant, and can do up to 150 mbyte/sec. if you want 300mbyte/sec, put in two HyperRAM interfaces. edmund's team has a libre-licensed HyperRAM implementation done already. i think it was micron who have 64 mbyte HyperRAM ICs commercially available. now... if OpenRAM were to implement a libre-licensed HyperRAM chip instead of SRAM that would be REALLY interesting. l.

David Lanzendörfer

26 Jun 26 Jun

4:24 a.m.

...

a much faster option is xSPI (HyperRAM). it's less pins (12), runs at 150mhz for the non-DDR variant, and can do up to 150 mbyte/sec. if you want 300mbyte/sec, put in two HyperRAM interfaces.

edmund's team has a libre-licensed HyperRAM implementation done already.

i think it was micron who have 64 mbyte HyperRAM ICs commercially available. Fantastic! I'll immediately look at it!

...

now... if OpenRAM were to implement a libre-licensed HyperRAM chip instead of SRAM that would be REALLY interesting. Well. We directly talk and cooperate with the OpenRAM folks, so we can certainly arrange that :-)

Cheers David

Mohammad Amin Nili

27 Jun 27 Jun

2:44 a.m.

Dear Luke, I’m currently working on list of requirements and tech specs for the GPU. Unfortunately I did not find any documents which describe the GC800’s specs completely (e.g. power consumption, area estimation and so on). Would you mind help me find a proper document including complete info? Otherwise is it possible for you to describe the GPU specs completely? These are what I’ve found until now from your emails (it would be great if you fill the question mark parts): Deadline = ? The GPU must be matched by the Gallium3D driver RTL must be sufficient to run on an FPGA. Software must be licensed under LGPLv2+ or BSD/MIT. Hardware (RTL) must be licensed under BSD or MIT with no “NON-COMMERCIAL CLAUSES”. Any proposals will be competing against Vivante GC800 (using Etnaviv driver). The GPU is integrated (like Mali400). So all that the GPU needs to do is write to an area of memory (framebuffer or area of the framebuffer). the SoC - which in this case has a RISC-V core and has peripherals such as the LCD controller - will take care of the rest. In this arcitecture, the GPU, the CPU and the peripherals are all on the same AXI4 shared memory bus. They all have access to the same shared DDR3/DDR4 RAM. So as a result the GPU will use AXI4 to write directly to the framebuffer and the rest will be handle by SoC. The job must be done by a team that shows sufficient expertise to reduce the risk. (Do you mean a team with good CVs? What about if the team shows you an acceptable FPGA prototype? I’m talking about a team of students which do not have big industrial CVs but they know how to handle this job (just like RocketChip or MIAOW or etc…). Best regards, Manili

Luke Kenneth Casson Leighton

7:38 a.m.

--- crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68 On Wed, Jun 27, 2018 at 3:44 AM, Mohammad Amin Nili <manili.devteam@gmail.com> wrote:

...

Dear Luke,

I’m currently working on list of requirements and tech specs for the GPU. Unfortunately I did not find any documents which describe the GC800’s specs completely (e.g. power consumption, area estimation and so on). Would you mind help me find a proper document including complete info? Otherwise is it possible for you to describe the GPU specs completely? These are what I’ve found until now from your emails (it would be great if you fill the question mark parts):

Deadline = ?

about 12-18 months which is really tight. if an FPGA (or simulation) plus the basics of the software driver are at least prototyped by then it *might* be ok. if using nyuzi as the basis it *might* be possible to begin the software port in parallel because jeff went to the trouble of writing a cycle-accurate simulation.

...

The GPU must be matched by the Gallium3D driver

that's the *recommended* approach, as i *suspect* it will result in less work than, for example, writing an entire OpenGL stack from scratch.

...

RTL must be sufficient to run on an FPGA.

a *demo* must run on an FPGA as an initial

...

Software must be licensed under LGPLv2+ or BSD/MIT.

and no other licenses. GPLv2+ is out.

...

Hardware (RTL) must be licensed under BSD or MIT with no “NON-COMMERCIAL CLAUSES”. Any proposals will be competing against Vivante GC800 (using Etnaviv driver).

in terms of price, performance and power budget, yes. if you look up the numbers (triangles/sec, pixels/sec, power usage, die area) you'll find it's really quite modest. nyuzi right now requires FOUR times the silicon area of e.g. MALI400 to achieve the same performance as MALI400, meaning that the power usage alone would be well in excess of the budget.

...

The GPU is integrated (like Mali400). So all that the GPU needs to do is write to an area of memory (framebuffer or area of the framebuffer). the SoC - which in this case has a RISC-V core and has peripherals such as the LCD controller - will take care of the rest. In this arcitecture, the GPU, the CPU and the peripherals are all on the same AXI4 shared memory bus. They all have access to the same shared DDR3/DDR4 RAM. So as a result the GPU will use AXI4 to write directly to the framebuffer and the rest will be handle by SoC. The job must be done by a team that shows sufficient expertise to reduce the risk. (Do you mean a team with good CVs? What about if the team shows you an acceptable FPGA prototype?

that would be fantastic as it would demonstrate not only competence but also committment. and will have taken out the "risk" of being "unknown", entirely.

...

I’m talking about a team of students which do not have big industrial CVs but they know how to handle this job (just like RocketChip or MIAOW or etc…).

works perfectly for me :) btw i am happy to put together a crowd-funding campaign (that's already underway) that would also help fund this effort. l.

Hagen SANKOWSKI

8:40 a.m.

Hello Luke. On 06/27/2018 09:38 AM, Luke Kenneth Casson Leighton wrote:

...

in terms of price, performance and power budget, yes. if you look up the numbers (triangles/sec, pixels/sec, power usage, die area) you'll find it's really quite modest. nyuzi right now requires FOUR times the silicon area of e.g. MALI400 to achieve the same performance as MALI400, meaning that the power usage alone would be well in excess of the budget.

5 years back I got the chance during a freelancing work to get a deeper look into the MALI implementation in Verilog. I think, we should become better than that crappy^H "legacy" code. There are very, very long combinatorial pathes, which are limiting the working frequency significant. This worst-case pathes are going over a lot of adder stages regarding the address translation logic. IMHO, our goal should beat the mali and similar cores.

...

...
In this arcitecture, the GPU, the CPU and the peripherals are all on the same AXI4 shared memory bus. They all have access to the same shared DDR3/DDR4 RAM. So as a result the GPU will use AXI4 to write directly to the framebuffer and the rest will be handle by SoC.

BTW, AXI4 is somehow open, but ARM has patented stuff on that. Please prefer a really free and open System-on-Chip Bus like Wishbone. And yes, I know, Wishbone does not have a streaming feature like AXI4.. Regards, Hagen Sankowski

Luke Kenneth Casson Leighton

9:15 a.m.

On Wed, Jun 27, 2018 at 9:40 AM, Hagen SANKOWSKI <hsank@posteo.de> wrote:

...

Hello Luke.

On 06/27/2018 09:38 AM, Luke Kenneth Casson Leighton wrote:

...
in terms of price, performance and power budget, yes. if you look up the numbers (triangles/sec, pixels/sec, power usage, die area) you'll find it's really quite modest. nyuzi right now requires FOUR times the silicon area of e.g. MALI400 to achieve the same performance as MALI400, meaning that the power usage alone would be well in excess of the budget.

5 years back I got the chance during a freelancing work to get a deeper look into the MALI implementation in Verilog.

I think, we should become better than that crappy^H "legacy" code. There are very, very long combinatorial pathes, which are limiting the working frequency significant. This worst-case pathes are going over a lot of adder stages regarding the address translation logic.

IMHO, our goal should beat the mali and similar cores.

for the first version it's really not a high priority. low-power is far more important. with power being a square law on size, high-frequency clock rates become a bit of a problem. much better to put in double the hardware at half the speed and gain a 2x reduction in power consumption. 3D tiling is inherently parallelisable so it's not an issue.

...

...
...
In this arcitecture, the GPU, the CPU and the peripherals are all on the same AXI4 shared memory bus. They all have access to the same shared DDR3/DDR4 RAM. So as a result the GPU will use AXI4 to write directly to the framebuffer and the rest will be handle by SoC.

BTW, AXI4 is somehow open, but ARM has patented stuff on that. Please prefer a really free and open System-on-Chip Bus like Wishbone. And yes, I know, Wishbone does not have a streaming feature like AXI4..

i know of a commercial product that put the vga_lcd code into production. it used wishbone. it failed to reach speeds beyond 1024x768 @ 55hz, 15bpp. also, AC97 is required and to do that efficiently and effectively AXI4 is necessary. l.

David Lanzendörfer

7:28 p.m.

Hi all I've now managed to strip away the Xilinx/FPGA specific configuration quirks and only produce a plain Verilog core which is platform agnostic: https://github.com/libresilicon/SauMauPing1/tree/master/builds Next steps are: * Wire all the peripherals out * Wire a GPU in there. For the GPU I'd prefer either Verilog (because of the Yosys support) or Chisel/Scala (Generates Verilog) Cheers -lev

Luke Kenneth Casson Leighton

7:41 p.m.

On Wed, Jun 27, 2018 at 8:28 PM, David Lanzendörfer <david.lanzendoerfer@o2s.ch> wrote:

...

Hi all I've now managed to strip away the Xilinx/FPGA specific configuration quirks

ah i need those, for testing on a ZC706.

...

and only produce a plain Verilog core which is platform agnostic: https://github.com/libresilicon/SauMauPing1/tree/master/builds Next steps are: * Wire all the peripherals out

david do you have someone who can write a priority muxer, in chisel3? the IOF code by SiFive can't cope with the many-pins-to-one-input case and i'd like chisel3 to be one of the back-ends for the libre pinmux code-generator. l.

Mohammad Amin Nili

8:18 p.m.

Hello David, Is it possible for you to add a README file for how to use it? Best regards, Manili

...

On Jun 27, 2018, at 11:58 PM, David Lanzendörfer <david.lanzendoerfer@o2s.ch> wrote:

Hi all I've now managed to strip away the Xilinx/FPGA specific configuration quirks and only produce a plain Verilog core which is platform agnostic: https://github.com/libresilicon/SauMauPing1/tree/master/builds Next steps are: * Wire all the peripherals out * Wire a GPU in there.

For the GPU I'd prefer either Verilog (because of the Yosys support) or Chisel/Scala (Generates Verilog)

Cheers -lev

David Lanzendörfer

28 Jun 28 Jun

6:34 a.m.

Hi Manili

...

Is it possible for you to add a README file for how to use it? Yeah. I'm working on that. Now I've got everything under control, next is to add the FPGA top levels back and allow people to choose whether they wanna build an ASIC or for an FPGA. I guess I should have a README done until tonight, ok?

Cheers David

Hagen SANKOWSKI

10:10 a.m.

Hello list! On 06/28/2018 08:34 AM, David Lanzendörfer wrote:

...

Now I've got everything under control, next is to add the FPGA top levels back and allow people to choose whether they wanna build an ASIC or for an FPGA. I guess I should have a README done until tonight, ok?

BTW, there is a Best-Practise style of switching between ASIC and FPGA targets. Let's explain 1. On ASICs there has to have a RESET for every register (Latch, FlipFlop). But on FPGA RESETs are useless (you can re-config the whole FPGA instead). While without a global line (aka RESET) which goes everywhere, the timing (and working frequency) gets better results. 2. With configuration, every register gets a initial value, so use them instead of RESET. Example code (in Verilog) // -------------------------------------------------------------- localparam c_register_reset = 0; // 1st, define reset value `ifdef CODINGSTYLE_FPGA reg [31:0] r_register = c_register_reset; // assign initial value `else // ASIC-like reg [31:0] r_register; // for ASIC w/o initial value `endif `ifdef CODINGSTYLE_FPGA always @ (posedge clk) begin `else always @ (posedge rst or posedge clk) begin if (rst) r_register <= c_register_reset; // reset value for ASICs only else `endif // operational with posedge clock // ... end // --------------------------------------------------------------- Usually, I put the CODINGSTYLE_FPGA into my Makefile environment to switch between both targets. Regards, Hagen Sankowski

David Lanzendörfer

3:28 p.m.

Hi After quiet a bit of research I had to realize that there isn't a single free DRAM controller code available. They use the vendor specific closed IP cores... They're just using the vendor specific ones for the FPGA but don't have their own integrated in rocket-chip... Like for Xilinx they use the proprietary of theirs and so on. No one has bothered writing an actually free DRAM controller for DDR3/DDR4 in chisel so that one could actually build it as an asic. That's also most likely why SiFive didn't publish their actual code from which they've made the chips.... Evaluated for you... we can't build this thing until we have designed our own DRAM controller... Anyway, I'm now doing a transplant and adapt their "Xilinx specific IP core" into my MIA702 version... And the pain repeats on for every new board... Cheers David On Thursday, 28 June 2018 3:28:21 AM HKT David Lanzendörfer wrote:

...

Hi all I've now managed to strip away the Xilinx/FPGA specific configuration quirks and only produce a plain Verilog core which is platform agnostic: https://github.com/libresilicon/SauMauPing1/tree/master/builds Next steps are: * Wire all the peripherals out * Wire a GPU in there.

For the GPU I'd prefer either Verilog (because of the Yosys support) or Chisel/Scala (Generates Verilog)

Cheers -lev

Luke Kenneth Casson Leighton

5:55 p.m.

--- crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68 On Thu, Jun 28, 2018 at 4:28 PM, David Lanzendörfer <david.lanzendoerfer@o2s.ch> wrote:

...

Hi After quiet a bit of research I had to realize that there isn't a single free DRAM controller code available.

there is.. .they're just damn hard to find.

...

They use the vendor specific closed IP cores...

yyup.

...

They're just using the vendor specific ones for the FPGA but don't have their own integrated in rocket-chip...

correct. and those typically don't have impedance matching / training on them, they assume fixed 40 ohms impedance, so if you build a PCB with that specific FPGA where the layers don't match up correctly to get the exact impedance on transmit and receive that the *specific* DRAM requires, you're absolutely screwed. this is why you can only use *specific* SO-DIMMs and DRAMs with specific FPGAs.

...

Like for Xilinx they use the proprietary of theirs and so on. No one has bothered writing an actually free DRAM controller for DDR3/DDR4

DDR3 yes (see below).

...

in chisel

no and i'm not hugely enamoured with chisel, so no great loss there.

...

That's also most likely why SiFive didn't publish their actual code from which they've made the chips....

yyup. ok so here's some links: https://github.com/enjoy-digital/litedram http://libre-riscv.org/shakti/m_class/DDR/ (hmm must update that) http://bugs.libre-riscv.org/show_bug.cgi?id=21 https://www.ohwr.org/projects/ddr3-sp6-core/wiki/wiki so there *is* an actual DDR3 controller *and* there is symbioticaeda who have an LPDDR3 PHY which they're happy to make libre given the right price. there's one other DDR3 controller that i know of, unfortunately it's GPL licensed. l.

David Lanzendörfer

29 Jun 29 Jun

10:33 a.m.

Hi Luke

...

ok so here's some links: https://github.com/enjoy-digital/litedram http://libre-riscv.org/shakti/m_class/DDR/ (hmm must update that) http://bugs.libre-riscv.org/show_bug.cgi?id=21 https://www.ohwr.org/projects/ddr3-sp6-core/wiki/wiki Great! I will try to tinker them into SauMauPing! Thanks!

...

so there *is* an actual DDR3 controller *and* there is symbioticaeda who have an LPDDR3 PHY which they're happy to make libre given the right price. That shouldn't be a problem as soon as we've got the PearlRiver done and everything works in the lab. We can also use the fab in Tai Po to serve mass orders in case there is a demand for their IP (in combination with other IPs of course) As soon as the process is ready they should join us in Mumble and discuss with us how much cash they expect from a potential mass-tapeout.

...

there's one other DDR3 controller that i know of, unfortunately it's GPL licensed. As I mentioned before, the GPL isn't worth anything when used for licensing semiconductor IP because it stops covering the flow half the way. So worrying about "GPL or whatever" is pointless ;-) As soon as Cedric is done with the first revision of the LibreSilicon public license we will have something actually able to defend intellectual ownership rights over IPs. But right now he's on a business trip, trying not to get attacked by drop bears ;-)

Cheers David

David Lanzendörfer

28 Jun 28 Jun

3:35 p.m.

Hi After quiet a bit of research I had to realize that there isn't a single free DRAM controller code available. They SiFive folks use the vendor specific closed IP cores... They're just using the vendor specific ones for the FPGA but don't have their own integrated in rocket-chip... Like for Xilinx they use the proprietary of Xilinx and so on. No one seem to have bothered writing an actually free DRAM controller for DDR3/DDR4 in chisel so that one could actually build it as an asic. That's also most likely why SiFive didn't publish their actual code from which they've made the chips... Proprietary code, proprietary hard ware. Gotta swallow the thrown up back down every time I've gotta read the "freedom" word in the file paths of their repo and newsflashes right now... Evaluated for you... we can't build this thing until we have designed our own DRAM controller... Anyway, I'm now doing a transplant and adapt their "Xilinx specific IP core" into my MIA702 version...[1] And the pain repeats on for every new board... So good thing we've started already now evaluation. We'll gotta hire some additional folks to develop and push the missing IP cores needed for a tape-out to GitHub, as soon as we get there. Cheers David [1] https://item.taobao.com/item.htm?id=563964492619&_u= On Thursday, 28 June 2018 3:28:21 AM HKT David Lanzendörfer wrote:

...

Hi all I've now managed to strip away the Xilinx/FPGA specific configuration quirks and only produce a plain Verilog core which is platform agnostic: https://github.com/libresilicon/SauMauPing1/tree/master/builds Next steps are: * Wire all the peripherals out * Wire a GPU in there.

For the GPU I'd prefer either Verilog (because of the Yosys support) or Chisel/Scala (Generates Verilog)

Cheers -lev

2670

Age (days ago)

2674

Last active (days ago)

List overview

Download

15 comments

4 participants

participants (4)

David Lanzendörfer
Hagen SANKOWSKI
Luke Kenneth Casson Leighton
Mohammad Amin Nili