--- crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68
On Wed, Jun 27, 2018 at 10:03 PM, Mohammad Amin Nili manili.devteam@gmail.com wrote:
Would you mind tell me what do you mean by “state” DS from main core and why do you think it is a problem
ok so you can't just execute GPU instructions on the main core, right? can you? because they're assembly code designed for.... the GPU, right? not the CPU, yes?
so you have some OpenGL data structures - "state" - which is on the main CPU, right? and, obviously, it's f***-all use sitting there on the CPU, you have to "package' that up, and get it to the GPU, yes?
but you can't just "throw it at the GPU and hope like hell it'll magically get there and do something", can you?
so you have to:
(a) "package" the OpenGL data structures - "state" - up into a format THAT THE GPU UNDERSTANDS. (b) *TELL* the GPU "here's your data, do something". (c) *** STOP *** the CPU from executing (or context-switch - do something else) whilst the GPU is working on it (d) *** RETURN*** or otherwise communicate with the GPU to tell you when the job is done.
that's damn complicated, isn't it? and how many data structures are there, and how much data to send over? you now need to design a hardware-software API, to deal with all that state, yes?
basically a CPU-GPU interface needs an IPC (Inter-Process Communication) mechanism that has to take into account TWO TOTALLY DIFFERENT ARCHITECTURES, doesn't it?
even if the only hardware was that ChiselGPU code, it would *still* be necessary to write some IPC system, packaging up the data on the CPU, telling the ChiselGPU engine "go", waiting for it to say "done", yes?
now compare that to just... taking the MesaGL source code and hitting the "compile" button, and taking the gallium3d-llvm source code and hitting the "compile" button. does that sound a lot easier?
in the case where you keep all of the code, the state, and the data structures in *ONE* processor, the entire development process becomes drastically, drastically simpler, yes?
one thing: it may be possible to begin profiling the gallium3d-llvm code *right now* (even on x86) to assess the inner loops and see where most of the time is spent, in each of the different areas associated with 3D rendering. take a look at jeff's nyuzi2016 paper to see what i mean. it's *really* important to know how many cycles are spent (on average, per pixel) transferring data from memory into registers (and back). it's really important to know how many cycles per pixel are spent on rasterisation, and so on.
whilst gallium3d-llvm on x86 will be heavily-optimised for SSE, it will at least give a good indication.
So if I understood right you are talking about a Larrabee-like RISC-V architecture?
yeah pretty much. with a focus on finding out *where* time is spent, then investigating if fixed-functions can be designed to speed that up (and reduce power consumption at the same time).
also see what low-level data types (FP16, FP12) and what sorts of SIMD / Vector widths would do.
also i would *really* like to know is: if extending RISC-V to 64 registers (it's currently 32), could the extra 32 registers (on a 64-bit system) be effectively used as a substitute for a tiling architecture's scratch-RAM area? 4x4 x 32bpp is basically 16 32-bit registers which is only 8 64-bit SIMD registers. which really is not a lot.
l.