(10-14-2014, 01:53 PM)LordDecapo Wrote: And maga, I fwd on an 8bit, it's easy with a simple XOR array to detect hazards..
A fwd is like this
Say u have these inst.
(($ means register and & means RAM btw)
$1+&2=$3
$3+$4=$5
Yup, that's how i go, but it's still part of the CU, as it's not on the PU side, basicly to what i discribed as a 1stage execution pipelined CPU i have a tag flowing along with the data aside the PU that would be calculated as soon as i reasonably could in previous stages, but worse assuming that i can't do that, then Fwd would be calculate via having a register on the in the exe stage that would OR the sum of two XORs from the next two inputs for the next cycle by passing the address data forward a stage.
In short Exe stage would XOR the output register address with the input register addresses for the next cycle via looking at the input registers in the fetch stage (or what ever stage was just before the execution
Same ordeal pretty much for the 2stage exe, but Fwd of stage two output goes into stage 1 input, and depending how slow the dataloop without Fwd is, it would look at further ahead then just the next cycle of data to conclude if it that dataloop isn't quick sufficent in that case; if that's the case then it would active Fwd but hold that value in a register untill required.
That how i'd do them, sounds pretty similar to that of yours
(10-14-2014, 02:47 PM)LordDecapo Wrote: "Resembles pipelineing"
It's pipelined or not pipelined, no grey area.
And your system IS TIMING based, u wanted to keep control wires and data in synce so u don't need buffers,. Is like the literal definition of pipelineing via timing base system, since u have to focus A LOT of time to timing it all PERFECT.
A compiler won't be able to rename registers effectively, as renaming usually used a pool of "tags" that is a lot larger then ur actuall register count.
And it will only be able to rearrange things so much,, u have to Fwd or do some tricky CU magic to get it fully optimized, a good computer system is a mix of A BUNCH of individually complex systems all working together, not just a simple data loop and a smart compiler.
Compiler also can not help get ur branch conditions sooner, as ur branching effectiveness is determined by your architecture and ur CU/branch hardware.
Memory allocation only helps with preventing storage corruption and inter program security, it doesn't speed up a system unless u plan to access a lot of data quickly and only use it once or twice each. If u have a way for it to speed up processing, due share In better refined explainations please.
Your whole "make logic same as buss time" isn't that effective.. At all., buss time in a well thought out system can be near zero, and ur parts are usually smaller then like 90 blocks (6tick ALU with each tick of logic being the 15block apart)
U would end up with the same speed but a MUCH MUCH bigger system, and the bussing delay would actually be greater.
Also what are u referring to as chunks? The game chucks of blocks loaded? (16x16)
A fwd router.. Doesn't necisarrily take any ticks at all, use a 1tick shifter at output of ALU that inverts signal, then have it invert again before the fwd, in which u can disaple the torches by giving them a 2nd control power source. So u can have in essence a 0tick output demux (it's a demux not router)
All Boolean should be sync, and take the same, if you have anything except a shit ALU, then add, sub and all Boolean will ALWAYS take same time frame.
And FWD is only for ALU, not Mult/div, ur right on that. And for a Mult div sequential system to work, you HAVE TO have a buffer or u will get MC glitches fucking up ur data at a 3 tick through put.. I have tried 3 ticks,, it DOESNT WORK. AT ALL.. Minecraft has a bug with torches, your data tick width will not always be 3 u will find that the bug will randomly give u data tick widths between 1 and 4.. Which isn't consistent enough at all to be feasible, a 4 tick clok MAY work.. But 5 is the only garentee it would function at all. But good luck wit FWD on a 5 tick, u would need at least a Dual Fwd system (so u can not only give data to the immediately following inst, but the one following that one as well... This isn't TOO hard, but if u don't know pipelining and hazard detection well (no offense, it doesn't seem like u are grasping those concepts that well) it can e a BITCH AND A HALF to implement. I suggest lengthining ur clock.
A slower clock can increase speed and preformance due to lower stage counts. Of u don't believe me.. I'll post a picture of my math I did that tests 3 clock speed, 2 different counts of stages.. The faster clock speed only helps if you like hardly wet branch.. It avg out that a 8tick clock with 8 stages.. Is 0.56% slower then a 10tick 5 stages on avgerage over a corse of tests I ran
The test uses a static branch prediction of true, which is actually a worse case scenario. Static False is better but fuck it.
I used these scenarios
(The format is %false branches-%true branches-1branch/(per x inst)
0-100-1/2
30-70-1/2
50-50-1/2
70-30-1/2
0-100-1/4
30-70-1/4
50-50-1/4
70-30-1/4
0-100-1/8
30-70-1/8
50-50-1/8
70-30-1/8
0-100-1/16
30-70-1/16
50-50-1/16
70-30-1/16
Note that IRL programs average like a 30-70-1/8 for calculation heavy programs
And like 50-50-1/16 for memory intensive programs, and with the normal programs I see ppl running in MC, it's like 30-70-1/4 sometimes 30-70-1/2
My 10tick 5 stages handles more frequent branches A LOT better over the course of 100inst.
My maths on a different senairo which had 100% successrate still had a higher stage count prove slower in serial computations and branching (I havn't really played with testing many conditions and collecting the flags with tags to figure out where to finaly branch to, i'd suspect this would be quicker though as it allows you to run branches like a vector processor until it makes it's final branch condition is done, then it would do a branch based on the collections of flags, this could be very good for say in C++ terminology Switch statements and nested If statements
Some quick maths thanks to excel only aproximations, not to acruate (can't be asked to make it 100% accurate, but this will be close enough to represent rough performance
Key:
Tr=Times Repeated, SC=Serial Computations, !SC=Non Serial Computations
Bc=Branch Count, Bt=Branch Time.
Config exe Stage Count ThroughPut Notes
A 1 8 Fwd capable from execution stage 1 output (Es1o) to Es1i (Input), No accumulative branch testing
B 2 5 Fwd like A /w SPR(s) preventing effective dataloop exceeding exe stage time.
C 1 8 Same as A /w Group Branch Calculation
D 2 5 Same as B /w Group Branch Calculation
Test Tr SC !SC Bc Bt
A 75 0 0 0 32
B 75 0 8 0 32
C 75 8 8 0 32
D 75 8 0 0 32
E 75 8 0 8 32
F 75 0 8 8 32
G 75 8 8 8 32
H 75 0 0 8 32
Tests A B C D
A 8 10 8 10
B 4808 3010 4808 3010
C 9608 9010 9608 9010
D 4808 6010 4808 6010
E 158408 198010 28808 31210
F 158408 195010 28808 28210
G 163208 201010 33608 34210
H 153608 192010 24008 25210
A>B&C>D @ Serial Computations
A<B&C<D @ Non Serial Computations
A&B<C&D @ Branching
Maths used (again this wasn't perfect maths, i did this quickly and thus isn't fully correct)
SCp=exe Stage Count[eSc] * Troughput [Tp]
A&B=(TR*((SC*SCp)+(!SC*Tp)+(SCp*Bc*Bt)+Tp)
C&D=(TR*((SC*SCp)+(!SC*Tp)+((Tp*Bc)+Bt)+Tp)
Also a note to TSO regarding my discribed setup, 4tick ALU for 8bits plus 1tick per extra 8bits, include Fwd withing that time, 1input stage buffer at 1tick, so effective dataloop of 5ticks
So doing multiplication on that via Looped addition == 5ticks per loop, so assuming you have a good CU that would loop add the higher value by the lower value times; A=8, B=5: 5x5=25ticks
A=13, B=11: 5x11=55ticks
A=16, B=16: 5x16=80ticks
I know your design accells in multiplication, however you do have a dedicated multiplyer meanwhile this is emulated without a shifter even, If you have a shifter intergrate you could perform this within 2cycles (10ticks), same with division as well.
How ever most of our PEs don't have multiplyers or dividers implemented, but some do have shifters
Also to note very few people have put much effort into CU so currently no GPRM PEs with a good CU to make looped addition as quick as posible, i believe two PEs do though.
Also the bitwise logic, addition and subtraction as well as branching in some cases in this setup is very reliably a consistent 5ticks per instruction, So programing isn't so hard with these kinds of machines, plus good ASM practice
The more simpler GPRM PEs that feature everything i said are around 18tickish, but those use no pipelining and so are to be expected to be slow, the simple pipelined ones are around 10-12ish ticks, and the more advanced onces around 8ticks, one i stated described is theoretical and based on what we currently have but with some extra paralised logic to reduce computation times (yet to be made but logically this already works)
Also thanks for explainations as to how yours works, it cleared a few things up