I am TSO - Printable Version

I am TSO - Printable Version

+- Forums - Open Redstone Engineers (https://forum.openredstone.org)
+-- Forum: ORE General (https://forum.openredstone.org/forum-39.html)
+--- Forum: Introductions (https://forum.openredstone.org/forum-18.html)
+--- Thread: I am TSO (/thread-4807.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

RE: I am TSO - Magazorb - 10-14-2014

The last two paragraphs explain the most to me, see i hadn't known how you intended to keep the cores running the program without it being spoon fed instructions

your terminology of throughput seems a little different then that of ours as well, so for use it's essentially represents the amount of operations that a device may do at it's best, as in Quantity/time, measuring from what's initiated that will/has came out correct.
So to give a example; a device that takes 100ticks to output what it's imputed, is given a new input every 3ticks, the output of every input comes out 100ticks latter, but is correct and 3ticks after it's outputed, the following computation for the one inputed after would be outputed
going back to Throughput = Quanity of operations/time, we can device that the throughput is 1/3ticks, or 3x(1/3)os^-1 (operations per second)

I do think you was following this but i believe you went stray from say if you was to run a multiplication instruction, each operation required to fulfill that.
Instead where i think we made the missunderstanding was probably due to my bad wording, but was you though of it as instruction as a whole and not the operation.

Yes as i have said this would be optimized for throughput, but where i stated that it would be slow in serial computations still stands true (as you seem to comprehend)

But a litle somewhere you seem to not follow about our general designs is that our Fwds will fetch data for the ALU before the GPRs even have that data, via doing this you can advance the exercution of serial computations and thus increase performances, it's easy to make this a self controlled feature, CU just takes care of it for you.
but you seem to get this idea that we have to pay for extra 2 buffers, 1 before and after the ALU... this is true in some designs but those with a well throughout Fwd would just limit your serial use of data to just the 1 buffer before effectively, because it going into the other stage buffer for write back, then it can be fetched again, but the Fwd means that while it's going though that stage, simultaneously, it can work on that same value it's still writing back to instead quickly loop it back into it's input on top of that which allows for quicker serial computations (this was a note that i was suggesting something you may find advantageous)

With what we have doing this based on clock and stage basses rather them timming all the controls, we can still easily get a throughput of 8ticks, and these are generalized things that currently exist.

Going onto what theoretically is possible with that system is a 4tick ALU for 8bits + 1tick per further 8b
with 2 outputs, 1 that is conditionally on based on Fwd and the other always on, from that point you only need 1tick for the ALU input stage buffer, thus resulting in a throughput of 1/5ticks or 2os^-1 (yes i'm making up random units of measurement, but they are usable and comparable either way)
And this was all while keeping a extremely low exe stage count: 1) could easly increase this to 2 and get 3+(1/3)os^-1 while maintaining what's essentially the same Fwd, however in this case throughput of non serial computations would be 1/3ticks, but Serial computational throughput would drop to 1/6ticks

So you see for being significantly more simple, they do retain a lot of performance, as for the compiler magic, most people would just program efficiently first time if optimized code was required, thus rendering the compiler redundant. (this was more to prove the point that in the way you see non serial computations in our systems are just as quick if we desire)

I've already explain what seems extensively on why your system not having the Fwd would seriously reduce the serial computation performance.

Like I've said a few times, you have good ideas and if implemented well they will make for a nice system, however it's not like it's pushing performance relative. (i'm unsure as to what your comparing what we have with what you hope to have, if you could please explain that would be nice)

I do concur on many of what you say about being quicker to make it timing based, theoretically this was always true, however most people never did this due to CU complications of having a timing based PE.

RE: I am TSO - jxu - 10-14-2014

stop talking and start building
tl;dr

RE: I am TSO - TSO - 10-14-2014

I'm going to address two of the things you mentioned right now. I remembered something in wire word that makes data entry dependent upon gates that contain their own clocks, giving clock everywhere logic; this is called timing dependent circuitry in the context of wire world. If that is what you mean here, than mine is certainly not timing based, I was thinking of something more like just keeping data together in a clockless system that resembles pipelining, so for any individual stage in each operation, data is as slow as the slowest operation, but data speed between independent operations is not limited this way. (If I ever start to build this thing, what I mean will quickly become apparent). Now, the compiler is accounting for way more than you think it is. The compiler is renaming registers, prearranging out of order execution, inserting lines of code to assist in condition prediction on some of the longer functions if the next line of code is a branch, rearranging the conditions in order to account for variable ALU operation timing, and managing/verifying the memory allocations before runtime. It is also synchronizing some aspects of the time dependence and variable clock rate of the entire system and making up for the fact that I don't understand how programming works at all.

I know that forwarding receives data before it is written, what I am saying is that because I fully intend to make all my operational units the exact length needed to make them bus themselves back to the registers, the extra two ticks inside the registers become transparent compared to a forward that needs some kind of control unit and a way to select if it forwards or wrights back (a 1 tick multiplexer at best on the output), plus an additional front end buffer (another 1 tick with locking repeaters). I'll do the same example using my multiplier that you did above. The distance from the front of the multiplier to the back is 5 chunks. The distance from input of the multiplier to the CPU is also five chunks (hey, would you look at that). The distance from the output of the multiplier to the register bank is zero chunks, and the distance from the register bank to the multiplier input is five chunks (weird, it's like it was built to just fit right in there).

The multiplier repeats every three ticks and takes 17 ticks to complete. In your design, we need one tick in a locking buffer at the input and 1 tick in a router at the back, plus all this work to make a control unit and a five chunk bus to return the output to the input. In my design, we have a register wright that takes two ticks, a register read that is actually simultaneous to the second wright tick, and then just reuses the five chunk bus.

So, you cost the system 19 ticks per operation, which is within the next clock cycle from my multiplier and everything else cancels out. I take 17 ticks, and then 1 tick, and then the next clock (one tick after that) results in the next read. This is also nineteen ticks, but mine is much easier to build. In this case we are always equal. We are only unequal if yours and mine end up on different sides of a clock cycle. We have the same serial and parallel execution times of 24 ticks per serial repetition. (It also just occurred to me that you will never forward a multiplier because the multiplier gives 32 bits output, but takes a sixteen bit input.) The only place where you exceed me in speed is the Boolean functions, specifically the OR NAND and NOR functions. My AND function is actually faster than yours. XNOR is the same for both, and XOR is significantly faster for you than for me because I integrated it into the adder.

Also, you say your's is a four tick ALU plus one for every eight bits? Making your clock also five tick instead of three?

RE: I am TSO - LordDecapo - 10-14-2014

Ok so I can finish reading all this with out a quick reply... I have a long ass reply I'm doing on my laptop,, checked this on my phone...
Do u know what Fwd is? And u do know that EVERY SINGLE INSTRUCTION except branching... Has a Writeback??! That is WHY fwd is so immportant..
And maga, I fwd on an 8bit, it's easy with a simple XOR array to detect hazards..

A fwd is like this
Say u have these inst.
(($ means register and & means RAM btw)
$1+&2=$3
$3+$4=$5
See how you NEED the result from the frost inst for use in the 2nd? And say u have a pipeline that one stage is execute and one stage is Writeback.. That means u won't have $3 correct will the cycle after $3+$4=$5.. Meaning if u dot FWd then $5 is now corrupted.
You Fwd by taking the output of the ALU and sending it right back to where $3 would come back into the ALU.. This makes it so u can use the data at the same time it's writing back. iE preventing a stall.

If u have a seperate inst for Writeback.. Ur CPU will be SLLLLOOOOOWWWWW AS BAWLS. No way around that slowness at all what so ever.

RE: I am TSO - LordDecapo - 10-14-2014

Fwd support isn't defined in IS, it is defined in ur ISA (ur Architectual immplimentation of your IS)

You CAN NOT WHAT SO EVER have instructions write back Out Of Order.. Unless u have a reorder buffer before ur writeback.. And the RoB will have to be able to fwd as well, or ur speed will suffer severly.. Unless u program with that in mind, in that case ur compiler complexity will be EMENSE.
You can not have multiple programs running at once UNLESS u hyper thread, Meaning having multiple sets of registers, each program using it's own, or you do ALOT of context swapping.. Meaning u store all register data to RAM for program0 and load all register data back from RAM for program1... Every single time u swap programs.

And a flip flop between stages IS A BUFER..
Now on to page 9!

RE: I am TSO - LordDecapo - 10-14-2014

"Resembles pipelineing"
It's pipelined or not pipelined, no grey area.

And your system IS TIMING based, u wanted to keep control wires and data in synce so u don't need buffers,. Is like the literal definition of pipelineing via timing base system, since u have to focus A LOT of time to timing it all PERFECT.

A compiler won't be able to rename registers effectively, as renaming usually used a pool of "tags" that is a lot larger then ur actuall register count.
And it will only be able to rearrange things so much,, u have to Fwd or do some tricky CU magic to get it fully optimized, a good computer system is a mix of A BUNCH of individually complex systems all working together, not just a simple data loop and a smart compiler.
Compiler also can not help get ur branch conditions sooner, as ur branching effectiveness is determined by your architecture and ur CU/branch hardware.
Memory allocation only helps with preventing storage corruption and inter program security, it doesn't speed up a system unless u plan to access a lot of data quickly and only use it once or twice each. If u have a way for it to speed up processing, due share In better refined explainations please.

Your whole "make logic same as buss time" isn't that effective.. At all., buss time in a well thought out system can be near zero, and ur parts are usually smaller then like 90 blocks (6tick ALU with each tick of logic being the 15block apart)
U would end up with the same speed but a MUCH MUCH bigger system, and the bussing delay would actually be greater.

Also what are u referring to as chunks? The game chucks of blocks loaded? (16x16)
A fwd router.. Doesn't necisarrily take any ticks at all, use a 1tick shifter at output of ALU that inverts signal, then have it invert again before the fwd, in which u can disaple the torches by giving them a 2nd control power source. So u can have in essence a 0tick output demux (it's a demux not router)

All Boolean should be sync, and take the same, if you have anything except a shit ALU, then add, sub and all Boolean will ALWAYS take same time frame.
And FWD is only for ALU, not Mult/div, ur right on that. And for a Mult div sequential system to work, you HAVE TO have a buffer or u will get MC glitches fucking up ur data at a 3 tick through put.. I have tried 3 ticks,, it DOESNT WORK. AT ALL.. Minecraft has a bug with torches, your data tick width will not always be 3 u will find that the bug will randomly give u data tick widths between 1 and 4.. Which isn't consistent enough at all to be feasible, a 4 tick clok MAY work.. But 5 is the only garentee it would function at all. But good luck wit FWD on a 5 tick, u would need at least a Dual Fwd system (so u can not only give data to the immediately following inst, but the one following that one as well... This isn't TOO hard, but if u don't know pipelining and hazard detection well (no offense, it doesn't seem like u are grasping those concepts that well) it can e a BITCH AND A HALF to implement. I suggest lengthining ur clock.
A slower clock can increase speed and preformance due to lower stage counts. Of u don't believe me.. I'll post a picture of my math I did that tests 3 clock speed, 2 different counts of stages.. The faster clock speed only helps if you like hardly wet branch.. It avg out that a 8tick clock with 8 stages.. Is 0.56% slower then a 10tick 5 stages on avgerage over a corse of tests I ran
The test uses a static branch prediction of true, which is actually a worse case scenario. Static False is better but fuck it.
I used these scenarios
(The format is %false branches-%true branches-1branch/(per x inst)
0-100-1/2
30-70-1/2
50-50-1/2
70-30-1/2
0-100-1/4
30-70-1/4
50-50-1/4
70-30-1/4
0-100-1/8
30-70-1/8
50-50-1/8
70-30-1/8
0-100-1/16
30-70-1/16
50-50-1/16
70-30-1/16

Note that IRL programs average like a 30-70-1/8 for calculation heavy programs
And like 50-50-1/16 for memory intensive programs, and with the normal programs I see ppl running in MC, it's like 30-70-1/4 sometimes 30-70-1/2
My 10tick 5 stages handles more frequent branches A LOT better over the course of 100inst.

RE: I am TSO - LordDecapo - 10-14-2014

Note I didn't test 100-0 on those branch tests, at that is case scenario and it will be equivalent to the through put if u didn't have any branches at all

RE: I am TSO - TSO - 10-14-2014

It stops when branching? Why?

RE: I am TSO - LordDecapo - 10-14-2014

(10-14-2014, 05:24 PM)TSO Wrote: It stops when branching? Why?

Doesn't "stop" it bubbles.
Let's say a CPU has a static prediction On branching to False (like the simulations I have examples of) So immediately after reading the branch inst. it starts loading the next line in program order, which it reads if the branch is False.
Now let's say that if you have a 5 pipeline stages. And u don't get the flags to check ur condition till stage 4, that means by the time u have realized the branch is actually true, u have the inst in stages 1 2 and 3 that are now invalid. So u have to clear those (really easy if u have buffers, just dont clock the buffer between 3&4 for 3 cycles and that incorrect data will get trashed.
Those 3 stages are the "stop" that ur referring to, where nothing gets done.
That stop time (bubbles) are what I used in my calculations as being the Pipeline Penalty.

RE: I am TSO - Magazorb - 10-14-2014

(10-14-2014, 01:53 PM)LordDecapo Wrote: And maga, I fwd on an 8bit, it's easy with a simple XOR array to detect hazards..

A fwd is like this
Say u have these inst.
(($ means register and & means RAM btw)
$1+&2=$3
$3+$4=$5

Yup, that's how i go, but it's still part of the CU, as it's not on the PU side, basicly to what i discribed as a 1stage execution pipelined CPU i have a tag flowing along with the data aside the PU that would be calculated as soon as i reasonably could in previous stages, but worse assuming that i can't do that, then Fwd would be calculate via having a register on the in the exe stage that would OR the sum of two XORs from the next two inputs for the next cycle by passing the address data forward a stage.

In short Exe stage would XOR the output register address with the input register addresses for the next cycle via looking at the input registers in the fetch stage (or what ever stage was just before the execution

Same ordeal pretty much for the 2stage exe, but Fwd of stage two output goes into stage 1 input, and depending how slow the dataloop without Fwd is, it would look at further ahead then just the next cycle of data to conclude if it that dataloop isn't quick sufficent in that case; if that's the case then it would active Fwd but hold that value in a register untill required.

That how i'd do them, sounds pretty similar to that of yours

(10-14-2014, 02:47 PM)LordDecapo Wrote: "Resembles pipelineing"
It's pipelined or not pipelined, no grey area.

And your system IS TIMING based, u wanted to keep control wires and data in synce so u don't need buffers,. Is like the literal definition of pipelineing via timing base system, since u have to focus A LOT of time to timing it all PERFECT.

A compiler won't be able to rename registers effectively, as renaming usually used a pool of "tags" that is a lot larger then ur actuall register count.
And it will only be able to rearrange things so much,, u have to Fwd or do some tricky CU magic to get it fully optimized, a good computer system is a mix of A BUNCH of individually complex systems all working together, not just a simple data loop and a smart compiler.
Compiler also can not help get ur branch conditions sooner, as ur branching effectiveness is determined by your architecture and ur CU/branch hardware.
Memory allocation only helps with preventing storage corruption and inter program security, it doesn't speed up a system unless u plan to access a lot of data quickly and only use it once or twice each. If u have a way for it to speed up processing, due share In better refined explainations please.

Your whole "make logic same as buss time" isn't that effective.. At all., buss time in a well thought out system can be near zero, and ur parts are usually smaller then like 90 blocks (6tick ALU with each tick of logic being the 15block apart)
U would end up with the same speed but a MUCH MUCH bigger system, and the bussing delay would actually be greater.

Also what are u referring to as chunks? The game chucks of blocks loaded? (16x16)
A fwd router.. Doesn't necisarrily take any ticks at all, use a 1tick shifter at output of ALU that inverts signal, then have it invert again before the fwd, in which u can disaple the torches by giving them a 2nd control power source. So u can have in essence a 0tick output demux (it's a demux not router)

All Boolean should be sync, and take the same, if you have anything except a shit ALU, then add, sub and all Boolean will ALWAYS take same time frame.
And FWD is only for ALU, not Mult/div, ur right on that. And for a Mult div sequential system to work, you HAVE TO have a buffer or u will get MC glitches fucking up ur data at a 3 tick through put.. I have tried 3 ticks,, it DOESNT WORK. AT ALL.. Minecraft has a bug with torches, your data tick width will not always be 3 u will find that the bug will randomly give u data tick widths between 1 and 4.. Which isn't consistent enough at all to be feasible, a 4 tick clok MAY work.. But 5 is the only garentee it would function at all. But good luck wit FWD on a 5 tick, u would need at least a Dual Fwd system (so u can not only give data to the immediately following inst, but the one following that one as well... This isn't TOO hard, but if u don't know pipelining and hazard detection well (no offense, it doesn't seem like u are grasping those concepts that well) it can e a BITCH AND A HALF to implement. I suggest lengthining ur clock.
A slower clock can increase speed and preformance due to lower stage counts. Of u don't believe me.. I'll post a picture of my math I did that tests 3 clock speed, 2 different counts of stages.. The faster clock speed only helps if you like hardly wet branch.. It avg out that a 8tick clock with 8 stages.. Is 0.56% slower then a 10tick 5 stages on avgerage over a corse of tests I ran
The test uses a static branch prediction of true, which is actually a worse case scenario. Static False is better but fuck it.
I used these scenarios
(The format is %false branches-%true branches-1branch/(per x inst)
0-100-1/2
30-70-1/2
50-50-1/2
70-30-1/2
0-100-1/4
30-70-1/4
50-50-1/4
70-30-1/4
0-100-1/8
30-70-1/8
50-50-1/8
70-30-1/8
0-100-1/16
30-70-1/16
50-50-1/16
70-30-1/16

Note that IRL programs average like a 30-70-1/8 for calculation heavy programs
And like 50-50-1/16 for memory intensive programs, and with the normal programs I see ppl running in MC, it's like 30-70-1/4 sometimes 30-70-1/2
My 10tick 5 stages handles more frequent branches A LOT better over the course of 100inst.

My maths on a different senairo which had 100% successrate still had a higher stage count prove slower in serial computations and branching (I havn't really played with testing many conditions and collecting the flags with tags to figure out where to finaly branch to, i'd suspect this would be quicker though as it allows you to run branches like a vector processor until it makes it's final branch condition is done, then it would do a branch based on the collections of flags, this could be very good for say in C++ terminology Switch statements and nested If statements

Some quick maths thanks to excel Smile

only aproximations, not to acruate (can't be asked to make it 100% accurate, but this will be close enough to represent rough performance

Key:
Tr=Times Repeated, SC=Serial Computations, !SC=Non Serial Computations
Bc=Branch Count, Bt=Branch Time.

Config exe Stage Count ThroughPut Notes
A 1 8 Fwd capable from execution stage 1 output (Es1o) to Es1i (Input), No accumulative branch testing
B 2 5 Fwd like A /w SPR(s) preventing effective dataloop exceeding exe stage time.
C 1 8 Same as A /w Group Branch Calculation
D 2 5 Same as B /w Group Branch Calculation

Test Tr SC !SC Bc Bt
A 75 0 0 0 32
B 75 0 8 0 32
C 75 8 8 0 32
D 75 8 0 0 32
E 75 8 0 8 32
F 75 0 8 8 32
G 75 8 8 8 32
H 75 0 0 8 32

Tests A B C D
A 8 10 8 10
B 4808 3010 4808 3010
C 9608 9010 9608 9010
D 4808 6010 4808 6010
E 158408 198010 28808 31210
F 158408 195010 28808 28210
G 163208 201010 33608 34210
H 153608 192010 24008 25210

A>B&C>D @ Serial Computations
A<B&C<D @ Non Serial Computations
A&B<C&D @ Branching

Maths used (again this wasn't perfect maths, i did this quickly and thus isn't fully correct)
SCp=exe Stage Count[eSc] * Troughput [Tp]
A&B=(TR*((SC*SCp)+(!SC*Tp)+(SCp*Bc*Bt)+Tp)
C&D=(TR*((SC*SCp)+(!SC*Tp)+((Tp*Bc)+Bt)+Tp)

Also a note to TSO regarding my discribed setup, 4tick ALU for 8bits plus 1tick per extra 8bits, include Fwd withing that time, 1input stage buffer at 1tick, so effective dataloop of 5ticks
So doing multiplication on that via Looped addition == 5ticks per loop, so assuming you have a good CU that would loop add the higher value by the lower value times; A=8, B=5: 5x5=25ticks
A=13, B=11: 5x11=55ticks
A=16, B=16: 5x16=80ticks

I know your design accells in multiplication, however you do have a dedicated multiplyer meanwhile this is emulated without a shifter even, If you have a shifter intergrate you could perform this within 2cycles (10ticks), same with division as well.

How ever most of our PEs don't have multiplyers or dividers implemented, but some do have shifters
Also to note very few people have put much effort into CU so currently no GPRM PEs with a good CU to make looped addition as quick as posible, i believe two PEs do though.

Also the bitwise logic, addition and subtraction as well as branching in some cases in this setup is very reliably a consistent 5ticks per instruction, So programing isn't so hard with these kinds of machines, plus good ASM practice Smile

The more simpler GPRM PEs that feature everything i said are around 18tickish, but those use no pipelining and so are to be expected to be slow, the simple pipelined ones are around 10-12ish ticks, and the more advanced onces around 8ticks, one i stated described is theoretical and based on what we currently have but with some extra paralised logic to reduce computation times (yet to be made but logically this already works)

Also thanks for explainations as to how yours works, it cleared a few things up Smile