10-14-2014, 02:19 AM
The last two paragraphs explain the most to me, see i hadn't known how you intended to keep the cores running the program without it being spoon fed instructions
your terminology of throughput seems a little different then that of ours as well, so for use it's essentially represents the amount of operations that a device may do at it's best, as in Quantity/time, measuring from what's initiated that will/has came out correct.
So to give a example; a device that takes 100ticks to output what it's imputed, is given a new input every 3ticks, the output of every input comes out 100ticks latter, but is correct and 3ticks after it's outputed, the following computation for the one inputed after would be outputed
going back to Throughput = Quanity of operations/time, we can device that the throughput is 1/3ticks, or 3x(1/3)os^-1 (operations per second)
I do think you was following this but i believe you went stray from say if you was to run a multiplication instruction, each operation required to fulfill that.
Instead where i think we made the missunderstanding was probably due to my bad wording, but was you though of it as instruction as a whole and not the operation.
Yes as i have said this would be optimized for throughput, but where i stated that it would be slow in serial computations still stands true (as you seem to comprehend)
But a litle somewhere you seem to not follow about our general designs is that our Fwds will fetch data for the ALU before the GPRs even have that data, via doing this you can advance the exercution of serial computations and thus increase performances, it's easy to make this a self controlled feature, CU just takes care of it for you.
but you seem to get this idea that we have to pay for extra 2 buffers, 1 before and after the ALU... this is true in some designs but those with a well throughout Fwd would just limit your serial use of data to just the 1 buffer before effectively, because it going into the other stage buffer for write back, then it can be fetched again, but the Fwd means that while it's going though that stage, simultaneously, it can work on that same value it's still writing back to instead quickly loop it back into it's input on top of that which allows for quicker serial computations (this was a note that i was suggesting something you may find advantageous)
With what we have doing this based on clock and stage basses rather them timming all the controls, we can still easily get a throughput of 8ticks, and these are generalized things that currently exist.
Going onto what theoretically is possible with that system is a 4tick ALU for 8bits + 1tick per further 8b
with 2 outputs, 1 that is conditionally on based on Fwd and the other always on, from that point you only need 1tick for the ALU input stage buffer, thus resulting in a throughput of 1/5ticks or 2os^-1 (yes i'm making up random units of measurement, but they are usable and comparable either way)
And this was all while keeping a extremely low exe stage count: 1) could easly increase this to 2 and get 3+(1/3)os^-1 while maintaining what's essentially the same Fwd, however in this case throughput of non serial computations would be 1/3ticks, but Serial computational throughput would drop to 1/6ticks
So you see for being significantly more simple, they do retain a lot of performance, as for the compiler magic, most people would just program efficiently first time if optimized code was required, thus rendering the compiler redundant. (this was more to prove the point that in the way you see non serial computations in our systems are just as quick if we desire)
I've already explain what seems extensively on why your system not having the Fwd would seriously reduce the serial computation performance.
Like I've said a few times, you have good ideas and if implemented well they will make for a nice system, however it's not like it's pushing performance relative. (i'm unsure as to what your comparing what we have with what you hope to have, if you could please explain that would be nice)
I do concur on many of what you say about being quicker to make it timing based, theoretically this was always true, however most people never did this due to CU complications of having a timing based PE.
your terminology of throughput seems a little different then that of ours as well, so for use it's essentially represents the amount of operations that a device may do at it's best, as in Quantity/time, measuring from what's initiated that will/has came out correct.
So to give a example; a device that takes 100ticks to output what it's imputed, is given a new input every 3ticks, the output of every input comes out 100ticks latter, but is correct and 3ticks after it's outputed, the following computation for the one inputed after would be outputed
going back to Throughput = Quanity of operations/time, we can device that the throughput is 1/3ticks, or 3x(1/3)os^-1 (operations per second)
I do think you was following this but i believe you went stray from say if you was to run a multiplication instruction, each operation required to fulfill that.
Instead where i think we made the missunderstanding was probably due to my bad wording, but was you though of it as instruction as a whole and not the operation.
Yes as i have said this would be optimized for throughput, but where i stated that it would be slow in serial computations still stands true (as you seem to comprehend)
But a litle somewhere you seem to not follow about our general designs is that our Fwds will fetch data for the ALU before the GPRs even have that data, via doing this you can advance the exercution of serial computations and thus increase performances, it's easy to make this a self controlled feature, CU just takes care of it for you.
but you seem to get this idea that we have to pay for extra 2 buffers, 1 before and after the ALU... this is true in some designs but those with a well throughout Fwd would just limit your serial use of data to just the 1 buffer before effectively, because it going into the other stage buffer for write back, then it can be fetched again, but the Fwd means that while it's going though that stage, simultaneously, it can work on that same value it's still writing back to instead quickly loop it back into it's input on top of that which allows for quicker serial computations (this was a note that i was suggesting something you may find advantageous)
With what we have doing this based on clock and stage basses rather them timming all the controls, we can still easily get a throughput of 8ticks, and these are generalized things that currently exist.
Going onto what theoretically is possible with that system is a 4tick ALU for 8bits + 1tick per further 8b
with 2 outputs, 1 that is conditionally on based on Fwd and the other always on, from that point you only need 1tick for the ALU input stage buffer, thus resulting in a throughput of 1/5ticks or 2os^-1 (yes i'm making up random units of measurement, but they are usable and comparable either way)
And this was all while keeping a extremely low exe stage count: 1) could easly increase this to 2 and get 3+(1/3)os^-1 while maintaining what's essentially the same Fwd, however in this case throughput of non serial computations would be 1/3ticks, but Serial computational throughput would drop to 1/6ticks
So you see for being significantly more simple, they do retain a lot of performance, as for the compiler magic, most people would just program efficiently first time if optimized code was required, thus rendering the compiler redundant. (this was more to prove the point that in the way you see non serial computations in our systems are just as quick if we desire)
I've already explain what seems extensively on why your system not having the Fwd would seriously reduce the serial computation performance.
Like I've said a few times, you have good ideas and if implemented well they will make for a nice system, however it's not like it's pushing performance relative. (i'm unsure as to what your comparing what we have with what you hope to have, if you could please explain that would be nice)
I do concur on many of what you say about being quicker to make it timing based, theoretically this was always true, however most people never did this due to CU complications of having a timing based PE.