10-12-2014, 03:55 PM
This hot air is to steamy, you're recycling to much, TSO you have some good ideas but you're overlooking key details, the Fwd is a good thing to have and utilise, it allows you to forward a data from the ALU back into the ALU rather then having to write it back before use, this increases serial performance, you wouldn't want to rid of that.
So say we have the example i gave, which you seem to missunderstand the setup, the alu stage has no buffers in it, only around it, now i'll be more realistic this time and use stuff based on what we can currently implement with no issues what so ever, which is a 6 tick ALU with 1 tick forward being CU assisted and 1 tick repeater lock stage buffer.
Now i'm going to be unrealistic to give your suggesting system as much of a advantage as posible, we shall have a write back of 1tick, and a read from that new write back 1 tick, now you'll notice that that equates to the same as the Fwd, but the difference is the Fwd only has to go back to the buffer before, meanwhile to achive this speed with the registers you would have to have them all repeater locked and comparative read disabled, in turn this means it's memory dencity is low and gives you a only aprox 30 dust of singal total (15 before lock and 15 after) to get all your registers back to ALU input, so ignoring the buffer as you suggest you don't want that.
And what you'll find is that yours is no quicker, in serial compute and that's in your most favorable configuration.
Now allow me to explain further in that those register setups havn't been done before except maybe in a 4G/SPRM, we always tend to use at least 8G/SPRs in our PEs, so you can easly see without that Fwd you can very easily loss serial performance, and unfortunately your oversight is that you forgotten about serial computations, you though so much that you could fill ideals with out of order executions, but unfortunately their's times where this isn't possible.
The resulting difference is that when comes a serial computation which is inevitable at some point, you will lose computations via ideal, this is still a big issue in IRL mass computations.
I guess you could via pulse timing achieve a max of 3 ticks per instruction through put but that's unlikely due to MC being a derp, 5 ticks is likely though, but theirs nothing stopping our systems from doing that anyhow providing the data isn't serial, and if it is based on current creations all that would do is slow us down to 8 ticks, but with timing comes a issue of how easily can a interrupt cause a corruption.
Now to go back to go back to the device you have, that one that gives instructions to other "CPUs" as you called it, How do you intend to have multiple of these PEs attached to one of these and fully satuate them when you have so few commands that would actually take up multiple PE cycles if you will, Besides, it's not like we don't have thise functions, it's just we don't bother to reorder instructions in compile time due to the amount of overhead that would be on what ever has to compile it before you could finaly exe. (I'd say it's overhead as it's actualy something that has to be processed before it can have a use, so it does add to making something longer in some ways)
Heck i'v stated i'm currently planning on making a SIMD array and how to address all the PEs is a similar issue, because i have to fetch data as quickly as i can while it's computing and SIMD computations are vastly more quicker computing speed vs memory speed which brings the same issue as what you will have with addressing multiple PEs, that you will only bealbe to use multiples when you have multiples executing instructions that would take multiple cycles.
The only difference between the my SIMD array and your multi-MIMD PE would be mine's data limited and your's is instruction limited.
When these kinds of things happen you'll need a PE that can stockpile multiple instructions/data coherently.
I've seen other guys do things similar to your suggestions, they gave up not realising how much they got them self's in, but you seem to know a lot of what you want to do but you've yet to figure out the short comings.
So say we have the example i gave, which you seem to missunderstand the setup, the alu stage has no buffers in it, only around it, now i'll be more realistic this time and use stuff based on what we can currently implement with no issues what so ever, which is a 6 tick ALU with 1 tick forward being CU assisted and 1 tick repeater lock stage buffer.
Now i'm going to be unrealistic to give your suggesting system as much of a advantage as posible, we shall have a write back of 1tick, and a read from that new write back 1 tick, now you'll notice that that equates to the same as the Fwd, but the difference is the Fwd only has to go back to the buffer before, meanwhile to achive this speed with the registers you would have to have them all repeater locked and comparative read disabled, in turn this means it's memory dencity is low and gives you a only aprox 30 dust of singal total (15 before lock and 15 after) to get all your registers back to ALU input, so ignoring the buffer as you suggest you don't want that.
And what you'll find is that yours is no quicker, in serial compute and that's in your most favorable configuration.
Now allow me to explain further in that those register setups havn't been done before except maybe in a 4G/SPRM, we always tend to use at least 8G/SPRs in our PEs, so you can easly see without that Fwd you can very easily loss serial performance, and unfortunately your oversight is that you forgotten about serial computations, you though so much that you could fill ideals with out of order executions, but unfortunately their's times where this isn't possible.
The resulting difference is that when comes a serial computation which is inevitable at some point, you will lose computations via ideal, this is still a big issue in IRL mass computations.
I guess you could via pulse timing achieve a max of 3 ticks per instruction through put but that's unlikely due to MC being a derp, 5 ticks is likely though, but theirs nothing stopping our systems from doing that anyhow providing the data isn't serial, and if it is based on current creations all that would do is slow us down to 8 ticks, but with timing comes a issue of how easily can a interrupt cause a corruption.
Now to go back to go back to the device you have, that one that gives instructions to other "CPUs" as you called it, How do you intend to have multiple of these PEs attached to one of these and fully satuate them when you have so few commands that would actually take up multiple PE cycles if you will, Besides, it's not like we don't have thise functions, it's just we don't bother to reorder instructions in compile time due to the amount of overhead that would be on what ever has to compile it before you could finaly exe. (I'd say it's overhead as it's actualy something that has to be processed before it can have a use, so it does add to making something longer in some ways)
Heck i'v stated i'm currently planning on making a SIMD array and how to address all the PEs is a similar issue, because i have to fetch data as quickly as i can while it's computing and SIMD computations are vastly more quicker computing speed vs memory speed which brings the same issue as what you will have with addressing multiple PEs, that you will only bealbe to use multiples when you have multiples executing instructions that would take multiple cycles.
The only difference between the my SIMD array and your multi-MIMD PE would be mine's data limited and your's is instruction limited.
When these kinds of things happen you'll need a PE that can stockpile multiple instructions/data coherently.
I've seen other guys do things similar to your suggestions, they gave up not realising how much they got them self's in, but you seem to know a lot of what you want to do but you've yet to figure out the short comings.