10-14-2014, 01:06 AM
What you call wright back is how my computer always runs. The only thing it can do is wright back unless an instruction to external memory is given.
I call wright back what you call forwarding. Mine simply does not have the ability to do what you call forwarding. There is no way to manage that inside the clock and instruction set that I want. Therefore I am stuck with a complete register wright between each repetition, but because the busses from the registers back to the processing element for any operation are (usually) exactly the same physical length as the operation itself, it isn't loosing as much time in bussing delay.
For the most part what we have is identical, the only difference was that mine isn't loosing time with the two ticks on the front and back of each ALU element, so all I had to do was figure out how much time is lost in register writing and reading in order to see at what point your forward becomes more efficient than my lack thereof. For the really big operations, this takes care of itself so well that the amount of code requiring a forward became unreasonably large in order to make yours any more efficient. For the smaller and faster operations, there is more of a loss, and the amount of code required to make forwarding more efficient was lesser. I finally showed that the phase arrangement of the register wright to the clock (and therefore the next read) was the main determinate of the range where mine would be more efficient than yours.
The compiler is a program. There happens to be a component designed to make things easier for the compiler because it is able to check conflicts by using hardware, but the compiler is a program that will arrange things based on how conflicts and bubbles present themselves, which is all predictable. (I don't actually know how to write a compiler, but we'll get there when we get there.)
Well, sort of... instructions can be decoded in three ticks, but the throughput of any particular instruction varies. So there can be a three tick clock, with an instruction every three ticks, but the operations don't wright back in the same amount of time. They are all multiples of three ticks, but the operations definitely return out of order.
This machine reaches it's maximum potential running multiple programs that don't share too many of the same operations and don't conflict their data.
You missed a tiny bit of how muticore would be implemented, and I missed a bit about some of the caveats. Each core would have it's own cache of program data, which reads once every clock. The paging system is part of the external hardware. The number of busses to the individual processing core's cache would be equal to the number of instruction paged in at a time, so for one core, if it takes eight ticks to get an address in main memory, and four ticks to read an address out of the program memory cache, we need to address four lines of code out of main memory at a time and have four data busses. For two cores, the main memory can still wright four lines of code at a time and switch between the two cores every other wright. The cores would not be able to share busses. If sixteen lines of code are paged at a time, then parallel pipelined decoding of an address in the main memory will take four ticks. If paged 32 at a time, it would take three ticks. This is the main reason why I need to figure out some sort of balance in the memory, because 32 busses is far too many to work with, but is the fastest possible amount to page and allows for more processors. After a point, more than one read unit is needed, but each read unit can easily address out to at least 32 processors if it wrights 32 lines of code at a time to each one. Do note that my computer is the size of it's ALU and program memory, so cramming 32 of these computers together is no problem if you can figure out a bussing solution. At that point, they probably would be sharing busses, but they would not all share the same one, and you could not consecutively wright to two that share a bus. My design is also really tall.
This final assumption is absolutely correct. No stopping, no clocking, once data leaves a unit, it's the next unit's problem to deal with. The write location and conditions also ship out with the data so that everything is organized and stays together. Everything decodes when it needs to. The clock is only at the register and program memory read, the wright is not clocked. Imagine it as the most organized set of race conditions ever created, absolutely everything depends on the data being somewhere exactly when it is supposed to, with tolerance inside of a tick. The only reason I can get away with this is because minecraft flip flops stabilize instantly and a 1 tick tolerance is extremely easy to synchronize.
I call wright back what you call forwarding. Mine simply does not have the ability to do what you call forwarding. There is no way to manage that inside the clock and instruction set that I want. Therefore I am stuck with a complete register wright between each repetition, but because the busses from the registers back to the processing element for any operation are (usually) exactly the same physical length as the operation itself, it isn't loosing as much time in bussing delay.
For the most part what we have is identical, the only difference was that mine isn't loosing time with the two ticks on the front and back of each ALU element, so all I had to do was figure out how much time is lost in register writing and reading in order to see at what point your forward becomes more efficient than my lack thereof. For the really big operations, this takes care of itself so well that the amount of code requiring a forward became unreasonably large in order to make yours any more efficient. For the smaller and faster operations, there is more of a loss, and the amount of code required to make forwarding more efficient was lesser. I finally showed that the phase arrangement of the register wright to the clock (and therefore the next read) was the main determinate of the range where mine would be more efficient than yours.
The compiler is a program. There happens to be a component designed to make things easier for the compiler because it is able to check conflicts by using hardware, but the compiler is a program that will arrange things based on how conflicts and bubbles present themselves, which is all predictable. (I don't actually know how to write a compiler, but we'll get there when we get there.)
Well, sort of... instructions can be decoded in three ticks, but the throughput of any particular instruction varies. So there can be a three tick clock, with an instruction every three ticks, but the operations don't wright back in the same amount of time. They are all multiples of three ticks, but the operations definitely return out of order.
This machine reaches it's maximum potential running multiple programs that don't share too many of the same operations and don't conflict their data.
You missed a tiny bit of how muticore would be implemented, and I missed a bit about some of the caveats. Each core would have it's own cache of program data, which reads once every clock. The paging system is part of the external hardware. The number of busses to the individual processing core's cache would be equal to the number of instruction paged in at a time, so for one core, if it takes eight ticks to get an address in main memory, and four ticks to read an address out of the program memory cache, we need to address four lines of code out of main memory at a time and have four data busses. For two cores, the main memory can still wright four lines of code at a time and switch between the two cores every other wright. The cores would not be able to share busses. If sixteen lines of code are paged at a time, then parallel pipelined decoding of an address in the main memory will take four ticks. If paged 32 at a time, it would take three ticks. This is the main reason why I need to figure out some sort of balance in the memory, because 32 busses is far too many to work with, but is the fastest possible amount to page and allows for more processors. After a point, more than one read unit is needed, but each read unit can easily address out to at least 32 processors if it wrights 32 lines of code at a time to each one. Do note that my computer is the size of it's ALU and program memory, so cramming 32 of these computers together is no problem if you can figure out a bussing solution. At that point, they probably would be sharing busses, but they would not all share the same one, and you could not consecutively wright to two that share a bus. My design is also really tall.
This final assumption is absolutely correct. No stopping, no clocking, once data leaves a unit, it's the next unit's problem to deal with. The write location and conditions also ship out with the data so that everything is organized and stays together. Everything decodes when it needs to. The clock is only at the register and program memory read, the wright is not clocked. Imagine it as the most organized set of race conditions ever created, absolutely everything depends on the data being somewhere exactly when it is supposed to, with tolerance inside of a tick. The only reason I can get away with this is because minecraft flip flops stabilize instantly and a 1 tick tolerance is extremely easy to synchronize.