By no means has this post been edited, considering I just got done writing an english essay.
To clarify, I would consider that a write back while a forward would be something like halting a multiplier in order to inject a partial sum into the middle, which would basically never happen because you would have also needed to calculate the partial sum before hand... using the multiplier (or using the register AND, the bit shift, the adder, and the forward injection process, which is not faster by any stretch of the imagination).
The question that must be considered, though, is twofold. Is this write back detrimental to the parallel performance of the unit, and can one justify adding one tick to every operation just for a faster write back?
The first question can go either way depending on whether or not you implement the write back so that it doesn't interfere with the input busses. In fact, you can even write back between cycles if the clock is slow enough for the system to have open time in between operation repetition. (Ideally, you wouldn't have this, but the instruction pointer decode time dictates that.) In my personal layout, the instruction pointer is designed for parallel pipelined decoding (figured this out last night, it's really cool, easily four ticks... maybe three, but you're starting to push the limits of the memory) in order to keep up with the clock, so this write back is not possible in my system that way. This leaves us with a buffer at the front of an ALU operation, but in my IS there is no space for such a command for wright back or a command to pause the register write that would occur regardless of wright back without some kind of output switching. As you stated, this switching would be one tick at the output, but actually would lose one tick plus the clock by adding a repeater at the input because the next instruction has just been halted. Otherwise, our loss at each end is one tick at the front and one tick at the back. But then comes the wright back delay. It is the length (in chunks) of the operational unit. So we have a command, it goes through, doesn't wright to registers, may or may not hold up the command that occurs a few lines later, does not calculate it's conditions, and adds two ticks to all operations. Good luck managing that.
This moves into the "is it worth it" part. If the operations are already optimized for bussing themselves back to the registers within their own operation delay time, then the distance from their output back to their input is nearly equal to the length of the existing bussing from the registers to the input anyway. The loss between register writing and internal wright back is now only the register wright time and the delay to the next operation. If this delay has been optimized with something like my four tick instruction pointer, this should be no more than three ticks and the delay is a constant intrinsic to your personal design preference. Our loss is now about six ticks slower for a wright back verses two ticks on every single instruction. Now, I'm not stupid enough to claim that the wright back does not have it's advantage, I'm simply going to say that there is a minimum number where it is not advantageous. We'll see how many this is for six operations and ask if a two tick loss per operation really justified? Let's see. We start with two wright backs per six operations. By your model, we have six operations, that's a twelve tick loss for our six operations. The wright back loses two ticks plus the two clock cycles for the two held operations, as well as the processing time to repeat the operations. So our loss was 14 ticks, two clock cycles, and two operation delays. My way takes an extra two instructions to command recalculation, plus two operation delays for the calculations, and somewhere between eight and two cycles in the clock/wright synchronization (a value you control in the design), plus two ticks lost in the two register wrights, and two ticks lost in the two register reads. So we lost six to twelve ticks, two clock cycles, and two operation delays.
Most of this cancels except for the six to twelve ticks to the fourteen ticks. That's a difference of eight to two ticks. Notice that it's the exact loss in the constant user-defined clock delay for my system. So the less the clock delay loss is on my system, the more efficient wright back avoidance becomes. For the long delay, mine is two ticks faster, so a wright back content of 33% makes these two about equal. For the short delay, the difference is eight ticks in my favor. Now, I'm just going to look closely and realize that for each wright back we add into the six, this loss is two ticks per wright back on mine as (short clock delay) and one tick per wright back on yours. So we become equally efficient if we contain 66% wright back in the code. So now the question is, with pipelined hardware multiply and divide, how often do we encounter this number of wright backs? This is a question for the designer about the intended operations, and I don't think that more than 2/3 of the code is intended to wright back for most of what I can think of. There are still huge issues with a fully serialized group of instructions, but this is avoided as best as possible by having the faster operations run at their own speed instead of being limited by the speed of the slower operations. Of course, as you said, this is still highly limited if a long operation is called.
Now, finally, for what you mentioned about me being limited to memory speed. I will admit that my method has a loss in code density. For every wright back, I need one extra line of code. Whether or not your's need the extra line us up to you and microcoding decisions. But if I can decode a 7 bit instruction pointer in four ticks, we could venture to say that I could decode a 16 bit address in 8 ticks (and I can). Now, if I load more than one memory location at a time, (say... oh... four lines), we remove two ticks from the decode because four memory locations can be addresses by the same number, and we get four and a half lines per every two instructions completed. So we are doing okay there.
The issue with my layout isn't instruction paging per se, but is probably the conditional branching. I have not yet given this system a way to prevent instructions not in the branch but already in the pipeline from being ignored. I think I know how I can do this, but it seems like I am putting a massive load on my instruction pointer and register wright hardware, everything depends upon their ability to block register writing of the incorrect operations while allowing register writing of the correct operations. This sounds straight forward at first: just block all the wrights after the jump for the time of the pipeline length, but then we remember that not everything takes the same amount of time to compute. A divide before the jump could arrive to the registers after a shift on the wrong side of the jump. Now we have to block the shift, allow the divide, and ensure the divide doesn't have a data conflict with the jump and the divide's conditions don't nullify the jump.
Interrupts wouldn't cause data errors in and of themselves, but an interrupt could use a register that is already in use, so there need to be special interrupt registers, and a separate interrupt bus.
As for the compiler, if you run the program more than a certain number of times (like four of five), then the losses in compile time will start to make up for themselves. This computer is not designed to run the program only once.
I think the difference between myself and these "other guys" is that I can acknowledge that my system does have downsides, but I also aim at making a balance between the downsides and the efficiency of removing as much as possible. I think that I do see what all the problems will be, but never declared the caveats when I described parts of this system because I thought you would all see and understand them.
(10-12-2014, 03:55 PM)Magazorb Wrote: This hot air is to steamy, you're recycling to much, TSO you have some good ideas but you're overlooking key details, the Fwd is a good thing to have and utilise, it allows you to forward a data from the ALU back into the ALU rather then having to write it back before use, this increases serial performance, you wouldn't want to rid of that.
So say we have the example i gave, which you seem to missunderstand the setup, the alu stage has no buffers in it, only around it, now i'll be more realistic this time and use stuff based on what we can currently implement with no issues what so ever, which is a 6 tick ALU with 1 tick forward being CU assisted and 1 tick repeater lock stage buffer.
Now i'm going to be unrealistic to give your suggesting system as much of a advantage as posible, we shall have a write back of 1tick, and a read from that new write back 1 tick, now you'll notice that that equates to the same as the Fwd, but the difference is the Fwd only has to go back to the buffer before, meanwhile to achive this speed with the registers you would have to have them all repeater locked and comparative read disabled, in turn this means it's memory dencity is low and gives you a only aprox 30 dust of singal total (15 before lock and 15 after) to get all your registers back to ALU input, so ignoring the buffer as you suggest you don't want that.
And what you'll find is that yours is no quicker, in serial compute and that's in your most favorable configuration.
Now allow me to explain further in that those register setups havn't been done before except maybe in a 4G/SPRM, we always tend to use at least 8G/SPRs in our PEs, so you can easly see without that Fwd you can very easily loss serial performance, and unfortunately your oversight is that you forgotten about serial computations, you though so much that you could fill ideals with out of order executions, but unfortunately their's times where this isn't possible.
The resulting difference is that when comes a serial computation which is inevitable at some point, you will lose computations via ideal, this is still a big issue in IRL mass computations.
I guess you could via pulse timing achieve a max of 3 ticks per instruction through put but that's unlikely due to MC being a derp, 5 ticks is likely though, but theirs nothing stopping our systems from doing that anyhow providing the data isn't serial, and if it is based on current creations all that would do is slow us down to 8 ticks, but with timing comes a issue of how easily can a interrupt cause a corruption.
Now to go back to go back to the device you have, that one that gives instructions to other "CPUs" as you called it, How do you intend to have multiple of these PEs attached to one of these and fully satuate them when you have so few commands that would actually take up multiple PE cycles if you will, Besides, it's not like we don't have thise functions, it's just we don't bother to reorder instructions in compile time due to the amount of overhead that would be on what ever has to compile it before you could finaly exe. (I'd say it's overhead as it's actualy something that has to be processed before it can have a use, so it does add to making something longer in some ways)
Heck i'v stated i'm currently planning on making a SIMD array and how to address all the PEs is a similar issue, because i have to fetch data as quickly as i can while it's computing and SIMD computations are vastly more quicker computing speed vs memory speed which brings the same issue as what you will have with addressing multiple PEs, that you will only bealbe to use multiples when you have multiples executing instructions that would take multiple cycles.
The only difference between the my SIMD array and your multi-MIMD PE would be mine's data limited and your's is instruction limited.
When these kinds of things happen you'll need a PE that can stockpile multiple instructions/data coherently.
I've seen other guys do things similar to your suggestions, they gave up not realising how much they got them self's in, but you seem to know a lot of what you want to do but you've yet to figure out the short comings.
To clarify, I would consider that a write back while a forward would be something like halting a multiplier in order to inject a partial sum into the middle, which would basically never happen because you would have also needed to calculate the partial sum before hand... using the multiplier (or using the register AND, the bit shift, the adder, and the forward injection process, which is not faster by any stretch of the imagination).
The question that must be considered, though, is twofold. Is this write back detrimental to the parallel performance of the unit, and can one justify adding one tick to every operation just for a faster write back?
The first question can go either way depending on whether or not you implement the write back so that it doesn't interfere with the input busses. In fact, you can even write back between cycles if the clock is slow enough for the system to have open time in between operation repetition. (Ideally, you wouldn't have this, but the instruction pointer decode time dictates that.) In my personal layout, the instruction pointer is designed for parallel pipelined decoding (figured this out last night, it's really cool, easily four ticks... maybe three, but you're starting to push the limits of the memory) in order to keep up with the clock, so this write back is not possible in my system that way. This leaves us with a buffer at the front of an ALU operation, but in my IS there is no space for such a command for wright back or a command to pause the register write that would occur regardless of wright back without some kind of output switching. As you stated, this switching would be one tick at the output, but actually would lose one tick plus the clock by adding a repeater at the input because the next instruction has just been halted. Otherwise, our loss at each end is one tick at the front and one tick at the back. But then comes the wright back delay. It is the length (in chunks) of the operational unit. So we have a command, it goes through, doesn't wright to registers, may or may not hold up the command that occurs a few lines later, does not calculate it's conditions, and adds two ticks to all operations. Good luck managing that.
This moves into the "is it worth it" part. If the operations are already optimized for bussing themselves back to the registers within their own operation delay time, then the distance from their output back to their input is nearly equal to the length of the existing bussing from the registers to the input anyway. The loss between register writing and internal wright back is now only the register wright time and the delay to the next operation. If this delay has been optimized with something like my four tick instruction pointer, this should be no more than three ticks and the delay is a constant intrinsic to your personal design preference. Our loss is now about six ticks slower for a wright back verses two ticks on every single instruction. Now, I'm not stupid enough to claim that the wright back does not have it's advantage, I'm simply going to say that there is a minimum number where it is not advantageous. We'll see how many this is for six operations and ask if a two tick loss per operation really justified? Let's see. We start with two wright backs per six operations. By your model, we have six operations, that's a twelve tick loss for our six operations. The wright back loses two ticks plus the two clock cycles for the two held operations, as well as the processing time to repeat the operations. So our loss was 14 ticks, two clock cycles, and two operation delays. My way takes an extra two instructions to command recalculation, plus two operation delays for the calculations, and somewhere between eight and two cycles in the clock/wright synchronization (a value you control in the design), plus two ticks lost in the two register wrights, and two ticks lost in the two register reads. So we lost six to twelve ticks, two clock cycles, and two operation delays.
Most of this cancels except for the six to twelve ticks to the fourteen ticks. That's a difference of eight to two ticks. Notice that it's the exact loss in the constant user-defined clock delay for my system. So the less the clock delay loss is on my system, the more efficient wright back avoidance becomes. For the long delay, mine is two ticks faster, so a wright back content of 33% makes these two about equal. For the short delay, the difference is eight ticks in my favor. Now, I'm just going to look closely and realize that for each wright back we add into the six, this loss is two ticks per wright back on mine as (short clock delay) and one tick per wright back on yours. So we become equally efficient if we contain 66% wright back in the code. So now the question is, with pipelined hardware multiply and divide, how often do we encounter this number of wright backs? This is a question for the designer about the intended operations, and I don't think that more than 2/3 of the code is intended to wright back for most of what I can think of. There are still huge issues with a fully serialized group of instructions, but this is avoided as best as possible by having the faster operations run at their own speed instead of being limited by the speed of the slower operations. Of course, as you said, this is still highly limited if a long operation is called.
Now, finally, for what you mentioned about me being limited to memory speed. I will admit that my method has a loss in code density. For every wright back, I need one extra line of code. Whether or not your's need the extra line us up to you and microcoding decisions. But if I can decode a 7 bit instruction pointer in four ticks, we could venture to say that I could decode a 16 bit address in 8 ticks (and I can). Now, if I load more than one memory location at a time, (say... oh... four lines), we remove two ticks from the decode because four memory locations can be addresses by the same number, and we get four and a half lines per every two instructions completed. So we are doing okay there.
The issue with my layout isn't instruction paging per se, but is probably the conditional branching. I have not yet given this system a way to prevent instructions not in the branch but already in the pipeline from being ignored. I think I know how I can do this, but it seems like I am putting a massive load on my instruction pointer and register wright hardware, everything depends upon their ability to block register writing of the incorrect operations while allowing register writing of the correct operations. This sounds straight forward at first: just block all the wrights after the jump for the time of the pipeline length, but then we remember that not everything takes the same amount of time to compute. A divide before the jump could arrive to the registers after a shift on the wrong side of the jump. Now we have to block the shift, allow the divide, and ensure the divide doesn't have a data conflict with the jump and the divide's conditions don't nullify the jump.
Interrupts wouldn't cause data errors in and of themselves, but an interrupt could use a register that is already in use, so there need to be special interrupt registers, and a separate interrupt bus.
As for the compiler, if you run the program more than a certain number of times (like four of five), then the losses in compile time will start to make up for themselves. This computer is not designed to run the program only once.
I think the difference between myself and these "other guys" is that I can acknowledge that my system does have downsides, but I also aim at making a balance between the downsides and the efficiency of removing as much as possible. I think that I do see what all the problems will be, but never declared the caveats when I described parts of this system because I thought you would all see and understand them.