10-14-2014, 04:20 AM
I'm going to address two of the things you mentioned right now. I remembered something in wire word that makes data entry dependent upon gates that contain their own clocks, giving clock everywhere logic; this is called timing dependent circuitry in the context of wire world. If that is what you mean here, than mine is certainly not timing based, I was thinking of something more like just keeping data together in a clockless system that resembles pipelining, so for any individual stage in each operation, data is as slow as the slowest operation, but data speed between independent operations is not limited this way. (If I ever start to build this thing, what I mean will quickly become apparent). Now, the compiler is accounting for way more than you think it is. The compiler is renaming registers, prearranging out of order execution, inserting lines of code to assist in condition prediction on some of the longer functions if the next line of code is a branch, rearranging the conditions in order to account for variable ALU operation timing, and managing/verifying the memory allocations before runtime. It is also synchronizing some aspects of the time dependence and variable clock rate of the entire system and making up for the fact that I don't understand how programming works at all.
I know that forwarding receives data before it is written, what I am saying is that because I fully intend to make all my operational units the exact length needed to make them bus themselves back to the registers, the extra two ticks inside the registers become transparent compared to a forward that needs some kind of control unit and a way to select if it forwards or wrights back (a 1 tick multiplexer at best on the output), plus an additional front end buffer (another 1 tick with locking repeaters). I'll do the same example using my multiplier that you did above. The distance from the front of the multiplier to the back is 5 chunks. The distance from input of the multiplier to the CPU is also five chunks (hey, would you look at that). The distance from the output of the multiplier to the register bank is zero chunks, and the distance from the register bank to the multiplier input is five chunks (weird, it's like it was built to just fit right in there).
The multiplier repeats every three ticks and takes 17 ticks to complete. In your design, we need one tick in a locking buffer at the input and 1 tick in a router at the back, plus all this work to make a control unit and a five chunk bus to return the output to the input. In my design, we have a register wright that takes two ticks, a register read that is actually simultaneous to the second wright tick, and then just reuses the five chunk bus.
So, you cost the system 19 ticks per operation, which is within the next clock cycle from my multiplier and everything else cancels out. I take 17 ticks, and then 1 tick, and then the next clock (one tick after that) results in the next read. This is also nineteen ticks, but mine is much easier to build. In this case we are always equal. We are only unequal if yours and mine end up on different sides of a clock cycle. We have the same serial and parallel execution times of 24 ticks per serial repetition. (It also just occurred to me that you will never forward a multiplier because the multiplier gives 32 bits output, but takes a sixteen bit input.) The only place where you exceed me in speed is the Boolean functions, specifically the OR NAND and NOR functions. My AND function is actually faster than yours. XNOR is the same for both, and XOR is significantly faster for you than for me because I integrated it into the adder.
Also, you say your's is a four tick ALU plus one for every eight bits? Making your clock also five tick instead of three?
I know that forwarding receives data before it is written, what I am saying is that because I fully intend to make all my operational units the exact length needed to make them bus themselves back to the registers, the extra two ticks inside the registers become transparent compared to a forward that needs some kind of control unit and a way to select if it forwards or wrights back (a 1 tick multiplexer at best on the output), plus an additional front end buffer (another 1 tick with locking repeaters). I'll do the same example using my multiplier that you did above. The distance from the front of the multiplier to the back is 5 chunks. The distance from input of the multiplier to the CPU is also five chunks (hey, would you look at that). The distance from the output of the multiplier to the register bank is zero chunks, and the distance from the register bank to the multiplier input is five chunks (weird, it's like it was built to just fit right in there).
The multiplier repeats every three ticks and takes 17 ticks to complete. In your design, we need one tick in a locking buffer at the input and 1 tick in a router at the back, plus all this work to make a control unit and a five chunk bus to return the output to the input. In my design, we have a register wright that takes two ticks, a register read that is actually simultaneous to the second wright tick, and then just reuses the five chunk bus.
So, you cost the system 19 ticks per operation, which is within the next clock cycle from my multiplier and everything else cancels out. I take 17 ticks, and then 1 tick, and then the next clock (one tick after that) results in the next read. This is also nineteen ticks, but mine is much easier to build. In this case we are always equal. We are only unequal if yours and mine end up on different sides of a clock cycle. We have the same serial and parallel execution times of 24 ticks per serial repetition. (It also just occurred to me that you will never forward a multiplier because the multiplier gives 32 bits output, but takes a sixteen bit input.) The only place where you exceed me in speed is the Boolean functions, specifically the OR NAND and NOR functions. My AND function is actually faster than yours. XNOR is the same for both, and XOR is significantly faster for you than for me because I integrated it into the adder.
Also, you say your's is a four tick ALU plus one for every eight bits? Making your clock also five tick instead of three?