Forums - Open Redstone Engineers
I am TSO - Printable Version

+- Forums - Open Redstone Engineers (https://forum.openredstone.org)
+-- Forum: ORE General (https://forum.openredstone.org/forum-39.html)
+--- Forum: Introductions (https://forum.openredstone.org/forum-18.html)
+--- Thread: I am TSO (/thread-4807.html)

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14


RE: I am TSO - TSO - 10-12-2014

(10-12-2014, 02:25 AM)Magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
(10-10-2014, 04:22 PM)Magazorb Wrote: Metric prefixes:
8Mb = 1(1x10^6) or 1,000,000B, 8KB = 8(8x10^3) or 64,000b, 4Kb = 1(4x10^3) or 4,000b
Binary Prefixes:
8Mib = 1(1024^2) or 1,048,576B, 8KiB = (8x8)(1024^1) or 65536b, 4Kib = 4(1024^1) or 4096b
the 1024 came from 2^10 which is the bases of which binary prefixes are powering from, so 8GB = 8((2^10)^3) = 8589934592Bytes or 6871947674bits.
The Binary prefixes was losely based around the French metrics or later came SI metrics
1.) I know my prefixes quite well, possibly better than some of you guys, so I don't need help with that.
2.) 8Mb is not equal to 1(1*10^6), it's much more like 8*10^6 b, and you have similar math errors throughout this section. Also, they are by no means "loosely" related, the relationship of the digits between base ten and base two is log(n)=bl(n)/bl(10). Every (useful) metric prefix is a factor of 1000. Log(1000)=bl(1000)/bl(10), now log(1000)=3 and bl(1000)= 9.96. Let's re examine that number as 10. Now we are looking at 2^10, which is 1024. So the relationship between the two is just (10^(log(1024)log([prefix])/3))B=[prefix]iB. End of story.
[/spoiler]

OK, first of you did originally use it incorrectly, i didn't say that you didn't know it, i just merely was clearing, sorry if it seemed i was suggesting you didn't know, also i said 8Mb is 1(1*10^6)B which is true, B= Byte, b=bit, you just miss read it.

Also the prefix naming of binary prefixed is losely based on metric prefixs, they just change the ending of metric prefixes to bi instead, hence why i said losely based, the values them self are all together different yes, but the names of prefixes are related.

You can easily tie the powers of prefixes to together rather easily by saying 10^3~1024^1, or 10^9~1024^3, basically powers of binary be metric_power/3, this become less accurate as the powers increase, so best to keep metrics and binary's separate, but regardless name still relates.
As I said, the logarithmic equation above is the exact relationship between the radicies.

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-09-2014, 05:16 AM)TSO Wrote: Sorry, I meant mostly how to generate the conditions (flags, I guess) in the same cycle as the operation, then I started wondering how many of these conditions I need... then I asked my father for help. He programmed back in the 90's when computers had no memory and actual program speed mattered because computers were still so slow (back when programmers actually had to be good at programming.)

What I learned was this: all your possible conditions are computed at the same time and given a true/false value. These values then go to a single register the same size as all the possible conditions (between eight and sixteen), you then use a bit mask to get the conditions you want out of this register (you can actually call more than one flag at the same time, then operate a condition on those two). Your goal is to never jump, EVER, especially more than one or two lines. Long goto jumps take longer, are harder to understand in debugging, and are a sign that the programmer does not understand the program being written. If, Then, Else statements also slow down the system, but are far better than a goto because the compiler will place the destination code for the if statement near the line the condition is on. The compiler will preload the program data line by line regardless of condition destinations, if an if, then, else occurs, a path is chosen, and all code for the other path is deleted, followed by loading all the subsequent program data for the line that was chosen. The nearer these lines are to the jump condition, the less gets loaded and unloaded. The amount of qued data that gets deleted depends upon how large the jump is, with a goto being the worst. Nearly all data is dumped and reloaded and all stages of the CPU are completely cleared in a goto because they inherently do not jump near the line their condition is on. And if you pipelined and it jumps conditionally, you have to somehow rid yourself of all the crap in the pipeline that is no longer correct because it was part of the other conditional side.

Luckily, my design inherently helps with this, but it will still suck to have to think about it.


Computers back in the 90's did have registers, if they was GPRMs, this wasn't used to often though because the processors at the time was slower then the memory speeds so having registers didn't improve performance much, and added to the instruction size, so was deemed uncesseriy for the time because of the extra memory requirements for the IS.

Two other popular PE types wear Stack based and Acumulator based, this two wear both popular for near enough the same reason, positives and negatives of each are near the same and are mainly the other type of PEs that pros and cons wear oposite of GPRM PEs, there was also other types of PEs that other systems used but the 3 most popular and main ones where the Stack,Accumulator and GPR based PEs

"all your possible conditions are computed at the same time and given a true/false value."
The idea and direction you're going with this is good, irl some do this, however in MC it isn't really a thing with most things being small enough they reach the new address and return before clock, thus giving no performance gain, or are predictivly prefetched for pipelined CPUs which also gives no gain.
This is a good idea though and in bigger MC things maybe useful, the idea has gone around a couple of times but has yet to be implemented, so 5pts. to you if you do it first with at least some gain Big Grin

"(back when programmers actually had to be good at programming.)" actual programming is harder now then it's ever been for pro's, the only programmers that aren't good tend to be game devs (sometimes they know some code but they aren't really programmers, they use software instead) and those who can't be asked to learn how to write optimal code, but for those doing it professionally are made to max their hardwars abilitys out and this becomes very hard.
To do this you have to learn verious languages (normaly this would be done with a combination of fortran, C++ and ASM, with GPGPU being utilised heavily.
Generically this style of programming is never learned but is in high demand of, also tends to not run to well on various configurations other then it's intended system config(s).

I'm also unclear as to what you're defining as a compiler, probably correct but sounds almost like you want to use it in active programs instead of compiled before use.
[/spoiler]

He said there were like 10 or 16 registers (remember he only really worked with x86 assembly), and you only worked with about four. All 16 were special purpose, but if you weren't using the function they related to in that instruction, they doubled as general purpose, there were really only five you worked with: A,B,C,D, FLAGS. Add in that x86 is RISCey, and you see that you only need four of these registers: the one involved with your operation, any two main registers, and FLAGS, so you never use D either. The FLAGS are all predicted simultaneously and parallel to execution, and the jump conditions were all opcodes that simply masked the FLAGS register and jumped to the specified line. X86 processors never performed more than one task in the same line of code, for the most part. There was some special stuff that played with things, and there were numbers you could insert into the opcode location that represented opcodes that didn't exist (or something like that) that made the CPU "fuck up just right" and perform operations that used many lines of code in one line. I have no idea how that works, but it abuses the fact that x86 is microcode heavy but doesn't know what to do if it receives a number that isn't a real opcode. Results range from exactly what you wanted to happen to computer suicide.

His only comment when I asked about stacks was, "You never use stacks." He used pointers all the time, though (in fact, there were a few string manipulation programs that only used pointers and pointers pointing to pointers and other stuff). He also never made comments on his code, so nobody knew how to debug it, (not that the debugging team was any good at their job to begin with), so he just debugged it himself.

He says the abilities of modern computers make even lazy coding fast enough for the intended application: goto is returning, as are stacks. When he did things, it would be a few lines in C that then resulted in a subroutine branch to a short assembly program, then some more lines of C, ex etera. Some of his programs were more assembly than C, some weren't, it depended on what needed to be done. The programs were intended for the x86 family, which is apparently an industry standard, or something.

Fortran was rarely used because the compile overhead was too great.

A compiler is a program that converts the written program into machine code. The more distant the programming language is from machine code, the more time is spent compiling, and the longer the resulting program. Some languages, such as assembly have zero line expansion (assembly is simply a mnemonic for machine code), while some languages (cough java cough) are so distant from the machine code that only an absolute moron would try to write in them unless you need the code to be insensitive to the machine it's running on. Of note: Python codes compile in runtime, there is no compile time. Now, that being said, some good compilers do actually alter and rearrange the lines of code in compile time to help optimize things like conditional jumps.

Also, I meant the computer cues up program data, not the compiler (damn auto correct)[/spoiler]

Seems you and i are on the same page altogether XD, yes all you say is mostly correct, x86 based processors use GPRM PEs, with more modern x86 CPUs having many x86 MIMD PEs, while clock higher clock speeds do aloud for lazy programers and with languages like Java that are nearly impossible to fail with, heavily optimized C++ and Fortran still do have the original issues you state and more as of their being more PEs you would have to vectorize, modern Intel and AMD processors have 16MIMD PEs (assuming a 4 core Intel chip or a 4 modual AMD processor, 32MIMD PEs on a I7 5960X; 4 per core)

Also x86 is concidered as a CiSC (pretty much a complicated and over developed RiSC)
He considered it RISC because all the assembly commands are very much RISC with one operation on two registers per line. The compiler expands out the instruction set based on the opcode and the fields involved.

The registers were all special purpose. For example, if you wrote to BX, you couldn't perform a move in the next operation because BX was the move command's index register for a machine coded array. You would lose what was already written to BX when that happened and the move would be wrong (unless, of course, you were performing an indexed move in an array, which you never did because that would waste more clock cycles than using pointers). AX was the accumulator register, CX was the count register for a machine coded array, DX was some even more obscure thing you absolutely never used. The only reason he even knew the second functions of the first three was because he would occasionally get a job that needed a program to run faster. The assembly code always had these registers used for their special purpose. (The only reason why these are any slower is because the machine coded array need like four other registers to define other properties of the array and it's address. That being said, many people thought if you did more than twelve (IIRC) consecutive equal move operations on the same array, these became useful... but they didn't realize this meant that you just created twelve consecutive equal RAW data conflicts in the pipeline, so instead the program slowed down while the CPU damned you to Hell, and you never needed that many moves unless you were moving every element in the array by one and were treating it as a stack... and there's the magic word)

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.


Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

Doesn't really matter how much you speed up things that are going slow, if you have them before your next clock comes it's waiting, so if you can, put other stuff their and let that be further out, again this is a unique thing with your designs, most don't have this issue so might be a little tricky figuring out a method of doing this automatically for the programmer or for the programmer to know where to do this.
[/spoiler]

1.) There was a slight error, DVQ/MOD and MUL/SQR are six ticks out, but will probably take about 96 and 12 ticks to complete, respectively. On the other hand, it just now occurred to me how I could pipeline these functions and have them point back toward the CPU so that the actual operator will bus itself (If I make the operator four chunks long (easily done) and place the output toward the CPU, then it ends up right back at the cache when it's all done because the bus to the input from the CPU is going to be four chunks long), cutting out the four bussing ticks needed to return the value and allowing more than one divide or multiply to be requested in sequence, though it still would take 96 or 16 ticks to compute each operation.

2.)The computer won't alter the code for the programmer to account for the timing difference, and the programmer also doesn't write the program with the timing difference in mind. Place the blatantly obvious secret sauce recipe here. (If you can't figure out how I'm going to be creating this particular work around, we have a problem, because it's already been discussed in this thread.)[/spoiler]

I'm not to well informed as to all the details of your overall system, or architecture. However what i can say is that your main two concerns are data loops and through put, I'm sure you're already aware of that though, data dependency's will also make a difference, so it's mostly how you intend to rid of data dependency's or reduce them, as increase pipe-lining will increase throughput, it also complicates and often allows for more data dependency's to accrue, in tern increasing the amount of time you have to ideal

Also i doubt anyone really knows what your "secret sauce" is really, as you seem somewhat inconsistent to what would otherwise seem to be it repetitively.
Basically it's somewhat vague to our general convention
Let's have an extremely short thought expirement. If you don't rearrange or alter data in runtime, and you don't rearrange or alter data at programming time, where would it happen? What other time could possibly exist in the program development flow chart where the data could be rearranged or altered? There is only one useful answer to this question.

Not every secret sauce is the same. It takes more than one ingredient to make a recipe.
magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-10-2014, 05:45 AM)TSO Wrote: That completely defeats the purpose of a forum. I advise you get around to reading it, we've been discussing making the impossible possible, and (at least according to GG) have actually made a small breakthrough in redstone computing (it sounds like I knocked a tick or two off of the fastest theoretical clock rate)

Hi

we've/you've not really been talking about things that was seemingly imposible, albe it you do have some great ideas in general, also sorry to say it to you but you won't be caliming performancing king just yet. maybe after a few revisions and optimisations once you settle in and learn stuff your combination with some of our members of ideas might do though (it's not really expected that new members even know half this much about CS really, so you're doing good Tongue)

If it's not much to ask for, once you have a IS drawn up may i see it please?
[/spoiler]

Due to a particular breakthrough I had (it was legitimately like a moment of enlightenment, LD sort of knows about it), I actually have almost all the instruction set drawn up now, as well as the entire bussing arrangement. I also don't think I'm going to ever be performance king, but I will certainly be the king of dual purpose logic.

I have a programmer father that hates wasting time in a program, my friend's father is a hardware developer at HP, one of my father's friends is a hardware developer at IBM, and one of my father's other friends has a degree in library science. Between my inquisitive nature and their experience, there is literally no way I wouldn't know this much about computer engineering. (Although, it is funny to get the two hardware developers in the same room, because they suddenly get real cautious about how they answer my questions. Let's just say neither of them are the low guy on the totem pole at their company.)[/spoiler]

Nice to know you sit your self in a position where you can easily acquire information about real life EE and CS, no doubt it's useful, however MC does have many limitations and possibility that IRL stuff doesn't account for, so this is generically the challenge when coming from RL based knowledge, it's still very useful as most of it can be applied though, also another thing to bear in mind, is technology is so vast that no small collective group of individuals fully know the capability's of hardware/programming, I even doubt that everyone's knowledge collectively would know.

In short you will hit a wall and you will destroy some, but a lot of the time it's unexpected, but that's all part of the fun
The funniest thing is that the library scientist is by far the most useful for what I'm trying to get done.

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
(10-10-2014, 08:26 PM)LordDecapo Wrote:
(10-07-2014, 02:42 AM)TSO Wrote: Or you can rename registers as you go... but that's for noobs.
cough cough, i can do that, and it only takes 3 ticks.
Noob.

LordDecpo Wrote:[spoiler]
(10-09-2014, 10:18 PM)greatgamer34 Wrote: Most people(including me) in minecraft have a Compare instruction then followed by a Flag register read.

The Flag register read can be BEQ(branch if equal to), BGT(branch if greater than), BLT(branch if less than), BNEQ(branch if not equal to)....etc, along these lines of any amount of possible conditions.

This then reads from the proper flags register and performs a jump to a specified address(usually found in the IMM or a pointer).

Ok so my branching works a bit different, cause my crazy ass cRISC or CISC or what ever u want to consider it Architecture.
I have 2 modes in my IS, Memory/System is mode0 and ALU functions are Mode1
Branching is a Mode0, And it is also a multi line
I have a specially made 3 deep Queue with mutliread so when a branch is detected, it reads Locations (Inst)0, (Inst)1, and (Inst)2 from the Queue, and routes there data to specific parts

Inst0 is the Main Inst, it tells what conditions to look for, weather it is a Call or a Return, weather it is conditional or not, if its direct, relitivePos. or RelitiveNeg.

Inst1 is the Destination address (so i can have 65535 or what ever lines of code on just a PROM, more if i access external memory, which is easy to do) that is only loaded into the PC if the condition is true

Inst2 is the function that defines where the flag can arise from, and this inst MUST be a Mode1, so u can add, sub, or, xor, compare, ect to generate any flag you want.

All of that gets decoded and sorted out in about 7ticks, then the branch is determined on the next cycled wether the conditions are mett, it has static prediction of False, so u only get hit with a 1 cycle penalty after a True flag comes through, leaving the penalty of branching not that devastating.

I will be making a forum post this weekend with pics and such of my MC CPU, since u cant join server, and will explain the IS in detail in it for those who are interested.


...It sounds really complicated... Tongue
...or maybe not...
Is it basically the same jump instruction system as mine, but without the parallel computing part?

I haven't quite gotten around to acquiring a free .rar extractor that only comes with minimal bonus material.

LordDecapo Wrote:

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.


Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

No please no, do not do a 3 tick clock its "theoretically" the fastest u can get with torches,,, but NO just NO! MC bugs are so disgraceful that ur clock cycles will be come uneven and will corrupt data in ways u never knew were possible... trust me, i had a huge project, and backed off the complexity and simplified the logic i was gonna use in my CU to get it to be a little longer clock,, well more then a little,, 3 ticks to 10 ticks, but the through put and penalty %ages are ridiculously less now as well. so it gives you better performance under normal operating conditions. Clock speed DOESNT mean more Power,, u have to take into consideration the IS, and the possible penalties the CPU could suffer from such small pipeline stages,,, and a 3 tick clock, leave 2 ticks for logic, 1 tick to store, so its really dumb xD i learned this the hard way... PC was the thing that we found killed it the fastest.
[/spoiler]

Again, there are actually many errors in that statement, as well as a massive oversight on my part. The clock is limited to the seven ticks it will take to decode the instruction pointer. I honestly have absolutely no idea how to speed that up without reducing the amount of cache space in the computer used for cuing up instruction data.

Three ticks does not give you two ticks for logic and one tick for store (at least in my architecture, just because of how every function would need to store at it's input side), it gives three to store, however long it takes to calculate, three to wright to the output bus, and three to store in the data registers. (Also, there is a device in the game that can store data up to four ticks, you'll never guess what it is. And no, it's not some "command block bull shit".)


Final announcement: the instruction set is nearly complete, it is still actually the reverse engineering of the processes in the CPU and ALU, but my moment of enlightenment allowed for me to engineer the CPU and bussing layout all in my head. It occurred to me that op codes are pointers, which is why I know how far away the inputs for each ALU function are from the CPU (that'll give you something to think about).[/spoiler]

Repeater locks, 1 tick store and the rest can be logic. it's pretty much the way to go.

Also loving this thread of yours Big Grin
Cue another philosophical minecraft moment...

Is it the way to go? Do you honestly need that repeater to even lock? If we multiply by accumulation, and place 16 3 tick adders with output connected to input incorrectly so the shift happens on it's own, do you really need to hold the other operand's data in a locking repeater, or do we just need it to pause for three ticks at the front of each add? If they are 1 tick adders sixteen blocks long, does the data for the second operand need to have any pause at all, or does the repeater every fifteen blocks suffice?

If they are 1 tick adders sixteen blocks long, are they really 1 tick... or are they zero tick because redstone needs a repeater every fifteen blocks?
If you direct this multiplier back toward the registers, did you remove the output bussing, or is the bus performing the operations?
Could that be applied to other elements of the processor?

(10-12-2014, 03:52 AM)Magazorb Wrote:
(10-12-2014, 03:14 AM)͝ ͟ ͜ Wrote:
(10-12-2014, 02:25 AM)Magazorb Wrote: In short you will waste away your life playing a stupid block game building outdated computers that serve no purpose

fix'd

Well that's not nice :'(

But it is true


RE: I am TSO - Magazorb - 10-12-2014

(10-12-2014, 03:59 AM)TSO Wrote:
(10-12-2014, 02:25 AM)Magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
(10-10-2014, 04:22 PM)Magazorb Wrote: Metric prefixes:
8Mb = 1(1x10^6) or 1,000,000B, 8KB = 8(8x10^3) or 64,000b, 4Kb = 1(4x10^3) or 4,000b
Binary Prefixes:
8Mib = 1(1024^2) or 1,048,576B, 8KiB = (8x8)(1024^1) or 65536b, 4Kib = 4(1024^1) or 4096b
the 1024 came from 2^10 which is the bases of which binary prefixes are powering from, so 8GB = 8((2^10)^3) = 8589934592Bytes or 6871947674bits.
The Binary prefixes was losely based around the French metrics or later came SI metrics
1.) I know my prefixes quite well, possibly better than some of you guys, so I don't need help with that.
2.) 8Mb is not equal to 1(1*10^6), it's much more like 8*10^6 b, and you have similar math errors throughout this section. Also, they are by no means "loosely" related, the relationship of the digits between base ten and base two is log(n)=bl(n)/bl(10). Every (useful) metric prefix is a factor of 1000. Log(1000)=bl(1000)/bl(10), now log(1000)=3 and bl(1000)= 9.96. Let's re examine that number as 10. Now we are looking at 2^10, which is 1024. So the relationship between the two is just (10^(log(1024)log([prefix])/3))B=[prefix]iB. End of story.
[/spoiler]

OK, first of you did originally use it incorrectly, i didn't say that you didn't know it, i just merely was clearing, sorry if it seemed i was suggesting you didn't know, also i said 8Mb is 1(1*10^6)B which is true, B= Byte, b=bit, you just miss read it.

Also the prefix naming of binary prefixed is losely based on metric prefixs, they just change the ending of metric prefixes to bi instead, hence why i said losely based, the values them self are all together different yes, but the names of prefixes are related.

You can easily tie the powers of prefixes to together rather easily by saying 10^3~1024^1, or 10^9~1024^3, basically powers of binary be metric_power/3, this become less accurate as the powers increase, so best to keep metrics and binary's separate, but regardless name still relates.

As I said, the logarithmic equation above is the exact relationship between the radicies.

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-09-2014, 05:16 AM)TSO Wrote: Sorry, I meant mostly how to generate the conditions (flags, I guess) in the same cycle as the operation, then I started wondering how many of these conditions I need... then I asked my father for help. He programmed back in the 90's when computers had no memory and actual program speed mattered because computers were still so slow (back when programmers actually had to be good at programming.)

What I learned was this: all your possible conditions are computed at the same time and given a true/false value. These values then go to a single register the same size as all the possible conditions (between eight and sixteen), you then use a bit mask to get the conditions you want out of this register (you can actually call more than one flag at the same time, then operate a condition on those two). Your goal is to never jump, EVER, especially more than one or two lines. Long goto jumps take longer, are harder to understand in debugging, and are a sign that the programmer does not understand the program being written. If, Then, Else statements also slow down the system, but are far better than a goto because the compiler will place the destination code for the if statement near the line the condition is on. The compiler will preload the program data line by line regardless of condition destinations, if an if, then, else occurs, a path is chosen, and all code for the other path is deleted, followed by loading all the subsequent program data for the line that was chosen. The nearer these lines are to the jump condition, the less gets loaded and unloaded. The amount of qued data that gets deleted depends upon how large the jump is, with a goto being the worst. Nearly all data is dumped and reloaded and all stages of the CPU are completely cleared in a goto because they inherently do not jump near the line their condition is on. And if you pipelined and it jumps conditionally, you have to somehow rid yourself of all the crap in the pipeline that is no longer correct because it was part of the other conditional side.

Luckily, my design inherently helps with this, but it will still suck to have to think about it.


Computers back in the 90's did have registers, if they was GPRMs, this wasn't used to often though because the processors at the time was slower then the memory speeds so having registers didn't improve performance much, and added to the instruction size, so was deemed uncesseriy for the time because of the extra memory requirements for the IS.

Two other popular PE types wear Stack based and Acumulator based, this two wear both popular for near enough the same reason, positives and negatives of each are near the same and are mainly the other type of PEs that pros and cons wear oposite of GPRM PEs, there was also other types of PEs that other systems used but the 3 most popular and main ones where the Stack,Accumulator and GPR based PEs

"all your possible conditions are computed at the same time and given a true/false value."
The idea and direction you're going with this is good, irl some do this, however in MC it isn't really a thing with most things being small enough they reach the new address and return before clock, thus giving no performance gain, or are predictivly prefetched for pipelined CPUs which also gives no gain.
This is a good idea though and in bigger MC things maybe useful, the idea has gone around a couple of times but has yet to be implemented, so 5pts. to you if you do it first with at least some gain Big Grin

"(back when programmers actually had to be good at programming.)" actual programming is harder now then it's ever been for pro's, the only programmers that aren't good tend to be game devs (sometimes they know some code but they aren't really programmers, they use software instead) and those who can't be asked to learn how to write optimal code, but for those doing it professionally are made to max their hardwars abilitys out and this becomes very hard.
To do this you have to learn verious languages (normaly this would be done with a combination of fortran, C++ and ASM, with GPGPU being utilised heavily.
Generically this style of programming is never learned but is in high demand of, also tends to not run to well on various configurations other then it's intended system config(s).

I'm also unclear as to what you're defining as a compiler, probably correct but sounds almost like you want to use it in active programs instead of compiled before use.
[/spoiler]

He said there were like 10 or 16 registers (remember he only really worked with x86 assembly), and you only worked with about four. All 16 were special purpose, but if you weren't using the function they related to in that instruction, they doubled as general purpose, there were really only five you worked with: A,B,C,D, FLAGS. Add in that x86 is RISCey, and you see that you only need four of these registers: the one involved with your operation, any two main registers, and FLAGS, so you never use D either. The FLAGS are all predicted simultaneously and parallel to execution, and the jump conditions were all opcodes that simply masked the FLAGS register and jumped to the specified line. X86 processors never performed more than one task in the same line of code, for the most part. There was some special stuff that played with things, and there were numbers you could insert into the opcode location that represented opcodes that didn't exist (or something like that) that made the CPU "fuck up just right" and perform operations that used many lines of code in one line. I have no idea how that works, but it abuses the fact that x86 is microcode heavy but doesn't know what to do if it receives a number that isn't a real opcode. Results range from exactly what you wanted to happen to computer suicide.

His only comment when I asked about stacks was, "You never use stacks." He used pointers all the time, though (in fact, there were a few string manipulation programs that only used pointers and pointers pointing to pointers and other stuff). He also never made comments on his code, so nobody knew how to debug it, (not that the debugging team was any good at their job to begin with), so he just debugged it himself.

He says the abilities of modern computers make even lazy coding fast enough for the intended application: goto is returning, as are stacks. When he did things, it would be a few lines in C that then resulted in a subroutine branch to a short assembly program, then some more lines of C, ex etera. Some of his programs were more assembly than C, some weren't, it depended on what needed to be done. The programs were intended for the x86 family, which is apparently an industry standard, or something.

Fortran was rarely used because the compile overhead was too great.

A compiler is a program that converts the written program into machine code. The more distant the programming language is from machine code, the more time is spent compiling, and the longer the resulting program. Some languages, such as assembly have zero line expansion (assembly is simply a mnemonic for machine code), while some languages (cough java cough) are so distant from the machine code that only an absolute moron would try to write in them unless you need the code to be insensitive to the machine it's running on. Of note: Python codes compile in runtime, there is no compile time. Now, that being said, some good compilers do actually alter and rearrange the lines of code in compile time to help optimize things like conditional jumps.

Also, I meant the computer cues up program data, not the compiler (damn auto correct)[/spoiler]

Seems you and i are on the same page altogether XD, yes all you say is mostly correct, x86 based processors use GPRM PEs, with more modern x86 CPUs having many x86 MIMD PEs, while clock higher clock speeds do aloud for lazy programers and with languages like Java that are nearly impossible to fail with, heavily optimized C++ and Fortran still do have the original issues you state and more as of their being more PEs you would have to vectorize, modern Intel and AMD processors have 16MIMD PEs (assuming a 4 core Intel chip or a 4 modual AMD processor, 32MIMD PEs on a I7 5960X; 4 per core)

Also x86 is concidered as a CiSC (pretty much a complicated and over developed RiSC)
He considered it RISC because all the assembly commands are very much RISC with one operation on two registers per line. The compiler expands out the instruction set based on the opcode and the fields involved.

The registers were all special purpose. For example, if you wrote to BX, you couldn't perform a move in the next operation because BX was the move command's index register for a machine coded array. You would lose what was already written to BX when that happened and the move would be wrong (unless, of course, you were performing an indexed move in an array, which you never did because that would waste more clock cycles than using pointers). AX was the accumulator register, CX was the count register for a machine coded array, DX was some even more obscure thing you absolutely never used. The only reason he even knew the second functions of the first three was because he would occasionally get a job that needed a program to run faster. The assembly code always had these registers used for their special purpose. (The only reason why these are any slower is because the machine coded array need like four other registers to define other properties of the array and it's address. That being said, many people thought if you did more than twelve (IIRC) consecutive equal move operations on the same array, these became useful... but they didn't realize this meant that you just created twelve consecutive equal RAW data conflicts in the pipeline, so instead the program slowed down while the CPU damned you to Hell, and you never needed that many moves unless you were moving every element in the array by one and were treating it as a stack... and there's the magic word)

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.


Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

Doesn't really matter how much you speed up things that are going slow, if you have them before your next clock comes it's waiting, so if you can, put other stuff their and let that be further out, again this is a unique thing with your designs, most don't have this issue so might be a little tricky figuring out a method of doing this automatically for the programmer or for the programmer to know where to do this.
[/spoiler]

1.) There was a slight error, DVQ/MOD and MUL/SQR are six ticks out, but will probably take about 96 and 12 ticks to complete, respectively. On the other hand, it just now occurred to me how I could pipeline these functions and have them point back toward the CPU so that the actual operator will bus itself (If I make the operator four chunks long (easily done) and place the output toward the CPU, then it ends up right back at the cache when it's all done because the bus to the input from the CPU is going to be four chunks long), cutting out the four bussing ticks needed to return the value and allowing more than one divide or multiply to be requested in sequence, though it still would take 96 or 16 ticks to compute each operation.

2.)The computer won't alter the code for the programmer to account for the timing difference, and the programmer also doesn't write the program with the timing difference in mind. Place the blatantly obvious secret sauce recipe here. (If you can't figure out how I'm going to be creating this particular work around, we have a problem, because it's already been discussed in this thread.)[/spoiler]

I'm not to well informed as to all the details of your overall system, or architecture. However what i can say is that your main two concerns are data loops and through put, I'm sure you're already aware of that though, data dependency's will also make a difference, so it's mostly how you intend to rid of data dependency's or reduce them, as increase pipe-lining will increase throughput, it also complicates and often allows for more data dependency's to accrue, in tern increasing the amount of time you have to ideal

Also i doubt anyone really knows what your "secret sauce" is really, as you seem somewhat inconsistent to what would otherwise seem to be it repetitively.
Basically it's somewhat vague to our general convention
Let's have an extremely short thought expirement. If you don't rearrange or alter data in runtime, and you don't rearrange or alter data at programming time, where would it happen? What other time could possibly exist in the program development flow chart where the data could be rearranged or altered? There is only one useful answer to this question.

Not every secret sauce is the same. It takes more than one ingredient to make a recipe.
Assumed you wouldn't do it then because that's to easy XD, would give a overhead though that maybe annoying.

(10-12-2014, 03:59 AM)TSO Wrote:
magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-10-2014, 05:45 AM)TSO Wrote: That completely defeats the purpose of a forum. I advise you get around to reading it, we've been discussing making the impossible possible, and (at least according to GG) have actually made a small breakthrough in redstone computing (it sounds like I knocked a tick or two off of the fastest theoretical clock rate)

Hi

we've/you've not really been talking about things that was seemingly imposible, albe it you do have some great ideas in general, also sorry to say it to you but you won't be caliming performancing king just yet. maybe after a few revisions and optimisations once you settle in and learn stuff your combination with some of our members of ideas might do though (it's not really expected that new members even know half this much about CS really, so you're doing good Tongue)

If it's not much to ask for, once you have a IS drawn up may i see it please?
[/spoiler]

Due to a particular breakthrough I had (it was legitimately like a moment of enlightenment, LD sort of knows about it), I actually have almost all the instruction set drawn up now, as well as the entire bussing arrangement. I also don't think I'm going to ever be performance king, but I will certainly be the king of dual purpose logic.

I have a programmer father that hates wasting time in a program, my friend's father is a hardware developer at HP, one of my father's friends is a hardware developer at IBM, and one of my father's other friends has a degree in library science. Between my inquisitive nature and their experience, there is literally no way I wouldn't know this much about computer engineering. (Although, it is funny to get the two hardware developers in the same room, because they suddenly get real cautious about how they answer my questions. Let's just say neither of them are the low guy on the totem pole at their company.)
[/spoiler]

Nice to know you sit your self in a position where you can easily acquire information about real life EE and CS, no doubt it's useful, however MC does have many limitations and possibility that IRL stuff doesn't account for, so this is generically the challenge when coming from RL based knowledge, it's still very useful as most of it can be applied though, also another thing to bear in mind, is technology is so vast that no small collective group of individuals fully know the capability's of hardware/programming, I even doubt that everyone's knowledge collectively would know.

In short you will hit a wall and you will destroy some, but a lot of the time it's unexpected, but that's all part of the fun
The funniest thing is that the library scientist is by far the most useful for what I'm trying to get done.

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
(10-10-2014, 08:26 PM)LordDecapo Wrote:
(10-07-2014, 02:42 AM)TSO Wrote: Or you can rename registers as you go... but that's for noobs.
cough cough, i can do that, and it only takes 3 ticks.
Noob.

LordDecpo Wrote:[spoiler]
(10-09-2014, 10:18 PM)greatgamer34 Wrote: Most people(including me) in minecraft have a Compare instruction then followed by a Flag register read.

The Flag register read can be BEQ(branch if equal to), BGT(branch if greater than), BLT(branch if less than), BNEQ(branch if not equal to)....etc, along these lines of any amount of possible conditions.

This then reads from the proper flags register and performs a jump to a specified address(usually found in the IMM or a pointer).

Ok so my branching works a bit different, cause my crazy ass cRISC or CISC or what ever u want to consider it Architecture.
I have 2 modes in my IS, Memory/System is mode0 and ALU functions are Mode1
Branching is a Mode0, And it is also a multi line
I have a specially made 3 deep Queue with mutliread so when a branch is detected, it reads Locations (Inst)0, (Inst)1, and (Inst)2 from the Queue, and routes there data to specific parts

Inst0 is the Main Inst, it tells what conditions to look for, weather it is a Call or a Return, weather it is conditional or not, if its direct, relitivePos. or RelitiveNeg.

Inst1 is the Destination address (so i can have 65535 or what ever lines of code on just a PROM, more if i access external memory, which is easy to do) that is only loaded into the PC if the condition is true

Inst2 is the function that defines where the flag can arise from, and this inst MUST be a Mode1, so u can add, sub, or, xor, compare, ect to generate any flag you want.

All of that gets decoded and sorted out in about 7ticks, then the branch is determined on the next cycled wether the conditions are mett, it has static prediction of False, so u only get hit with a 1 cycle penalty after a True flag comes through, leaving the penalty of branching not that devastating.

I will be making a forum post this weekend with pics and such of my MC CPU, since u cant join server, and will explain the IS in detail in it for those who are interested.


...It sounds really complicated... Tongue
...or maybe not...
Is it basically the same jump instruction system as mine, but without the parallel computing part?

I haven't quite gotten around to acquiring a free .rar extractor that only comes with minimal bonus material.

LordDecapo Wrote:

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.


Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

No please no, do not do a 3 tick clock its "theoretically" the fastest u can get with torches,,, but NO just NO! MC bugs are so disgraceful that ur clock cycles will be come uneven and will corrupt data in ways u never knew were possible... trust me, i had a huge project, and backed off the complexity and simplified the logic i was gonna use in my CU to get it to be a little longer clock,, well more then a little,, 3 ticks to 10 ticks, but the through put and penalty %ages are ridiculously less now as well. so it gives you better performance under normal operating conditions. Clock speed DOESNT mean more Power,, u have to take into consideration the IS, and the possible penalties the CPU could suffer from such small pipeline stages,,, and a 3 tick clock, leave 2 ticks for logic, 1 tick to store, so its really dumb xD i learned this the hard way... PC was the thing that we found killed it the fastest.
[/spoiler]

Again, there are actually many errors in that statement, as well as a massive oversight on my part. The clock is limited to the seven ticks it will take to decode the instruction pointer. I honestly have absolutely no idea how to speed that up without reducing the amount of cache space in the computer used for cuing up instruction data.

Three ticks does not give you two ticks for logic and one tick for store (at least in my architecture, just because of how every function would need to store at it's input side), it gives three to store, however long it takes to calculate, three to wright to the output bus, and three to store in the data registers. (Also, there is a device in the game that can store data up to four ticks, you'll never guess what it is. And no, it's not some "command block bull shit".)


Final announcement: the instruction set is nearly complete, it is still actually the reverse engineering of the processes in the CPU and ALU, but my moment of enlightenment allowed for me to engineer the CPU and bussing layout all in my head. It occurred to me that op codes are pointers, which is why I know how far away the inputs for each ALU function are from the CPU (that'll give you something to think about).[/spoiler]

Repeater locks, 1 tick store and the rest can be logic. it's pretty much the way to go.

Also loving this thread of yours Big Grin
Cue another philosophical minecraft moment...

Is it the way to go? Do you honestly need that repeater to even lock? If we multiply by accumulation, and place 16 3 tick adders with output connected to input incorrectly so the shift happens on it's own, do you really need to hold the other operand's data in a locking repeater, or do we just need it to pause for three ticks at the front of each add? If they are 1 tick adders sixteen blocks long, does the data for the second operand need to have any pause at all, or does the repeater every fifteen blocks suffice?

If they are 1 tick adders sixteen blocks long, are they really 1 tick... or are they zero tick because redstone needs a repeater every fifteen blocks?
If you direct this multiplier back toward the registers, did you remove the output bussing, or is the bus performing the operations?
Could that be applied to other elements of the processor?
you don't need to be philosophical about this, if you wanted a 3tick pipeline for instance, 1tick would be on the repeater lock as a stage buffer and other 2 would be logic between stage, or say a 8tick it would be 1 tick on locks and 7 on logic, it's just quicker then using Sr-latches, if you wanted to make a extream data loop with simple pipelining and 1 Fwd, you can theorticly achive a effective dataloop of 5 ticks, 1 tick for buffer, 4 for ULG or Adder/Subtractor with a CU that looks ahead to see if Fwd is required and redirects Data Fwd into the Buffer before the stage, a little messy and large, but still holds in theory and also the PE would require two computes to the final half of the logic for the ALU stage, half acting as Fwd which output is naturally disabled by CU unless Fwd is required, the other is unobstructed and continues to the GPRs, though good luck to anyone trying to implement a 8bit version of that.

But it's a good idea to have stage buffers as sometimes interrupts can disrupt the flow of data and currupt things without it being noticeable by the CU it's self, in tern invalid result and calculations there on that are unknown.

Probably not so much of a worrie for a more simple system without them.

If you really want you can go without stage buffers in MC if all is sync enough, and you know where everything will be and when, but it can make it trickier to implement and get working.

(10-12-2014, 03:59 AM)TSO Wrote:
(10-12-2014, 03:52 AM)Magazorb Wrote:
(10-12-2014, 03:14 AM)͝ ͟ ͜ Wrote:
(10-12-2014, 02:25 AM)Magazorb Wrote: In short you will waste away your life playing a stupid block game building outdated computers that serve no purpose

fix'd

Well that's not nice :'(

But it is true



RE: I am TSO - TSO - 10-12-2014

magazorb Wrote:Assumed you wouldn't do it then because that's to easy XD, would give a overhead though that maybe annoying.
I really don't care about how long compile time takes. Also, I don't think compile time is considered overhead.

magazorb Wrote:you don't need to be philosophical about this, if you wanted a 3tick pipeline for instance, 1tick would be on the repeater lock as a stage buffer and other 2 would be logic between stage, or say a 8tick it would be 1 tick on locks and 7 on logic, it's just quicker then using Sr-latches, if you wanted to make a extream data loop with simple pipelining and 1 Fwd, you can theorticly achive a effective dataloop of 5 ticks, 1 tick for buffer, 4 for ULG or Adder/Subtractor with a CU that looks ahead to see if Fwd is required and redirects Data Fwd into the Buffer before the stage, a little messy and large, but still holds in theory and also the PE would require two computes to the final half of the logic for the ALU stage, half acting as Fwd which output is naturally disabled by CU unless Fwd is required, the other is unobstructed and continues to the GPRs, though good luck to anyone trying to implement a 8bit version of that.

But it's a good idea to have stage buffers as sometimes interrupts can disrupt the flow of data and currupt things without it being noticeable by the CU it's self, in tern invalid result and calculations there on that are unknown.

Probably not so much of a worrie for a more simple system without them.

If you really want you can go without stage buffers in MC if all is sync enough, and you know where everything will be and when, but it can make it trickier to implement and get working.

I argue that we did need to ask the question. In fact, I argue we need to go on because you didn't consider where that could have gone.

We'll start small with your counter claim and ask ourselves where buffers and locking repeaters are needed. (I think you'll be mildly surprised as to where this goes.)

You already answered this. They allow for the pipeline to properly forward. Out of order execution and register renaming are ways around a forward.

So, where do pipelines forward at? ALU? CPU? What about out of order execution? Register renaming?

Well, a forward occurs if incoming instructions RAW conflict with one of the instructions inside the pipeline or an incoming instruction WAR conflicts with a later instruction (it's possible in a for loop). Out of order execution triggers for the same reason, and register renaming triggers if a RAW hazard presents with the incoming instruction and the first instruction in the pipeline. If we add that the CPU always does the forward, then we can also say that the ALU will never forward. Going back, this means no buffers or repeater locks are needed anywhere in the ALU because the ALU will never forward. We see that only the CPU forwards in a data conflict, but a forward can be avoided by the other two methods.

With that out of the way, let us then consider a pipelined out of order execution CPU with register renaming where we are guaranteed there will never be a data conflict in the instructions. We see that the renaming and order changing algorithms never trigger, we see the pipeline never forwards, and we see no pipeline bubbles at any point. What does this mean? All the additions to the CPU are unnecessary, they can be removed. All the buffers, all the locking repeaters, all the hardware for the register renaming and the out of order execution disappear. There is no wasted time in the CPU and nothing inside is idle or unused. There is 100% hardware efficiency in the pipeline, with none of the clock cycle lost to buffer and all of the clock cycle in the logic. The only problem with the consideration is that guarantee. There is no way to guarantee all program will be written without data conflicts unless we use magic. Obviously that hypothetical was entirely useless.

Wait, a minute, though. Was that a caveat I used? Indeed it was. Let us explore that caveat. "Without the use of magic," Does that imply this is possible with magic? I would argue it does. With the use of magic, it is possible to guarantee a program could always be written without conflicts. What about secret sauce? I think someone here already equated these two, so that makes this a legal move, as long as we adjust the scope of the claim slightly. We can assume that with secret sauce or magic, it should be equally possible to guarantee a processor will only receive programs with no data conflicts. Note what that scope change was. We moved from guaranteeing a program can be written without conflicts to a program can be received without conflicts. This is because secret sauce has no scope over the programmer, only the computer. In fact the magic can even be removed from the statement because it has been stated before that secret sauce is not magic.

Now (home stretch), let's look at what out of order execution and register renaming do. I will use identical terminology for a third time. These operations are intended to rearrange and alter the lines of code in order to remove data hazards in runtime. So if data hazards are avoided by rearranging or altering data, and secret sauce removes data hazards, by the transitive law of hypothetical models (note this law still a hypothesis), we could say the secret sauce must rearrange or alter data... But when?

Just to hammer in the fact that I have been reusing terminology this whole time, I'm going to quote it.
TSO Wrote:Let's have an extremely short thought experiment. If you don't rearrange or alter data in runtime, and you don't rearrange or alter data at programming time, where would it happen? What other time could possibly exist in the program development flow chart where the data could be rearranged or altered?

How on earth will we solve this conundrum? Is it even possible? I guess we will never know...
But wait, (there's more), we already have answered this question in a completely different experiment! We already have discovered the compiler and the magical compile time! (and there was much rejoicing)

With this magical compiler, we could entirely remove all data hazards. What does that mean? All we need is a compiler that can rearrange and alter instructions whenever a conflict occurs. And what does that mean? It means we can guarantee a program with no data hazards, which means...*Uses copy paste*

TSO Wrote:All the additions to the CPU are unnecessary, they can be removed. All the buffers, all the locking repeaters, all the hardware for the register renaming and the out of order execution disappear. There is no wasted time in the CPU and nothing inside is idle or unused. There is 100% hardware efficiency in the pipeline, with none of the clock cycle lost to buffer and all of the clock cycle in the logic.

Unlike last time, though, we don't need magic to make that guarantee. At the same time, two different points in the discussion have suddenly been fused because it was always the same. One of the secret sauces has been universal to the design from the beginning, always hiding in plain sight for people to infer. You just never took the time to see it.

And finally, with that in mind, an answer arises to the whole of the argument.
TSO Wrote:Is it the way to go? Do you honestly need that repeater to even lock? If we multiply by accumulation, and place 16 3 tick adders with output connected to input incorrectly so the shift happens on it's own, do you really need to hold the other operand's data in a locking repeater, or do we just need it to pause for three ticks at the front of each add? If they are 1 tick adders sixteen blocks long, does the data for the second operand need to have any pause at all, or does the repeater every fifteen blocks suffice?

Indeed, all that was needed was a repeater of the same delay as the adder, not a buffer, and nothing was ever needed between the adders. You only need to ensure signals that originated at the same time propagate together. You do not need to lock them in place.

TSO Wrote:If they are 1 tick adders sixteen blocks long, are they really 1 tick... or are they zero tick because redstone needs a repeater every fifteen blocks?
If you direct this multiplier back toward the registers, did you remove the output bussing, or is the bus performing the operations?
Could that be applied to other elements of the processor?
Again, yes, the adder is effectively zero tick because it is no slower than the bus of equivalent length, meaning the adder can be used as the return bus for the operation.

Now, with much preamble (and a shoehorn), we get where I wanted you to go. If the bus operates on the numbers, and opcodes are really just pointers to hardware, could we decode en route? What about the decoders for the register addresses? Could the entire CPU be zero ticks relative to the bus of equal length? I think you know the answer already. If you had read the PM I forwarded to you, you would have already seen all of this post in a much shorter form.

You should all know the answer to that final question, though, because I always give you a, "Yes," for my thought experiments.


RE: I am TSO - Xray_Doc - 10-12-2014

Tl;dr


RE: I am TSO - Magazorb - 10-12-2014

This hot air is to steamy, you're recycling to much, TSO you have some good ideas but you're overlooking key details, the Fwd is a good thing to have and utilise, it allows you to forward a data from the ALU back into the ALU rather then having to write it back before use, this increases serial performance, you wouldn't want to rid of that.

So say we have the example i gave, which you seem to missunderstand the setup, the alu stage has no buffers in it, only around it, now i'll be more realistic this time and use stuff based on what we can currently implement with no issues what so ever, which is a 6 tick ALU with 1 tick forward being CU assisted and 1 tick repeater lock stage buffer.
Now i'm going to be unrealistic to give your suggesting system as much of a advantage as posible, we shall have a write back of 1tick, and a read from that new write back 1 tick, now you'll notice that that equates to the same as the Fwd, but the difference is the Fwd only has to go back to the buffer before, meanwhile to achive this speed with the registers you would have to have them all repeater locked and comparative read disabled, in turn this means it's memory dencity is low and gives you a only aprox 30 dust of singal total (15 before lock and 15 after) to get all your registers back to ALU input, so ignoring the buffer as you suggest you don't want that.
And what you'll find is that yours is no quicker, in serial compute and that's in your most favorable configuration.
Now allow me to explain further in that those register setups havn't been done before except maybe in a 4G/SPRM, we always tend to use at least 8G/SPRs in our PEs, so you can easly see without that Fwd you can very easily loss serial performance, and unfortunately your oversight is that you forgotten about serial computations, you though so much that you could fill ideals with out of order executions, but unfortunately their's times where this isn't possible.

The resulting difference is that when comes a serial computation which is inevitable at some point, you will lose computations via ideal, this is still a big issue in IRL mass computations.
I guess you could via pulse timing achieve a max of 3 ticks per instruction through put but that's unlikely due to MC being a derp, 5 ticks is likely though, but theirs nothing stopping our systems from doing that anyhow providing the data isn't serial, and if it is based on current creations all that would do is slow us down to 8 ticks, but with timing comes a issue of how easily can a interrupt cause a corruption.

Now to go back to go back to the device you have, that one that gives instructions to other "CPUs" as you called it, How do you intend to have multiple of these PEs attached to one of these and fully satuate them when you have so few commands that would actually take up multiple PE cycles if you will, Besides, it's not like we don't have thise functions, it's just we don't bother to reorder instructions in compile time due to the amount of overhead that would be on what ever has to compile it before you could finaly exe. (I'd say it's overhead as it's actualy something that has to be processed before it can have a use, so it does add to making something longer in some ways)

Heck i'v stated i'm currently planning on making a SIMD array and how to address all the PEs is a similar issue, because i have to fetch data as quickly as i can while it's computing and SIMD computations are vastly more quicker computing speed vs memory speed which brings the same issue as what you will have with addressing multiple PEs, that you will only bealbe to use multiples when you have multiples executing instructions that would take multiple cycles.
The only difference between the my SIMD array and your multi-MIMD PE would be mine's data limited and your's is instruction limited.
When these kinds of things happen you'll need a PE that can stockpile multiple instructions/data coherently.

I've seen other guys do things similar to your suggestions, they gave up not realising how much they got them self's in, but you seem to know a lot of what you want to do but you've yet to figure out the short comings.


RE: I am TSO - jxu - 10-13-2014

TSO don't listen to them, minecraft is not IRL, even if it's stupid, give it a shot (minecraft always has some interesting quirks)

And I know it's true


RE: I am TSO - TSO - 10-13-2014

By no means has this post been edited, considering I just got done writing an english essay.
(10-12-2014, 03:55 PM)Magazorb Wrote: This hot air is to steamy, you're recycling to much, TSO you have some good ideas but you're overlooking key details, the Fwd is a good thing to have and utilise, it allows you to forward a data from the ALU back into the ALU rather then having to write it back before use, this increases serial performance, you wouldn't want to rid of that.

So say we have the example i gave, which you seem to missunderstand the setup, the alu stage has no buffers in it, only around it, now i'll be more realistic this time and use stuff based on what we can currently implement with no issues what so ever, which is a 6 tick ALU with 1 tick forward being CU assisted and 1 tick repeater lock stage buffer.
Now i'm going to be unrealistic to give your suggesting system as much of a advantage as posible, we shall have a write back of 1tick, and a read from that new write back 1 tick, now you'll notice that that equates to the same as the Fwd, but the difference is the Fwd only has to go back to the buffer before, meanwhile to achive this speed with the registers you would have to have them all repeater locked and comparative read disabled, in turn this means it's memory dencity is low and gives you a only aprox 30 dust of singal total (15 before lock and 15 after) to get all your registers back to ALU input, so ignoring the buffer as you suggest you don't want that.
And what you'll find is that yours is no quicker, in serial compute and that's in your most favorable configuration.
Now allow me to explain further in that those register setups havn't been done before except maybe in a 4G/SPRM, we always tend to use at least 8G/SPRs in our PEs, so you can easly see without that Fwd you can very easily loss serial performance, and unfortunately your oversight is that you forgotten about serial computations, you though so much that you could fill ideals with out of order executions, but unfortunately their's times where this isn't possible.

The resulting difference is that when comes a serial computation which is inevitable at some point, you will lose computations via ideal, this is still a big issue in IRL mass computations.
I guess you could via pulse timing achieve a max of 3 ticks per instruction through put but that's unlikely due to MC being a derp, 5 ticks is likely though, but theirs nothing stopping our systems from doing that anyhow providing the data isn't serial, and if it is based on current creations all that would do is slow us down to 8 ticks, but with timing comes a issue of how easily can a interrupt cause a corruption.

Now to go back to go back to the device you have, that one that gives instructions to other "CPUs" as you called it, How do you intend to have multiple of these PEs attached to one of these and fully satuate them when you have so few commands that would actually take up multiple PE cycles if you will, Besides, it's not like we don't have thise functions, it's just we don't bother to reorder instructions in compile time due to the amount of overhead that would be on what ever has to compile it before you could finaly exe. (I'd say it's overhead as it's actualy something that has to be processed before it can have a use, so it does add to making something longer in some ways)

Heck i'v stated i'm currently planning on making a SIMD array and how to address all the PEs is a similar issue, because i have to fetch data as quickly as i can while it's computing and SIMD computations are vastly more quicker computing speed vs memory speed which brings the same issue as what you will have with addressing multiple PEs, that you will only bealbe to use multiples when you have multiples executing instructions that would take multiple cycles.
The only difference between the my SIMD array and your multi-MIMD PE would be mine's data limited and your's is instruction limited.
When these kinds of things happen you'll need a PE that can stockpile multiple instructions/data coherently.

I've seen other guys do things similar to your suggestions, they gave up not realising how much they got them self's in, but you seem to know a lot of what you want to do but you've yet to figure out the short comings.

To clarify, I would consider that a write back while a forward would be something like halting a multiplier in order to inject a partial sum into the middle, which would basically never happen because you would have also needed to calculate the partial sum before hand... using the multiplier (or using the register AND, the bit shift, the adder, and the forward injection process, which is not faster by any stretch of the imagination).

The question that must be considered, though, is twofold. Is this write back detrimental to the parallel performance of the unit, and can one justify adding one tick to every operation just for a faster write back?

The first question can go either way depending on whether or not you implement the write back so that it doesn't interfere with the input busses. In fact, you can even write back between cycles if the clock is slow enough for the system to have open time in between operation repetition. (Ideally, you wouldn't have this, but the instruction pointer decode time dictates that.) In my personal layout, the instruction pointer is designed for parallel pipelined decoding (figured this out last night, it's really cool, easily four ticks... maybe three, but you're starting to push the limits of the memory) in order to keep up with the clock, so this write back is not possible in my system that way. This leaves us with a buffer at the front of an ALU operation, but in my IS there is no space for such a command for wright back or a command to pause the register write that would occur regardless of wright back without some kind of output switching. As you stated, this switching would be one tick at the output, but actually would lose one tick plus the clock by adding a repeater at the input because the next instruction has just been halted. Otherwise, our loss at each end is one tick at the front and one tick at the back. But then comes the wright back delay. It is the length (in chunks) of the operational unit. So we have a command, it goes through, doesn't wright to registers, may or may not hold up the command that occurs a few lines later, does not calculate it's conditions, and adds two ticks to all operations. Good luck managing that.

This moves into the "is it worth it" part. If the operations are already optimized for bussing themselves back to the registers within their own operation delay time, then the distance from their output back to their input is nearly equal to the length of the existing bussing from the registers to the input anyway. The loss between register writing and internal wright back is now only the register wright time and the delay to the next operation. If this delay has been optimized with something like my four tick instruction pointer, this should be no more than three ticks and the delay is a constant intrinsic to your personal design preference. Our loss is now about six ticks slower for a wright back verses two ticks on every single instruction. Now, I'm not stupid enough to claim that the wright back does not have it's advantage, I'm simply going to say that there is a minimum number where it is not advantageous. We'll see how many this is for six operations and ask if a two tick loss per operation really justified? Let's see. We start with two wright backs per six operations. By your model, we have six operations, that's a twelve tick loss for our six operations. The wright back loses two ticks plus the two clock cycles for the two held operations, as well as the processing time to repeat the operations. So our loss was 14 ticks, two clock cycles, and two operation delays. My way takes an extra two instructions to command recalculation, plus two operation delays for the calculations, and somewhere between eight and two cycles in the clock/wright synchronization (a value you control in the design), plus two ticks lost in the two register wrights, and two ticks lost in the two register reads. So we lost six to twelve ticks, two clock cycles, and two operation delays.

Most of this cancels except for the six to twelve ticks to the fourteen ticks. That's a difference of eight to two ticks. Notice that it's the exact loss in the constant user-defined clock delay for my system. So the less the clock delay loss is on my system, the more efficient wright back avoidance becomes. For the long delay, mine is two ticks faster, so a wright back content of 33% makes these two about equal. For the short delay, the difference is eight ticks in my favor. Now, I'm just going to look closely and realize that for each wright back we add into the six, this loss is two ticks per wright back on mine as (short clock delay) and one tick per wright back on yours. So we become equally efficient if we contain 66% wright back in the code. So now the question is, with pipelined hardware multiply and divide, how often do we encounter this number of wright backs? This is a question for the designer about the intended operations, and I don't think that more than 2/3 of the code is intended to wright back for most of what I can think of. There are still huge issues with a fully serialized group of instructions, but this is avoided as best as possible by having the faster operations run at their own speed instead of being limited by the speed of the slower operations. Of course, as you said, this is still highly limited if a long operation is called.

Now, finally, for what you mentioned about me being limited to memory speed. I will admit that my method has a loss in code density. For every wright back, I need one extra line of code. Whether or not your's need the extra line us up to you and microcoding decisions. But if I can decode a 7 bit instruction pointer in four ticks, we could venture to say that I could decode a 16 bit address in 8 ticks (and I can). Now, if I load more than one memory location at a time, (say... oh... four lines), we remove two ticks from the decode because four memory locations can be addresses by the same number, and we get four and a half lines per every two instructions completed. So we are doing okay there.

The issue with my layout isn't instruction paging per se, but is probably the conditional branching. I have not yet given this system a way to prevent instructions not in the branch but already in the pipeline from being ignored. I think I know how I can do this, but it seems like I am putting a massive load on my instruction pointer and register wright hardware, everything depends upon their ability to block register writing of the incorrect operations while allowing register writing of the correct operations. This sounds straight forward at first: just block all the wrights after the jump for the time of the pipeline length, but then we remember that not everything takes the same amount of time to compute. A divide before the jump could arrive to the registers after a shift on the wrong side of the jump. Now we have to block the shift, allow the divide, and ensure the divide doesn't have a data conflict with the jump and the divide's conditions don't nullify the jump.

Interrupts wouldn't cause data errors in and of themselves, but an interrupt could use a register that is already in use, so there need to be special interrupt registers, and a separate interrupt bus.

As for the compiler, if you run the program more than a certain number of times (like four of five), then the losses in compile time will start to make up for themselves. This computer is not designed to run the program only once.

I think the difference between myself and these "other guys" is that I can acknowledge that my system does have downsides, but I also aim at making a balance between the downsides and the efficiency of removing as much as possible. I think that I do see what all the problems will be, but never declared the caveats when I described parts of this system because I thought you would all see and understand them.


RE: I am TSO - Xray_Doc - 10-13-2014

(10-12-2014, 02:28 PM)Xray_Doc Wrote: Tl;dr



RE: I am TSO - Magazorb - 10-14-2014

i can only speak for my self when i say that i see and understand the drawbacks that your idea would suggest, however when i ask about them i'm more so asking as to how you intend to minimize these effects.

Fwd in our terminology seams different then that of your, as does write back, so between what i had tryed to point out and what you have pointed out we have gotten lost in translation.

So by write back i mean the writing back to the GPRs, and for Fwd i mean writing back to ALU input from ALU output, these arnt particualy the most advanced meanings, if anything overly simplistic and you may find it a little awkward at first, but truly no two institutes have the esact same technical terminology.

As for the "compiler" as i'm a litle unsure as weather this device you discribe technicly counts as a full compiler, (I'm aware that it's not far out) could do this over a single pass, given a unlimited amount of tags if you will.
My only suggestion about this that i can make is to just have it conscientiously have it cycle over until it comes up with no more logical improvements, in other words stops optimizing

I'll hand it to you that you handle the bulk stuff quite well, as in allowing for maximum throughput, however i must leave the comment of it under the described setups you have does seem to have a few times where it will struggle.

Also i don't quite think you followed on the instruction stuff fully, i was assuming that you could pulse that as you would other stuff and have a throughput of 3 tick/instruction, and then expanding upon that by asking how you could make much effective use out of a 2nd PE via how much time you had between having to fetch instructions for the first PE, as well having to deal with the issues of keeping the data streams between PE1 and PE2 coherent.

Or correct me if i'm wrong but you do wish to only pulse data through the device and not use buffers, but control the logic via timing thus regardless to the stage size of something, you could always maintain a instruction every 3 ticks.


RE: I am TSO - TSO - 10-14-2014

What you call wright back is how my computer always runs. The only thing it can do is wright back unless an instruction to external memory is given.

I call wright back what you call forwarding. Mine simply does not have the ability to do what you call forwarding. There is no way to manage that inside the clock and instruction set that I want. Therefore I am stuck with a complete register wright between each repetition, but because the busses from the registers back to the processing element for any operation are (usually) exactly the same physical length as the operation itself, it isn't loosing as much time in bussing delay.

For the most part what we have is identical, the only difference was that mine isn't loosing time with the two ticks on the front and back of each ALU element, so all I had to do was figure out how much time is lost in register writing and reading in order to see at what point your forward becomes more efficient than my lack thereof. For the really big operations, this takes care of itself so well that the amount of code requiring a forward became unreasonably large in order to make yours any more efficient. For the smaller and faster operations, there is more of a loss, and the amount of code required to make forwarding more efficient was lesser. I finally showed that the phase arrangement of the register wright to the clock (and therefore the next read) was the main determinate of the range where mine would be more efficient than yours.

The compiler is a program. There happens to be a component designed to make things easier for the compiler because it is able to check conflicts by using hardware, but the compiler is a program that will arrange things based on how conflicts and bubbles present themselves, which is all predictable. (I don't actually know how to write a compiler, but we'll get there when we get there.)

Well, sort of... instructions can be decoded in three ticks, but the throughput of any particular instruction varies. So there can be a three tick clock, with an instruction every three ticks, but the operations don't wright back in the same amount of time. They are all multiples of three ticks, but the operations definitely return out of order.
This machine reaches it's maximum potential running multiple programs that don't share too many of the same operations and don't conflict their data.

You missed a tiny bit of how muticore would be implemented, and I missed a bit about some of the caveats. Each core would have it's own cache of program data, which reads once every clock. The paging system is part of the external hardware. The number of busses to the individual processing core's cache would be equal to the number of instruction paged in at a time, so for one core, if it takes eight ticks to get an address in main memory, and four ticks to read an address out of the program memory cache, we need to address four lines of code out of main memory at a time and have four data busses. For two cores, the main memory can still wright four lines of code at a time and switch between the two cores every other wright. The cores would not be able to share busses. If sixteen lines of code are paged at a time, then parallel pipelined decoding of an address in the main memory will take four ticks. If paged 32 at a time, it would take three ticks. This is the main reason why I need to figure out some sort of balance in the memory, because 32 busses is far too many to work with, but is the fastest possible amount to page and allows for more processors. After a point, more than one read unit is needed, but each read unit can easily address out to at least 32 processors if it wrights 32 lines of code at a time to each one. Do note that my computer is the size of it's ALU and program memory, so cramming 32 of these computers together is no problem if you can figure out a bussing solution. At that point, they probably would be sharing busses, but they would not all share the same one, and you could not consecutively wright to two that share a bus. My design is also really tall.

This final assumption is absolutely correct. No stopping, no clocking, once data leaves a unit, it's the next unit's problem to deal with. The write location and conditions also ship out with the data so that everything is organized and stays together. Everything decodes when it needs to. The clock is only at the register and program memory read, the wright is not clocked. Imagine it as the most organized set of race conditions ever created, absolutely everything depends on the data being somewhere exactly when it is supposed to, with tolerance inside of a tick. The only reason I can get away with this is because minecraft flip flops stabilize instantly and a 1 tick tolerance is extremely easy to synchronize.