I am TSO

TSO · 10-11-2014, 06:02 AM

(10-10-2014, 04:22 PM)Magazorb Wrote: Metric prefixes:
8Mb = 1(1x10^6) or 1,000,000B, 8KB = 8(8x10^3) or 64,000b, 4Kb = 1(4x10^3) or 4,000b
Binary Prefixes:
8Mib = 1(1024^2) or 1,048,576B, 8KiB = (8x8)(1024^1) or 65536b, 4Kib = 4(1024^1) or 4096b
the 1024 came from 2^10 which is the bases of which binary prefixes are powering from, so 8GB = 8((2^10)^3) = 8589934592Bytes or 6871947674bits.
The Binary prefixes was losely based around the French metrics or later came SI metrics

1.) I know my prefixes quite well, possibly better than some of you guys, so I don't need help with that.
2.) 8Mb is not equal to 1(1*10^6), it's much more like 8*10^6 b, and you have similar math errors throughout this section. Also, they are by no means "loosely" related, the relationship of the digits between base ten and base two is log(n)=bl(n)/bl(10). Every (useful) metric prefix is a factor of 1000. Log(1000)=bl(1000)/bl(10), now log(1000)=3 and bl(1000)= 9.96. Let's re examine that number as 10. Now we are looking at 2^10, which is 1024. So the relationship between the two is just (10^(log(1024)log([prefix])/3))B=[prefix]iB. End of story.

magazorb Wrote:

(10-09-2014, 05:16 AM)TSO Wrote: Sorry, I meant mostly how to generate the conditions (flags, I guess) in the same cycle as the operation, then I started wondering how many of these conditions I need... then I asked my father for help. He programmed back in the 90's when computers had no memory and actual program speed mattered because computers were still so slow (back when programmers actually had to be good at programming.)

What I learned was this: all your possible conditions are computed at the same time and given a true/false value. These values then go to a single register the same size as all the possible conditions (between eight and sixteen), you then use a bit mask to get the conditions you want out of this register (you can actually call more than one flag at the same time, then operate a condition on those two). Your goal is to never jump, EVER, especially more than one or two lines. Long goto jumps take longer, are harder to understand in debugging, and are a sign that the programmer does not understand the program being written. If, Then, Else statements also slow down the system, but are far better than a goto because the compiler will place the destination code for the if statement near the line the condition is on. The compiler will preload the program data line by line regardless of condition destinations, if an if, then, else occurs, a path is chosen, and all code for the other path is deleted, followed by loading all the subsequent program data for the line that was chosen. The nearer these lines are to the jump condition, the less gets loaded and unloaded. The amount of qued data that gets deleted depends upon how large the jump is, with a goto being the worst. Nearly all data is dumped and reloaded and all stages of the CPU are completely cleared in a goto because they inherently do not jump near the line their condition is on. And if you pipelined and it jumps conditionally, you have to somehow rid yourself of all the crap in the pipeline that is no longer correct because it was part of the other conditional side.

Luckily, my design inherently helps with this, but it will still suck to have to think about it.

Computers back in the 90's did have registers, if they was GPRMs, this wasn't used to often though because the processors at the time was slower then the memory speeds so having registers didn't improve performance much, and added to the instruction size, so was deemed uncesseriy for the time because of the extra memory requirements for the IS.

Two other popular PE types wear Stack based and Acumulator based, this two wear both popular for near enough the same reason, positives and negatives of each are near the same and are mainly the other type of PEs that pros and cons wear oposite of GPRM PEs, there was also other types of PEs that other systems used but the 3 most popular and main ones where the Stack,Accumulator and GPR based PEs

"all your possible conditions are computed at the same time and given a true/false value."
The idea and direction you're going with this is good, irl some do this, however in MC it isn't really a thing with most things being small enough they reach the new address and return before clock, thus giving no performance gain, or are predictivly prefetched for pipelined CPUs which also gives no gain.
This is a good idea though and in bigger MC things maybe useful, the idea has gone around a couple of times but has yet to be implemented, so 5pts. to you if you do it first with at least some gain

"(back when programmers actually had to be good at programming.)" actual programming is harder now then it's ever been for pro's, the only programmers that aren't good tend to be game devs (sometimes they know some code but they aren't really programmers, they use software instead) and those who can't be asked to learn how to write optimal code, but for those doing it professionally are made to max their hardwars abilitys out and this becomes very hard.
To do this you have to learn verious languages (normaly this would be done with a combination of fortran, C++ and ASM, with GPGPU being utilised heavily.
Generically this style of programming is never learned but is in high demand of, also tends to not run to well on various configurations other then it's intended system config(s).

I'm also unclear as to what you're defining as a compiler, probably correct but sounds almost like you want to use it in active programs instead of compiled before use.

He said there were like 10 or 16 registers (remember he only really worked with x86 assembly), and you only worked with about four. All 16 were special purpose, but if you weren't using the function they related to in that instruction, they doubled as general purpose, there were really only five you worked with: A,B,C,D, FLAGS. Add in that x86 is RISCey, and you see that you only need four of these registers: the one involved with your operation, any two main registers, and FLAGS, so you never use D either. The FLAGS are all predicted simultaneously and parallel to execution, and the jump conditions were all opcodes that simply masked the FLAGS register and jumped to the specified line. X86 processors never performed more than one task in the same line of code, for the most part. There was some special stuff that played with things, and there were numbers you could insert into the opcode location that represented opcodes that didn't exist (or something like that) that made the CPU "fuck up just right" and perform operations that used many lines of code in one line. I have no idea how that works, but it abuses the fact that x86 is microcode heavy but doesn't know what to do if it receives a number that isn't a real opcode. Results range from exactly what you wanted to happen to computer suicide.

His only comment when I asked about stacks was, "You never use stacks." He used pointers all the time, though (in fact, there were a few string manipulation programs that only used pointers and pointers pointing to pointers and other stuff). He also never made comments on his code, so nobody knew how to debug it, (not that the debugging team was any good at their job to begin with), so he just debugged it himself.

He says the abilities of modern computers make even lazy coding fast enough for the intended application: goto is returning, as are stacks. When he did things, it would be a few lines in C that then resulted in a subroutine branch to a short assembly program, then some more lines of C, ex etera. Some of his programs were more assembly than C, some weren't, it depended on what needed to be done. The programs were intended for the x86 family, which is apparently an industry standard, or something.

Fortran was rarely used because the compile overhead was too great.

A compiler is a program that converts the written program into machine code. The more distant the programming language is from machine code, the more time is spent compiling, and the longer the resulting program. Some languages, such as assembly have zero line expansion (assembly is simply a mnemonic for machine code), while some languages (cough java cough) are so distant from the machine code that only an absolute moron would try to write in them unless you need the code to be insensitive to the machine it's running on. Of note: Python codes compile in runtime, there is no compile time. Now, that being said, some good compilers do actually alter and rearrange the lines of code in compile time to help optimize things like conditional jumps.

Also, I meant the computer cues up program data, not the compiler (damn auto correct)

magazorb Wrote:

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

Doesn't really matter how much you speed up things that are going slow, if you have them before your next clock comes it's waiting, so if you can, put other stuff their and let that be further out, again this is a unique thing with your designs, most don't have this issue so might be a little tricky figuring out a method of doing this automatically for the programmer or for the programmer to know where to do this.

1.) There was a slight error, DVQ/MOD and MUL/SQR are six ticks out, but will probably take about 96 and 12 ticks to complete, respectively. On the other hand, it just now occurred to me how I could pipeline these functions and have them point back toward the CPU so that the actual operator will bus itself (If I make the operator four chunks long (easily done) and place the output toward the CPU, then it ends up right back at the cache when it's all done because the bus to the input from the CPU is going to be four chunks long), cutting out the four bussing ticks needed to return the value and allowing more than one divide or multiply to be requested in sequence, though it still would take 96 or 16 ticks to compute each operation.

2.)The computer won't alter the code for the programmer to account for the timing difference, and the programmer also doesn't write the program with the timing difference in mind. Place the blatantly obvious secret sauce recipe here. (If you can't figure out how I'm going to be creating this particular work around, we have a problem, because it's already been discussed in this thread.)

magazorb Wrote:

(10-10-2014, 05:45 AM)TSO Wrote: That completely defeats the purpose of a forum. I advise you get around to reading it, we've been discussing making the impossible possible, and (at least according to GG) have actually made a small breakthrough in redstone computing (it sounds like I knocked a tick or two off of the fastest theoretical clock rate)

Hi

we've/you've not really been talking about things that was seemingly imposible, albe it you do have some great ideas in general, also sorry to say it to you but you won't be caliming performancing king just yet. maybe after a few revisions and optimisations once you settle in and learn stuff your combination with some of our members of ideas might do though (it's not really expected that new members even know half this much about CS really, so you're doing good )

If it's not much to ask for, once you have a IS drawn up may i see it please?

Due to a particular breakthrough I had (it was legitimately like a moment of enlightenment, LD sort of knows about it), I actually have almost all the instruction set drawn up now, as well as the entire bussing arrangement. I also don't think I'm going to ever be performance king, but I will certainly be the king of dual purpose logic.

I have a programmer father that hates wasting time in a program, my friend's father is a hardware developer at HP, one of my father's friends is a hardware developer at IBM, and one of my father's other friends has a degree in library science. Between my inquisitive nature and their experience, there is literally no way I wouldn't know this much about computer engineering. (Although, it is funny to get the two hardware developers in the same room, because they suddenly get real cautious about how they answer my questions. Let's just say neither of them are the low guy on the totem pole at their company.)

(10-10-2014, 08:26 PM)LordDecapo Wrote:
(10-07-2014, 02:42 AM)TSO Wrote: Or you can rename registers as you go... but that's for noobs.
cough cough, i can do that, and it only takes 3 ticks.

Noob.

LordDecpo Wrote:

(10-09-2014, 10:18 PM)greatgamer34 Wrote: Most people(including me) in minecraft have a Compare instruction then followed by a Flag register read.

The Flag register read can be BEQ(branch if equal to), BGT(branch if greater than), BLT(branch if less than), BNEQ(branch if not equal to)....etc, along these lines of any amount of possible conditions.

This then reads from the proper flags register and performs a jump to a specified address(usually found in the IMM or a pointer).

Ok so my branching works a bit different, cause my crazy ass cRISC or CISC or what ever u want to consider it Architecture.
I have 2 modes in my IS, Memory/System is mode0 and ALU functions are Mode1
Branching is a Mode0, And it is also a multi line
I have a specially made 3 deep Queue with mutliread so when a branch is detected, it reads Locations (Inst)0, (Inst)1, and (Inst)2 from the Queue, and routes there data to specific parts

Inst0 is the Main Inst, it tells what conditions to look for, weather it is a Call or a Return, weather it is conditional or not, if its direct, relitivePos. or RelitiveNeg.

Inst1 is the Destination address (so i can have 65535 or what ever lines of code on just a PROM, more if i access external memory, which is easy to do) that is only loaded into the PC if the condition is true

Inst2 is the function that defines where the flag can arise from, and this inst MUST be a Mode1, so u can add, sub, or, xor, compare, ect to generate any flag you want.

All of that gets decoded and sorted out in about 7ticks, then the branch is determined on the next cycled wether the conditions are mett, it has static prediction of False, so u only get hit with a 1 cycle penalty after a True flag comes through, leaving the penalty of branching not that devastating.

I will be making a forum post this weekend with pics and such of my MC CPU, since u cant join server, and will explain the IS in detail in it for those who are interested.

...It sounds really complicated... Tongue

...or maybe not...
Is it basically the same jump instruction system as mine, but without the parallel computing part?

I haven't quite gotten around to acquiring a free .rar extractor that only comes with minimal bonus material.

LordDecapo Wrote:

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

No please no, do not do a 3 tick clock its "theoretically" the fastest u can get with torches,,, but NO just NO! MC bugs are so disgraceful that ur clock cycles will be come uneven and will corrupt data in ways u never knew were possible... trust me, i had a huge project, and backed off the complexity and simplified the logic i was gonna use in my CU to get it to be a little longer clock,, well more then a little,, 3 ticks to 10 ticks, but the through put and penalty %ages are ridiculously less now as well. so it gives you better performance under normal operating conditions. Clock speed DOESNT mean more Power,, u have to take into consideration the IS, and the possible penalties the CPU could suffer from such small pipeline stages,,, and a 3 tick clock, leave 2 ticks for logic, 1 tick to store, so its really dumb xD i learned this the hard way... PC was the thing that we found killed it the fastest.

Again, there are actually many errors in that statement, as well as a massive oversight on my part. The clock is limited to the seven ticks it will take to decode the instruction pointer. I honestly have absolutely no idea how to speed that up without reducing the amount of cache space in the computer used for cuing up instruction data.

Three ticks does not give you two ticks for logic and one tick for store (at least in my architecture, just because of how every function would need to store at it's input side), it gives three to store, however long it takes to calculate, three to wright to the output bus, and three to store in the data registers. (Also, there is a device in the game that can store data up to four ticks, you'll never guess what it is. And no, it's not some "command block bull shit".)

Final announcement: the instruction set is nearly complete, it is still actually the reverse engineering of the processes in the CPU and ALU, but my moment of enlightenment allowed for me to engineer the CPU and bussing layout all in my head. It occurred to me that op codes are pointers, which is why I know how far away the inputs for each ALU function are from the CPU (that'll give you something to think about).

The ORE MyBB forums are being retired. Please use the new discussion board. View more information regarding the migration.