I am TSO - Printable Version

I am TSO - Printable Version

+- Forums - Open Redstone Engineers (https://forum.openredstone.org)
+-- Forum: ORE General (https://forum.openredstone.org/forum-39.html)
+--- Forum: Introductions (https://forum.openredstone.org/forum-18.html)
+--- Thread: I am TSO (/thread-4807.html)

I am TSO - TSO - 10-03-2014

The name stands for The Supreme One, it comes from a different site.

I'm a bit new to redstone, but am by no means new to electrical engineering or computer science. My last build project was a true oscilloscope the (it would display the signal without it progressing across the screen), it didn't do everything I wanted, but it was good.

My current project is a 16 bit computer. It will have features that I can pretty much guarantee have never been implemented in minecraft before, and (if everything goes as planned) will be the most powerful computer in minecraft, if you consider power to be the time derivative of work (which it is).

I have only discussed the actual features two people, the computer will implement some secret processes to use a unique device to do secret stuff that will make it run faster.

If this computer actually works as expected, than (using more secret sauce) it could easily be doubled and coded so that it would be the first true dual-core processor in minecraft (I've seen people claim to have made one, but I have never actually seen it done)

Features: easy to program, fast, serial output capabilities, secret sauce, SECRET top secret stuff, intuitive user interface.

RE: I am TSO - dylanrusell - 10-03-2014

Would you be kind enough to share your IS for your computer design or is that "secret sauce"?

RE: I am TSO - Nuuppanaani - 10-03-2014

Welcome to ORE!

Always great to see more diehards join the server Smile

Also really darn interestend on the specs and IS of that computer of yours, it sounds so coolio :p

RE: I am TSO - greatgamer34 - 10-03-2014

(10-03-2014, 10:39 AM)dylanrusell Wrote: Would you be kind enough to share your IS for your computer design or is that "secret sauce"?

im sure this is one of those guys who just builds a screen and says its a CPU.

ps. We have all been there!

RE: I am TSO - TSO - 10-04-2014

I don't even know what an IS is, but I will tell you my primary goal involves the memory. I don't have any schematics to prove my claims because I started only two days ago, and I'm still designing every element.

I guess I could tell you that half of the secret involves managing the space the computer takes up. You see, every computer in minecraft is limited to the chunks that load around the player, and if any part of the computer sits outside this area, it won't function. Or, at least, that's what everyone thinks. There also happens to be a way of using more than just the chunks in the player loading range. I'll let you guys guess how I plan on doing that.

RE: I am TSO - greatgamer34 - 10-04-2014

Most computers (unless they are unnecessarily large) do not take up more chunks then what are loaded at a time. Also, with you claiming to not know what an IS is, than you probably arent going to be 'managing your memory' properly.

I have a feeling your going to try to keep chunks loaded using some command block bull-shit. But in this post, it sounds like you have no idea what youre talking about.

Have you ever heard of the scientific method, where you share your data with others so they can reproduce it and get similar results to prove what you have works? It feels like youre trying to avoid that because you are going to exploit some bull shit system.

Other than that, good luck. But if its with command blocks, youre on the wrong server.

RE: I am TSO - TSO - 10-04-2014

I'm not using "command block bull shit". there are 256 chunks that never unload, I'm filling every single one of those, bedrock to sky, with memory, and then I'm filling half the load space with memory, bedrock to sky, than I'm putting the computer in the sky and placing more memory under it.

I probably know what the concept of IS is, I just don't know what the acronym stands for.

EDIT, it just occured to me you meant instruction set. CISC

RE: I am TSO - greatgamer34 - 10-04-2014

you realize, the amount of lag that is going to produce will be massive. Along with the fact that addressing that much memory is nearly impossible with 16 bits. Scratch that, it is IMPOSSIBLE. unless you use pointers...

RE: I am TSO - VoltzLive - 10-04-2014

(10-04-2014, 12:23 AM)TSO Wrote: I'm not using "command block bull shit". there are 256 chunks that never unload, I'm filling every single one of those, bedrock to sky, with memory, and then I'm filling half the load space with memory, bedrock to sky, than I'm putting the computer in the sky and placing more memory under it.

I probably know what the concept of IS is, I just don't know what the acronym stands for.

EDIT, it just occured to me you meant instruction set. CISC

Power does not come from memory.. In a minecraft CPU it really doesn't matter how much memory you have because the processors and programs can't keep up with it. With that much memory it would definitely be a bane to the computer as it would take possibly minutes to index all of it, not to mention (like Great said) the lag of this will be so great the computer wouldn't be able to function.
You didn't even know what an instruction set was which makes me think you have not a god damn clue on what you're getting yourself into. Saying that it will be CISC means nothing, I don't even think you know what CISC means.
If you really want it to be the best CPU in minecraft will have a fully pipelined 16 bit CISC 12 tick computer with one of the best IS in minecraft (to date) to contend with.

I think you need to goto the School server and learn then maybe in a few months you might have the foundations you need to build a computer like what you say.
Maybe if you elaborate on your ideas a bit (To me they sound like word salad, just some shit you picked up off of a few videos/wiki) Then I would change my mind. As of now I think you're a tryhard beginner that wants to be cool like the experienced people.

RE: I am TSO - dylanrusell - 10-04-2014

That method seems incredibly slow compared to what we have now, do to addressing and bussing delay. You do not need a huge amount of memory for a cpu to run a program that you want. Now on to the IS (instruction set) I was asking what your Instructions were and not the type because anyone can tell if it's Cisc or any type of instruction set by looking at the instructions themselves. Also if you do not have a good instruction set like Great said your cpu will not be very good.

RE: I am TSO - LordDecapo - 10-04-2014

bro, memory is a huge deterrent of speed on a CPU, the access times due to the signal strength will be ridiculous. The best route is to do a pipeline CPU that is ATLEAST 10 tick clock, and have an IS and architecture that can deter or eliminate any and all pipeline hazards.
This is what is going to happen, you will build this huge thing, the tick count will be so high u will scrap the idea and based on your enthusiasm (Which is great btw, and i encourage it, plz continue your idea and learn, we need more ppl with a go-getter attitude) you will start a new design with a new everything to get it smaller and faster.
You will realize a pipeline is the only way to get a faster clock without doing CRAZY ass shit that limits the ACTUAL power of the CPU, which is how much different shit u can do with the fastest throughput and at the same time encountering the least hazard stalls/branch misprediction penalties.
Which i give you this advice, look up some stuff on Instruction set, look up CPU Architecture, and then look up Instruction sets, and just read as much as you can, there are some great resources out there that can get you to the goal you seem to want to acheive, but i had the same approach about 10 months ago, and i have gone through like 13 design changes to finaly get the first fastish pipelined CISC CPU on the build server, 8 bit data, 8 tick clock, and full serial interface, as well as ability to support 16bit addressing on 32 different pieces of hardware. LOTS of functionality, and good through put,
hazards like
RAR(not really, its a "false" hazard), RAW, WAW, WAR.. Those of which you will need a decently fast and passive Hazard detection system as well as a Fwd system for results to elimate the RAW hazards, and if you con construct ur stages in ur pipeline well enough the WAW and the WAR hazards can take care of them selfs,
i am familiar with what i speak, and i will tell you that i can do OOOexe in Logisim and MC, and it isnt worth it in MC, and only worth it in OOOexe if you REALLY REALLY want to,, just try and go with a smart IS and good pipeline stages with a Forward feature.. all this you will understand as you do research.

feel free to apply and ill be happy to share what i know and give you links.... but i think you are underestimating severely what you are trying to accomplish.

RE: I am TSO - TSO - 10-04-2014

Safari crashed on my iPod, which is why this reply took so long, then I went out to dinner, then I spent time doing other stuff.

First off, I'm well aware what an instruction set is, I'm just bad with acronyms. I have never referred to an instruction set as an, "IS," so when you said that, I thought there was an IS component unique to redstone computers or a component I gave a different name to.
I know CISC is a blanket term (I actually don't know what defines the difference between RISC and CISC). In this case it means, "I don't actually quite have it worked out, but I have 44 operations I need this thing to perform, which is probably a lot more than any other minecraft computer, so I'll just call it CSIC." The actual opcodes aren't going to be created until after the computer is built, so that I can reverse engineer the whole system and have the codes mirror the processes that occur in the system, thus reducing the number of decoders.(I already know one typically does this the other way, buy my mind just doesn't quite work like that.) All I know is the first two bits of the opcodes, which will tell the CPU where to ship the information.

Memory space greater than 16 bits can easily be addresses using 16 bits and the secret sauce. Pointers are possible to write, but aren't needed, and not all the memory is registers or cache. There is an hierarchy that allows for addressing more than the bit space available to the word size. Although, I don't have a solution for fast registers yet...

Memory should only be around 19 ticks max to read or write, with nearer memory being faster. Special sauce is used here to fix that speed issue, and instant wire will fix bussing delays.

Now, I have discovered that memory actually does not induce lag, which makes sense if you think about it for more than two seconds and don't assume that anything you build with redstone automatically makes lag happen. The following things cause lag: sounds, light updates, block updates, complex rendering. If memory is really far away, you won't see or hear it, and there are no lighting updates because it isn't in render distance. The only time the blocks update are when it is being written to or read, in which case, only a very small part is actually updating. For memory in render distance: don't look at it. I'll make some images that show various frame rates for using it, looking at it, and looking away from it while standing near it. (Funny thing, if you stand inside it, you get a better fps than if you stand outside and look at it.)

The Von Newman architecture is why it needs so much memory, and more secret sauce makes sure the computer has the program data as it's needed.

The ALU uses some basic calculus to approximate large answers at a speed greater than what can be accomplished by other types of software implementations. (Mostly for division and any root, maybe real or complex)

The entire system could use two's compliment floating point encoding for large integers, or it will use standard binary if told to.

And most importantly: more than one computer can use the memory without any issues. This is where the secret sauce really gets lathered on there.

I'm willing to bet a few of you already figured out what the secret sauce is made of, and the only reason it is even there is so that the computer can be programmed in a language other than machine code (like c or maybe just assembly), but I discovered that this single piece would allow for all of these other magical properties to suddenly be available because it would make the CPU itself run much faster because it only really has to do one thing now: move memory between registers or cache. This is a five(ish) tick process.

There was something else I wanted to mention, but I have forgotten it now. No worries, though, you people will point out that I am an idiot because I forgot to mention that tiny detail.

I do suppose pipelining would be an option, but it would be more applicable toward the ALU than anything else. And this is just the first iteration anyway.

@LD, the only thin there I've never heard of is OOOexe.

RE: I am TSO - greatgamer34 - 10-04-2014

Quote:but I have 44 operations I need this thing to perform, which is probably a lot more than any other minecraft computer, so I'll just call it CSIC." The actual opcodes aren't going to be created until after the computer is built,

wat

Quote:The ALU uses some basic calculus to approximate large answers at a speed greater than what can be accomplished by other types of software implementations. (Mostly for division and any root, maybe real or complex)

good luck with calculus on a 16 bit alu

you never addressed Lords questions about pipelines xD

but saying that this CPU will work before building it is just as ridiculous as claiming i have a buggati veyron and no pictures of it..

RE: I am TSO - TSO - 10-04-2014

You do realize, as I said earlier, that I don't actually have the CPU yet, right? All I have is a memory cube, which took three days because I am not a fast builder. In fact, I only have half of it. So when I say the CPU will work, it's because I don't even have a design yet that can be proven to not work, and we know CPU's exist that do work, so we can say it is possible for me to have a CPU that works in our hypothetical situation. Whether or not I designed it myself is irrelevant to the current conditions of the hypothetical.

Most of the instruction set actually just passes through the CPU, being routed to it's destination without any decoding at all.

As for your question about coding, I know what I need it to do, like add, subtract, fetch some memory, ext. but I don't actally have the opcodes for the functions. Even then, I know the first two bits for every opcode because they are the destination being considered.

Calculus is just addition, include multiplication if you don't want it to take as long. Mostly I'm looking at Newton's method when I think of that.

I wrote most of that post before I saw LD's question about pipelines, so all I really wrote to it was those quick last two lines.

Upon consideration, I could venture to say that if you stood back far enough from it and squinted real hard, you would see something that sort of looked like a pipelined CPU.

Also, just in case you still don't know the chunk loading thing I mentioned, I was talking about the spawn chunks. (Also, I had to look up what a command block was because I have never heard of them.)

RE: I am TSO - VoltzLive - 10-04-2014

I am now completely sure you have not a clue what you're getting your self into.
What I believe is you watched some redstone videos, read up a bunch on Wikipedia and such but other than that I believe you haven't a snowballs chance in hell to get anything like what you say. The most advanced redstoners in the world are only beginning to tread on some of the ideas that you are discussing, People that have been working towards these problems for years now.. I expect this project to go almost nowhere as you are a complete beginner with no reference to what is feasible within redstone.

RE: I am TSO - LordDecapo - 10-04-2014

NOTE: I am not calling you an idiot, and I am not telling you your design wont work, I am telling you from experience that your plans so far sound like it wil lbe a very slow computer, and unless you do pipelining then you are looking at a clock no faster then 40 or so ticks whilst current Redstone CPU's avg in the 20's.
And My own CPU being an 8tick clock, 8 staged pipelined, 16/32/48bit CISC IS, with full serial interface, Hardware stacks, and 105 total different arrangements of branch type/destination.
I started with the same enthusiastic intentions as you, and it will work for your better, but u will need to do some homework first and listen to others in this community for advice on what to do to make it faster.
Read my responses below with that in mind and at the end of this post I will attach a .rar file with a bunch of PDFs and PPTs I have on my laptop, if you would like more I can get them for you, as well as I know of a meriade of college lectures on Comp Sci. topics that can help you emensly,
Also I will include a Logisim version of my CPU so you know im not blowing smoke up your ass.
Feel free to ask me any questions.
and i hope you continue ur GO-GETTER attitude, i would love to have someone else to talk about advanced control units with on the server xD

(10-04-2014, 03:28 AM)TSO Wrote: First off, I'm well aware what an instruction set is, I'm just bad with acronyms. I have never referred to an instruction set as an, "IS," so when you said that, I thought there was an IS component unique to redstone computers or a component I gave a different name to.

IS is something that you will find that almost all college or more advanced places will refer to an Instruction Set, after doing a bit more research you will see IS a lot.
Another one that may help is Inst. which is just short for Instruction.

(10-04-2014, 03:28 AM)TSO Wrote: I have 44 operations I need this thing to perform, which is probably a lot more than any other minecraft computer, so I'll just call it CSIC. The actual opcodes aren't going to be created until after the computer is built, so that I can reverse engineer the whole system and have the codes mirror the processes that occur in the system, thus reducing the number of decoders.(I already know one typically does this the other way, buy my mind just doesn't quite work like that.) All I know is the first two bits of the opcodes, which will tell the CPU where to ship the information.

44 is ALOT, unless you have a HUGE IS that is just all your control lines bussed to one Program Memory bank, I highly recommend either finidng someones IS you like a lot on the server and then make yours based on there layout (only at first, you can change it 100%, but it will give you a great place to start) or you can throw together a quick one to make sure you have some of your CPU's functions predetermined,, it sucks adding a bunch of features to a CPU and realizing you only needed like 1/3 of them to actually do what all your IS needs.
An IS also helps define a lot of basic asspects of your CPU, making it almost a road map for how to build it with out ever touching hardware.

(10-04-2014, 03:28 AM)TSO Wrote: Memory space greater than 16 bits can easily be addresses using 16 bits and the secret sauce. Pointers are possible to write, but aren't needed, and not all the memory is registers or cache. There is an hierarchy that allows for addressing more than the bit space available to the word size. Although, I don't have a solution for fast registers yet...

how are you going to use more then 16bit addressing on a 16bit system? And be aware, you can make a CPU with as much memory as you want, but are you ever going to make a porgram that ACTUALLY uses it? Or could you program a tad bit better and have the program use your memory more wisely.

(10-04-2014, 03:28 AM)TSO Wrote: instant wire will fix bussing delays.
The Von Newman architecture is why it needs so much memory, and more secret sauce makes sure the computer has the program data as it's needed.

NO! instant wire bussing is a VERY VERY bad idea, I have seen many ppl (about half of which were long time MC CPU veterans) try and use it as one of those “cure all” solutions to bussing,, the timing will be hell, and you will end up wanting to pul ur hair out.
Also I personly dont ever use pistons.. in any CPU... ever... I hate them and there half tick nonsense and BUD glitches.. a Pistonless CPU (also called SS (Solid State) is much more predictible,
But if you like pistons and you want to use them, have at it Big Grin

just saying I personally dont like them at all.

(10-04-2014, 03:28 AM)TSO Wrote: The ALU uses some basic calculus

“basic” is a very HARD word here...
Cause calculus isnt something that is normally done at a MC CPU level, due to the massive amount of ticks it will add to your CPU, and if ur only using a basic ALU to do these,, there will be a lot of time that it will be clogged preforming the “basic” task of your calculus.

(10-04-2014, 03:28 AM)TSO Wrote: The entire system could use two's compliment floating point encoding for large integers, or it will use standard binary if told to.

2's comp is mainly for negitive, and floating point is decimal points (better known as Radix bits in binary) and you will need a good barrel shifter to be able to do FP well, I have a good design im working on if you want to use it.

(10-04-2014, 03:28 AM)TSO Wrote: And most importantly: more than one computer can use the memory without any issues. This is where the secret sauce really gets lathered on there.

2 CPUs with one memory is a bad idea,, only do it if you have a system to manage memory that is independent from the CPU's themselves, otherwise you will have addressing confilicts and data hazards, as well as architectural hazards (more then 1 data packet wanting to go down same bus line at same time, and other stuff)

(10-04-2014, 03:28 AM)TSO Wrote: I'm willing to bet a few of you already figured out what the secret sauce is made of, and the only reason it is even there is so that the computer can be programmed in a language other than machine code (like c or maybe just assembly), but I discovered that this single piece would allow for all of these other magical properties to suddenly be available because it would make the CPU itself run much faster because it only really has to do one thing now: move memory between registers or cache. This is a five(ish) tick process.

They may have but I have not figured this one out, but please tell me xD im curious.. and the way to make a CPU be able to be programmed in a higher lang, is by making a compiler and to have a GOOD IS that is small, yet feature rich, so you can easily port the coding in and out of the CPU's machine code.

(10-04-2014, 03:28 AM)TSO Wrote: I do suppose pipelining would be an option, but it would be more applicable toward the ALU than anything else. And this is just the first iteration anyway.

just saying the ALU is like the ONLY place you do not pipeline like ever, I mean you can yes, but due to speed limitations in MC, it doesnt give you any speed advantage, and to pipeline an ALU, you have to have a good hazard detection system and possibly OOOExe,, considering the delay of an OOOexe system,,, unless you make a milestone discovery it wont be beneficial to your clock speed/throughput.

(10-04-2014, 03:28 AM)TSO Wrote: @LD, the only thing there I've never heard of is OOOexe.

well dont worry, that is annoying as hell, I believe im the only one on the server that has made something using OOOexe, (the OOOexe stuff is in the Logisim CPU file I have linked in the bottom, however it isnt being used in the CPU at all, and it has some bugs that need to be fixed at some point if I ever decide to actually use it,,, but a Belt Architecture is better for consistency and such so I will be going that route)

(10-04-2014, 06:39 AM)TSO Wrote: Most of the instruction set actually just passes through the CPU, being routed to it's destination without any decoding at all.

well yes, all inst. Go though the CPU, xD that is what they are instructing Tongue

and no.. u have to decode your Op's unless you want something so big that the straight control line programming will end up adding more delay then a simple decoder would have, and be about 5x the space

(10-04-2014, 06:39 AM)TSO Wrote: Upon consideration, I could venture to say that if you stood back far enough from it and squinted real hard, you would see something that sort of looked like a pipelined CPU.

doesnt work like that, either you are pipelined or your not,
the best way I would suggest you to get into pipelining is to do a 2 maybe 3 stage at first, and if you would like to do that, I can let you know the best way to go about that.

(10-04-2014, 06:39 AM)TSO Wrote: Also, just in case you still don't know the chunk loading thing I mentioned, I was talking about the spawn chunks. (Also, I had to look up what a command block was because I have never heard of them.)

this could work,, but a CPU is RARELY that big that it even comes close to going outside a render distance, and nothing outside the render distance (besides spawn chucks) even register when redstone is activated,, well they do SOMETIMES,, but its buggy enough that you can not use the system for a CPU as it wil corrupt data

Links as I promised:
Losgisim version of CPU
https://www.dropbox.com/s/df3e2wi82kftgty/IizRx16-RCr1.6b4.rar?dl=0
This will have the CPU itself, the Instruction set and the 16/32/48bit framework exampled at the bottom of the IS, it also has the Logisim Program itself so u dont need anything else, and it has programs you can upload into the PROM in the CPU circuit, it is a work in progress and I am making a Belt version of the same IS shortly, after I finish porting that current version to MC (dropping data width from 16 to 8bit)
ask one of us if you are unsure how to use logisim

PDF and PPT resources
https://www.dropbox.com/s/j8ocf6aezj9uiup/CPU%20Resources.rar?dl=0
these are not in any particular order and the naming system in them is dumb as hell (just left there names as they downloaded as) some are super advanced, some are mild, and some are simpler.. ull just have to go through them, let me know if you want stuff on a specific topic, and I can get you some great stuff on it.

As I said, im here to help, not to discourage

LAST NOTE: Bussing bro BUSSING,,, one of the most important things to keep In mind is “how do I minimize bussing time”. When making a CPU, you can have the fastest parts, but if your data loop has like 8 ticks of bussing delay,,, those are all wasted ticks,, that could either be eliminated, or that could be put to better use, like doing proccessing. So learn how to stack ur parts close as u can.

RE: I am TSO - greatgamer34 - 10-04-2014

well lord said more than i would have taken the time to...

RE: I am TSO - jxu - 10-04-2014

(10-04-2014, 01:54 AM)LordDecapo Wrote: The best route is to do a pipeline CPU that is ATLEAST 10 tick clock

10 ticks is really pushing it in terms of actual functionality and memory access compared to some benchmark tests

RE: I am TSO - LordDecapo - 10-04-2014

yes, i meant that as like more then 10, doing anything under 11 is hard if u haven't done pipelining before

RE: I am TSO - Magazorb - 10-04-2014

TSO i've no idea who you are but i love you, you're just that bit crazy enough to do new things and that's cool Big Grin

, but you've got lot's to learn.

It seemed to you wanted to be proven wrong and so now i shall help; Your talk on the memory not causeing lag for the reasons you said are somewhat correct, those things you said are the main causes of "Lag" but you didn't notice how the server keeps track of all of those states regardless to weather it's a chunk a player is in or not, also each time the redstone changes MC makes a write to HDD (because mojang derp) which of course takes time, but to say you placed glowstone in all the otherwise dead space, it may reduce the lighting update a little.

Now to move on to the instant wire busing you wanted, their's a update limit in MC iirc, implementing instant wire has proven unreliable in the past and those where on relatively small scale, to put it into the scale you have would most definitely drop values and thus cause corrupt data.

This where just some facts, i get the sense that you're trying to initialate multicores, unfortunitly in MC they tend to not work for the fact we don't have enough load, and for the few things where multicore could be usefull we don't have the memory bandwidth to match the requirements of this, hence why most CPUs are only a single SISD PE

Further more Decapo does have a rather powerful SISD PE that nears what is theroeticly the best possible. beyond which if you want to talk about raw processing power i'm still drafting methods of a SIMD or MSIMD PE array, of which i target to achieve a minimal of 26.6recurring (26+(2/3)) computations a second average.

Currently what you have to contend with are PEs that can decode instructions, fetch, compute, writeback and branch with a 10tick clock, some are even quicker then that.

I like some of the ideas you have and you seem like a nice enough guy, i hope you shall be proud of your creation how ever it turns out as few tempt to make such large scale stuff.

though i have to now make it clear why i choice to go against you so strongly:

(10-04-2014, 06:39 AM)TSO Wrote: So when I say the CPU will work, it's because I don't even have a design yet that can be proven to not work, and we know CPU's exist that do work, so we can say it is possible for me to have a CPU that works in our hypothetical situation. Whether or not I designed it myself is irrelevant to the current conditions of the hypothetical.

We call this ignorance in my country, this argument is what most religious people would say to "prove" that their god(s) or equal is real.
But coming back to reality, no evidence of existence isn't evidence of lack of existence, and neither does it work in the other direction, in you're case; lacking the evidence to say you're idea doesn't work doesn't mean it does work.

You should try to be a little more scientific about things you build Tongue

also i define power as P=IV.

It would be nice to see what you come up with as you seem to have a good few ideas and would also like to see you around a bit, best of luck.

RE: I am TSO - LordDecapo - 10-04-2014

^^^ +1 Maga

RE: I am TSO - TSO - 10-05-2014

To answer the thing about my reasoning for the CPU working and saying lack of disproof is a sign of proof. You see this type of proof happen all the time in mathematics, but what I was saying was that in my hypothetical, I have this CPU that does CPU stuff, and detail is mostly ignored. Now, we can safely say the CPU will work because either a.) my design works, or b.) I shamelessly throw someone else's in there. Thus we know for certain that a working CPU stage is possible in that portion of the discussion. Do note that we can't say that my CPU design will work by that principle, but we can say the computer will work.

RE: I am TSO - LordDecapo - 10-05-2014

and that answers..... well anyway, if you wanna stop playing all defensive and listen, we can help you.. did u even read my post?

RE: I am TSO - TSO - 10-05-2014

Yes, I just have been too busy as of late to fully respond to all that.

The only reason I'm hiding any of this is because there isn't a minecraft "patent" that prevents other people that see this from building it first and claiming it was their idea. There isn't a way to force people to say where they got the idea from. I can tell you right now I got it from when my father was explaining computer programming (i can't program at all) and said that the operating system was involved in a process. I have a personal bias against operating systems (lol), so I tried to think of a workaround for that particular function.

RE: I am TSO - LordDecapo - 10-06-2014

which function are you speaking of? and its digital so its tech copyright, and since this is a server that Ip logs and keeps time, you always have this page as proof its yours, but i can assure u i will not steal anything, so if u want to bounce the idea off someone u can private message me and let me know.

RE: I am TSO - TSO - 10-06-2014

(10-04-2014, 03:02 PM)LordDecapo Wrote:
(10-04-2014, 03:28 AM)TSO Wrote: I have 44 operations I need this thing to perform, which is probably a lot more than any other minecraft computer, so I'll just call it CSIC. The actual opcodes aren't going to be created until after the computer is built, so that I can reverse engineer the whole system and have the codes mirror the processes that occur in the system, thus reducing the number of decoders.(I already know one typically does this the other way, buy my mind just doesn't quite work like that.) All I know is the first two bits of the opcodes, which will tell the CPU where to ship the information.
44 is ALOT, unless you have a HUGE IS that is just all your control lines bussed to one Program Memory bank, I highly recommend either finidng someones IS you like a lot on the server and then make yours based on there layout (only at first, you can change it 100%, but it will give you a great place to start) or you can throw together a quick one to make sure you have some of your CPU's functions predetermined,, it sucks adding a bunch of features to a CPU and realizing you only needed like 1/3 of them to actually do what all your IS needs.
An IS also helps define a lot of basic asspects of your CPU, making it almost a road map for how to build it with out ever touching hardware.

I can quickly build the CPU, which is why I wasn't really worried about it, if you want me to. I don't know if I can upload the schematic, though.

LD Wrote:
(10-04-2014, 03:28 AM)TSO Wrote: Memory space greater than 16 bits can easily be addresses using 16 bits and the secret sauce. Pointers are possible to write, but aren't needed, and not all the memory is registers or cache. There is an hierarchy that allows for addressing more than the bit space available to the word size. Although, I don't have a solution for fast registers yet...
how are you going to use more then 16bit addressing on a 16bit system? And be aware, you can make a CPU with as much memory as you want, but are you ever going to make a porgram that ACTUALLY uses it? Or could you program a tad bit better and have the program use your memory more wisely.

The first part is through the use of magic.
You would want that memory for situations where... say... you wanted more than one program on the computer, or needed to run two at a time.
Also, a compiler will naturally waste memory, I don't know why because I'm not a programmer, but it's been explained to me that the higher the language you compile from, the more memory you waste, and the less efficient the code is, but the programming language becomes more understandable.

LD Wrote:
(10-04-2014, 03:28 AM)TSO Wrote: instant wire will fix bussing delays.
The Von Newman architecture is why it needs so much memory, and more secret sauce makes sure the computer has the program data as it's needed.
NO! instant wire bussing is a VERY VERY bad idea, I have seen many ppl (about half of which were long time MC CPU veterans) try and use it as one of those “cure all” solutions to bussing,, the timing will be hell, and you will end up wanting to pul ur hair out.
Also I personly dont ever use pistons.. in any CPU... ever... I hate them and there half tick nonsense and BUD glitches.. a Pistonless CPU (also called SS (Solid State) is much more predictible,
But if you like pistons and you want to use them, have at it just saying I personally dont like them at all.

We differ in opinion about pistons, but instant wire bussing can be managed rather easily once you get used to it. Also, builds like my oscilloscope actually pivots around the idea that pistons take a half tick longer than repeaters. If worse comes to worst, you can do something like insta-wire for nine chunks, then repeat 1 time, then skip the next nine, so that while the bus does have delay, it still is quite a bit faster than the typical bus. This also helps fix what Megazorb mentioned about there being maximum update, if that is true. (I have no idea if it is.)

Also, if you just look at the CPU, instructions and data do not share a bus, so it is modified Harvard, not Von Newman.

LD Wrote:
(10-04-2014, 03:28 AM)TSO Wrote: The ALU uses some basic calculus
“basic” is a very HARD word here...
Cause calculus isnt something that is normally done at a MC CPU level, due to the massive amount of ticks it will add to your CPU, and if ur only using a basic ALU to do these,, there will be a lot of time that it will be clogged preforming the “basic” task of your calculus.

This is where my "sorta pipelined" comes in. For example, the CPU could request a divide (which means calculus), in one cycle, but then request an add in the next cycle before the divide is even done. As long as the two don't conflict (RAW?... You aren't asking to add the quotient), everything is just fine. So really, the calculus just holds up one ALU operation, but the rest of the ALU is still free to be used for other functions. You could even simultaneously divide, square root, multiply, add, and take the mod function (that's ten separate registers being acted on) as long as one of those operations doesn't need the output of another on the input.

LD Wrote:
(10-04-2014, 03:28 AM)TSO Wrote: The entire system could use two's compliment floating point encoding for large integers, or it will use standard binary if told to.
2's comp is mainly for negitive, and floating point is decimal points (better known as Radix bits in binary) and you will need a good barrel shifter to be able to do FP well, I have a good design im working on if you want to use it.

Sorry, I was saying it has options to run any of the four permutations available from that. So positive normal, floating positive, two's compliment, or floating two's compliment. The only difference would be the selected opcode.

LD Wrote:
(10-04-2014, 03:28 AM)TSO Wrote: And most importantly: more than one computer can use the memory without any issues. This is where the secret sauce really gets lathered on there.
2 CPUs with one memory is a bad idea,, only do it if you have a system to manage memory that is independent from the CPU's themselves, otherwise you will have addressing confilicts and data hazards, as well as architectural hazards (more then 1 data packet wanting to go down same bus line at same time, and other stuff)

They wouldn't share bus lines (because they don't need to), and the rest falls under the veil of mystery.

LD Wrote:
(10-04-2014, 03:28 AM)TSO Wrote: I'm willing to bet a few of you already figured out what the secret sauce is made of, and the only reason it is even there is so that the computer can be programmed in a language other than machine code (like c or maybe just assembly), but I discovered that this single piece would allow for all of these other magical properties to suddenly be available because it would make the CPU itself run much faster because it only really has to do one thing now: move memory between registers or cache. This is a five(ish) tick process.
They may have but I have not figured this one out, but please tell me xD im curious.. and the way to make a CPU be able to be programmed in a higher lang, is by making a compiler and to have a GOOD IS that is small, yet feature rich, so you can easily port the coding in and out of the CPU's machine code.

I intend on programs to be compiled as they are being written, not real time.

LD Wrote:
(10-04-2014, 03:28 AM)TSO Wrote: I do suppose pipelining would be an option, but it would be more applicable toward the ALU than anything else. And this is just the first iteration anyway.
just saying the ALU is like the ONLY place you do not pipeline like ever, I mean you can yes, but due to speed limitations in MC, it doesnt give you any speed advantage, and to pipeline an ALU, you have to have a good hazard detection system and possibly OOOExe,, considering the delay of an OOOexe system,,, unless you make a milestone discovery it wont be beneficial to your clock speed/throughput.

I think we might be arguing different personal definitions of pipelining. I consider it to be any time one component is running two tasks at roughly the same time. So a pipelined ALU is just able to do more than one arithmetic on independent operands at roughly the same time. (As discussed above.)

LD Wrote:
(10-04-2014, 06:39 AM)TSO Wrote: Most of the instruction set actually just passes through the CPU, being routed to it's destination without any decoding at all.
well yes, all inst. Go though the CPU, xD that is what they are instructing and no.. u have to decode your Op's unless you want something so big that the straight control line programming will end up adding more delay then a simple decoder would have, and be about 5x the space

I mean like, out of a 32 bit instruction set, only the first two bits of the opcode are evaluated by the CPU, and only the registers mentioned in the instruction set get acted on. (So a little bit of decoding for the registers)

LD Wrote:
(10-04-2014, 06:39 AM)TSO Wrote: Upon consideration, I could venture to say that if you stood back far enough from it and squinted real hard, you would see something that sort of looked like a pipelined CPU.
doesnt work like that, either you are pipelined or your not,
the best way I would suggest you to get into pipelining is to do a 2 maybe 3 stage at first, and if you would like to do that, I can let you know the best way to go about that.

See above. This is why I really think we define it differently.

LD Wrote:
(10-04-2014, 06:39 AM)TSO Wrote: Also, just in case you still don't know the chunk loading thing I mentioned, I was talking about the spawn chunks. (Also, I had to look up what a command block was because I have never heard of them.)
this could work,, but a CPU is RARELY that big that it even comes close to going outside a render distance, and nothing outside the render distance (besides spawn chucks) even register when redstone is activated,, well they do SOMETIMES,, but its buggy enough that you can not use the system for a CPU as it wil corrupt data

Mostly memory is leaving render distance, which as I said is rather inactive, so I'll just have to see if it works out or not.

RE: I am TSO - LordDecapo - 10-06-2014

All programs in MC to run at all fast need to be precompiled

The definition i am using of pipeline is the only definition, when u split up a single operation into steps then buffer those steps and have multiple inst being processed at different stages of the CPU simultaneously. not the same piece of hardware doing 2 things at once.
with the variable completion time u will have between ur mult/root/divide/square/add/sub.. you will have to have a system to track that, or have to precompile with those time frames in mind, cause all ops have to save and u cant save 2 at the same time unless u make one of the first dual write Reg.

just saying good luck with he instant wire,, the delay in bussing is generally needed for control lines and such, and instant just gets stuff all screwy.... and a CPU is a bit more complex then a osciliscope,

"quickly build" is never the case, any GOOD CPU takes time researching, thinking and designing as well as building... and being able to run multiple programs using more memory isnt the solution, u will run into problems with Context Switching/Swapping.. to which there are a few ways to make that faster, up to u how u do it tho. the best plan is to just do Call's and Returns, and have a main program like a "mini kernel" have a context switching protocol to do between every swap. to which u need a stack to transfer data between processes or u need to have programs access eachothers memory (that gets really fishy)

RE: I am TSO - TSO - 10-06-2014

Check your PM. Precompile will account for the variable rates, even allow for variable clock rates.

I quickly rechecked the official definition. What I have in mind is like a mix between pipeline and superscaling, with more focus on superscaling. Pipelining that would be ideal.

I have run time stacks that I accidentally built as hardware because I wasn't thinking. They're really big.

Is a non approved user (such as myself) allowed to just stroll around the build server and look at stuff, or do they have to apply for a homestead and substantially improve the 260 acre property for five years and all that stuff first?

RE: I am TSO - greatgamer34 - 10-06-2014

(10-06-2014, 03:12 AM)TSO Wrote: Check your PM. Precompile will account for the variable rates, even allow for variable clock rates.

I quickly rechecked the official definition. What I have in mind is like a mix between pipeline and superscaling, with more focus on superscaling. Pipelining that would be ideal.

I have run time stacks that I accidentally built as hardware because I wasn't thinking. They're really big.

Is a non approved user (such as myself) allowed to just stroll around the build server and look at stuff, or do they have to apply for a homestead and substantially improve the 260 acre property for five years and all that stuff first?

anyone can browse the build server. I recommend you check out lords CPU (im unaware of the warp name)

I also suggest checking out my latest and greatest CPU at /warp bcdcpu
it runs a 15 tick clock with out any pipelining.

RE: I am TSO - TSO - 10-06-2014

Cool

Also, any suggestions on what I could make in an hour?
I mean, if you gave me like a week, I could build anything.

But an hour... French doors? (Trust me, I'd spend five minutes making them and still somehow spend thirty debugging it)

RE: I am TSO - greatgamer34 - 10-06-2014

An alu will suffice, but not some RCA alu. Put some thought into it, and plan ahead.

RE: I am TSO - TSO - 10-06-2014

Hmm. Sounds easy enough. Don't know if I can do it in an hour.

Also, I have tried everything, but I can't connect to the server (I think it's my own fault though.)

RE: I am TSO - greatgamer34 - 10-06-2014

(10-06-2014, 04:52 AM)TSO Wrote: Hmm. Sounds easy enough. Don't know if I can do it in an hour.

Also, I have tried everything, but I can't connect to the server (I think it's my own fault though.)

1.7.10, also, yes an ALU is easy to do in an hour, bit count doesnt really matter(although i prefer to see at least 4 bits).
Also, admins(and trusted members i believe) are allowed to W/E for you. Just tell them the points and how many to stack.

if it is of any significance on another redstone server i was allotted an hour to build something. I made an entire 4 bit CPU....

RE: I am TSO - Magazorb - 10-06-2014

(10-05-2014, 09:24 PM)TSO Wrote: To answer the thing about my reasoning for the CPU working and saying lack of disproof is a sign of proof. You see this type of proof happen all the time in mathematics, but what I was saying was that in my hypothetical, I have this CPU that does CPU stuff, and detail is mostly ignored. Now, we can safely say the CPU will work because either a.) my design works, or b.) I shamelessly throw someone else's in there. Thus we know for certain that a working CPU stage is possible in that portion of the discussion. Do note that we can't say that my CPU design will work by that principle, but we can say the computer will work.

Now you're pretending to say what i said... read what i said once more... you will then realise what i say about lack of evidence isn't evidence, i mean you can be ignorant about that if you want, but then i could say that pigs have wings and that just because we haven't seen a pig with wings doesn't mean it's not the truth.

(10-05-2014, 09:42 PM)TSO Wrote: Yes, I just have been too busy as of late to fully respond to all that.

The only reason I'm hiding any of this is because there isn't a minecraft "patent" that prevents other people that see this from building it first and claiming it was their idea. There isn't a way to force people to say where they got the idea from. I can tell you right now I got it from when my father was explaining computer programming (i can't program at all) and said that the operating system was involved in a process. I have a personal bias against operating systems (lol), so I tried to think of a workaround for that particular function.

we are luck we don't have to worry about patents being so bad, because if we did intel and AMD would be suing us big time for being based on them

(10-06-2014, 03:05 AM)LordDecapo Wrote: with the variable completion time u will have between ur mult/root/divide/square/add/sub.. you will have to have a system to track that, or have to precompile with those time frames in mind, cause all ops have to save and u cant save 2 at the same time unless u make one of the first dual write Reg.

he would be 2nd, i have all ready made this a year or so ago.

Also TSO please note my name isn't megazorb... i'm not a power ranger XD

if you're so concerned with patents then why don't you pay the 100's of 1000's it cost to get some? besides they don't stop anyone from building anything patented (except USA for some bizarre reason they aren't aloud to)

You have some interesting ideas and the way you describe implementation is vastly different to what I've heard before, but please also make solid plans, because by how it's sounding is you're just making to much for your self to actually make, you say you're familiure with EE and CS, but i doubt you've even as much knowledge as a first year student (not to be offensive, but the knowledge they have is eminence, few of us on the server can compare, most that can are EE or CS students them selfs)

So yer... try to get some more solid plans and designs that can give you a strong idea as to how everything may work together, you will find a few gaps where you hadn't though to much into and you can then reflect on that, we have plenty of people here you can learn from if you need additional info.

Sorry if this seemed a little rantish, tired at this point :/ Have a nice time Tongue

RE: I am TSO - TSO - 10-06-2014

We aren't in the same page as far as the proof thing, but I think I'm just going to give up on that.

I'm very good with electrical engineering, but that is completely useless in minecraft. (Not really a way to prove it since anything you ask me could just as easily be googled, so if you believe me or not is up to you)

CS I just kinda wing it. I know what it needs to do, trial and error suffice to get me the rest of the way pretty quickly.

My ideas are so vastly different because I often model compuers in thought expiriments. Though, this has led to some odd results. (For example, it is theoretically possible for a computer to run without a clock, which is actually the basis for some of my ocilloscope.)

RE: I am TSO - LordDecapo - 10-06-2014

An Async clock less CPU is deffinitly possible, I have seen quite a few I the server already, I got ur PM and will reply more in depth later this afternoon as I am at work for another 5 hours today.
I can do a trial with u tonight if you feel you can make something, I'd suggest a simple multiplier, as they get u in usually if u can explain how they work well enough.

Maga, I have read his PM and he does have some solid ideas, but what u said is true he will run into some gaps in his design that he will have to address, which isn't bad, everyone does up until they are finished with the project. xD

And before you ask, I am not an EE major or CS major, I haven't even gone to school for anything electrical or computer xD I'm self taught. Tho I have looked at and gone through the resources for about half of a lot of different courses online.

My warp for my CPU is
/warp IizRCPUx8
Or do /pwarp LordDecapo (May have to /pwarp LordDecapo 2 As I have 2 plots)
And just spin I a circle till u see the 2 black wools that say something like "CPU this way" and fly that direction, it's like 80blocks away and floating like 30blocks in the air.
Let me know what u think Big Grin

pS. Super scalar is harder with multi length inst. you have to compile VERY VERY well (and program will only work on THAT CPU, or u have to set up a more basic OOOexe to keep it all timed right

RE: I am TSO - TSO - 10-06-2014

The program can work on other CPU's, it's just that the compiled program is unique to the CPU. This is always going to be true no matter what CPU you encounter, real life or minecraft. The workaround is to compile twice, once to a language all computers understand (.exe) in compile time, then the receiving platform compiles to its unique machine code in run time. This is actually how java works, but the technique makes these programs rather slow to run. You could also precompile the incoming file and just have two compile times, once for all platforms, then once for each platform, then they run the program after it has been compiled.

Can someone just tell me what oooexe stands for? Google gives me nothing of use.

OK, I'm 99 percent certain I can't get onto the server because I'm one of those cheap bastards that uses the team extreme cracked minecraft.

RE: I am TSO - greatgamer34 - 10-06-2014

Quote:OK, I'm 99 percent certain I can't get onto the server because I'm one of those cheap bastards that uses the team extreme cracked minecraft.

just buy the game, it costs less than $30. It wont break the bank...

RE: I am TSO - LordDecapo - 10-06-2014

You have to have a legal version of MC to play on Ore, as loging on checks the validation server.

OOOexe stands for
Out of Order Execution... Taking a compiled programs with inst A B C D E
An if it finds any hazards it can rearrange them like B C A E D. So you don't lose any clock cycles to stalls.
OOOexe brings in A LOT of issues when u start using it. If u want to learn more, a good portion of the PDFs and PPTs in my long reply that are in the 2nd link, are about OOOExe and related issues. Or u can ask me and I can helps. I have done way to much research into trying to elominate all stalls in pipelines

RE: I am TSO - TSO - 10-06-2014

Again, I know what out of order execution is. Instructions are performed in an arbitrary order, generally one that avoids conflicts. Problems arise from the same data conflicts as pipelining, but now the computer also needs to know which program line has priority on the conflict. So even though the previous arrangement might have had conflicts, the new one has different conflicts in different places, and some conflicts can not be avoided. Theoretically, this should have fewer holes in the execution due to cases being avoidable, but occasionally, this technique will actually be worse due to excessive overhead time spent finding the new order or the current arrangement already being the least conflicting in the pipeline to begin with.

RE: I am TSO - LordDecapo - 10-07-2014

or or or,, you can just compile like normal,, then apply one of the main OOOexe algorithms (or just use mine as its simplified and does same function)... now you have code with a near 0 overall penalty.
or you can just slow ur Pipeline and Fwd, so u can eliminate even the worst hazards at runtime

RE: I am TSO - TSO - 10-07-2014

Or you can rename registers as you go... but that's for noobs

RE: I am TSO - Magazorb - 10-07-2014

(10-06-2014, 05:40 AM)TSO Wrote: We aren't in the same page as far as the proof thing, but I think I'm just going to give up on that.

I'm very good with electrical engineering, but that is completely useless in minecraft. (Not really a way to prove it since anything you ask me could just as easily be googled, so if you believe me or not is up to you)

CS I just kinda wing it. I know what it needs to do, trial and error suffice to get me the rest of the way pretty quickly.

My ideas are so vastly different because I often model compuers in thought expiriments. Though, this has led to some odd results. (For example, it is theoretically possible for a computer to run without a clock, which is actually the basis for some of my ocilloscope.)

It wasn't about being on the same page... it's merely a fact that without supporting evidence doesn't mean it's true if there's no unsporting evidence (this is how science works)

EE knowledge can be applied very well in MC, just not the fundamentals.

Trial and error is good and is a important step in learning EE or CS, or even redstone magic, just be sure to take note of what happens and try to expand your knowledge based on that

Look on my plot and you'll see nothing interesting, my brain is my library of knowledge, not my plot Tongue

i probably can't compare to Decapo but i do definitely simulate just about everything in my head, so we are much alike if that's how you truly are (Yay for dyslexia power Big Grin

)

Also it's impossible to have a effective clockles (meaning that it's allways effectively clocked, not that "clockles" is bad or uneffective), as in clockles designed you have a something holding data back until stage is completed computation, this in it's self generates a clock, or other means of clockles implementations has similar effects

If it's ok to, may i see your current progress, you have a good few ideas and would just like to see a little more about them (I've had most of them my self which intrigues me to see how someone else would go about those ideas.)

RE: I am TSO - TSO - 10-08-2014

Right now it just looks like a big block of memory... But yeah, I've got some stuff.

RE: I am TSO - greatgamer34 - 10-08-2014

screenshots?

RE: I am TSO - TSO - 10-08-2014

Build world tour, focus is mainly on the biggest projects I've had in this world. There are 31 screencaps below, veiw at your own risk.

There is apparently an image count limit. Continued in next post

Second half.

Good God! This took 2 hours to make.

RE: I am TSO - greatgamer34 - 10-08-2014

umm, you'd be surprised how much better Solid state is. Just ask decapo, his stack is super small and super fast!

RE: I am TSO - TSO - 10-08-2014

The stack or the big blocks of memory?

RE: I am TSO - Magazorb - 10-08-2014

Believe he meant in general, Yer it doesn't look so bad, probly one of the better things to come from someone outside, will keep a eye on this Tongue

Also, i believe it's 12KiB (12KibiByte; 12x(2^10)^1 vs 12KiloByte; 12x(10^3))

It's a common misunderstanding to use the metric prefixes instead of the binary prefixes.

RE: I am TSO - greatgamer34 - 10-08-2014

(10-08-2014, 02:40 PM)TSO Wrote: The stack or the big blocks of memory?

the stack, yours was quite large in comparison

RE: I am TSO - TSO - 10-08-2014

I was thinking of data KB, thus powers of two.

The stack is entirely solid state except for a single instant repeater that moves the pop command from the left to the right side.

RE: I am TSO - greatgamer34 - 10-08-2014

(10-08-2014, 06:17 PM)TSO Wrote: I was thinking of data KB, thus powers of two.

The stack is entirely solid state except for a single instant repeater that moves the pop command from the left to the right side.

oh my bad xD this is the stack right?
[Image: 2616713.png]

but you should really check out the one on lords CPU, it is way more compact!

RE: I am TSO - TSO - 10-08-2014

yeah... I can make it thinner by using vertical transmission instead of stairs for the data... I can make it eight blocks shorter... and the length will be the same.

Most of that width is in the pop. The other limit is the size of the eight registers, which is just made of those d flip flops that are shown in a different picture.

RE: I am TSO - TSO - 10-08-2014

SO, I was working out some IS stuff, and it dawned on me that I have no idea how to do conditional branching.

RE: I am TSO - greatgamer34 - 10-09-2014

compare your flags. based on your instructions, read the flags reg.

RE: I am TSO - TSO - 10-09-2014

Sorry, I meant mostly how to generate the conditions (flags, I guess) in the same cycle as the operation, then I started wondering how many of these conditions I need... then I asked my father for help. He programmed back in the 90's when computers had no memory and actual program speed mattered because computers were still so slow (back when programmers actually had to be good at programming.)

What I learned was this: all your possible conditions are computed at the same time and given a true/false value. These values then go to a single register the same size as all the possible conditions (between eight and sixteen), you then use a bit mask to get the conditions you want out of this register (you can actually call more than one flag at the same time, then operate a condition on those two). Your goal is to never jump, EVER, especially more than one or two lines. Long goto jumps take longer, are harder to understand in debugging, and are a sign that the programmer does not understand the program being written. If, Then, Else statements also slow down the system, but are far better than a goto because the compiler will place the destination code for the if statement near the line the condition is on. The compiler will preload the program data line by line regardless of condition destinations, if an if, then, else occurs, a path is chosen, and all code for the other path is deleted, followed by loading all the subsequent program data for the line that was chosen. The nearer these lines are to the jump condition, the less gets loaded and unloaded. The amount of qued data that gets deleted depends upon how large the jump is, with a goto being the worst. Nearly all data is dumped and reloaded and all stages of the CPU are completely cleared in a goto because they inherently do not jump near the line their condition is on. And if you pipelined and it jumps conditionally, you have to somehow rid yourself of all the crap in the pipeline that is no longer correct because it was part of the other conditional side.

Luckily, my design inherently helps with this, but it will still suck to have to think about it.

RE: I am TSO - greatgamer34 - 10-09-2014

(10-09-2014, 05:16 AM)TSO Wrote: Sorry, I meant mostly how to generate the conditions (flags, I guess) in the same cycle as the operation, then I started wondering how many of these conditions I need... then I asked my father for help. He programmed back in the 90's when computers had no memory and actual program speed mattered because computers were still so slow (back when programmers actually had to be good at programming.)

What I learned was this: all your possible conditions are computed at the same time and given a true/false value. These values then go to a single register the same size as all the possible conditions (between eight and sixteen), you then use a bit mask to get the conditions you want out of this register (you can actually call more than one flag at the same time, then operate a condition on those two). Your goal is to never jump, EVER, especially more than one or two lines. Long goto jumps take longer, are harder to understand in debugging, and are a sign that the programmer does not understand the program being written. If, Then, Else statements also slow down the system, but are far better than a goto because the compiler will place the destination code for the if statement near the line the condition is on. The compiler will preload the program data line by line regardless of condition destinations, if an if, then, else occurs, a path is chosen, and all code for the other path is deleted, followed by loading all the subsequent program data for the line that was chosen. The nearer these lines are to the jump condition, the less gets loaded and unloaded. The amount of qued data that gets deleted depends upon how large the jump is, with a goto being the worst. Nearly all data is dumped and reloaded and all stages of the CPU are completely cleared in a goto because they inherently do not jump near the line their condition is on. And if you pipelined and it jumps conditionally, you have to somehow rid yourself of all the crap in the pipeline that is no longer correct because it was part of the other conditional side.

Luckily, my design inherently helps with this, but it will still suck to have to think about it.

Ive never thought of doing it this way, but its genius. Just have a SPR that can save the flags and then be read from and they can be AND'ed with a bit mask to find what you want, for the results that are one, then perform the jump condition.

RE: I am TSO - TSO - 10-09-2014

Yeah, he said that's what he always coded for, or at least that's how I imagined what he explained. (He only knows how to code for something, and I kinda imagine how that would convert back to the machine). It makes the most sense to me to do it that way. I should mention that the register can actually keep conditions across multiple cycles if you tell it to. In X86 instruction sets, you actually have to tell it to clear things like the carry overflow register because they hold values for a certain number of cycles after the overflow occurs. (I think they hold it for three cycles afterward so that you can reference it and move the value into a register in the next line of code). If you need it to be zero for the next line, you tell it to clear.

(Now to figure out what SPR means... simple push register?)

How have you been doing it?

An example he described was you run the multiply, mask the overflow register, (which has the OR of all most significant bits in the multiplier somewhere in there), if you get a true value, you write the overflow portion of the multiplier to a second register. This is actually coded as the least significant bits to Register A, followed by the AND of the multiplier overflow flag with the overflow bits of the multiplier into Register B.

RE: I am TSO - greatgamer34 - 10-09-2014

SPR=special purpose register
GPR=general purpose register
To set the flags, there needs to be a compare operation. The flags will only re update once a new compare op has been sent to the cpu. IIRC

Most people(including me) in minecraft have a Compare instruction then followed by a Flag register read.

The Flag register read can be BEQ(branch if equal to), BGT(branch if greater than), BLT(branch if less than), BNEQ(branch if not equal to)....etc, along these lines of any amount of possible conditions.

This then reads from the proper flags register and performs a jump to a specified address(usually found in the IMM or a pointer).

RE: I am TSO - TSO - 10-10-2014

So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

RE: I am TSO - newomaster - 10-10-2014

Most of this thread is TL;DR for me, but welcome TSO Big Grin

RE: I am TSO - TSO - 10-10-2014

That completely defeats the purpose of a forum. I advise you get around to reading it, we've been discussing making the impossible possible, and (at least according to GG) have actually made a small breakthrough in redstone computing (it sounds like I knocked a tick or two off of the fastest theoretical clock rate)

Hi

RE: I am TSO - Magazorb - 10-10-2014

(10-08-2014, 06:17 PM)TSO Wrote: I was thinking of data KB, thus powers of two.

The stack is entirely solid state except for a single instant repeater that moves the pop command from the left to the right side.

Metric prefixes:
8Mb = 1(1x10^6) or 1,000,000B, 8KB = 8(8x10^3) or 64,000b, 4Kb = 1(4x10^3) or 4,000b
Binary Prefixes:
8Mib = 1(1024^2) or 1,048,576B, 8KiB = (8x8)(1024^1) or 65536b, 4Kib = 4(1024^1) or 4096b
the 1024 came from 2^10 which is the bases of which binary prefixes are powering from, so 8GB = 8((2^10)^3) = 8589934592Bytes or 6871947674bits.
The Binary prefixes was losely based around the French metrics or later came SI metrics

(10-09-2014, 05:16 AM)TSO Wrote: Sorry, I meant mostly how to generate the conditions (flags, I guess) in the same cycle as the operation, then I started wondering how many of these conditions I need... then I asked my father for help. He programmed back in the 90's when computers had no memory and actual program speed mattered because computers were still so slow (back when programmers actually had to be good at programming.)

What I learned was this: all your possible conditions are computed at the same time and given a true/false value. These values then go to a single register the same size as all the possible conditions (between eight and sixteen), you then use a bit mask to get the conditions you want out of this register (you can actually call more than one flag at the same time, then operate a condition on those two). Your goal is to never jump, EVER, especially more than one or two lines. Long goto jumps take longer, are harder to understand in debugging, and are a sign that the programmer does not understand the program being written. If, Then, Else statements also slow down the system, but are far better than a goto because the compiler will place the destination code for the if statement near the line the condition is on. The compiler will preload the program data line by line regardless of condition destinations, if an if, then, else occurs, a path is chosen, and all code for the other path is deleted, followed by loading all the subsequent program data for the line that was chosen. The nearer these lines are to the jump condition, the less gets loaded and unloaded. The amount of qued data that gets deleted depends upon how large the jump is, with a goto being the worst. Nearly all data is dumped and reloaded and all stages of the CPU are completely cleared in a goto because they inherently do not jump near the line their condition is on. And if you pipelined and it jumps conditionally, you have to somehow rid yourself of all the crap in the pipeline that is no longer correct because it was part of the other conditional side.

Luckily, my design inherently helps with this, but it will still suck to have to think about it.

Computers back in the 90's did have registers, if they was GPRMs, this wasn't used to often though because the processors at the time was slower then the memory speeds so having registers didn't improve performance much, and added to the instruction size, so was deemed uncesseriy for the time because of the extra memory requirements for the IS.

Two other popular PE types wear Stack based and Acumulator based, this two wear both popular for near enough the same reason, positives and negatives of each are near the same and are mainly the other type of PEs that pros and cons wear oposite of GPRM PEs, there was also other types of PEs that other systems used but the 3 most popular and main ones where the Stack,Accumulator and GPR based PEs

"all your possible conditions are computed at the same time and given a true/false value."
The idea and direction you're going with this is good, irl some do this, however in MC it isn't really a thing with most things being small enough they reach the new address and return before clock, thus giving no performance gain, or are predictivly prefetched for pipelined CPUs which also gives no gain.
This is a good idea though and in bigger MC things maybe useful, the idea has gone around a couple of times but has yet to be implemented, so 5pts. to you if you do it first with at least some gain Big Grin

"(back when programmers actually had to be good at programming.)" actual programming is harder now then it's ever been for pro's, the only programmers that aren't good tend to be game devs (sometimes they know some code but they aren't really programmers, they use software instead) and those who can't be asked to learn how to write optimal code, but for those doing it professionally are made to max their hardwars abilitys out and this becomes very hard.
To do this you have to learn verious languages (normaly this would be done with a combination of fortran, C++ and ASM, with GPGPU being utilised heavily.
Generically this style of programming is never learned but is in high demand of, also tends to not run to well on various configurations other then it's intended system config(s).

I'm also unclear as to what you're defining as a compiler, probably correct but sounds almost like you want to use it in active programs instead of compiled before use.

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

Doesn't really matter how much you speed up things that are going slow, if you have them before your next clock comes it's waiting, so if you can, put other stuff their and let that be further out, again this is a unique thing with your designs, most don't have this issue so might be a little tricky figuring out a method of doing this automatically for the programmer or for the programmer to know where to do this.

(10-10-2014, 05:45 AM)TSO Wrote: That completely defeats the purpose of a forum. I advise you get around to reading it, we've been discussing making the impossible possible, and (at least according to GG) have actually made a small breakthrough in redstone computing (it sounds like I knocked a tick or two off of the fastest theoretical clock rate)

Hi

we've/you've not really been talking about things that was seemingly imposible, albe it you do have some great ideas in general, also sorry to say it to you but you won't be caliming performancing king just yet. maybe after a few revisions and optimisations once you settle in and learn stuff your combination with some of our members of ideas might do though (it's not really expected that new members even know half this much about CS really, so you're doing good Tongue

)

If it's not much to ask for, once you have a IS drawn up may i see it please?

RE: I am TSO - LordDecapo - 10-10-2014

(10-07-2014, 02:42 AM)TSO Wrote: Or you can rename registers as you go... but that's for noobs

cough cough, i can do that, and it only takes 3 ticks.

(10-07-2014, 11:52 PM)Magazorb Wrote:
(10-06-2014, 05:40 AM)TSO Wrote: Full Quote lol

EE knowledge can be applied very well in MC, just not the fundamentals.

Look on my plot and you'll see nothing interesting, my brain is my library of knowledge, not my plot i probably can't compare to Decapo but i do definitely simulate just about everything in my head, so we are much alike if that's how you truly are (Yay for dyslexia power )

Thanks Maga xD lol i have a problem with learning,, i must always be learning more, and CPU arch. stuff has been my thing for like a year so ya xD
and i am severely ADHD, and i do the same thing, everything i do, i have already ran like 100times in my head to make sure it works.

(10-08-2014, 02:04 PM)greatgamer34 Wrote: umm, you'd be surprised how much better Solid state is. Just ask decapo, his stack is super small and super fast!

just saying,,, that stack isnt mine, i fell asleep one night talking about wanting to make one,, i woke up and dylan was like I HAS U A STACKS!,,,, was like fuck ya,, then the next night him and freeman compacted the fuck out of it even more xD so lol its in the CPU but its Dylan's design... The CPU is also pretty much a colab, my arch, dylans implementation and compacting as well as general part design, and i did the architecture, assembly, IS, and technical maths stuff, as well as build a good amount of the pieces.

(10-09-2014, 10:18 PM)greatgamer34 Wrote: Most people(including me) in minecraft have a Compare instruction then followed by a Flag register read.

The Flag register read can be BEQ(branch if equal to), BGT(branch if greater than), BLT(branch if less than), BNEQ(branch if not equal to)....etc, along these lines of any amount of possible conditions.

This then reads from the proper flags register and performs a jump to a specified address(usually found in the IMM or a pointer).

Ok so my branching works a bit different, cause my crazy ass cRISC or CISC or what ever u want to consider it Architecture.
I have 2 modes in my IS, Memory/System is mode0 and ALU functions are Mode1
Branching is a Mode0, And it is also a multi line
I have a specially made 3 deep Queue with mutliread so when a branch is detected, it reads Locations (Inst)0, (Inst)1, and (Inst)2 from the Queue, and routes there data to specific parts

Inst0 is the Main Inst, it tells what conditions to look for, weather it is a Call or a Return, weather it is conditional or not, if its direct, relitivePos. or RelitiveNeg.

Inst1 is the Destination address (so i can have 65535 or what ever lines of code on just a PROM, more if i access external memory, which is easy to do) that is only loaded into the PC if the condition is true

Inst2 is the function that defines where the flag can arise from, and this inst MUST be a Mode1, so u can add, sub, or, xor, compare, ect to generate any flag you want.

All of that gets decoded and sorted out in about 7ticks, then the branch is determined on the next cycled wether the conditions are mett, it has static prediction of False, so u only get hit with a 1 cycle penalty after a True flag comes through, leaving the penalty of branching not that devastating.

I will be making a forum post this weekend with pics and such of my MC CPU, since u cant join server, and will explain the IS in detail in it for those who are interested.

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

No please no, do not do a 3 tick clock its "theoretically" the fastest u can get with torches,,, but NO just NO! MC bugs are so disgraceful that ur clock cycles will be come uneven and will corrupt data in ways u never knew were possible... trust me, i had a huge project, and backed off the complexity and simplified the logic i was gonna use in my CU to get it to be a little longer clock,, well more then a little,, 3 ticks to 10 ticks, but the through put and penalty %ages are ridiculously less now as well. so it gives you better performance under normal operating conditions. Clock speed DOESNT mean more Power,, u have to take into consideration the IS, and the possible penalties the CPU could suffer from such small pipeline stages,,, and a 3 tick clock, leave 2 ticks for logic, 1 tick to store, so its really dumb xD i learned this the hard way... PC was the thing that we found killed it the fastest.

RE: I am TSO - Magazorb - 10-10-2014

(10-10-2014, 08:26 PM)LordDecapo Wrote: No please no, do not do a 3 tick clock its "theoretically" the fastest u can get with torches,,, but NO just NO! MC bugs are so disgraceful that ur clock cycles will be come uneven and will corrupt data in ways u never knew were possible... trust me, i had a huge project, and backed off the complexity and simplified the logic i was gonna use in my CU to get it to be a little longer clock,, well more then a little,, 3 ticks to 10 ticks, but the through put and penalty %ages are ridiculously less now as well. so it gives you better performance under normal operating conditions. Clock speed DOESNT mean more Power,, u have to take into consideration the IS, and the possible penalties the CPU could suffer from such small pipeline stages,,, and a 3 tick clock, leave 2 ticks for logic, 1 tick to store, so its really dumb xD i learned this the hard way... PC was the thing that we found killed it the fastest.

Yes, but i bet you wouldn't have many objections with me making 3tick, then again... you know about that Tongue

Also your welcome, maybe we think alike?

RE: I am TSO - TSO - 10-11-2014

(10-10-2014, 04:22 PM)Magazorb Wrote: Metric prefixes:
8Mb = 1(1x10^6) or 1,000,000B, 8KB = 8(8x10^3) or 64,000b, 4Kb = 1(4x10^3) or 4,000b
Binary Prefixes:
8Mib = 1(1024^2) or 1,048,576B, 8KiB = (8x8)(1024^1) or 65536b, 4Kib = 4(1024^1) or 4096b
the 1024 came from 2^10 which is the bases of which binary prefixes are powering from, so 8GB = 8((2^10)^3) = 8589934592Bytes or 6871947674bits.
The Binary prefixes was losely based around the French metrics or later came SI metrics

1.) I know my prefixes quite well, possibly better than some of you guys, so I don't need help with that.
2.) 8Mb is not equal to 1(1*10^6), it's much more like 8*10^6 b, and you have similar math errors throughout this section. Also, they are by no means "loosely" related, the relationship of the digits between base ten and base two is log(n)=bl(n)/bl(10). Every (useful) metric prefix is a factor of 1000. Log(1000)=bl(1000)/bl(10), now log(1000)=3 and bl(1000)= 9.96. Let's re examine that number as 10. Now we are looking at 2^10, which is 1024. So the relationship between the two is just (10^(log(1024)log([prefix])/3))B=[prefix]iB. End of story.

magazorb Wrote:

(10-09-2014, 05:16 AM)TSO Wrote: Sorry, I meant mostly how to generate the conditions (flags, I guess) in the same cycle as the operation, then I started wondering how many of these conditions I need... then I asked my father for help. He programmed back in the 90's when computers had no memory and actual program speed mattered because computers were still so slow (back when programmers actually had to be good at programming.)

What I learned was this: all your possible conditions are computed at the same time and given a true/false value. These values then go to a single register the same size as all the possible conditions (between eight and sixteen), you then use a bit mask to get the conditions you want out of this register (you can actually call more than one flag at the same time, then operate a condition on those two). Your goal is to never jump, EVER, especially more than one or two lines. Long goto jumps take longer, are harder to understand in debugging, and are a sign that the programmer does not understand the program being written. If, Then, Else statements also slow down the system, but are far better than a goto because the compiler will place the destination code for the if statement near the line the condition is on. The compiler will preload the program data line by line regardless of condition destinations, if an if, then, else occurs, a path is chosen, and all code for the other path is deleted, followed by loading all the subsequent program data for the line that was chosen. The nearer these lines are to the jump condition, the less gets loaded and unloaded. The amount of qued data that gets deleted depends upon how large the jump is, with a goto being the worst. Nearly all data is dumped and reloaded and all stages of the CPU are completely cleared in a goto because they inherently do not jump near the line their condition is on. And if you pipelined and it jumps conditionally, you have to somehow rid yourself of all the crap in the pipeline that is no longer correct because it was part of the other conditional side.

Luckily, my design inherently helps with this, but it will still suck to have to think about it.

Computers back in the 90's did have registers, if they was GPRMs, this wasn't used to often though because the processors at the time was slower then the memory speeds so having registers didn't improve performance much, and added to the instruction size, so was deemed uncesseriy for the time because of the extra memory requirements for the IS.

Two other popular PE types wear Stack based and Acumulator based, this two wear both popular for near enough the same reason, positives and negatives of each are near the same and are mainly the other type of PEs that pros and cons wear oposite of GPRM PEs, there was also other types of PEs that other systems used but the 3 most popular and main ones where the Stack,Accumulator and GPR based PEs

"all your possible conditions are computed at the same time and given a true/false value."
The idea and direction you're going with this is good, irl some do this, however in MC it isn't really a thing with most things being small enough they reach the new address and return before clock, thus giving no performance gain, or are predictivly prefetched for pipelined CPUs which also gives no gain.
This is a good idea though and in bigger MC things maybe useful, the idea has gone around a couple of times but has yet to be implemented, so 5pts. to you if you do it first with at least some gain

"(back when programmers actually had to be good at programming.)" actual programming is harder now then it's ever been for pro's, the only programmers that aren't good tend to be game devs (sometimes they know some code but they aren't really programmers, they use software instead) and those who can't be asked to learn how to write optimal code, but for those doing it professionally are made to max their hardwars abilitys out and this becomes very hard.
To do this you have to learn verious languages (normaly this would be done with a combination of fortran, C++ and ASM, with GPGPU being utilised heavily.
Generically this style of programming is never learned but is in high demand of, also tends to not run to well on various configurations other then it's intended system config(s).

I'm also unclear as to what you're defining as a compiler, probably correct but sounds almost like you want to use it in active programs instead of compiled before use.

He said there were like 10 or 16 registers (remember he only really worked with x86 assembly), and you only worked with about four. All 16 were special purpose, but if you weren't using the function they related to in that instruction, they doubled as general purpose, there were really only five you worked with: A,B,C,D, FLAGS. Add in that x86 is RISCey, and you see that you only need four of these registers: the one involved with your operation, any two main registers, and FLAGS, so you never use D either. The FLAGS are all predicted simultaneously and parallel to execution, and the jump conditions were all opcodes that simply masked the FLAGS register and jumped to the specified line. X86 processors never performed more than one task in the same line of code, for the most part. There was some special stuff that played with things, and there were numbers you could insert into the opcode location that represented opcodes that didn't exist (or something like that) that made the CPU "fuck up just right" and perform operations that used many lines of code in one line. I have no idea how that works, but it abuses the fact that x86 is microcode heavy but doesn't know what to do if it receives a number that isn't a real opcode. Results range from exactly what you wanted to happen to computer suicide.

His only comment when I asked about stacks was, "You never use stacks." He used pointers all the time, though (in fact, there were a few string manipulation programs that only used pointers and pointers pointing to pointers and other stuff). He also never made comments on his code, so nobody knew how to debug it, (not that the debugging team was any good at their job to begin with), so he just debugged it himself.

He says the abilities of modern computers make even lazy coding fast enough for the intended application: goto is returning, as are stacks. When he did things, it would be a few lines in C that then resulted in a subroutine branch to a short assembly program, then some more lines of C, ex etera. Some of his programs were more assembly than C, some weren't, it depended on what needed to be done. The programs were intended for the x86 family, which is apparently an industry standard, or something.

Fortran was rarely used because the compile overhead was too great.

A compiler is a program that converts the written program into machine code. The more distant the programming language is from machine code, the more time is spent compiling, and the longer the resulting program. Some languages, such as assembly have zero line expansion (assembly is simply a mnemonic for machine code), while some languages (cough java cough) are so distant from the machine code that only an absolute moron would try to write in them unless you need the code to be insensitive to the machine it's running on. Of note: Python codes compile in runtime, there is no compile time. Now, that being said, some good compilers do actually alter and rearrange the lines of code in compile time to help optimize things like conditional jumps.

Also, I meant the computer cues up program data, not the compiler (damn auto correct)

magazorb Wrote:

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

Doesn't really matter how much you speed up things that are going slow, if you have them before your next clock comes it's waiting, so if you can, put other stuff their and let that be further out, again this is a unique thing with your designs, most don't have this issue so might be a little tricky figuring out a method of doing this automatically for the programmer or for the programmer to know where to do this.

1.) There was a slight error, DVQ/MOD and MUL/SQR are six ticks out, but will probably take about 96 and 12 ticks to complete, respectively. On the other hand, it just now occurred to me how I could pipeline these functions and have them point back toward the CPU so that the actual operator will bus itself (If I make the operator four chunks long (easily done) and place the output toward the CPU, then it ends up right back at the cache when it's all done because the bus to the input from the CPU is going to be four chunks long), cutting out the four bussing ticks needed to return the value and allowing more than one divide or multiply to be requested in sequence, though it still would take 96 or 16 ticks to compute each operation.

2.)The computer won't alter the code for the programmer to account for the timing difference, and the programmer also doesn't write the program with the timing difference in mind. Place the blatantly obvious secret sauce recipe here. (If you can't figure out how I'm going to be creating this particular work around, we have a problem, because it's already been discussed in this thread.)

magazorb Wrote:

(10-10-2014, 05:45 AM)TSO Wrote: That completely defeats the purpose of a forum. I advise you get around to reading it, we've been discussing making the impossible possible, and (at least according to GG) have actually made a small breakthrough in redstone computing (it sounds like I knocked a tick or two off of the fastest theoretical clock rate)

Hi

we've/you've not really been talking about things that was seemingly imposible, albe it you do have some great ideas in general, also sorry to say it to you but you won't be caliming performancing king just yet. maybe after a few revisions and optimisations once you settle in and learn stuff your combination with some of our members of ideas might do though (it's not really expected that new members even know half this much about CS really, so you're doing good )

If it's not much to ask for, once you have a IS drawn up may i see it please?

Due to a particular breakthrough I had (it was legitimately like a moment of enlightenment, LD sort of knows about it), I actually have almost all the instruction set drawn up now, as well as the entire bussing arrangement. I also don't think I'm going to ever be performance king, but I will certainly be the king of dual purpose logic.

I have a programmer father that hates wasting time in a program, my friend's father is a hardware developer at HP, one of my father's friends is a hardware developer at IBM, and one of my father's other friends has a degree in library science. Between my inquisitive nature and their experience, there is literally no way I wouldn't know this much about computer engineering. (Although, it is funny to get the two hardware developers in the same room, because they suddenly get real cautious about how they answer my questions. Let's just say neither of them are the low guy on the totem pole at their company.)

(10-10-2014, 08:26 PM)LordDecapo Wrote:
(10-07-2014, 02:42 AM)TSO Wrote: Or you can rename registers as you go... but that's for noobs.
cough cough, i can do that, and it only takes 3 ticks.

Noob.

LordDecpo Wrote:

(10-09-2014, 10:18 PM)greatgamer34 Wrote: Most people(including me) in minecraft have a Compare instruction then followed by a Flag register read.

The Flag register read can be BEQ(branch if equal to), BGT(branch if greater than), BLT(branch if less than), BNEQ(branch if not equal to)....etc, along these lines of any amount of possible conditions.

This then reads from the proper flags register and performs a jump to a specified address(usually found in the IMM or a pointer).

Ok so my branching works a bit different, cause my crazy ass cRISC or CISC or what ever u want to consider it Architecture.
I have 2 modes in my IS, Memory/System is mode0 and ALU functions are Mode1
Branching is a Mode0, And it is also a multi line
I have a specially made 3 deep Queue with mutliread so when a branch is detected, it reads Locations (Inst)0, (Inst)1, and (Inst)2 from the Queue, and routes there data to specific parts

Inst0 is the Main Inst, it tells what conditions to look for, weather it is a Call or a Return, weather it is conditional or not, if its direct, relitivePos. or RelitiveNeg.

Inst1 is the Destination address (so i can have 65535 or what ever lines of code on just a PROM, more if i access external memory, which is easy to do) that is only loaded into the PC if the condition is true

Inst2 is the function that defines where the flag can arise from, and this inst MUST be a Mode1, so u can add, sub, or, xor, compare, ect to generate any flag you want.

All of that gets decoded and sorted out in about 7ticks, then the branch is determined on the next cycled wether the conditions are mett, it has static prediction of False, so u only get hit with a 1 cycle penalty after a True flag comes through, leaving the penalty of branching not that devastating.

I will be making a forum post this weekend with pics and such of my MC CPU, since u cant join server, and will explain the IS in detail in it for those who are interested.

...It sounds really complicated... Tongue

...or maybe not...
Is it basically the same jump instruction system as mine, but without the parallel computing part?

I haven't quite gotten around to acquiring a free .rar extractor that only comes with minimal bonus material.

LordDecapo Wrote:

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

No please no, do not do a 3 tick clock its "theoretically" the fastest u can get with torches,,, but NO just NO! MC bugs are so disgraceful that ur clock cycles will be come uneven and will corrupt data in ways u never knew were possible... trust me, i had a huge project, and backed off the complexity and simplified the logic i was gonna use in my CU to get it to be a little longer clock,, well more then a little,, 3 ticks to 10 ticks, but the through put and penalty %ages are ridiculously less now as well. so it gives you better performance under normal operating conditions. Clock speed DOESNT mean more Power,, u have to take into consideration the IS, and the possible penalties the CPU could suffer from such small pipeline stages,,, and a 3 tick clock, leave 2 ticks for logic, 1 tick to store, so its really dumb xD i learned this the hard way... PC was the thing that we found killed it the fastest.

Again, there are actually many errors in that statement, as well as a massive oversight on my part. The clock is limited to the seven ticks it will take to decode the instruction pointer. I honestly have absolutely no idea how to speed that up without reducing the amount of cache space in the computer used for cuing up instruction data.

Three ticks does not give you two ticks for logic and one tick for store (at least in my architecture, just because of how every function would need to store at it's input side), it gives three to store, however long it takes to calculate, three to wright to the output bus, and three to store in the data registers. (Also, there is a device in the game that can store data up to four ticks, you'll never guess what it is. And no, it's not some "command block bull shit".)

Final announcement: the instruction set is nearly complete, it is still actually the reverse engineering of the processes in the CPU and ALU, but my moment of enlightenment allowed for me to engineer the CPU and bussing layout all in my head. It occurred to me that op codes are pointers, which is why I know how far away the inputs for each ALU function are from the CPU (that'll give you something to think about).

RE: I am TSO - jxu - 10-11-2014

(10-10-2014, 05:45 AM)TSO Wrote: That completely defeats the purpose of a forum.

This is the introductions section buddy

RE: I am TSO - Magazorb - 10-12-2014

(10-11-2014, 06:02 AM)TSO Wrote:

(10-10-2014, 04:22 PM)Magazorb Wrote: Metric prefixes:
8Mb = 1(1x10^6) or 1,000,000B, 8KB = 8(8x10^3) or 64,000b, 4Kb = 1(4x10^3) or 4,000b
Binary Prefixes:
8Mib = 1(1024^2) or 1,048,576B, 8KiB = (8x8)(1024^1) or 65536b, 4Kib = 4(1024^1) or 4096b
the 1024 came from 2^10 which is the bases of which binary prefixes are powering from, so 8GB = 8((2^10)^3) = 8589934592Bytes or 6871947674bits.
The Binary prefixes was losely based around the French metrics or later came SI metrics

1.) I know my prefixes quite well, possibly better than some of you guys, so I don't need help with that.
2.) 8Mb is not equal to 1(1*10^6), it's much more like 8*10^6 b, and you have similar math errors throughout this section. Also, they are by no means "loosely" related, the relationship of the digits between base ten and base two is log(n)=bl(n)/bl(10). Every (useful) metric prefix is a factor of 1000. Log(1000)=bl(1000)/bl(10), now log(1000)=3 and bl(1000)= 9.96. Let's re examine that number as 10. Now we are looking at 2^10, which is 1024. So the relationship between the two is just (10^(log(1024)log([prefix])/3))B=[prefix]iB. End of story.

OK, first of you did originally use it incorrectly, i didn't say that you didn't know it, i just merely was clearing, sorry if it seemed i was suggesting you didn't know, also i said 8Mb is 1(1*10^6)B which is true, B= Byte, b=bit, you just miss read it.

Also the prefix naming of binary prefixed is losely based on metric prefixs, they just change the ending of metric prefixes to bi instead, hence why i said losely based, the values them self are all together different yes, but the names of prefixes are related.

You can easily tie the powers of prefixes to together rather easily by saying 10^3~1024^1, or 10^9~1024^3, basically powers of binary be metric_power/3, this become less accurate as the powers increase, so best to keep metrics and binary's separate, but regardless name still relates.

(10-11-2014, 06:02 AM)TSO Wrote:

magazorb Wrote:[spoiler]
(10-09-2014, 05:16 AM)TSO Wrote: Sorry, I meant mostly how to generate the conditions (flags, I guess) in the same cycle as the operation, then I started wondering how many of these conditions I need... then I asked my father for help. He programmed back in the 90's when computers had no memory and actual program speed mattered because computers were still so slow (back when programmers actually had to be good at programming.)

What I learned was this: all your possible conditions are computed at the same time and given a true/false value. These values then go to a single register the same size as all the possible conditions (between eight and sixteen), you then use a bit mask to get the conditions you want out of this register (you can actually call more than one flag at the same time, then operate a condition on those two). Your goal is to never jump, EVER, especially more than one or two lines. Long goto jumps take longer, are harder to understand in debugging, and are a sign that the programmer does not understand the program being written. If, Then, Else statements also slow down the system, but are far better than a goto because the compiler will place the destination code for the if statement near the line the condition is on. The compiler will preload the program data line by line regardless of condition destinations, if an if, then, else occurs, a path is chosen, and all code for the other path is deleted, followed by loading all the subsequent program data for the line that was chosen. The nearer these lines are to the jump condition, the less gets loaded and unloaded. The amount of qued data that gets deleted depends upon how large the jump is, with a goto being the worst. Nearly all data is dumped and reloaded and all stages of the CPU are completely cleared in a goto because they inherently do not jump near the line their condition is on. And if you pipelined and it jumps conditionally, you have to somehow rid yourself of all the crap in the pipeline that is no longer correct because it was part of the other conditional side.

Luckily, my design inherently helps with this, but it will still suck to have to think about it.

Computers back in the 90's did have registers, if they was GPRMs, this wasn't used to often though because the processors at the time was slower then the memory speeds so having registers didn't improve performance much, and added to the instruction size, so was deemed uncesseriy for the time because of the extra memory requirements for the IS.

Two other popular PE types wear Stack based and Acumulator based, this two wear both popular for near enough the same reason, positives and negatives of each are near the same and are mainly the other type of PEs that pros and cons wear oposite of GPRM PEs, there was also other types of PEs that other systems used but the 3 most popular and main ones where the Stack,Accumulator and GPR based PEs

"all your possible conditions are computed at the same time and given a true/false value."
The idea and direction you're going with this is good, irl some do this, however in MC it isn't really a thing with most things being small enough they reach the new address and return before clock, thus giving no performance gain, or are predictivly prefetched for pipelined CPUs which also gives no gain.
This is a good idea though and in bigger MC things maybe useful, the idea has gone around a couple of times but has yet to be implemented, so 5pts. to you if you do it first with at least some gain

"(back when programmers actually had to be good at programming.)" actual programming is harder now then it's ever been for pro's, the only programmers that aren't good tend to be game devs (sometimes they know some code but they aren't really programmers, they use software instead) and those who can't be asked to learn how to write optimal code, but for those doing it professionally are made to max their hardwars abilitys out and this becomes very hard.
To do this you have to learn verious languages (normaly this would be done with a combination of fortran, C++ and ASM, with GPGPU being utilised heavily.
Generically this style of programming is never learned but is in high demand of, also tends to not run to well on various configurations other then it's intended system config(s).

I'm also unclear as to what you're defining as a compiler, probably correct but sounds almost like you want to use it in active programs instead of compiled before use.

(10-11-2014, 06:02 AM)TSO Wrote:

magazorb Wrote:[spoiler]
(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

Doesn't really matter how much you speed up things that are going slow, if you have them before your next clock comes it's waiting, so if you can, put other stuff their and let that be further out, again this is a unique thing with your designs, most don't have this issue so might be a little tricky figuring out a method of doing this automatically for the programmer or for the programmer to know where to do this.

[/spoiler]

1.) There was a slight error, DVQ/MOD and MUL/SQR are six ticks out, but will probably take about 96 and 12 ticks to complete, respectively. On the other hand, it just now occurred to me how I could pipeline these functions and have them point back toward the CPU so that the actual operator will bus itself (If I make the operator four chunks long (easily done) and place the output toward the CPU, then it ends up right back at the cache when it's all done because the bus to the input from the CPU is going to be four chunks long), cutting out the four bussing ticks needed to return the value and allowing more than one divide or multiply to be requested in sequence, though it still would take 96 or 16 ticks to compute each operation.

2.)The computer won't alter the code for the programmer to account for the timing difference, and the programmer also doesn't write the program with the timing difference in mind. Place the blatantly obvious secret sauce recipe here. (If you can't figure out how I'm going to be creating this particular work around, we have a problem, because it's already been discussed in this thread.)
I'm not to well informed as to all the details of your overall system, or architecture. However what i can say is that your main two concerns are data loops and through put, I'm sure you're already aware of that though, data dependency's will also make a difference, so it's mostly how you intend to rid of data dependency's or reduce them, as increase pipe-lining will increase throughput, it also complicates and often allows for more data dependency's to accrue, in tern increasing the amount of time you have to ideal

Also i doubt anyone really knows what your "secret sauce" is really, as you seem somewhat inconsistent to what would otherwise seem to be it repetitively.
Basically it's somewhat vague to our general convention

(10-11-2014, 06:02 AM)TSO Wrote:

magazorb Wrote:[spoiler]
(10-10-2014, 05:45 AM)TSO Wrote: That completely defeats the purpose of a forum. I advise you get around to reading it, we've been discussing making the impossible possible, and (at least according to GG) have actually made a small breakthrough in redstone computing (it sounds like I knocked a tick or two off of the fastest theoretical clock rate)

Hi

we've/you've not really been talking about things that was seemingly imposible, albe it you do have some great ideas in general, also sorry to say it to you but you won't be caliming performancing king just yet. maybe after a few revisions and optimisations once you settle in and learn stuff your combination with some of our members of ideas might do though (it's not really expected that new members even know half this much about CS really, so you're doing good )

If it's not much to ask for, once you have a IS drawn up may i see it please?

[/spoiler]

Due to a particular breakthrough I had (it was legitimately like a moment of enlightenment, LD sort of knows about it), I actually have almost all the instruction set drawn up now, as well as the entire bussing arrangement. I also don't think I'm going to ever be performance king, but I will certainly be the king of dual purpose logic.

I have a programmer father that hates wasting time in a program, my friend's father is a hardware developer at HP, one of my father's friends is a hardware developer at IBM, and one of my father's other friends has a degree in library science. Between my inquisitive nature and their experience, there is literally no way I wouldn't know this much about computer engineering. (Although, it is funny to get the two hardware developers in the same room, because they suddenly get real cautious about how they answer my questions. Let's just say neither of them are the low guy on the totem pole at their company.)
Nice to know you sit your self in a position where you can easily acquire information about real life EE and CS, no doubt it's useful, however MC does have many limitations and possibility that IRL stuff doesn't account for, so this is generically the challenge when coming from RL based knowledge, it's still very useful as most of it can be applied though, also another thing to bear in mind, is technology is so vast that no small collective group of individuals fully know the capability's of hardware/programming, I even doubt that everyone's knowledge collectively would know.

In short you will hit a wall and you will destroy some, but a lot of the time it's unexpected, but that's all part of the fun

(10-11-2014, 06:02 AM)TSO Wrote:

(10-10-2014, 08:26 PM)LordDecapo Wrote:
(10-07-2014, 02:42 AM)TSO Wrote: Or you can rename registers as you go... but that's for noobs.
cough cough, i can do that, and it only takes 3 ticks.
Noob.

LordDecpo Wrote:[spoiler]
(10-09-2014, 10:18 PM)greatgamer34 Wrote: Most people(including me) in minecraft have a Compare instruction then followed by a Flag register read.

The Flag register read can be BEQ(branch if equal to), BGT(branch if greater than), BLT(branch if less than), BNEQ(branch if not equal to)....etc, along these lines of any amount of possible conditions.

This then reads from the proper flags register and performs a jump to a specified address(usually found in the IMM or a pointer).

Ok so my branching works a bit different, cause my crazy ass cRISC or CISC or what ever u want to consider it Architecture.
I have 2 modes in my IS, Memory/System is mode0 and ALU functions are Mode1
Branching is a Mode0, And it is also a multi line
I have a specially made 3 deep Queue with mutliread so when a branch is detected, it reads Locations (Inst)0, (Inst)1, and (Inst)2 from the Queue, and routes there data to specific parts

Inst0 is the Main Inst, it tells what conditions to look for, weather it is a Call or a Return, weather it is conditional or not, if its direct, relitivePos. or RelitiveNeg.

Inst1 is the Destination address (so i can have 65535 or what ever lines of code on just a PROM, more if i access external memory, which is easy to do) that is only loaded into the PC if the condition is true

Inst2 is the function that defines where the flag can arise from, and this inst MUST be a Mode1, so u can add, sub, or, xor, compare, ect to generate any flag you want.

All of that gets decoded and sorted out in about 7ticks, then the branch is determined on the next cycled wether the conditions are mett, it has static prediction of False, so u only get hit with a 1 cycle penalty after a True flag comes through, leaving the penalty of branching not that devastating.

I will be making a forum post this weekend with pics and such of my MC CPU, since u cant join server, and will explain the IS in detail in it for those who are interested.

...It sounds really complicated... Tongue

LordDecapo Wrote:

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

No please no, do not do a 3 tick clock its "theoretically" the fastest u can get with torches,,, but NO just NO! MC bugs are so disgraceful that ur clock cycles will be come uneven and will corrupt data in ways u never knew were possible... trust me, i had a huge project, and backed off the complexity and simplified the logic i was gonna use in my CU to get it to be a little longer clock,, well more then a little,, 3 ticks to 10 ticks, but the through put and penalty %ages are ridiculously less now as well. so it gives you better performance under normal operating conditions. Clock speed DOESNT mean more Power,, u have to take into consideration the IS, and the possible penalties the CPU could suffer from such small pipeline stages,,, and a 3 tick clock, leave 2 ticks for logic, 1 tick to store, so its really dumb xD i learned this the hard way... PC was the thing that we found killed it the fastest.

[/spoiler]

Again, there are actually many errors in that statement, as well as a massive oversight on my part. The clock is limited to the seven ticks it will take to decode the instruction pointer. I honestly have absolutely no idea how to speed that up without reducing the amount of cache space in the computer used for cuing up instruction data.

Three ticks does not give you two ticks for logic and one tick for store (at least in my architecture, just because of how every function would need to store at it's input side), it gives three to store, however long it takes to calculate, three to wright to the output bus, and three to store in the data registers. (Also, there is a device in the game that can store data up to four ticks, you'll never guess what it is. And no, it's not some "command block bull shit".)

Final announcement: the instruction set is nearly complete, it is still actually the reverse engineering of the processes in the CPU and ALU, but my moment of enlightenment allowed for me to engineer the CPU and bussing layout all in my head. It occurred to me that op codes are pointers, which is why I know how far away the inputs for each ALU function are from the CPU (that'll give you something to think about).
Repeater locks, 1 tick store and the rest can be logic. it's pretty much the way to go.

Also loving this thread of yours Big Grin

RE: I am TSO - jxu - 10-12-2014

(10-12-2014, 02:25 AM)Magazorb Wrote: In short you will waste away your life playing a stupid block game building outdated computers that serve no purpose

fix'd

RE: I am TSO - Magazorb - 10-12-2014

(10-12-2014, 03:14 AM)͝ ͟ ͜ Wrote:
(10-12-2014, 02:25 AM)Magazorb Wrote: In short you will waste away your life playing a stupid block game building outdated computers that serve no purpose

fix'd

Well that's not nice :'(

RE: I am TSO - TSO - 10-12-2014

(10-12-2014, 02:25 AM)Magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
(10-10-2014, 04:22 PM)Magazorb Wrote: Metric prefixes:
8Mb = 1(1x10^6) or 1,000,000B, 8KB = 8(8x10^3) or 64,000b, 4Kb = 1(4x10^3) or 4,000b
Binary Prefixes:
8Mib = 1(1024^2) or 1,048,576B, 8KiB = (8x8)(1024^1) or 65536b, 4Kib = 4(1024^1) or 4096b
the 1024 came from 2^10 which is the bases of which binary prefixes are powering from, so 8GB = 8((2^10)^3) = 8589934592Bytes or 6871947674bits.
The Binary prefixes was losely based around the French metrics or later came SI metrics

1.) I know my prefixes quite well, possibly better than some of you guys, so I don't need help with that.
2.) 8Mb is not equal to 1(1*10^6), it's much more like 8*10^6 b, and you have similar math errors throughout this section. Also, they are by no means "loosely" related, the relationship of the digits between base ten and base two is log(n)=bl(n)/bl(10). Every (useful) metric prefix is a factor of 1000. Log(1000)=bl(1000)/bl(10), now log(1000)=3 and bl(1000)= 9.96. Let's re examine that number as 10. Now we are looking at 2^10, which is 1024. So the relationship between the two is just (10^(log(1024)log([prefix])/3))B=[prefix]iB. End of story.

[/spoiler]

OK, first of you did originally use it incorrectly, i didn't say that you didn't know it, i just merely was clearing, sorry if it seemed i was suggesting you didn't know, also i said 8Mb is 1(1*10^6)B which is true, B= Byte, b=bit, you just miss read it.

Also the prefix naming of binary prefixed is losely based on metric prefixs, they just change the ending of metric prefixes to bi instead, hence why i said losely based, the values them self are all together different yes, but the names of prefixes are related.

You can easily tie the powers of prefixes to together rather easily by saying 10^3~1024^1, or 10^9~1024^3, basically powers of binary be metric_power/3, this become less accurate as the powers increase, so best to keep metrics and binary's separate, but regardless name still relates.
As I said, the logarithmic equation above is the exact relationship between the radicies.

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-09-2014, 05:16 AM)TSO Wrote: Sorry, I meant mostly how to generate the conditions (flags, I guess) in the same cycle as the operation, then I started wondering how many of these conditions I need... then I asked my father for help. He programmed back in the 90's when computers had no memory and actual program speed mattered because computers were still so slow (back when programmers actually had to be good at programming.)

What I learned was this: all your possible conditions are computed at the same time and given a true/false value. These values then go to a single register the same size as all the possible conditions (between eight and sixteen), you then use a bit mask to get the conditions you want out of this register (you can actually call more than one flag at the same time, then operate a condition on those two). Your goal is to never jump, EVER, especially more than one or two lines. Long goto jumps take longer, are harder to understand in debugging, and are a sign that the programmer does not understand the program being written. If, Then, Else statements also slow down the system, but are far better than a goto because the compiler will place the destination code for the if statement near the line the condition is on. The compiler will preload the program data line by line regardless of condition destinations, if an if, then, else occurs, a path is chosen, and all code for the other path is deleted, followed by loading all the subsequent program data for the line that was chosen. The nearer these lines are to the jump condition, the less gets loaded and unloaded. The amount of qued data that gets deleted depends upon how large the jump is, with a goto being the worst. Nearly all data is dumped and reloaded and all stages of the CPU are completely cleared in a goto because they inherently do not jump near the line their condition is on. And if you pipelined and it jumps conditionally, you have to somehow rid yourself of all the crap in the pipeline that is no longer correct because it was part of the other conditional side.

Luckily, my design inherently helps with this, but it will still suck to have to think about it.

Computers back in the 90's did have registers, if they was GPRMs, this wasn't used to often though because the processors at the time was slower then the memory speeds so having registers didn't improve performance much, and added to the instruction size, so was deemed uncesseriy for the time because of the extra memory requirements for the IS.

Two other popular PE types wear Stack based and Acumulator based, this two wear both popular for near enough the same reason, positives and negatives of each are near the same and are mainly the other type of PEs that pros and cons wear oposite of GPRM PEs, there was also other types of PEs that other systems used but the 3 most popular and main ones where the Stack,Accumulator and GPR based PEs

"all your possible conditions are computed at the same time and given a true/false value."
The idea and direction you're going with this is good, irl some do this, however in MC it isn't really a thing with most things being small enough they reach the new address and return before clock, thus giving no performance gain, or are predictivly prefetched for pipelined CPUs which also gives no gain.
This is a good idea though and in bigger MC things maybe useful, the idea has gone around a couple of times but has yet to be implemented, so 5pts. to you if you do it first with at least some gain

"(back when programmers actually had to be good at programming.)" actual programming is harder now then it's ever been for pro's, the only programmers that aren't good tend to be game devs (sometimes they know some code but they aren't really programmers, they use software instead) and those who can't be asked to learn how to write optimal code, but for those doing it professionally are made to max their hardwars abilitys out and this becomes very hard.
To do this you have to learn verious languages (normaly this would be done with a combination of fortran, C++ and ASM, with GPGPU being utilised heavily.
Generically this style of programming is never learned but is in high demand of, also tends to not run to well on various configurations other then it's intended system config(s).

I'm also unclear as to what you're defining as a compiler, probably correct but sounds almost like you want to use it in active programs instead of compiled before use.

[/spoiler]

He said there were like 10 or 16 registers (remember he only really worked with x86 assembly), and you only worked with about four. All 16 were special purpose, but if you weren't using the function they related to in that instruction, they doubled as general purpose, there were really only five you worked with: A,B,C,D, FLAGS. Add in that x86 is RISCey, and you see that you only need four of these registers: the one involved with your operation, any two main registers, and FLAGS, so you never use D either. The FLAGS are all predicted simultaneously and parallel to execution, and the jump conditions were all opcodes that simply masked the FLAGS register and jumped to the specified line. X86 processors never performed more than one task in the same line of code, for the most part. There was some special stuff that played with things, and there were numbers you could insert into the opcode location that represented opcodes that didn't exist (or something like that) that made the CPU "fuck up just right" and perform operations that used many lines of code in one line. I have no idea how that works, but it abuses the fact that x86 is microcode heavy but doesn't know what to do if it receives a number that isn't a real opcode. Results range from exactly what you wanted to happen to computer suicide.

His only comment when I asked about stacks was, "You never use stacks." He used pointers all the time, though (in fact, there were a few string manipulation programs that only used pointers and pointers pointing to pointers and other stuff). He also never made comments on his code, so nobody knew how to debug it, (not that the debugging team was any good at their job to begin with), so he just debugged it himself.

He says the abilities of modern computers make even lazy coding fast enough for the intended application: goto is returning, as are stacks. When he did things, it would be a few lines in C that then resulted in a subroutine branch to a short assembly program, then some more lines of C, ex etera. Some of his programs were more assembly than C, some weren't, it depended on what needed to be done. The programs were intended for the x86 family, which is apparently an industry standard, or something.

Fortran was rarely used because the compile overhead was too great.

A compiler is a program that converts the written program into machine code. The more distant the programming language is from machine code, the more time is spent compiling, and the longer the resulting program. Some languages, such as assembly have zero line expansion (assembly is simply a mnemonic for machine code), while some languages (cough java cough) are so distant from the machine code that only an absolute moron would try to write in them unless you need the code to be insensitive to the machine it's running on. Of note: Python codes compile in runtime, there is no compile time. Now, that being said, some good compilers do actually alter and rearrange the lines of code in compile time to help optimize things like conditional jumps.

Also, I meant the computer cues up program data, not the compiler (damn auto correct)[/spoiler]

Seems you and i are on the same page altogether XD, yes all you say is mostly correct, x86 based processors use GPRM PEs, with more modern x86 CPUs having many x86 MIMD PEs, while clock higher clock speeds do aloud for lazy programers and with languages like Java that are nearly impossible to fail with, heavily optimized C++ and Fortran still do have the original issues you state and more as of their being more PEs you would have to vectorize, modern Intel and AMD processors have 16MIMD PEs (assuming a 4 core Intel chip or a 4 modual AMD processor, 32MIMD PEs on a I7 5960X; 4 per core)

Also x86 is concidered as a CiSC (pretty much a complicated and over developed RiSC)
He considered it RISC because all the assembly commands are very much RISC with one operation on two registers per line. The compiler expands out the instruction set based on the opcode and the fields involved.

The registers were all special purpose. For example, if you wrote to BX, you couldn't perform a move in the next operation because BX was the move command's index register for a machine coded array. You would lose what was already written to BX when that happened and the move would be wrong (unless, of course, you were performing an indexed move in an array, which you never did because that would waste more clock cycles than using pointers). AX was the accumulator register, CX was the count register for a machine coded array, DX was some even more obscure thing you absolutely never used. The only reason he even knew the second functions of the first three was because he would occasionally get a job that needed a program to run faster. The assembly code always had these registers used for their special purpose. (The only reason why these are any slower is because the machine coded array need like four other registers to define other properties of the array and it's address. That being said, many people thought if you did more than twelve (IIRC) consecutive equal move operations on the same array, these became useful... but they didn't realize this meant that you just created twelve consecutive equal RAW data conflicts in the pipeline, so instead the program slowed down while the CPU damned you to Hell, and you never needed that many moves unless you were moving every element in the array by one and were treating it as a stack... and there's the magic word)

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

Doesn't really matter how much you speed up things that are going slow, if you have them before your next clock comes it's waiting, so if you can, put other stuff their and let that be further out, again this is a unique thing with your designs, most don't have this issue so might be a little tricky figuring out a method of doing this automatically for the programmer or for the programmer to know where to do this.

[/spoiler]

1.) There was a slight error, DVQ/MOD and MUL/SQR are six ticks out, but will probably take about 96 and 12 ticks to complete, respectively. On the other hand, it just now occurred to me how I could pipeline these functions and have them point back toward the CPU so that the actual operator will bus itself (If I make the operator four chunks long (easily done) and place the output toward the CPU, then it ends up right back at the cache when it's all done because the bus to the input from the CPU is going to be four chunks long), cutting out the four bussing ticks needed to return the value and allowing more than one divide or multiply to be requested in sequence, though it still would take 96 or 16 ticks to compute each operation.

2.)The computer won't alter the code for the programmer to account for the timing difference, and the programmer also doesn't write the program with the timing difference in mind. Place the blatantly obvious secret sauce recipe here. (If you can't figure out how I'm going to be creating this particular work around, we have a problem, because it's already been discussed in this thread.)[/spoiler]

I'm not to well informed as to all the details of your overall system, or architecture. However what i can say is that your main two concerns are data loops and through put, I'm sure you're already aware of that though, data dependency's will also make a difference, so it's mostly how you intend to rid of data dependency's or reduce them, as increase pipe-lining will increase throughput, it also complicates and often allows for more data dependency's to accrue, in tern increasing the amount of time you have to ideal

Also i doubt anyone really knows what your "secret sauce" is really, as you seem somewhat inconsistent to what would otherwise seem to be it repetitively.
Basically it's somewhat vague to our general convention
Let's have an extremely short thought expirement. If you don't rearrange or alter data in runtime, and you don't rearrange or alter data at programming time, where would it happen? What other time could possibly exist in the program development flow chart where the data could be rearranged or altered? There is only one useful answer to this question.

Not every secret sauce is the same. It takes more than one ingredient to make a recipe.

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-10-2014, 05:45 AM)TSO Wrote: That completely defeats the purpose of a forum. I advise you get around to reading it, we've been discussing making the impossible possible, and (at least according to GG) have actually made a small breakthrough in redstone computing (it sounds like I knocked a tick or two off of the fastest theoretical clock rate)

Hi

we've/you've not really been talking about things that was seemingly imposible, albe it you do have some great ideas in general, also sorry to say it to you but you won't be caliming performancing king just yet. maybe after a few revisions and optimisations once you settle in and learn stuff your combination with some of our members of ideas might do though (it's not really expected that new members even know half this much about CS really, so you're doing good )

If it's not much to ask for, once you have a IS drawn up may i see it please?

[/spoiler]

Due to a particular breakthrough I had (it was legitimately like a moment of enlightenment, LD sort of knows about it), I actually have almost all the instruction set drawn up now, as well as the entire bussing arrangement. I also don't think I'm going to ever be performance king, but I will certainly be the king of dual purpose logic.

I have a programmer father that hates wasting time in a program, my friend's father is a hardware developer at HP, one of my father's friends is a hardware developer at IBM, and one of my father's other friends has a degree in library science. Between my inquisitive nature and their experience, there is literally no way I wouldn't know this much about computer engineering. (Although, it is funny to get the two hardware developers in the same room, because they suddenly get real cautious about how they answer my questions. Let's just say neither of them are the low guy on the totem pole at their company.)[/spoiler]

Nice to know you sit your self in a position where you can easily acquire information about real life EE and CS, no doubt it's useful, however MC does have many limitations and possibility that IRL stuff doesn't account for, so this is generically the challenge when coming from RL based knowledge, it's still very useful as most of it can be applied though, also another thing to bear in mind, is technology is so vast that no small collective group of individuals fully know the capability's of hardware/programming, I even doubt that everyone's knowledge collectively would know.

In short you will hit a wall and you will destroy some, but a lot of the time it's unexpected, but that's all part of the fun
The funniest thing is that the library scientist is by far the most useful for what I'm trying to get done.

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
(10-10-2014, 08:26 PM)LordDecapo Wrote:
(10-07-2014, 02:42 AM)TSO Wrote: Or you can rename registers as you go... but that's for noobs.
cough cough, i can do that, and it only takes 3 ticks.
Noob.

LordDecpo Wrote:[spoiler]
(10-09-2014, 10:18 PM)greatgamer34 Wrote: Most people(including me) in minecraft have a Compare instruction then followed by a Flag register read.

The Flag register read can be BEQ(branch if equal to), BGT(branch if greater than), BLT(branch if less than), BNEQ(branch if not equal to)....etc, along these lines of any amount of possible conditions.

This then reads from the proper flags register and performs a jump to a specified address(usually found in the IMM or a pointer).

Ok so my branching works a bit different, cause my crazy ass cRISC or CISC or what ever u want to consider it Architecture.
I have 2 modes in my IS, Memory/System is mode0 and ALU functions are Mode1
Branching is a Mode0, And it is also a multi line
I have a specially made 3 deep Queue with mutliread so when a branch is detected, it reads Locations (Inst)0, (Inst)1, and (Inst)2 from the Queue, and routes there data to specific parts

Inst0 is the Main Inst, it tells what conditions to look for, weather it is a Call or a Return, weather it is conditional or not, if its direct, relitivePos. or RelitiveNeg.

Inst1 is the Destination address (so i can have 65535 or what ever lines of code on just a PROM, more if i access external memory, which is easy to do) that is only loaded into the PC if the condition is true

Inst2 is the function that defines where the flag can arise from, and this inst MUST be a Mode1, so u can add, sub, or, xor, compare, ect to generate any flag you want.

All of that gets decoded and sorted out in about 7ticks, then the branch is determined on the next cycled wether the conditions are mett, it has static prediction of False, so u only get hit with a 1 cycle penalty after a True flag comes through, leaving the penalty of branching not that devastating.

I will be making a forum post this weekend with pics and such of my MC CPU, since u cant join server, and will explain the IS in detail in it for those who are interested.

...It sounds really complicated... Tongue

LordDecapo Wrote:

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

No please no, do not do a 3 tick clock its "theoretically" the fastest u can get with torches,,, but NO just NO! MC bugs are so disgraceful that ur clock cycles will be come uneven and will corrupt data in ways u never knew were possible... trust me, i had a huge project, and backed off the complexity and simplified the logic i was gonna use in my CU to get it to be a little longer clock,, well more then a little,, 3 ticks to 10 ticks, but the through put and penalty %ages are ridiculously less now as well. so it gives you better performance under normal operating conditions. Clock speed DOESNT mean more Power,, u have to take into consideration the IS, and the possible penalties the CPU could suffer from such small pipeline stages,,, and a 3 tick clock, leave 2 ticks for logic, 1 tick to store, so its really dumb xD i learned this the hard way... PC was the thing that we found killed it the fastest.

[/spoiler]

Again, there are actually many errors in that statement, as well as a massive oversight on my part. The clock is limited to the seven ticks it will take to decode the instruction pointer. I honestly have absolutely no idea how to speed that up without reducing the amount of cache space in the computer used for cuing up instruction data.

Three ticks does not give you two ticks for logic and one tick for store (at least in my architecture, just because of how every function would need to store at it's input side), it gives three to store, however long it takes to calculate, three to wright to the output bus, and three to store in the data registers. (Also, there is a device in the game that can store data up to four ticks, you'll never guess what it is. And no, it's not some "command block bull shit".)

Final announcement: the instruction set is nearly complete, it is still actually the reverse engineering of the processes in the CPU and ALU, but my moment of enlightenment allowed for me to engineer the CPU and bussing layout all in my head. It occurred to me that op codes are pointers, which is why I know how far away the inputs for each ALU function are from the CPU (that'll give you something to think about).[/spoiler]

Repeater locks, 1 tick store and the rest can be logic. it's pretty much the way to go.

Also loving this thread of yours Big Grin

(10-12-2014, 03:52 AM)Magazorb Wrote:
(10-12-2014, 03:14 AM)͝ ͟ ͜ Wrote:
(10-12-2014, 02:25 AM)Magazorb Wrote: In short you will waste away your life playing a stupid block game building outdated computers that serve no purpose

fix'd

Well that's not nice :'(

But it is true

RE: I am TSO - Magazorb - 10-12-2014

(10-12-2014, 03:59 AM)TSO Wrote:
(10-12-2014, 02:25 AM)Magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
(10-10-2014, 04:22 PM)Magazorb Wrote: Metric prefixes:
8Mb = 1(1x10^6) or 1,000,000B, 8KB = 8(8x10^3) or 64,000b, 4Kb = 1(4x10^3) or 4,000b
Binary Prefixes:
8Mib = 1(1024^2) or 1,048,576B, 8KiB = (8x8)(1024^1) or 65536b, 4Kib = 4(1024^1) or 4096b
the 1024 came from 2^10 which is the bases of which binary prefixes are powering from, so 8GB = 8((2^10)^3) = 8589934592Bytes or 6871947674bits.
The Binary prefixes was losely based around the French metrics or later came SI metrics

1.) I know my prefixes quite well, possibly better than some of you guys, so I don't need help with that.
2.) 8Mb is not equal to 1(1*10^6), it's much more like 8*10^6 b, and you have similar math errors throughout this section. Also, they are by no means "loosely" related, the relationship of the digits between base ten and base two is log(n)=bl(n)/bl(10). Every (useful) metric prefix is a factor of 1000. Log(1000)=bl(1000)/bl(10), now log(1000)=3 and bl(1000)= 9.96. Let's re examine that number as 10. Now we are looking at 2^10, which is 1024. So the relationship between the two is just (10^(log(1024)log([prefix])/3))B=[prefix]iB. End of story.
[/spoiler]

OK, first of you did originally use it incorrectly, i didn't say that you didn't know it, i just merely was clearing, sorry if it seemed i was suggesting you didn't know, also i said 8Mb is 1(1*10^6)B which is true, B= Byte, b=bit, you just miss read it.

Also the prefix naming of binary prefixed is losely based on metric prefixs, they just change the ending of metric prefixes to bi instead, hence why i said losely based, the values them self are all together different yes, but the names of prefixes are related.

You can easily tie the powers of prefixes to together rather easily by saying 10^3~1024^1, or 10^9~1024^3, basically powers of binary be metric_power/3, this become less accurate as the powers increase, so best to keep metrics and binary's separate, but regardless name still relates.

As I said, the logarithmic equation above is the exact relationship between the radicies.

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-09-2014, 05:16 AM)TSO Wrote: Sorry, I meant mostly how to generate the conditions (flags, I guess) in the same cycle as the operation, then I started wondering how many of these conditions I need... then I asked my father for help. He programmed back in the 90's when computers had no memory and actual program speed mattered because computers were still so slow (back when programmers actually had to be good at programming.)

What I learned was this: all your possible conditions are computed at the same time and given a true/false value. These values then go to a single register the same size as all the possible conditions (between eight and sixteen), you then use a bit mask to get the conditions you want out of this register (you can actually call more than one flag at the same time, then operate a condition on those two). Your goal is to never jump, EVER, especially more than one or two lines. Long goto jumps take longer, are harder to understand in debugging, and are a sign that the programmer does not understand the program being written. If, Then, Else statements also slow down the system, but are far better than a goto because the compiler will place the destination code for the if statement near the line the condition is on. The compiler will preload the program data line by line regardless of condition destinations, if an if, then, else occurs, a path is chosen, and all code for the other path is deleted, followed by loading all the subsequent program data for the line that was chosen. The nearer these lines are to the jump condition, the less gets loaded and unloaded. The amount of qued data that gets deleted depends upon how large the jump is, with a goto being the worst. Nearly all data is dumped and reloaded and all stages of the CPU are completely cleared in a goto because they inherently do not jump near the line their condition is on. And if you pipelined and it jumps conditionally, you have to somehow rid yourself of all the crap in the pipeline that is no longer correct because it was part of the other conditional side.

Luckily, my design inherently helps with this, but it will still suck to have to think about it.

Computers back in the 90's did have registers, if they was GPRMs, this wasn't used to often though because the processors at the time was slower then the memory speeds so having registers didn't improve performance much, and added to the instruction size, so was deemed uncesseriy for the time because of the extra memory requirements for the IS.

Two other popular PE types wear Stack based and Acumulator based, this two wear both popular for near enough the same reason, positives and negatives of each are near the same and are mainly the other type of PEs that pros and cons wear oposite of GPRM PEs, there was also other types of PEs that other systems used but the 3 most popular and main ones where the Stack,Accumulator and GPR based PEs

"all your possible conditions are computed at the same time and given a true/false value."
The idea and direction you're going with this is good, irl some do this, however in MC it isn't really a thing with most things being small enough they reach the new address and return before clock, thus giving no performance gain, or are predictivly prefetched for pipelined CPUs which also gives no gain.
This is a good idea though and in bigger MC things maybe useful, the idea has gone around a couple of times but has yet to be implemented, so 5pts. to you if you do it first with at least some gain

"(back when programmers actually had to be good at programming.)" actual programming is harder now then it's ever been for pro's, the only programmers that aren't good tend to be game devs (sometimes they know some code but they aren't really programmers, they use software instead) and those who can't be asked to learn how to write optimal code, but for those doing it professionally are made to max their hardwars abilitys out and this becomes very hard.
To do this you have to learn verious languages (normaly this would be done with a combination of fortran, C++ and ASM, with GPGPU being utilised heavily.
Generically this style of programming is never learned but is in high demand of, also tends to not run to well on various configurations other then it's intended system config(s).

I'm also unclear as to what you're defining as a compiler, probably correct but sounds almost like you want to use it in active programs instead of compiled before use.

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

Doesn't really matter how much you speed up things that are going slow, if you have them before your next clock comes it's waiting, so if you can, put other stuff their and let that be further out, again this is a unique thing with your designs, most don't have this issue so might be a little tricky figuring out a method of doing this automatically for the programmer or for the programmer to know where to do this.

[/spoiler]

1.) There was a slight error, DVQ/MOD and MUL/SQR are six ticks out, but will probably take about 96 and 12 ticks to complete, respectively. On the other hand, it just now occurred to me how I could pipeline these functions and have them point back toward the CPU so that the actual operator will bus itself (If I make the operator four chunks long (easily done) and place the output toward the CPU, then it ends up right back at the cache when it's all done because the bus to the input from the CPU is going to be four chunks long), cutting out the four bussing ticks needed to return the value and allowing more than one divide or multiply to be requested in sequence, though it still would take 96 or 16 ticks to compute each operation.

2.)The computer won't alter the code for the programmer to account for the timing difference, and the programmer also doesn't write the program with the timing difference in mind. Place the blatantly obvious secret sauce recipe here. (If you can't figure out how I'm going to be creating this particular work around, we have a problem, because it's already been discussed in this thread.)[/spoiler]

I'm not to well informed as to all the details of your overall system, or architecture. However what i can say is that your main two concerns are data loops and through put, I'm sure you're already aware of that though, data dependency's will also make a difference, so it's mostly how you intend to rid of data dependency's or reduce them, as increase pipe-lining will increase throughput, it also complicates and often allows for more data dependency's to accrue, in tern increasing the amount of time you have to ideal

Also i doubt anyone really knows what your "secret sauce" is really, as you seem somewhat inconsistent to what would otherwise seem to be it repetitively.
Basically it's somewhat vague to our general convention
Let's have an extremely short thought expirement. If you don't rearrange or alter data in runtime, and you don't rearrange or alter data at programming time, where would it happen? What other time could possibly exist in the program development flow chart where the data could be rearranged or altered? There is only one useful answer to this question.

Not every secret sauce is the same. It takes more than one ingredient to make a recipe.
Assumed you wouldn't do it then because that's to easy XD, would give a overhead though that maybe annoying.

(10-12-2014, 03:59 AM)TSO Wrote:
magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
magazorb Wrote:[spoiler]
(10-10-2014, 05:45 AM)TSO Wrote: That completely defeats the purpose of a forum. I advise you get around to reading it, we've been discussing making the impossible possible, and (at least according to GG) have actually made a small breakthrough in redstone computing (it sounds like I knocked a tick or two off of the fastest theoretical clock rate)

Hi

we've/you've not really been talking about things that was seemingly imposible, albe it you do have some great ideas in general, also sorry to say it to you but you won't be caliming performancing king just yet. maybe after a few revisions and optimisations once you settle in and learn stuff your combination with some of our members of ideas might do though (it's not really expected that new members even know half this much about CS really, so you're doing good )

If it's not much to ask for, once you have a IS drawn up may i see it please?
[/spoiler]

Due to a particular breakthrough I had (it was legitimately like a moment of enlightenment, LD sort of knows about it), I actually have almost all the instruction set drawn up now, as well as the entire bussing arrangement. I also don't think I'm going to ever be performance king, but I will certainly be the king of dual purpose logic.

I have a programmer father that hates wasting time in a program, my friend's father is a hardware developer at HP, one of my father's friends is a hardware developer at IBM, and one of my father's other friends has a degree in library science. Between my inquisitive nature and their experience, there is literally no way I wouldn't know this much about computer engineering. (Although, it is funny to get the two hardware developers in the same room, because they suddenly get real cautious about how they answer my questions. Let's just say neither of them are the low guy on the totem pole at their company.)

[/spoiler]

Nice to know you sit your self in a position where you can easily acquire information about real life EE and CS, no doubt it's useful, however MC does have many limitations and possibility that IRL stuff doesn't account for, so this is generically the challenge when coming from RL based knowledge, it's still very useful as most of it can be applied though, also another thing to bear in mind, is technology is so vast that no small collective group of individuals fully know the capability's of hardware/programming, I even doubt that everyone's knowledge collectively would know.

In short you will hit a wall and you will destroy some, but a lot of the time it's unexpected, but that's all part of the fun
The funniest thing is that the library scientist is by far the most useful for what I'm trying to get done.

magazorb Wrote:

(10-11-2014, 06:02 AM)TSO Wrote: [spoiler]
(10-10-2014, 08:26 PM)LordDecapo Wrote:
(10-07-2014, 02:42 AM)TSO Wrote: Or you can rename registers as you go... but that's for noobs.
cough cough, i can do that, and it only takes 3 ticks.
Noob.

LordDecpo Wrote:[spoiler]
(10-09-2014, 10:18 PM)greatgamer34 Wrote: Most people(including me) in minecraft have a Compare instruction then followed by a Flag register read.

The Flag register read can be BEQ(branch if equal to), BGT(branch if greater than), BLT(branch if less than), BNEQ(branch if not equal to)....etc, along these lines of any amount of possible conditions.

This then reads from the proper flags register and performs a jump to a specified address(usually found in the IMM or a pointer).

Ok so my branching works a bit different, cause my crazy ass cRISC or CISC or what ever u want to consider it Architecture.
I have 2 modes in my IS, Memory/System is mode0 and ALU functions are Mode1
Branching is a Mode0, And it is also a multi line
I have a specially made 3 deep Queue with mutliread so when a branch is detected, it reads Locations (Inst)0, (Inst)1, and (Inst)2 from the Queue, and routes there data to specific parts

Inst0 is the Main Inst, it tells what conditions to look for, weather it is a Call or a Return, weather it is conditional or not, if its direct, relitivePos. or RelitiveNeg.

Inst1 is the Destination address (so i can have 65535 or what ever lines of code on just a PROM, more if i access external memory, which is easy to do) that is only loaded into the PC if the condition is true

Inst2 is the function that defines where the flag can arise from, and this inst MUST be a Mode1, so u can add, sub, or, xor, compare, ect to generate any flag you want.

All of that gets decoded and sorted out in about 7ticks, then the branch is determined on the next cycled wether the conditions are mett, it has static prediction of False, so u only get hit with a 1 cycle penalty after a True flag comes through, leaving the penalty of branching not that devastating.

I will be making a forum post this weekend with pics and such of my MC CPU, since u cant join server, and will explain the IS in detail in it for those who are interested.

...It sounds really complicated... Tongue

LordDecapo Wrote:

(10-10-2014, 01:48 AM)TSO Wrote: So you would be saving a clock cycle per instruction.

I spoke with him, and yes you do exactly what I described when programming assembly for the 386, with one slight exception. The instruction set does not carry the conditional with it, there is a branching operation and the CPU uses some kind of hardware look ahead in order to set the flags in one clock cycle so that the next cycle will pipeline correctly.

Also, when optimizing for speed on an ALU where not every operation takes the same amount of time but multiple simultaneous operations are possible, is it better to put the fast operations close to the CPU and have them be a hell of a lot faster than the slow ones, or put the fast farthest away and have it all kinda balance out? For example, my current instruction set, which I have discussed with LD would allow for a bit shift to occur three ticks after being instructed, and repeatable every six ticks, with the ability to run all the other operations at such speeds as well (the CPU can have a three tick clock). The Boolean operators are four ticks out, but also repeatable every six ticks. At the other end, the MOD function is 106 ticks out, so that's like 34 near operations for every far operation.

No please no, do not do a 3 tick clock its "theoretically" the fastest u can get with torches,,, but NO just NO! MC bugs are so disgraceful that ur clock cycles will be come uneven and will corrupt data in ways u never knew were possible... trust me, i had a huge project, and backed off the complexity and simplified the logic i was gonna use in my CU to get it to be a little longer clock,, well more then a little,, 3 ticks to 10 ticks, but the through put and penalty %ages are ridiculously less now as well. so it gives you better performance under normal operating conditions. Clock speed DOESNT mean more Power,, u have to take into consideration the IS, and the possible penalties the CPU could suffer from such small pipeline stages,,, and a 3 tick clock, leave 2 ticks for logic, 1 tick to store, so its really dumb xD i learned this the hard way... PC was the thing that we found killed it the fastest.

[/spoiler]

Again, there are actually many errors in that statement, as well as a massive oversight on my part. The clock is limited to the seven ticks it will take to decode the instruction pointer. I honestly have absolutely no idea how to speed that up without reducing the amount of cache space in the computer used for cuing up instruction data.

Three ticks does not give you two ticks for logic and one tick for store (at least in my architecture, just because of how every function would need to store at it's input side), it gives three to store, however long it takes to calculate, three to wright to the output bus, and three to store in the data registers. (Also, there is a device in the game that can store data up to four ticks, you'll never guess what it is. And no, it's not some "command block bull shit".)

Final announcement: the instruction set is nearly complete, it is still actually the reverse engineering of the processes in the CPU and ALU, but my moment of enlightenment allowed for me to engineer the CPU and bussing layout all in my head. It occurred to me that op codes are pointers, which is why I know how far away the inputs for each ALU function are from the CPU (that'll give you something to think about).[/spoiler]

Repeater locks, 1 tick store and the rest can be logic. it's pretty much the way to go.

Also loving this thread of yours Big Grin

Cue another philosophical minecraft moment...

Is it the way to go? Do you honestly need that repeater to even lock? If we multiply by accumulation, and place 16 3 tick adders with output connected to input incorrectly so the shift happens on it's own, do you really need to hold the other operand's data in a locking repeater, or do we just need it to pause for three ticks at the front of each add? If they are 1 tick adders sixteen blocks long, does the data for the second operand need to have any pause at all, or does the repeater every fifteen blocks suffice?

If they are 1 tick adders sixteen blocks long, are they really 1 tick... or are they zero tick because redstone needs a repeater every fifteen blocks?
If you direct this multiplier back toward the registers, did you remove the output bussing, or is the bus performing the operations?
Could that be applied to other elements of the processor?
you don't need to be philosophical about this, if you wanted a 3tick pipeline for instance, 1tick would be on the repeater lock as a stage buffer and other 2 would be logic between stage, or say a 8tick it would be 1 tick on locks and 7 on logic, it's just quicker then using Sr-latches, if you wanted to make a extream data loop with simple pipelining and 1 Fwd, you can theorticly achive a effective dataloop of 5 ticks, 1 tick for buffer, 4 for ULG or Adder/Subtractor with a CU that looks ahead to see if Fwd is required and redirects Data Fwd into the Buffer before the stage, a little messy and large, but still holds in theory and also the PE would require two computes to the final half of the logic for the ALU stage, half acting as Fwd which output is naturally disabled by CU unless Fwd is required, the other is unobstructed and continues to the GPRs, though good luck to anyone trying to implement a 8bit version of that.

But it's a good idea to have stage buffers as sometimes interrupts can disrupt the flow of data and currupt things without it being noticeable by the CU it's self, in tern invalid result and calculations there on that are unknown.

Probably not so much of a worrie for a more simple system without them.

If you really want you can go without stage buffers in MC if all is sync enough, and you know where everything will be and when, but it can make it trickier to implement and get working.

(10-12-2014, 03:59 AM)TSO Wrote:

(10-12-2014, 03:52 AM)Magazorb Wrote:
(10-12-2014, 03:14 AM)͝ ͟ ͜ Wrote:
(10-12-2014, 02:25 AM)Magazorb Wrote: In short you will waste away your life playing a stupid block game building outdated computers that serve no purpose

fix'd

Well that's not nice :'(

But it is true

RE: I am TSO - TSO - 10-12-2014

magazorb Wrote:Assumed you wouldn't do it then because that's to easy XD, would give a overhead though that maybe annoying.

I really don't care about how long compile time takes. Also, I don't think compile time is considered overhead.

magazorb Wrote:you don't need to be philosophical about this, if you wanted a 3tick pipeline for instance, 1tick would be on the repeater lock as a stage buffer and other 2 would be logic between stage, or say a 8tick it would be 1 tick on locks and 7 on logic, it's just quicker then using Sr-latches, if you wanted to make a extream data loop with simple pipelining and 1 Fwd, you can theorticly achive a effective dataloop of 5 ticks, 1 tick for buffer, 4 for ULG or Adder/Subtractor with a CU that looks ahead to see if Fwd is required and redirects Data Fwd into the Buffer before the stage, a little messy and large, but still holds in theory and also the PE would require two computes to the final half of the logic for the ALU stage, half acting as Fwd which output is naturally disabled by CU unless Fwd is required, the other is unobstructed and continues to the GPRs, though good luck to anyone trying to implement a 8bit version of that.

But it's a good idea to have stage buffers as sometimes interrupts can disrupt the flow of data and currupt things without it being noticeable by the CU it's self, in tern invalid result and calculations there on that are unknown.

Probably not so much of a worrie for a more simple system without them.

If you really want you can go without stage buffers in MC if all is sync enough, and you know where everything will be and when, but it can make it trickier to implement and get working.

I argue that we did need to ask the question. In fact, I argue we need to go on because you didn't consider where that could have gone.

We'll start small with your counter claim and ask ourselves where buffers and locking repeaters are needed. (I think you'll be mildly surprised as to where this goes.)

You already answered this. They allow for the pipeline to properly forward. Out of order execution and register renaming are ways around a forward.

So, where do pipelines forward at? ALU? CPU? What about out of order execution? Register renaming?

Well, a forward occurs if incoming instructions RAW conflict with one of the instructions inside the pipeline or an incoming instruction WAR conflicts with a later instruction (it's possible in a for loop). Out of order execution triggers for the same reason, and register renaming triggers if a RAW hazard presents with the incoming instruction and the first instruction in the pipeline. If we add that the CPU always does the forward, then we can also say that the ALU will never forward. Going back, this means no buffers or repeater locks are needed anywhere in the ALU because the ALU will never forward. We see that only the CPU forwards in a data conflict, but a forward can be avoided by the other two methods.

With that out of the way, let us then consider a pipelined out of order execution CPU with register renaming where we are guaranteed there will never be a data conflict in the instructions. We see that the renaming and order changing algorithms never trigger, we see the pipeline never forwards, and we see no pipeline bubbles at any point. What does this mean? All the additions to the CPU are unnecessary, they can be removed. All the buffers, all the locking repeaters, all the hardware for the register renaming and the out of order execution disappear. There is no wasted time in the CPU and nothing inside is idle or unused. There is 100% hardware efficiency in the pipeline, with none of the clock cycle lost to buffer and all of the clock cycle in the logic. The only problem with the consideration is that guarantee. There is no way to guarantee all program will be written without data conflicts unless we use magic. Obviously that hypothetical was entirely useless.

Wait, a minute, though. Was that a caveat I used? Indeed it was. Let us explore that caveat. "Without the use of magic," Does that imply this is possible with magic? I would argue it does. With the use of magic, it is possible to guarantee a program could always be written without conflicts. What about secret sauce? I think someone here already equated these two, so that makes this a legal move, as long as we adjust the scope of the claim slightly. We can assume that with secret sauce or magic, it should be equally possible to guarantee a processor will only receive programs with no data conflicts. Note what that scope change was. We moved from guaranteeing a program can be written without conflicts to a program can be received without conflicts. This is because secret sauce has no scope over the programmer, only the computer. In fact the magic can even be removed from the statement because it has been stated before that secret sauce is not magic.

Now (home stretch), let's look at what out of order execution and register renaming do. I will use identical terminology for a third time. These operations are intended to rearrange and alter the lines of code in order to remove data hazards in runtime. So if data hazards are avoided by rearranging or altering data, and secret sauce removes data hazards, by the transitive law of hypothetical models (note this law still a hypothesis), we could say the secret sauce must rearrange or alter data... But when?

Just to hammer in the fact that I have been reusing terminology this whole time, I'm going to quote it.

TSO Wrote:Let's have an extremely short thought experiment. If you don't rearrange or alter data in runtime, and you don't rearrange or alter data at programming time, where would it happen? What other time could possibly exist in the program development flow chart where the data could be rearranged or altered?

How on earth will we solve this conundrum? Is it even possible? I guess we will never know...
But wait, (there's more), we already have answered this question in a completely different experiment! We already have discovered the compiler and the magical compile time! (and there was much rejoicing)

With this magical compiler, we could entirely remove all data hazards. What does that mean? All we need is a compiler that can rearrange and alter instructions whenever a conflict occurs. And what does that mean? It means we can guarantee a program with no data hazards, which means...*Uses copy paste*

TSO Wrote:All the additions to the CPU are unnecessary, they can be removed. All the buffers, all the locking repeaters, all the hardware for the register renaming and the out of order execution disappear. There is no wasted time in the CPU and nothing inside is idle or unused. There is 100% hardware efficiency in the pipeline, with none of the clock cycle lost to buffer and all of the clock cycle in the logic.

Unlike last time, though, we don't need magic to make that guarantee. At the same time, two different points in the discussion have suddenly been fused because it was always the same. One of the secret sauces has been universal to the design from the beginning, always hiding in plain sight for people to infer. You just never took the time to see it.

And finally, with that in mind, an answer arises to the whole of the argument.

TSO Wrote:Is it the way to go? Do you honestly need that repeater to even lock? If we multiply by accumulation, and place 16 3 tick adders with output connected to input incorrectly so the shift happens on it's own, do you really need to hold the other operand's data in a locking repeater, or do we just need it to pause for three ticks at the front of each add? If they are 1 tick adders sixteen blocks long, does the data for the second operand need to have any pause at all, or does the repeater every fifteen blocks suffice?

Indeed, all that was needed was a repeater of the same delay as the adder, not a buffer, and nothing was ever needed between the adders. You only need to ensure signals that originated at the same time propagate together. You do not need to lock them in place.

TSO Wrote:If they are 1 tick adders sixteen blocks long, are they really 1 tick... or are they zero tick because redstone needs a repeater every fifteen blocks?
If you direct this multiplier back toward the registers, did you remove the output bussing, or is the bus performing the operations?
Could that be applied to other elements of the processor?

Again, yes, the adder is effectively zero tick because it is no slower than the bus of equivalent length, meaning the adder can be used as the return bus for the operation.

Now, with much preamble (and a shoehorn), we get where I wanted you to go. If the bus operates on the numbers, and opcodes are really just pointers to hardware, could we decode en route? What about the decoders for the register addresses? Could the entire CPU be zero ticks relative to the bus of equal length? I think you know the answer already. If you had read the PM I forwarded to you, you would have already seen all of this post in a much shorter form.

You should all know the answer to that final question, though, because I always give you a, "Yes," for my thought experiments.

RE: I am TSO - Xray_Doc - 10-12-2014

Tl;dr

RE: I am TSO - Magazorb - 10-12-2014

This hot air is to steamy, you're recycling to much, TSO you have some good ideas but you're overlooking key details, the Fwd is a good thing to have and utilise, it allows you to forward a data from the ALU back into the ALU rather then having to write it back before use, this increases serial performance, you wouldn't want to rid of that.

So say we have the example i gave, which you seem to missunderstand the setup, the alu stage has no buffers in it, only around it, now i'll be more realistic this time and use stuff based on what we can currently implement with no issues what so ever, which is a 6 tick ALU with 1 tick forward being CU assisted and 1 tick repeater lock stage buffer.
Now i'm going to be unrealistic to give your suggesting system as much of a advantage as posible, we shall have a write back of 1tick, and a read from that new write back 1 tick, now you'll notice that that equates to the same as the Fwd, but the difference is the Fwd only has to go back to the buffer before, meanwhile to achive this speed with the registers you would have to have them all repeater locked and comparative read disabled, in turn this means it's memory dencity is low and gives you a only aprox 30 dust of singal total (15 before lock and 15 after) to get all your registers back to ALU input, so ignoring the buffer as you suggest you don't want that.
And what you'll find is that yours is no quicker, in serial compute and that's in your most favorable configuration.
Now allow me to explain further in that those register setups havn't been done before except maybe in a 4G/SPRM, we always tend to use at least 8G/SPRs in our PEs, so you can easly see without that Fwd you can very easily loss serial performance, and unfortunately your oversight is that you forgotten about serial computations, you though so much that you could fill ideals with out of order executions, but unfortunately their's times where this isn't possible.

The resulting difference is that when comes a serial computation which is inevitable at some point, you will lose computations via ideal, this is still a big issue in IRL mass computations.
I guess you could via pulse timing achieve a max of 3 ticks per instruction through put but that's unlikely due to MC being a derp, 5 ticks is likely though, but theirs nothing stopping our systems from doing that anyhow providing the data isn't serial, and if it is based on current creations all that would do is slow us down to 8 ticks, but with timing comes a issue of how easily can a interrupt cause a corruption.

Now to go back to go back to the device you have, that one that gives instructions to other "CPUs" as you called it, How do you intend to have multiple of these PEs attached to one of these and fully satuate them when you have so few commands that would actually take up multiple PE cycles if you will, Besides, it's not like we don't have thise functions, it's just we don't bother to reorder instructions in compile time due to the amount of overhead that would be on what ever has to compile it before you could finaly exe. (I'd say it's overhead as it's actualy something that has to be processed before it can have a use, so it does add to making something longer in some ways)

Heck i'v stated i'm currently planning on making a SIMD array and how to address all the PEs is a similar issue, because i have to fetch data as quickly as i can while it's computing and SIMD computations are vastly more quicker computing speed vs memory speed which brings the same issue as what you will have with addressing multiple PEs, that you will only bealbe to use multiples when you have multiples executing instructions that would take multiple cycles.
The only difference between the my SIMD array and your multi-MIMD PE would be mine's data limited and your's is instruction limited.
When these kinds of things happen you'll need a PE that can stockpile multiple instructions/data coherently.

I've seen other guys do things similar to your suggestions, they gave up not realising how much they got them self's in, but you seem to know a lot of what you want to do but you've yet to figure out the short comings.

RE: I am TSO - jxu - 10-13-2014

TSO don't listen to them, minecraft is not IRL, even if it's stupid, give it a shot (minecraft always has some interesting quirks)

And I know it's true

RE: I am TSO - TSO - 10-13-2014

By no means has this post been edited, considering I just got done writing an english essay.

(10-12-2014, 03:55 PM)Magazorb Wrote: This hot air is to steamy, you're recycling to much, TSO you have some good ideas but you're overlooking key details, the Fwd is a good thing to have and utilise, it allows you to forward a data from the ALU back into the ALU rather then having to write it back before use, this increases serial performance, you wouldn't want to rid of that.

So say we have the example i gave, which you seem to missunderstand the setup, the alu stage has no buffers in it, only around it, now i'll be more realistic this time and use stuff based on what we can currently implement with no issues what so ever, which is a 6 tick ALU with 1 tick forward being CU assisted and 1 tick repeater lock stage buffer.
Now i'm going to be unrealistic to give your suggesting system as much of a advantage as posible, we shall have a write back of 1tick, and a read from that new write back 1 tick, now you'll notice that that equates to the same as the Fwd, but the difference is the Fwd only has to go back to the buffer before, meanwhile to achive this speed with the registers you would have to have them all repeater locked and comparative read disabled, in turn this means it's memory dencity is low and gives you a only aprox 30 dust of singal total (15 before lock and 15 after) to get all your registers back to ALU input, so ignoring the buffer as you suggest you don't want that.
And what you'll find is that yours is no quicker, in serial compute and that's in your most favorable configuration.
Now allow me to explain further in that those register setups havn't been done before except maybe in a 4G/SPRM, we always tend to use at least 8G/SPRs in our PEs, so you can easly see without that Fwd you can very easily loss serial performance, and unfortunately your oversight is that you forgotten about serial computations, you though so much that you could fill ideals with out of order executions, but unfortunately their's times where this isn't possible.

The resulting difference is that when comes a serial computation which is inevitable at some point, you will lose computations via ideal, this is still a big issue in IRL mass computations.
I guess you could via pulse timing achieve a max of 3 ticks per instruction through put but that's unlikely due to MC being a derp, 5 ticks is likely though, but theirs nothing stopping our systems from doing that anyhow providing the data isn't serial, and if it is based on current creations all that would do is slow us down to 8 ticks, but with timing comes a issue of how easily can a interrupt cause a corruption.

Now to go back to go back to the device you have, that one that gives instructions to other "CPUs" as you called it, How do you intend to have multiple of these PEs attached to one of these and fully satuate them when you have so few commands that would actually take up multiple PE cycles if you will, Besides, it's not like we don't have thise functions, it's just we don't bother to reorder instructions in compile time due to the amount of overhead that would be on what ever has to compile it before you could finaly exe. (I'd say it's overhead as it's actualy something that has to be processed before it can have a use, so it does add to making something longer in some ways)

Heck i'v stated i'm currently planning on making a SIMD array and how to address all the PEs is a similar issue, because i have to fetch data as quickly as i can while it's computing and SIMD computations are vastly more quicker computing speed vs memory speed which brings the same issue as what you will have with addressing multiple PEs, that you will only bealbe to use multiples when you have multiples executing instructions that would take multiple cycles.
The only difference between the my SIMD array and your multi-MIMD PE would be mine's data limited and your's is instruction limited.
When these kinds of things happen you'll need a PE that can stockpile multiple instructions/data coherently.

I've seen other guys do things similar to your suggestions, they gave up not realising how much they got them self's in, but you seem to know a lot of what you want to do but you've yet to figure out the short comings.

To clarify, I would consider that a write back while a forward would be something like halting a multiplier in order to inject a partial sum into the middle, which would basically never happen because you would have also needed to calculate the partial sum before hand... using the multiplier (or using the register AND, the bit shift, the adder, and the forward injection process, which is not faster by any stretch of the imagination).

The question that must be considered, though, is twofold. Is this write back detrimental to the parallel performance of the unit, and can one justify adding one tick to every operation just for a faster write back?

The first question can go either way depending on whether or not you implement the write back so that it doesn't interfere with the input busses. In fact, you can even write back between cycles if the clock is slow enough for the system to have open time in between operation repetition. (Ideally, you wouldn't have this, but the instruction pointer decode time dictates that.) In my personal layout, the instruction pointer is designed for parallel pipelined decoding (figured this out last night, it's really cool, easily four ticks... maybe three, but you're starting to push the limits of the memory) in order to keep up with the clock, so this write back is not possible in my system that way. This leaves us with a buffer at the front of an ALU operation, but in my IS there is no space for such a command for wright back or a command to pause the register write that would occur regardless of wright back without some kind of output switching. As you stated, this switching would be one tick at the output, but actually would lose one tick plus the clock by adding a repeater at the input because the next instruction has just been halted. Otherwise, our loss at each end is one tick at the front and one tick at the back. But then comes the wright back delay. It is the length (in chunks) of the operational unit. So we have a command, it goes through, doesn't wright to registers, may or may not hold up the command that occurs a few lines later, does not calculate it's conditions, and adds two ticks to all operations. Good luck managing that.

This moves into the "is it worth it" part. If the operations are already optimized for bussing themselves back to the registers within their own operation delay time, then the distance from their output back to their input is nearly equal to the length of the existing bussing from the registers to the input anyway. The loss between register writing and internal wright back is now only the register wright time and the delay to the next operation. If this delay has been optimized with something like my four tick instruction pointer, this should be no more than three ticks and the delay is a constant intrinsic to your personal design preference. Our loss is now about six ticks slower for a wright back verses two ticks on every single instruction. Now, I'm not stupid enough to claim that the wright back does not have it's advantage, I'm simply going to say that there is a minimum number where it is not advantageous. We'll see how many this is for six operations and ask if a two tick loss per operation really justified? Let's see. We start with two wright backs per six operations. By your model, we have six operations, that's a twelve tick loss for our six operations. The wright back loses two ticks plus the two clock cycles for the two held operations, as well as the processing time to repeat the operations. So our loss was 14 ticks, two clock cycles, and two operation delays. My way takes an extra two instructions to command recalculation, plus two operation delays for the calculations, and somewhere between eight and two cycles in the clock/wright synchronization (a value you control in the design), plus two ticks lost in the two register wrights, and two ticks lost in the two register reads. So we lost six to twelve ticks, two clock cycles, and two operation delays.

Most of this cancels except for the six to twelve ticks to the fourteen ticks. That's a difference of eight to two ticks. Notice that it's the exact loss in the constant user-defined clock delay for my system. So the less the clock delay loss is on my system, the more efficient wright back avoidance becomes. For the long delay, mine is two ticks faster, so a wright back content of 33% makes these two about equal. For the short delay, the difference is eight ticks in my favor. Now, I'm just going to look closely and realize that for each wright back we add into the six, this loss is two ticks per wright back on mine as (short clock delay) and one tick per wright back on yours. So we become equally efficient if we contain 66% wright back in the code. So now the question is, with pipelined hardware multiply and divide, how often do we encounter this number of wright backs? This is a question for the designer about the intended operations, and I don't think that more than 2/3 of the code is intended to wright back for most of what I can think of. There are still huge issues with a fully serialized group of instructions, but this is avoided as best as possible by having the faster operations run at their own speed instead of being limited by the speed of the slower operations. Of course, as you said, this is still highly limited if a long operation is called.

Now, finally, for what you mentioned about me being limited to memory speed. I will admit that my method has a loss in code density. For every wright back, I need one extra line of code. Whether or not your's need the extra line us up to you and microcoding decisions. But if I can decode a 7 bit instruction pointer in four ticks, we could venture to say that I could decode a 16 bit address in 8 ticks (and I can). Now, if I load more than one memory location at a time, (say... oh... four lines), we remove two ticks from the decode because four memory locations can be addresses by the same number, and we get four and a half lines per every two instructions completed. So we are doing okay there.

The issue with my layout isn't instruction paging per se, but is probably the conditional branching. I have not yet given this system a way to prevent instructions not in the branch but already in the pipeline from being ignored. I think I know how I can do this, but it seems like I am putting a massive load on my instruction pointer and register wright hardware, everything depends upon their ability to block register writing of the incorrect operations while allowing register writing of the correct operations. This sounds straight forward at first: just block all the wrights after the jump for the time of the pipeline length, but then we remember that not everything takes the same amount of time to compute. A divide before the jump could arrive to the registers after a shift on the wrong side of the jump. Now we have to block the shift, allow the divide, and ensure the divide doesn't have a data conflict with the jump and the divide's conditions don't nullify the jump.

Interrupts wouldn't cause data errors in and of themselves, but an interrupt could use a register that is already in use, so there need to be special interrupt registers, and a separate interrupt bus.

As for the compiler, if you run the program more than a certain number of times (like four of five), then the losses in compile time will start to make up for themselves. This computer is not designed to run the program only once.

I think the difference between myself and these "other guys" is that I can acknowledge that my system does have downsides, but I also aim at making a balance between the downsides and the efficiency of removing as much as possible. I think that I do see what all the problems will be, but never declared the caveats when I described parts of this system because I thought you would all see and understand them.

RE: I am TSO - Xray_Doc - 10-13-2014

(10-12-2014, 02:28 PM)Xray_Doc Wrote: Tl;dr

RE: I am TSO - Magazorb - 10-14-2014

i can only speak for my self when i say that i see and understand the drawbacks that your idea would suggest, however when i ask about them i'm more so asking as to how you intend to minimize these effects.

Fwd in our terminology seams different then that of your, as does write back, so between what i had tryed to point out and what you have pointed out we have gotten lost in translation.

So by write back i mean the writing back to the GPRs, and for Fwd i mean writing back to ALU input from ALU output, these arnt particualy the most advanced meanings, if anything overly simplistic and you may find it a little awkward at first, but truly no two institutes have the esact same technical terminology.

As for the "compiler" as i'm a litle unsure as weather this device you discribe technicly counts as a full compiler, (I'm aware that it's not far out) could do this over a single pass, given a unlimited amount of tags if you will.
My only suggestion about this that i can make is to just have it conscientiously have it cycle over until it comes up with no more logical improvements, in other words stops optimizing

I'll hand it to you that you handle the bulk stuff quite well, as in allowing for maximum throughput, however i must leave the comment of it under the described setups you have does seem to have a few times where it will struggle.

Also i don't quite think you followed on the instruction stuff fully, i was assuming that you could pulse that as you would other stuff and have a throughput of 3 tick/instruction, and then expanding upon that by asking how you could make much effective use out of a 2nd PE via how much time you had between having to fetch instructions for the first PE, as well having to deal with the issues of keeping the data streams between PE1 and PE2 coherent.

Or correct me if i'm wrong but you do wish to only pulse data through the device and not use buffers, but control the logic via timing thus regardless to the stage size of something, you could always maintain a instruction every 3 ticks.

RE: I am TSO - TSO - 10-14-2014

What you call wright back is how my computer always runs. The only thing it can do is wright back unless an instruction to external memory is given.

I call wright back what you call forwarding. Mine simply does not have the ability to do what you call forwarding. There is no way to manage that inside the clock and instruction set that I want. Therefore I am stuck with a complete register wright between each repetition, but because the busses from the registers back to the processing element for any operation are (usually) exactly the same physical length as the operation itself, it isn't loosing as much time in bussing delay.

For the most part what we have is identical, the only difference was that mine isn't loosing time with the two ticks on the front and back of each ALU element, so all I had to do was figure out how much time is lost in register writing and reading in order to see at what point your forward becomes more efficient than my lack thereof. For the really big operations, this takes care of itself so well that the amount of code requiring a forward became unreasonably large in order to make yours any more efficient. For the smaller and faster operations, there is more of a loss, and the amount of code required to make forwarding more efficient was lesser. I finally showed that the phase arrangement of the register wright to the clock (and therefore the next read) was the main determinate of the range where mine would be more efficient than yours.

The compiler is a program. There happens to be a component designed to make things easier for the compiler because it is able to check conflicts by using hardware, but the compiler is a program that will arrange things based on how conflicts and bubbles present themselves, which is all predictable. (I don't actually know how to write a compiler, but we'll get there when we get there.)

Well, sort of... instructions can be decoded in three ticks, but the throughput of any particular instruction varies. So there can be a three tick clock, with an instruction every three ticks, but the operations don't wright back in the same amount of time. They are all multiples of three ticks, but the operations definitely return out of order.
This machine reaches it's maximum potential running multiple programs that don't share too many of the same operations and don't conflict their data.

You missed a tiny bit of how muticore would be implemented, and I missed a bit about some of the caveats. Each core would have it's own cache of program data, which reads once every clock. The paging system is part of the external hardware. The number of busses to the individual processing core's cache would be equal to the number of instruction paged in at a time, so for one core, if it takes eight ticks to get an address in main memory, and four ticks to read an address out of the program memory cache, we need to address four lines of code out of main memory at a time and have four data busses. For two cores, the main memory can still wright four lines of code at a time and switch between the two cores every other wright. The cores would not be able to share busses. If sixteen lines of code are paged at a time, then parallel pipelined decoding of an address in the main memory will take four ticks. If paged 32 at a time, it would take three ticks. This is the main reason why I need to figure out some sort of balance in the memory, because 32 busses is far too many to work with, but is the fastest possible amount to page and allows for more processors. After a point, more than one read unit is needed, but each read unit can easily address out to at least 32 processors if it wrights 32 lines of code at a time to each one. Do note that my computer is the size of it's ALU and program memory, so cramming 32 of these computers together is no problem if you can figure out a bussing solution. At that point, they probably would be sharing busses, but they would not all share the same one, and you could not consecutively wright to two that share a bus. My design is also really tall.

This final assumption is absolutely correct. No stopping, no clocking, once data leaves a unit, it's the next unit's problem to deal with. The write location and conditions also ship out with the data so that everything is organized and stays together. Everything decodes when it needs to. The clock is only at the register and program memory read, the wright is not clocked. Imagine it as the most organized set of race conditions ever created, absolutely everything depends on the data being somewhere exactly when it is supposed to, with tolerance inside of a tick. The only reason I can get away with this is because minecraft flip flops stabilize instantly and a 1 tick tolerance is extremely easy to synchronize.

RE: I am TSO - Magazorb - 10-14-2014

The last two paragraphs explain the most to me, see i hadn't known how you intended to keep the cores running the program without it being spoon fed instructions

your terminology of throughput seems a little different then that of ours as well, so for use it's essentially represents the amount of operations that a device may do at it's best, as in Quantity/time, measuring from what's initiated that will/has came out correct.
So to give a example; a device that takes 100ticks to output what it's imputed, is given a new input every 3ticks, the output of every input comes out 100ticks latter, but is correct and 3ticks after it's outputed, the following computation for the one inputed after would be outputed
going back to Throughput = Quanity of operations/time, we can device that the throughput is 1/3ticks, or 3x(1/3)os^-1 (operations per second)

I do think you was following this but i believe you went stray from say if you was to run a multiplication instruction, each operation required to fulfill that.
Instead where i think we made the missunderstanding was probably due to my bad wording, but was you though of it as instruction as a whole and not the operation.

Yes as i have said this would be optimized for throughput, but where i stated that it would be slow in serial computations still stands true (as you seem to comprehend)

But a litle somewhere you seem to not follow about our general designs is that our Fwds will fetch data for the ALU before the GPRs even have that data, via doing this you can advance the exercution of serial computations and thus increase performances, it's easy to make this a self controlled feature, CU just takes care of it for you.
but you seem to get this idea that we have to pay for extra 2 buffers, 1 before and after the ALU... this is true in some designs but those with a well throughout Fwd would just limit your serial use of data to just the 1 buffer before effectively, because it going into the other stage buffer for write back, then it can be fetched again, but the Fwd means that while it's going though that stage, simultaneously, it can work on that same value it's still writing back to instead quickly loop it back into it's input on top of that which allows for quicker serial computations (this was a note that i was suggesting something you may find advantageous)

With what we have doing this based on clock and stage basses rather them timming all the controls, we can still easily get a throughput of 8ticks, and these are generalized things that currently exist.

Going onto what theoretically is possible with that system is a 4tick ALU for 8bits + 1tick per further 8b
with 2 outputs, 1 that is conditionally on based on Fwd and the other always on, from that point you only need 1tick for the ALU input stage buffer, thus resulting in a throughput of 1/5ticks or 2os^-1 (yes i'm making up random units of measurement, but they are usable and comparable either way)
And this was all while keeping a extremely low exe stage count: 1) could easly increase this to 2 and get 3+(1/3)os^-1 while maintaining what's essentially the same Fwd, however in this case throughput of non serial computations would be 1/3ticks, but Serial computational throughput would drop to 1/6ticks

So you see for being significantly more simple, they do retain a lot of performance, as for the compiler magic, most people would just program efficiently first time if optimized code was required, thus rendering the compiler redundant. (this was more to prove the point that in the way you see non serial computations in our systems are just as quick if we desire)

I've already explain what seems extensively on why your system not having the Fwd would seriously reduce the serial computation performance.

Like I've said a few times, you have good ideas and if implemented well they will make for a nice system, however it's not like it's pushing performance relative. (i'm unsure as to what your comparing what we have with what you hope to have, if you could please explain that would be nice)

I do concur on many of what you say about being quicker to make it timing based, theoretically this was always true, however most people never did this due to CU complications of having a timing based PE.

RE: I am TSO - jxu - 10-14-2014

stop talking and start building
tl;dr

RE: I am TSO - TSO - 10-14-2014

I'm going to address two of the things you mentioned right now. I remembered something in wire word that makes data entry dependent upon gates that contain their own clocks, giving clock everywhere logic; this is called timing dependent circuitry in the context of wire world. If that is what you mean here, than mine is certainly not timing based, I was thinking of something more like just keeping data together in a clockless system that resembles pipelining, so for any individual stage in each operation, data is as slow as the slowest operation, but data speed between independent operations is not limited this way. (If I ever start to build this thing, what I mean will quickly become apparent). Now, the compiler is accounting for way more than you think it is. The compiler is renaming registers, prearranging out of order execution, inserting lines of code to assist in condition prediction on some of the longer functions if the next line of code is a branch, rearranging the conditions in order to account for variable ALU operation timing, and managing/verifying the memory allocations before runtime. It is also synchronizing some aspects of the time dependence and variable clock rate of the entire system and making up for the fact that I don't understand how programming works at all.

I know that forwarding receives data before it is written, what I am saying is that because I fully intend to make all my operational units the exact length needed to make them bus themselves back to the registers, the extra two ticks inside the registers become transparent compared to a forward that needs some kind of control unit and a way to select if it forwards or wrights back (a 1 tick multiplexer at best on the output), plus an additional front end buffer (another 1 tick with locking repeaters). I'll do the same example using my multiplier that you did above. The distance from the front of the multiplier to the back is 5 chunks. The distance from input of the multiplier to the CPU is also five chunks (hey, would you look at that). The distance from the output of the multiplier to the register bank is zero chunks, and the distance from the register bank to the multiplier input is five chunks (weird, it's like it was built to just fit right in there).

The multiplier repeats every three ticks and takes 17 ticks to complete. In your design, we need one tick in a locking buffer at the input and 1 tick in a router at the back, plus all this work to make a control unit and a five chunk bus to return the output to the input. In my design, we have a register wright that takes two ticks, a register read that is actually simultaneous to the second wright tick, and then just reuses the five chunk bus.

So, you cost the system 19 ticks per operation, which is within the next clock cycle from my multiplier and everything else cancels out. I take 17 ticks, and then 1 tick, and then the next clock (one tick after that) results in the next read. This is also nineteen ticks, but mine is much easier to build. In this case we are always equal. We are only unequal if yours and mine end up on different sides of a clock cycle. We have the same serial and parallel execution times of 24 ticks per serial repetition. (It also just occurred to me that you will never forward a multiplier because the multiplier gives 32 bits output, but takes a sixteen bit input.) The only place where you exceed me in speed is the Boolean functions, specifically the OR NAND and NOR functions. My AND function is actually faster than yours. XNOR is the same for both, and XOR is significantly faster for you than for me because I integrated it into the adder.

Also, you say your's is a four tick ALU plus one for every eight bits? Making your clock also five tick instead of three?

RE: I am TSO - LordDecapo - 10-14-2014

Ok so I can finish reading all this with out a quick reply... I have a long ass reply I'm doing on my laptop,, checked this on my phone...
Do u know what Fwd is? And u do know that EVERY SINGLE INSTRUCTION except branching... Has a Writeback??! That is WHY fwd is so immportant..
And maga, I fwd on an 8bit, it's easy with a simple XOR array to detect hazards..

A fwd is like this
Say u have these inst.
(($ means register and & means RAM btw)
$1+&2=$3
$3+$4=$5
See how you NEED the result from the frost inst for use in the 2nd? And say u have a pipeline that one stage is execute and one stage is Writeback.. That means u won't have $3 correct will the cycle after $3+$4=$5.. Meaning if u dot FWd then $5 is now corrupted.
You Fwd by taking the output of the ALU and sending it right back to where $3 would come back into the ALU.. This makes it so u can use the data at the same time it's writing back. iE preventing a stall.

If u have a seperate inst for Writeback.. Ur CPU will be SLLLLOOOOOWWWWW AS BAWLS. No way around that slowness at all what so ever.

RE: I am TSO - LordDecapo - 10-14-2014

Fwd support isn't defined in IS, it is defined in ur ISA (ur Architectual immplimentation of your IS)

You CAN NOT WHAT SO EVER have instructions write back Out Of Order.. Unless u have a reorder buffer before ur writeback.. And the RoB will have to be able to fwd as well, or ur speed will suffer severly.. Unless u program with that in mind, in that case ur compiler complexity will be EMENSE.
You can not have multiple programs running at once UNLESS u hyper thread, Meaning having multiple sets of registers, each program using it's own, or you do ALOT of context swapping.. Meaning u store all register data to RAM for program0 and load all register data back from RAM for program1... Every single time u swap programs.

And a flip flop between stages IS A BUFER..
Now on to page 9!

RE: I am TSO - LordDecapo - 10-14-2014

"Resembles pipelineing"
It's pipelined or not pipelined, no grey area.

And your system IS TIMING based, u wanted to keep control wires and data in synce so u don't need buffers,. Is like the literal definition of pipelineing via timing base system, since u have to focus A LOT of time to timing it all PERFECT.

A compiler won't be able to rename registers effectively, as renaming usually used a pool of "tags" that is a lot larger then ur actuall register count.
And it will only be able to rearrange things so much,, u have to Fwd or do some tricky CU magic to get it fully optimized, a good computer system is a mix of A BUNCH of individually complex systems all working together, not just a simple data loop and a smart compiler.
Compiler also can not help get ur branch conditions sooner, as ur branching effectiveness is determined by your architecture and ur CU/branch hardware.
Memory allocation only helps with preventing storage corruption and inter program security, it doesn't speed up a system unless u plan to access a lot of data quickly and only use it once or twice each. If u have a way for it to speed up processing, due share In better refined explainations please.

Your whole "make logic same as buss time" isn't that effective.. At all., buss time in a well thought out system can be near zero, and ur parts are usually smaller then like 90 blocks (6tick ALU with each tick of logic being the 15block apart)
U would end up with the same speed but a MUCH MUCH bigger system, and the bussing delay would actually be greater.

Also what are u referring to as chunks? The game chucks of blocks loaded? (16x16)
A fwd router.. Doesn't necisarrily take any ticks at all, use a 1tick shifter at output of ALU that inverts signal, then have it invert again before the fwd, in which u can disaple the torches by giving them a 2nd control power source. So u can have in essence a 0tick output demux (it's a demux not router)

All Boolean should be sync, and take the same, if you have anything except a shit ALU, then add, sub and all Boolean will ALWAYS take same time frame.
And FWD is only for ALU, not Mult/div, ur right on that. And for a Mult div sequential system to work, you HAVE TO have a buffer or u will get MC glitches fucking up ur data at a 3 tick through put.. I have tried 3 ticks,, it DOESNT WORK. AT ALL.. Minecraft has a bug with torches, your data tick width will not always be 3 u will find that the bug will randomly give u data tick widths between 1 and 4.. Which isn't consistent enough at all to be feasible, a 4 tick clok MAY work.. But 5 is the only garentee it would function at all. But good luck wit FWD on a 5 tick, u would need at least a Dual Fwd system (so u can not only give data to the immediately following inst, but the one following that one as well... This isn't TOO hard, but if u don't know pipelining and hazard detection well (no offense, it doesn't seem like u are grasping those concepts that well) it can e a BITCH AND A HALF to implement. I suggest lengthining ur clock.
A slower clock can increase speed and preformance due to lower stage counts. Of u don't believe me.. I'll post a picture of my math I did that tests 3 clock speed, 2 different counts of stages.. The faster clock speed only helps if you like hardly wet branch.. It avg out that a 8tick clock with 8 stages.. Is 0.56% slower then a 10tick 5 stages on avgerage over a corse of tests I ran
The test uses a static branch prediction of true, which is actually a worse case scenario. Static False is better but fuck it.
I used these scenarios
(The format is %false branches-%true branches-1branch/(per x inst)
0-100-1/2
30-70-1/2
50-50-1/2
70-30-1/2
0-100-1/4
30-70-1/4
50-50-1/4
70-30-1/4
0-100-1/8
30-70-1/8
50-50-1/8
70-30-1/8
0-100-1/16
30-70-1/16
50-50-1/16
70-30-1/16

Note that IRL programs average like a 30-70-1/8 for calculation heavy programs
And like 50-50-1/16 for memory intensive programs, and with the normal programs I see ppl running in MC, it's like 30-70-1/4 sometimes 30-70-1/2
My 10tick 5 stages handles more frequent branches A LOT better over the course of 100inst.

RE: I am TSO - LordDecapo - 10-14-2014

Note I didn't test 100-0 on those branch tests, at that is case scenario and it will be equivalent to the through put if u didn't have any branches at all

RE: I am TSO - TSO - 10-14-2014

It stops when branching? Why?

RE: I am TSO - LordDecapo - 10-14-2014

(10-14-2014, 05:24 PM)TSO Wrote: It stops when branching? Why?

Doesn't "stop" it bubbles.
Let's say a CPU has a static prediction On branching to False (like the simulations I have examples of) So immediately after reading the branch inst. it starts loading the next line in program order, which it reads if the branch is False.
Now let's say that if you have a 5 pipeline stages. And u don't get the flags to check ur condition till stage 4, that means by the time u have realized the branch is actually true, u have the inst in stages 1 2 and 3 that are now invalid. So u have to clear those (really easy if u have buffers, just dont clock the buffer between 3&4 for 3 cycles and that incorrect data will get trashed.
Those 3 stages are the "stop" that ur referring to, where nothing gets done.
That stop time (bubbles) are what I used in my calculations as being the Pipeline Penalty.

RE: I am TSO - Magazorb - 10-14-2014

(10-14-2014, 01:53 PM)LordDecapo Wrote: And maga, I fwd on an 8bit, it's easy with a simple XOR array to detect hazards..

A fwd is like this
Say u have these inst.
(($ means register and & means RAM btw)
$1+&2=$3
$3+$4=$5

Yup, that's how i go, but it's still part of the CU, as it's not on the PU side, basicly to what i discribed as a 1stage execution pipelined CPU i have a tag flowing along with the data aside the PU that would be calculated as soon as i reasonably could in previous stages, but worse assuming that i can't do that, then Fwd would be calculate via having a register on the in the exe stage that would OR the sum of two XORs from the next two inputs for the next cycle by passing the address data forward a stage.

In short Exe stage would XOR the output register address with the input register addresses for the next cycle via looking at the input registers in the fetch stage (or what ever stage was just before the execution

Same ordeal pretty much for the 2stage exe, but Fwd of stage two output goes into stage 1 input, and depending how slow the dataloop without Fwd is, it would look at further ahead then just the next cycle of data to conclude if it that dataloop isn't quick sufficent in that case; if that's the case then it would active Fwd but hold that value in a register untill required.

That how i'd do them, sounds pretty similar to that of yours

(10-14-2014, 02:47 PM)LordDecapo Wrote: "Resembles pipelineing"
It's pipelined or not pipelined, no grey area.

And your system IS TIMING based, u wanted to keep control wires and data in synce so u don't need buffers,. Is like the literal definition of pipelineing via timing base system, since u have to focus A LOT of time to timing it all PERFECT.

A compiler won't be able to rename registers effectively, as renaming usually used a pool of "tags" that is a lot larger then ur actuall register count.
And it will only be able to rearrange things so much,, u have to Fwd or do some tricky CU magic to get it fully optimized, a good computer system is a mix of A BUNCH of individually complex systems all working together, not just a simple data loop and a smart compiler.
Compiler also can not help get ur branch conditions sooner, as ur branching effectiveness is determined by your architecture and ur CU/branch hardware.
Memory allocation only helps with preventing storage corruption and inter program security, it doesn't speed up a system unless u plan to access a lot of data quickly and only use it once or twice each. If u have a way for it to speed up processing, due share In better refined explainations please.

Your whole "make logic same as buss time" isn't that effective.. At all., buss time in a well thought out system can be near zero, and ur parts are usually smaller then like 90 blocks (6tick ALU with each tick of logic being the 15block apart)
U would end up with the same speed but a MUCH MUCH bigger system, and the bussing delay would actually be greater.

Also what are u referring to as chunks? The game chucks of blocks loaded? (16x16)
A fwd router.. Doesn't necisarrily take any ticks at all, use a 1tick shifter at output of ALU that inverts signal, then have it invert again before the fwd, in which u can disaple the torches by giving them a 2nd control power source. So u can have in essence a 0tick output demux (it's a demux not router)

All Boolean should be sync, and take the same, if you have anything except a shit ALU, then add, sub and all Boolean will ALWAYS take same time frame.
And FWD is only for ALU, not Mult/div, ur right on that. And for a Mult div sequential system to work, you HAVE TO have a buffer or u will get MC glitches fucking up ur data at a 3 tick through put.. I have tried 3 ticks,, it DOESNT WORK. AT ALL.. Minecraft has a bug with torches, your data tick width will not always be 3 u will find that the bug will randomly give u data tick widths between 1 and 4.. Which isn't consistent enough at all to be feasible, a 4 tick clok MAY work.. But 5 is the only garentee it would function at all. But good luck wit FWD on a 5 tick, u would need at least a Dual Fwd system (so u can not only give data to the immediately following inst, but the one following that one as well... This isn't TOO hard, but if u don't know pipelining and hazard detection well (no offense, it doesn't seem like u are grasping those concepts that well) it can e a BITCH AND A HALF to implement. I suggest lengthining ur clock.
A slower clock can increase speed and preformance due to lower stage counts. Of u don't believe me.. I'll post a picture of my math I did that tests 3 clock speed, 2 different counts of stages.. The faster clock speed only helps if you like hardly wet branch.. It avg out that a 8tick clock with 8 stages.. Is 0.56% slower then a 10tick 5 stages on avgerage over a corse of tests I ran
The test uses a static branch prediction of true, which is actually a worse case scenario. Static False is better but fuck it.
I used these scenarios
(The format is %false branches-%true branches-1branch/(per x inst)
0-100-1/2
30-70-1/2
50-50-1/2
70-30-1/2
0-100-1/4
30-70-1/4
50-50-1/4
70-30-1/4
0-100-1/8
30-70-1/8
50-50-1/8
70-30-1/8
0-100-1/16
30-70-1/16
50-50-1/16
70-30-1/16

Note that IRL programs average like a 30-70-1/8 for calculation heavy programs
And like 50-50-1/16 for memory intensive programs, and with the normal programs I see ppl running in MC, it's like 30-70-1/4 sometimes 30-70-1/2
My 10tick 5 stages handles more frequent branches A LOT better over the course of 100inst.

My maths on a different senairo which had 100% successrate still had a higher stage count prove slower in serial computations and branching (I havn't really played with testing many conditions and collecting the flags with tags to figure out where to finaly branch to, i'd suspect this would be quicker though as it allows you to run branches like a vector processor until it makes it's final branch condition is done, then it would do a branch based on the collections of flags, this could be very good for say in C++ terminology Switch statements and nested If statements

Some quick maths thanks to excel Smile

only aproximations, not to acruate (can't be asked to make it 100% accurate, but this will be close enough to represent rough performance

Key:
Tr=Times Repeated, SC=Serial Computations, !SC=Non Serial Computations
Bc=Branch Count, Bt=Branch Time.

Config exe Stage Count ThroughPut Notes
A 1 8 Fwd capable from execution stage 1 output (Es1o) to Es1i (Input), No accumulative branch testing
B 2 5 Fwd like A /w SPR(s) preventing effective dataloop exceeding exe stage time.
C 1 8 Same as A /w Group Branch Calculation
D 2 5 Same as B /w Group Branch Calculation

Test Tr SC !SC Bc Bt
A 75 0 0 0 32
B 75 0 8 0 32
C 75 8 8 0 32
D 75 8 0 0 32
E 75 8 0 8 32
F 75 0 8 8 32
G 75 8 8 8 32
H 75 0 0 8 32

Tests A B C D
A 8 10 8 10
B 4808 3010 4808 3010
C 9608 9010 9608 9010
D 4808 6010 4808 6010
E 158408 198010 28808 31210
F 158408 195010 28808 28210
G 163208 201010 33608 34210
H 153608 192010 24008 25210

A>B&C>D @ Serial Computations
A<B&C<D @ Non Serial Computations
A&B<C&D @ Branching

Maths used (again this wasn't perfect maths, i did this quickly and thus isn't fully correct)
SCp=exe Stage Count[eSc] * Troughput [Tp]
A&B=(TR*((SC*SCp)+(!SC*Tp)+(SCp*Bc*Bt)+Tp)
C&D=(TR*((SC*SCp)+(!SC*Tp)+((Tp*Bc)+Bt)+Tp)

Also a note to TSO regarding my discribed setup, 4tick ALU for 8bits plus 1tick per extra 8bits, include Fwd withing that time, 1input stage buffer at 1tick, so effective dataloop of 5ticks
So doing multiplication on that via Looped addition == 5ticks per loop, so assuming you have a good CU that would loop add the higher value by the lower value times; A=8, B=5: 5x5=25ticks
A=13, B=11: 5x11=55ticks
A=16, B=16: 5x16=80ticks

I know your design accells in multiplication, however you do have a dedicated multiplyer meanwhile this is emulated without a shifter even, If you have a shifter intergrate you could perform this within 2cycles (10ticks), same with division as well.

How ever most of our PEs don't have multiplyers or dividers implemented, but some do have shifters
Also to note very few people have put much effort into CU so currently no GPRM PEs with a good CU to make looped addition as quick as posible, i believe two PEs do though.

Also the bitwise logic, addition and subtraction as well as branching in some cases in this setup is very reliably a consistent 5ticks per instruction, So programing isn't so hard with these kinds of machines, plus good ASM practice Smile

The more simpler GPRM PEs that feature everything i said are around 18tickish, but those use no pipelining and so are to be expected to be slow, the simple pipelined ones are around 10-12ish ticks, and the more advanced onces around 8ticks, one i stated described is theoretical and based on what we currently have but with some extra paralised logic to reduce computation times (yet to be made but logically this already works)

Also thanks for explainations as to how yours works, it cleared a few things up Smile

RE: I am TSO - TSO - 10-14-2014

I think I might make this in 4 bit first... just judging by the size of the multiplier grid. I'll show you what I'm thinking of using that because all of you are clearly of different opinions about varying aspects. (The depressing part: the IS is still 48 bit)

RE: I am TSO - LordDecapo - 10-15-2014

u do not need 48 bits for an IS, condense it...
condense it
condense it,..
oh ya then
condense it again xD lolol
lolol
i have gone through sooo many sequences of "OHHH I WANNA BE ABLE TO DO THIS!" and i add 2-4 ops to do the idea fully,
then i end up condensing it to a single Op with arguements and usually have more functionality in that One Op as i did in 4 i started with.. its all done with abstraction

RE: I am TSO - TSO - 10-15-2014

... Building this shit is really boring.

On the other hand, if anyone can start developing systems that use a diagonally staggered 1 wide tile able imput and output, let me know, I might use this if you do.

RE: I am TSO - LordDecapo - 10-15-2014

Building it is arguably the hardest part. No one can build your architecture for u, it wouldnt be the same.
Get WE or SPC, get good with them and use then, speeds builds up like Ten Fold

RE: I am TSO - LordDecapo - 10-15-2014

Oh and if u don't want to be bored...
Why not just throw together some money.. And buy the full MC, then u can trial, and build on ORE, building with a community of equally interested ppl makes things a lot more enjoyable and less boring.

RE: I am TSO - Magazorb - 10-15-2014

That's a understatement, more like 100fold once your good Smile

Anyhows STO the difference between 4bit and 8bit won't be much, as everything that's not part of the PU will remain the same size.

Also the fact people look at different things with different options is part of the beauty of community projects, it helps develop under made sections of projects, for example i've dag a lot into the Serial Computations of your device a fair bit, Decapo has gone into a few other things, it's not like we're ignorant of where your designs are good at, we just like to improve things XD

Like your system should branch and execute many stands of data well, fully aware of it Smile

i said before that i like to know how people overcome the issues, it's boring just letting things sit as they are with issues XD

Decapo is probly best to consult with inst. related things as of this moment, but a few other people would know other stuff about different things.
You've engaged with most Decapo and I because both of us put a lot of focus on the CU rather then PU, and most of what you have is pretty CU and non PU side stuff Smile

Meanwhile if you was to focus your design points on stuff like the PU or other logical devices you would probably have newo or GG engaging.

RE: I am TSO - LordDecapo - 10-15-2014

Well said maga.
Newo and GG are GREAT (pun intended) for PU stuff no doubt,
Me and Maga are the mains ones (if anyone else does PLEASE tell me, I love talking about this stuff and getting new/sharing ideas) for CU stuff.

I do really wanna see TSO on the server tho. His thinking would help solve some reoccurring issues ppl have with there builds

RE: I am TSO - TSO - 10-15-2014

What do you people expect me to do? Dig the money out of the booths at McDonald's? I don't have a job.

I already have world edit, but this is still just as boring. How do you do it?

How about you two have GG and Newo build you the greatest ALU ever, and then you two design the greatest CPU of all time around it? Turn it on and bam!, you have the best computer ever... unless you guys all hate each other.

Maga, it's the difference between 4 and 16, which is a massive reduction in size for all registers and ALU components. As I said, and then LG said, and then you said, though, the instruction set is still the same size.

Also, I don't know quite how to condense that instruction set, but there was a reason why commenting was enabled in that google doc.

It occurred to me that I'm going to have the most bizarre school server application... especially if this damn thing works.

I also get the feeling if this thing works, and maybe even if it doesn't, that I have been fully approved for trial on the build server without ever applying in the first place...

RE: I am TSO - TSO - 10-15-2014

I think I've finally got it! The full analogy you can all understand as to how it works. Okay, It is a single CPU that only takes two bit opcodes. This then gets sent out to multiple ALU type components that are individually pipelined, but are independent of each other. One is a shift processor, one is a Boolean processor, one is an add/subtract/xor processor, one is a multiplication/square processor, one whole division/remainder processor, and one is a square root processor. (I think division approximation is going to have to go.), but there is one other ALU processor, a conditional statement prediction processor. It determines, using inequalities, the conditional results for the multiplication, division, and remainders before they are actually calculated so that the system doesn't have to wait on their longer process times.

These processors all share register memory, but that is all they have in common, their opcodes and syntax are different in each unit, and the time they take to complete is different, with the shift being the fastest (1 tick), but their mutual independence is critical to the understanding that this system may have a three tick throughput without issue because they also have priorities for register memory based upon when the operations were ordered. This prevents any writing conflict in the registers, but their actual priority and delay was predetermined in compile time so that the registers in use when a unit wrights are not conflicting with the registers being read or set aside by other units. These units therefore do not need their actual priority coded with them, they just need to have it already accounted for. The conditional predictor is the only one that is actually integrated into the CPU, and it controls the instruction pointer for the next line of code. Together, these two can predict a branch in two cycles. The other external device is a virtual memory unit that is only addressable by the CPU, and any external device you want can be added on the serial out if the proper driver is installed in main memory and linked in the program being run so that the compiler can integrate the driver into the program (yeah, I just went there). This does limit the number of devices you can connect considerably.

The main memory can also control itself, being more accurately described as a mainframe server, and is able to send and receive instructions from other computers attached to it's various output ports.

RE: I am TSO - Magazorb - 10-15-2014

Yes, so it's a multi-element processor with out of order execution and few extra fancy stuff, i already noted that from when you said it before XD

but go back to what i said about 4bit and 8bit, the difference isn't much... if you do something you may as well go for 8bits otherwise you'll end up wasting time making a 8bit version of what you've allready made on a size just to small to work correctly.

PE = Processing Element = a device that is at least partly controlled that commits work on a data stream handed to them, it's just it's a more older style of dedicated logic back when the clock speed was low and it made sense to use this kind of stuff XD, i did say we got it Smile

we understood the instruction stuff as well, (i'd admit first explanation of it i didn't get but i have understood it for a while)

Anyhow, the trials are only a means of testing if someone is up to scratch.

Also as for the job thing if you're old enough to get a job you can just offer to work for cheap, won't get you much initially, but experience is the key factor, that gets you more jobs, hence why internships wear such a popular thing.

RE: I am TSO - LordDecapo - 10-15-2014

Lol TSO,u know more then enough it seems to just apply to build, you don't have to be on school first, just till u know how to make basic ALU and data loop stuff.
You seem past that xD

Could ur parents buy it? Lol

Also I did have a good understanding of what u wanted to do, just wasn't sure how u wanted to arrange ur Superscalar setup (Superscalar meaning, multiple FU (Functional Units, ALU, Multipliers, devidera. Etc) being controlled by a single CU)
Also u won't need a seperate Boolean unit, as u only really need like 4-5 Boolean functions. The rest aren't used often at all. Those 4-5 key ones u can get from a modified adder ALU, the rest u can simulate with a couple inst if you actually need them in a program

RE: I am TSO - TSO - 10-15-2014

Maga, I won't be building in eight bit, I'm just going to jump straight from four to sixteen bit. Prototyping in eight bit takes much longer than in four bit so I might as well just use four. I chose four because it's functional enough to allow for all the quirks in the large operations to present themselves, but quick enough that I can just blaze through it.

Another thing I just realized is that my computer can also run on much larger word sizes than sixteen bit without any issues in the instruction set (except one very tiny one which makes it so that writing a nonzero constant to registers must be done in more than one clock.) So it's instruction efficiency will get better as it becomes larger. This computer is really a number cruncher or digital signal processor of some kind. I do have one open space in my instruction set... a fused multiply-add could fit in there quite nicely, and will easily combine with the existing multiplier... I personally will not build this any larger than 16 bit, but it can easily be done. I also forgot to mention that it no longer naturally analyses in two's compliment or floating point. That must be done by the programmer.

I don't think I could ever pass the actual trial, I'm just so slow at building things...

LD, if you notice, my school application was for piston extensions greater than two and tree farming. These were actually what made me build the oscilloscope, which led to the idea for this computer. Neither the tree farm, nor the piston extension was ever figured out.

The Boolean unit is separate from the adder because of how the opcodes are shared. There are only like four booleans it does, and the adder does three types of add and the xor, so separation of units allows for me to use parallel decoding in the adder for it's four and parallel decoding in the Boolean unit for it's whole system. This takes a single tick of delay out of the adder, two ticks out of the Boolean unit, and makes both units extremely simplified in comparison.

Bit too old for my parents to buy it, bit too young to work as an intern, and I really don't want to get an associate's degree in hamburger flipping at McDonald's while I wait interim.

RE: I am TSO - Magazorb - 10-15-2014

Ahh ok TSO, wasn't aware of you wanting to up it to 16 bit, and multiple inst. are normal for inputting a IMM, so don't stress about it Smile

RE: I am TSO - jxu - 10-16-2014

(10-14-2014, 03:03 AM)͝ ͟ ͜ Wrote: stop talking and start building
tl;dr

RE: I am TSO - Magazorb - 10-16-2014

(10-16-2014, 01:15 AM)͝ ͟ ͜ Wrote:
(10-14-2014, 03:03 AM)͝ ͟ ͜ Wrote: stop talking and start building
tl;dr

Just be civilized , stop being rude and shut up.
If you don't like reading this thread you know where to go.
Or if you can't figure that out, you know what thread to not read... not that you've been reading it anyhow, all you been doing is spamming tl;dr, like that's really usefull!
if nothing to contribute but show your lack of mental capacity then just go else where and stop saying "rl;dr, stop talking and start building", not sure you noticed but all the better systems on the server are by those that talk it over a fair bit before building.

But oh well... some people are just a few less then others, know what i mean? XD

RE: I am TSO - jxu - 10-17-2014

You missed the point
TSO is in a situation where he has (afaik) not built at all
You could sit around all day talking it over but you only end up with theoretical results: the true test of usability is actual building
You have to build to get better, and don't forget: premature optimization is the root of all evil

As for your last sentence: of course, there are billions of people above and below average. As for you magazorb, I've barely seen any building from you Wink

RE: I am TSO - greatgamer34 - 10-17-2014

shots fired.... ;-;

should staff reveal the identity of ͝ ͟ ͜ ?

jk

i wouldnt do that <3...... or would i?

RE: I am TSO - Magazorb - 10-17-2014

-_- guess you win this round, apologies.

RE: I am TSO - LordDecapo - 10-17-2014

lol i dont see any building of CPU stuffs much at all from you either ͝ ͟ ͜ ,,, and u know i know who u are

RE: I am TSO - TSO - 10-17-2014

I have never seen building from any of you guys... (It's true)

RE: I am TSO - greatgamer34 - 10-17-2014

(10-17-2014, 07:42 PM)TSO Wrote: I have never seen building from any of you guys... (It's true)

https://www.youtube.com/channel/UC1SlR7gYFfqhhR1EWzAEIOg

RE: I am TSO - jxu - 10-17-2014

(10-17-2014, 07:24 PM)LordDecapo Wrote: lol i dont see any building of CPU stuffs much at all from you either ͝ ͟ ͜ ,,, and u know i know who u are

That's because I don't redstone at all

And yes, I am TSO.

RE: I am TSO - Tjakka5 - 10-18-2014

SO ITS MAGA!!!
I FIGURED IT OUT GUYS!
^ IS MAGA

RE: I am TSO - Magazorb - 10-18-2014

Yer... because I'm really insane enough to require a posonar.

RE: I am TSO - Nuuppanaani - 10-18-2014

Let'sperformthedyslexiacheckonthemysterypersontoseeifit'sMaga!

RE: I am TSO - LordDecapo - 10-18-2014

(10-17-2014, 07:42 PM)TSO Wrote: I have never seen building from any of you guys... (It's true)

My Redstone Stuffs (imgur)

there u go

RE: I am TSO - LordDecapo - 10-18-2014

(10-18-2014, 06:25 PM)Nuuppanaani Wrote: Let'sperformthedyslexiacheckonthemysterypersontoseeifit'sMaga!

sofuckedandsogreatXD

RE: I am TSO - greatgamer34 - 10-18-2014

(10-18-2014, 08:07 PM)LordDecapo Wrote:
(10-18-2014, 06:25 PM)Nuuppanaani Wrote: Let'sperformthedyslexiacheckonthemysterypersontoseeifit'sMaga!

sofuckedandsogreatXD

noimgreat

RE: I am TSO - jxu - 10-18-2014

I am mergerzehrb

Everything I have ever done has led me to this

RE: I am TSO - Magazorb - 10-18-2014

(10-18-2014, 09:37 PM)͝ ͟ ͜ Wrote: I am mergerzehrb

Everything I have ever done has led me to this

Well that name is way yo well spelt.

So today was national have fun with dyslexics day in the ORE community? XD
funny either way.

RE: I am TSO - greatgamer34 - 10-18-2014

well im lysdexic

RE: I am TSO - Magazorb - 10-18-2014

No dyslexic doesn't know how to spell dyslexic.

RE: I am TSO - LordDecapo - 10-19-2014

lolthisshithasgonefromserioustogoofayisitdead?

lol this shit has gone from serious to goofay is it dead?

translated so all of us can read Big Grin

RE: I am TSO - Magazorb - 10-19-2014

Thanks lord... and seemingly so Sad

RE: I am TSO - LordDecapo - 10-20-2014

(10-19-2014, 09:28 PM)Magazorb Wrote: Thanks lord... and seemingly so

sry maga.. u know the love is real... no hard feelings ment...
also are we gonna continue the serious convo on this thread?

RE: I am TSO - Magazorb - 10-20-2014

Well you and i always talk about our stuff when we get the opertunites, so i'm sure we won't need to Tongue

also i'll probably be busy for a while as my maths is holding me back so i'll be putting focus on that till i average out Smile

RE: I am TSO - jxu - 10-20-2014

haha....average
I'm very clever

RE: I am TSO - TSO - 10-21-2014

*returns*
WTF?

So, I leave for like three days, and this is the best you guys could do with your time?

Also, my update: I haven't even launched the minecraft application in over a week.

RE: I am TSO - Magazorb - 10-21-2014

(10-21-2014, 05:20 AM)TSO Wrote: *returns*
WTF?

So, I leave for like three days, and this is the best you guys could do with your time?

Also, my update: I haven't even launched the minecraft application in over a week.

It's not like i've had much time to do anything as of late :/

RE: I am TSO - LordDecapo - 10-27-2014

(10-21-2014, 05:20 AM)TSO Wrote: *returns*
WTF?

So, I leave for like three days, and this is the best you guys could do with your time?

Also, my update: I haven't even launched the minecraft application in over a week.

Lol. How would we post on here while ur away? It's Been aabout comparing and contrasting your design ideas to ours. So since u haven't posted anything. We have nothing to talk about Tongue

You haven't been in MC, but have you done any research or thinking on your stuff?

Also I'm proud to say that I have debugged about 3/4 of my Mode1 operations. And just got unconditional branching set up. It has a couple timing bugs tthat need to be worked out for it to run on the 10tick clock. But it runs Direct, Calculative Positive (adding to PC), and Calculative Negative (subtract from PC ) unconditionally perfectly on a manual clock.
Will bbe timing that tonight as well as trying to hook up my conditionals and Call/Return.

Wish me luck: ]

RE: I am TSO - greatgamer34 - 10-27-2014

good luck.

Wow this thread should not be in introductions... it has derailed xD

RE: I am TSO - TSO - 10-29-2014

Do you think I could get the game by searching with bing?

RE: I am TSO - CreepyTacoMan - 12-15-2014

well then. just ctrl-fed the page for "porn" and there are now 270 entries. That is certainly something alright.

RE: I am TSO - TSO - 09-02-2015

Guess what? I'm not dead.

RE: I am TSO - Apuly - 09-02-2015

Shame

RE: I am TSO - Halflife390 - 09-02-2015

^^^^^^^

RE: I am TSO - LordDecapo - 09-02-2015

Wow pauly... so much love....

RE: I am TSO - TSO - 09-02-2015

Great to know I was missed. Anyway, I was readin some of the project threads here, and tw while time I was just thinking "I have no idea what they're talking about." Big Grin

RE: I am TSO - TSO - 04-01-2016

Finally reassembled my copy of minecraft.

I'm now looking through my computer save (this world is amazingly ugly. What was I thinking) and I have no idea how to operate any of my inventions.

(here's the save, btw) https://drive.google.com/folderview?id=0B33TbxOHRbCEbGJ0Qjd2SllWWkk&usp=sharing

Explore, my friends. Also, I need to learn how book & pen works and make a habit of using it.