My CPU

PaukkuPalikka · (This post was last modified: 07-01-2018, 09:44 PM by PaukkuPalikka.)

This time for real.

This thread serves two purposes:
1. progress updates and possibly discussion
2. increasing the likelyhood of the project actually getting finished

Expecting to finish before August 1st 2019, but this may fluctuate in one direction or another.

More information soon [TM] (sometime next week).

Trecar · 07-03-2018, 11:26 AM

Will there be a branch predictor?

PaukkuPalikka

Trecar Wrote: Will there be a branch predictor?

Most likely at least a direction predictor. There will almost definitely not be a branch target buffer as it is very laggy, and it's possible to accomplish the same by hoisting the branch (or parts of it, EPIC-style "prepare-to-branch" instruction maybe) ahead of time. This way the branch prediction hardware runs only when a branch is known to be executed. Hoisting up the branch along with the comparison will increase the misprediction penalty, which is also possible to trade for a delay slot. Delay slots are likely to be a thing, and I think they will only rarely be wasted.

I don't know when this will be decided

LambdaPI · (This post was last modified: 07-03-2018, 07:06 PM by LambdaPI.)

But will it support virtual memory? And will it have any data/instruction caches?

-probably dont need virtual memory for minecraft.

PaukkuPalikka · (This post was last modified: 07-03-2018, 08:08 PM by PaukkuPalikka.)

LambdaPI Wrote:But will it support virtual memory? And will it have any data/instruction caches?

-probably dont need virtual memory for minecraft.

It will have data and instruction caches for sure. I haven't thought about details yet though, the plan is to run simulations before committing to any specifics about caching as it will easily end up being a performance bottleneck. For the data cache, something like 4x 8 byte blocks could be possible, but that's likely to change. There will also be a queue system for memory operations. More details on that soon [TM].

As for virtual memory... I haven't given much thought to it yet (in the context of this project specifically), but if it is going to be implemented, it would probably be a single address space that's shared for everything. That would allow for a virtually tagged cache without aliasing checking, so translation could be deferred to lower levels of the memory hierarchy. There is one thing where having address translation would be very useful (might also require a high page granularity), but more details on that later once I get there Tongue

PaukkuPalikka · (This post was last modified: 07-08-2018, 08:09 PM by PaukkuPalikka.)

Update 1
So this memory queue is an idea I had a while ago after reading about decoupled access-execute architectures (E. Smith, James. (1984). Decoupled Access/Execute Computer Architectures). The basic idea in DAE is to partition the instruction stream, statically or dynamically, into two separate streams: one for memory access-related instructions and another for the rest. These streams then communicate through architecturally exposed queues, which allows the streams to "slip" ahead of one another and execute out-or-order in respect to each other. Memory access can then be performed before the data is actually needed.

The actual memory queue is similar in many ways to load/store queues used in out-of-order processors. The load queue is exposed as an architectural queue (an idea from DAE), read and write mapped to a register name. This has some important effects: register pressure is relieved a bit and cache misses can be better overlapped with other execution. Hoisting loads upwards in code is a common technique in static scheduling, but it has a few drawbacks. Without modifying the architecture there would have to be hardware marking a register as invalid in case of a cache miss and generate a stall if an invalid register is accessed. Some other solutions to this have been proposed. The Mill uses deferred loads, which retire after a certain amount of cycles, and pickup loads, which associate a name with each loaded value, with which it can later be read from a buffer. Another, more general way, is to use prefetch instructions. The queue effectively uses the . The CPU will stall if the queue is read while the queue head is not ready.

So loads can be hoisted up in code to tolerate latency. What about stores? A store can miss in the cache too, and in my case will have to wait for the cache block to be fetched first. OOO processors buffer stores for this purpose also, and provide load-to-store forwarding. My queue will do the same (it doesn't actually have to be a queue for stores, but having them both in the same structure simplifies things). Another thing that can be done is store merging: if a store to the same address already exists and is waiting in the queue, the value of it can simply be overwritten by a new store. This would not be possible without store-to-load forwarding.

Here is how the queue will work:
[bla bla to be updated later]

I've also started working on a test compiler to get some statistics about the occurence of different instructions and scheduling stuff.

EEVV

CP(auk)U :o

The ORE MyBB forums are being retired. Please use the new discussion board. View more information regarding the migration.