07-08-2018, 07:56 PM
(This post was last modified: 07-08-2018, 08:09 PM by PaukkuPalikka.)
Update 1
So this memory queue is an idea I had a while ago after reading about decoupled access-execute architectures (E. Smith, James. (1984). Decoupled Access/Execute Computer Architectures). The basic idea in DAE is to partition the instruction stream, statically or dynamically, into two separate streams: one for memory access-related instructions and another for the rest. These streams then communicate through architecturally exposed queues, which allows the streams to "slip" ahead of one another and execute out-or-order in respect to each other. Memory access can then be performed before the data is actually needed.
The actual memory queue is similar in many ways to load/store queues used in out-of-order processors. The load queue is exposed as an architectural queue (an idea from DAE), read and write mapped to a register name. This has some important effects: register pressure is relieved a bit and cache misses can be better overlapped with other execution. Hoisting loads upwards in code is a common technique in static scheduling, but it has a few drawbacks. Without modifying the architecture there would have to be hardware marking a register as invalid in case of a cache miss and generate a stall if an invalid register is accessed. Some other solutions to this have been proposed. The Mill uses deferred loads, which retire after a certain amount of cycles, and pickup loads, which associate a name with each loaded value, with which it can later be read from a buffer. Another, more general way, is to use prefetch instructions. The queue effectively uses the . The CPU will stall if the queue is read while the queue head is not ready.
So loads can be hoisted up in code to tolerate latency. What about stores? A store can miss in the cache too, and in my case will have to wait for the cache block to be fetched first. OOO processors buffer stores for this purpose also, and provide load-to-store forwarding. My queue will do the same (it doesn't actually have to be a queue for stores, but having them both in the same structure simplifies things). Another thing that can be done is store merging: if a store to the same address already exists and is waiting in the queue, the value of it can simply be overwritten by a new store. This would not be possible without store-to-load forwarding.
Here is how the queue will work:
[bla bla to be updated later]
I've also started working on a test compiler to get some statistics about the occurence of different instructions and scheduling stuff.
So this memory queue is an idea I had a while ago after reading about decoupled access-execute architectures (E. Smith, James. (1984). Decoupled Access/Execute Computer Architectures). The basic idea in DAE is to partition the instruction stream, statically or dynamically, into two separate streams: one for memory access-related instructions and another for the rest. These streams then communicate through architecturally exposed queues, which allows the streams to "slip" ahead of one another and execute out-or-order in respect to each other. Memory access can then be performed before the data is actually needed.
The actual memory queue is similar in many ways to load/store queues used in out-of-order processors. The load queue is exposed as an architectural queue (an idea from DAE), read and write mapped to a register name. This has some important effects: register pressure is relieved a bit and cache misses can be better overlapped with other execution. Hoisting loads upwards in code is a common technique in static scheduling, but it has a few drawbacks. Without modifying the architecture there would have to be hardware marking a register as invalid in case of a cache miss and generate a stall if an invalid register is accessed. Some other solutions to this have been proposed. The Mill uses deferred loads, which retire after a certain amount of cycles, and pickup loads, which associate a name with each loaded value, with which it can later be read from a buffer. Another, more general way, is to use prefetch instructions. The queue effectively uses the . The CPU will stall if the queue is read while the queue head is not ready.
So loads can be hoisted up in code to tolerate latency. What about stores? A store can miss in the cache too, and in my case will have to wait for the cache block to be fetched first. OOO processors buffer stores for this purpose also, and provide load-to-store forwarding. My queue will do the same (it doesn't actually have to be a queue for stores, but having them both in the same structure simplifies things). Another thing that can be done is store merging: if a store to the same address already exists and is waiting in the queue, the value of it can simply be overwritten by a new store. This would not be possible without store-to-load forwarding.
Here is how the queue will work:
[bla bla to be updated later]
I've also started working on a test compiler to get some statistics about the occurence of different instructions and scheduling stuff.
Powered by Free Software