11-17-2014, 05:17 PM
(note that 100% of this was written before my first reply to this thread, also, due to time constraints, do no expect the promised assemby function explanations to come any time soon, if at all. I have too much work to do at the moment.)
@maga:
Before I start with the quoting, I will immediately say that you have made several mistakes with assumptions about single and double precision compatibility. I will make an edit sometime tomorrow that details the exact method that 32 and 64 bit CPU's would deal with each type as well as how 32 and 64 bit FPU's would deal with it (an FPU has hardware support for at least one of the precision types.) To make things confusing, the CPU stuff will all be written for x86 without any debugging or checking (good luck actually making it work) in AT&T syntax (you have been warned). Convenient ASCII art diagram thingies might be included because I'm not a total monster and vim actually makes their creation pretty easy and the 32 CPU codes will also contain a diagram of The Stack because I ran out of registers after about three lines of code. Do note that I'm not really using The Stack as a stack, just as an extremely convenient and rather large register. (Also another great quote from my father I just got this evening, "Nobody actually codes in assembly anymore because computers are so fast now... hell, modern computers are so fast they can almost run java." His programming quotes are the best...)
No. Two 32 bit processors running 32 bit single point will be mutually independent. A 64 bit processor can not perform two single point operations simultaneously without many, many complications because the single precision operations still need to be independent, so either the system will need to correct wrong carries and mask the exponents back into the lower half, while masking in the leading digits of the upper number and then removing the bit 31 carry and accounting for the bit 22 carry if they do carry (this adds quite a few steps to the execution), or by having a two parallel single point hardware units and one double point hardware unit all sharing the same 64 bit bus with an opcode extension choosing between the two (so now we need to invoke something like SADDQ or DADDQ for our single point and double point opcodes. This makes each 64 processor four times larger than each 32 bit processor, and that doesn't account for the extra hardware needed to support the additional operations). Instead, double point processors tend to pad the exponent and numerical portions with leading zeroes, therefore being the exact same speed as one 32 bit processor (half as fast as two of them). These all assume that the execution is done in dedicated hardware. If software is used, the difference is in which bits are masked out for which part of each process, and then some method is used to prevent the numerical operands from interfering while monitoring the carries in the upper and lower halves, such as separating with three zeroes in the case of an add and then monitoring the carry out of bit 23 (hint:it's in bit 24) and then adding one to the exponent of the corresponding numbers if there is a carry, which are in a different register, bit shifting only the value that carries, and doing nothing to the signs which are in yet another register, and then masking all three registers back together to get our two single point numbers. This actually has a very small loss compared to the double point, which comes from the extra shift on the lower value (it has to decide which of four masks to use when recombining the three registers.)
This depends on choices, but the extra cycles more likely come from the actual composition of the single and double point numbers in the registers. They have to be parsed a few more times and then put back together again when not using double point on 64 bit or single point on 32 bit. (I will edit in some x86-like asm to show you what each system would have to do.)
Also, you forgot that this isn't a single central control unit to a GPU. That would be rather silly considering that absolutely every single internal processing unit is doing exactly the same thing at exactly the same time to exactly the same data for exactly the same reason: they all share a control bus. They don't need a central unit because they aren't CPU's, they are CPU add-ons that use the existing CPU as their control unit. For example, (this is my best guess for some of these thisgs, I haven't quite had a chance to read the full GT manual, but the pictures seem to indicate this) according to my laptop, the operating data is in memory 0xE2800000-0xE2BFFFFF, this space is a series of structures that contain different matricies that operate on the primary bitmap held in memory 0xD000000-0xDFFFFFFF. The corresponding opcodes are sent through in I/O 0x8000-0x803F, I/O 0x3B0-3BB and I/O 0x3CO-0x3DF (as far as I can tell, one of those sets is the VGA operations port, one is the HDMI operations port, and one is the 3D operations port, in no particular order). It should be noted that the VGA terminal bitmap is memory 0xA0000-0xBFFFF (this is a fact that I am 100% certain of.)
On the memory input port of the graphics unit is a simple decoder/synchronizer that transforms incoming data into the proper type depending upon if it is intended for VGA, HDMI, or 3D, and ensures that the proper corresponding opcode also gets sent to the 12 processing units inside the GPU. If there was a central unit to the GPU, then there would only be the central unit's I/O ports and then why wouldn't you just spend an extra hundred or so bucks and have up that central unit be an actual multicore x86 processor with all the bells and whistles (except hyper threading, in this case hyper threading is very bad juju) that serves as the whole computer's CPU.
That's not how that works. The input bit rate is actually determined by a.) how quickly these things can get code out of the processor. Unfortunately, data conflicts are rather prominent in the system, so their memory bandwidth goes down quite a bit and b.) the actual interface to the processor. I don't think they just come with a 512 pin plug onboard. I'm farily certain that PCIe x32 is like 300 pins, so you're oing to have to cut down that number immediately, and I know that pcie 32 is only capable of 31.5 Gb/s, so you've got yourself a problem there.
Multiple processes in a GPU is actually rather rare because their entire goal is to make one image at a time, so they all aim to do exactly the same thing at exactly the same time to as much corresponding data as possible. They do not aim for the MIMD goal of CPU's because a GPU needs to perform only the matrix and encoding transforms for the bitmap to turn into the final code for the screen to interpret and decode into pixel illuminations. A GPU is inherently a unitasker and is fully intended to stay this way.
@maga:
Before I start with the quoting, I will immediately say that you have made several mistakes with assumptions about single and double precision compatibility. I will make an edit sometime tomorrow that details the exact method that 32 and 64 bit CPU's would deal with each type as well as how 32 and 64 bit FPU's would deal with it (an FPU has hardware support for at least one of the precision types.) To make things confusing, the CPU stuff will all be written for x86 without any debugging or checking (good luck actually making it work) in AT&T syntax (you have been warned). Convenient ASCII art diagram thingies might be included because I'm not a total monster and vim actually makes their creation pretty easy and the 32 CPU codes will also contain a diagram of The Stack because I ran out of registers after about three lines of code. Do note that I'm not really using The Stack as a stack, just as an extremely convenient and rather large register. (Also another great quote from my father I just got this evening, "Nobody actually codes in assembly anymore because computers are so fast now... hell, modern computers are so fast they can almost run java." His programming quotes are the best...)
(11-14-2014, 03:07 AM)Magazorb Wrote: Somehow i doubt Nvidia will make much of a dent... do remember double precision performance (programmers and deverlopers are starting to make more use of this for verious reasons) on nvidia is often bad, no idea how, GTX Titan blacks wear somewhere between 8 and 9 GFlops of single precision floats (can't remember it's esact number for base clock speed) but only around 2.6GFlops of double (i don't think they use 64bit processors, rather emulated 64 with 32bit processors, maybe fancey CU stuff, that's normaly N-vidia's style. 290X has around same amount of double precision for significantly less, and scales perfectly for single precision, meaning twice the speed then double.Only 3 GFlops? My laptop's GPU (intel HD 3000) does 129 GFlops at 1.3 GHz with 12 cores (about 4600 less than the cited gpu's) and whatever the Sandy Bridge I/O bus speed is for the badwidth, an xbox does 137 GFlops at about 60GB/s bandwidth, and my i5 can do even better than that due to the AVX extensions that nobody uses (256 bit processing using all four threads). Methinks you hath misplaced some decimals (though, I think the HD graphics unit is 64 bit, and the i5 certainly is).
magazorb Wrote:Yes GTX 770 and up take the single precision win against AMD... but with no double support untill you reach Titan on N-vidias consumer line up, their will probly be plenty of games that run bad with N-vidia, Starcitizen for example, which will use doubles (currently single but the engine is being modded to support double)
My thoughs about why it's using a 32bit cuda processor over 64bits is mainly over the fact that it's performance in double precision is merely 1/3 the performance of it's single precision performance, normaly the differance of running a 32bit processor instead of 64bits but having twice as many would easily be though as having the same effect as having a 64bit processor that can process two 32bit when required, or even two 32bit processors running with carry condtionaly in series based on which precision is used.
No. Two 32 bit processors running 32 bit single point will be mutually independent. A 64 bit processor can not perform two single point operations simultaneously without many, many complications because the single precision operations still need to be independent, so either the system will need to correct wrong carries and mask the exponents back into the lower half, while masking in the leading digits of the upper number and then removing the bit 31 carry and accounting for the bit 22 carry if they do carry (this adds quite a few steps to the execution), or by having a two parallel single point hardware units and one double point hardware unit all sharing the same 64 bit bus with an opcode extension choosing between the two (so now we need to invoke something like SADDQ or DADDQ for our single point and double point opcodes. This makes each 64 processor four times larger than each 32 bit processor, and that doesn't account for the extra hardware needed to support the additional operations). Instead, double point processors tend to pad the exponent and numerical portions with leading zeroes, therefore being the exact same speed as one 32 bit processor (half as fast as two of them). These all assume that the execution is done in dedicated hardware. If software is used, the difference is in which bits are masked out for which part of each process, and then some method is used to prevent the numerical operands from interfering while monitoring the carries in the upper and lower halves, such as separating with three zeroes in the case of an add and then monitoring the carry out of bit 23 (hint:it's in bit 24) and then adding one to the exponent of the corresponding numbers if there is a carry, which are in a different register, bit shifting only the value that carries, and doing nothing to the signs which are in yet another register, and then masking all three registers back together to get our two single point numbers. This actually has a very small loss compared to the double point, which comes from the extra shift on the lower value (it has to decide which of four masks to use when recombining the three registers.)
magazorb Wrote:How ever if we think about this in further depths we know that to keep good clock speeds we are required to pipeline, this then becomes a issue if your computations for 64bit value is based on running the lower half in first and chasing the 2nd half in after, ofcourse doing this you require to wait for the first half to compute, and for the carry flag to cycle and be ready, in a low pipeline count processor this probably wouldn't be much of a issue, but in more complex pipelines, it may not be able to be ready by the next cycle (this is very easily the case if you want really high throughput you would general have stage buffers at both input and output of the execution stage and thus in doing so create a issue where to keep clock speed up, you make the data-loop 3 cycles instead of 2 (say if we had only a input buffer)
I'd like to remind you that this is just my suspicion, i don't know why for sure single precision is 3 times faster then double
This depends on choices, but the extra cycles more likely come from the actual composition of the single and double point numbers in the registers. They have to be parsed a few more times and then put back together again when not using double point on 64 bit or single point on 32 bit. (I will edit in some x86-like asm to show you what each system would have to do.)
magazorb Wrote:I suspect that this means that N-vidia will have to address this if they wish to remain performance king, and if my suspicions of them using 32bit processors is correct, then that will take up a lot of their new effective space, thus resulting in not much actual things being put on.... seems legit, except that (I'll use your layout of the 390x) it sounds more like they are all composed of either 4096 one-bit processors, 64 individual 64-bit processors (since you could view each individual bit trace in an ALU is a processing unit in it's own right), or 64 RISC processors with 64 possible operations and each ALU internal unit (like each adder, shifter, multiplier ect.) is being referred to as a processing unit. No matter what, it seems fairly obvious that these all are MEG compatible (the Marketing Engineering instruction set extension.)
If i was to make as much of a educated guess as i could based on the way N-vidia seems to go, it would probly be either of following
Based on 980 arch:
6 GPCs each with 4 SMMs on each GPCs and 128 Cuda cores on each SMM ==3072 Cuda cores
or
Based on GTXTB arch:
6 GPCs each with 3 SMX on each GPCs and 256 Cuda cores on each SMX == 4608
Both are being optermistic, but both seem posible, the GTXTB to what i can see in arch has:
256 Cuda cores physicly on each SMX (3840), however only 192 functional on each SMX (2880)
but due to that reason it wouldn't suprise me at all to see a:
6GPCs with 3SMX each and 192 Cuda cores on each SMX == 3456.
It should be noted that to what i can find 980 has 4GPCs with 4SMMs with 128 Cuda physicaly and functional while the GTXTB has 5GPCs with 3SMXs with 256 (3840) phys Cuda and 192 (2880) Funct Cuda
i scaled both up to 6GPCs based on how much extra efective room i expected to be the new die then the old ones they was based on, this assumed that the issue i propose of N-vidia useing a 32bit processor isn't increased to a 64bit.
It also seems safe to assume N-vidia would 6x64b (384b) memory interface clocked at 1750MHz (7GHz effective) however with a increase in cores it wouldn't be unreasonable that they will either use a even higher clock memory controler or more of then (maybe 8x64b(512b) memory interface)
to correct GG, the 390X has listed 64CUs, 64SPs per CU and thus 4096SP
This wouldn't be hard to expect being a step up from the 290X with the 290X having:
4SEs, 11CUs per SE (44), 64SPs per CU (2816) and also has 8x64b (512b) memory interface
from this we can say fairly safely that the 390X with supposed stats would architecturally be as follows:
4SEs, 16CUs per SE (64), 64SPs per CU (4096), 8x64b (512b) MI
By this point you probly notice simularitys between N-vidia and AMD, this isn't really a corincidence, both of them in their most fundamental states are M-SIMD PEA (Multiple-Single Instruction stream Multiple Data stream Processing Element Array)
we can say that fundamentally both Cuda and SP cores are SIMD PE's controled by a common CU, the CUs in AMD GPUs are quite obviously CUs, and in N-Vidia you may also concider the SMXs and SMMs as CUs.
Now having busing going to all the CUs from the PCIe Interface would be ugly, so we use another device of which's technical name i'm unaware of, so i'll go by what AMD calls this based on them having quite simular naming to what we do and so going with the name of SE (Shader Engine), the N-Vidia's GPCs could be concidered the equal.
It should be noted that the CU also makes use of other devices besides the PEs
Also, you forgot that this isn't a single central control unit to a GPU. That would be rather silly considering that absolutely every single internal processing unit is doing exactly the same thing at exactly the same time to exactly the same data for exactly the same reason: they all share a control bus. They don't need a central unit because they aren't CPU's, they are CPU add-ons that use the existing CPU as their control unit. For example, (this is my best guess for some of these thisgs, I haven't quite had a chance to read the full GT manual, but the pictures seem to indicate this) according to my laptop, the operating data is in memory 0xE2800000-0xE2BFFFFF, this space is a series of structures that contain different matricies that operate on the primary bitmap held in memory 0xD000000-0xDFFFFFFF. The corresponding opcodes are sent through in I/O 0x8000-0x803F, I/O 0x3B0-3BB and I/O 0x3CO-0x3DF (as far as I can tell, one of those sets is the VGA operations port, one is the HDMI operations port, and one is the 3D operations port, in no particular order). It should be noted that the VGA terminal bitmap is memory 0xA0000-0xBFFFF (this is a fact that I am 100% certain of.)
On the memory input port of the graphics unit is a simple decoder/synchronizer that transforms incoming data into the proper type depending upon if it is intended for VGA, HDMI, or 3D, and ensures that the proper corresponding opcode also gets sent to the 12 processing units inside the GPU. If there was a central unit to the GPU, then there would only be the central unit's I/O ports and then why wouldn't you just spend an extra hundred or so bucks and have up that central unit be an actual multicore x86 processor with all the bells and whistles (except hyper threading, in this case hyper threading is very bad juju) that serves as the whole computer's CPU.
magazorb Wrote:So a SE in a rather lose speech will buss data to and from the CUs, PCIe Interface and extraneous components within the SE
Rather convienently both AMD and N-vidia have roughtly the same clock speeds on just about everything, only real difference is the memory interface, where AMD uses more bit-width to achive higher bandwidth, N-vidia uses higher clock speeds to get higher bandwidth, Theoretically:
Refrance cards often have:
AMD: 512b x 5GHz = 2560Gb/s
N-Vidia: 384b x 7GHz = 2688Gb/s
after market cards often have:
AMD: 512b x 5.5GHz = 2816Gb/s
N-Vidia: 384b x 7.2GHz = 2764.8Gb/s
That's not how that works. The input bit rate is actually determined by a.) how quickly these things can get code out of the processor. Unfortunately, data conflicts are rather prominent in the system, so their memory bandwidth goes down quite a bit and b.) the actual interface to the processor. I don't think they just come with a 512 pin plug onboard. I'm farily certain that PCIe x32 is like 300 pins, so you're oing to have to cut down that number immediately, and I know that pcie 32 is only capable of 31.5 Gb/s, so you've got yourself a problem there.
magazorb Wrote:Latency is also another factor for performance, although less so with GPUs and arn't to much to worry about.
Having more of SEs, CUs and/or PEs is allways a good thing, Having more:
SEs; will help with trafficing data around (good if you have a lot of data movement)
CUs; will allow you to work on different instruction streams (this almost acts pretty like a thread)
PEs; with more PEs comes more processes thus more computational power.
You might be curious about the AMD 7850K & 7700K APUs at this point knowing that AMD designated them as 12 & 10 "Compute Cores" and yes, that is based on the CU count for the GPU cores, each CU having 64 SPs per CU, thus 8(64SP/CU)=512SPs & 6(64SP/CU)=384SPs
on a note about 7850K and 7700K is that they do use the module design as the FX series, however the Stream-Roller Modules are a more optimized version of the Pilerdrive Module, resulting in better performance overall per core per clock.
Refrances:
GTX980 arch: http://hexus.net/tech/reviews/graphics/7...m-maxwell/
GTXTB arch: http://www.bit-tech.net/hardware/graphic...k-review/1
290X arch: http://www.guru3d.com/articles-pages/rad...rks,4.html
A10-7 arch: http://www.anandtech.com/show/7677/amd-k...10-7850k/3
Multiple processes in a GPU is actually rather rare because their entire goal is to make one image at a time, so they all aim to do exactly the same thing at exactly the same time to as much corresponding data as possible. They do not aim for the MIMD goal of CPU's because a GPU needs to perform only the matrix and encoding transforms for the bitmap to turn into the final code for the screen to interpret and decode into pixel illuminations. A GPU is inherently a unitasker and is fully intended to stay this way.