11-15-2014, 06:11 AM
(11-14-2014, 03:07 AM)Magazorb Wrote: Somehow i doubt Nvidia will make much of a dent... do remember double precision performance (programmers and deverlopers are starting to make more use of this for verious reasons) on nvidia is often bad, no idea how, GTX Titan blacks wear somewhere between 8 and 9 GFlops of single precision floats (can't remember it's esact number for base clock speed) but only around 2.6GFlops of double (i don't think they use 64bit processors, rather emulated 64 with 32bit processors, maybe fancey CU stuff, that's normaly N-vidia's style. 290X has around same amount of double precision for significantly less, and scales perfectly for single precision, meaning twice the speed then double.
Yes GTX 770 and up take the single precision win against AMD... but with no double support untill you reach Titan on N-vidias consumer line up, their will probly be plenty of games that run bad with N-vidia, Starcitizen for example, which will use doubles (currently single but the engine is being modded to support double)
My thoughs about why it's using a 32bit cuda processor over 64bits is mainly over the fact that it's performance in double precision is merely 1/3 the performance of it's single precision performance, normaly the differance of running a 32bit processor instead of 64bits but having twice as many would easily be though as having the same effect as having a 64bit processor that can process two 32bit when required, or even two 32bit processors running with carry condtionaly in series based on which precision is used.
How ever if we think about this in further depths we know that to keep good clock speeds we are required to pipeline, this then becomes a issue if your computations for 64bit value is based on running the lower half in first and chasing the 2nd half in after, ofcourse doing this you require to wait for the first half to compute, and for the carry flag to cycle and be ready, in a low pipeline count processor this probably wouldn't be much of a issue, but in more complex pipelines, it may not be able to be ready by the next cycle (this is very easily the case if you want really high throughput you would general have stage buffers at both input and output of the execution stage and thus in doing so create a issue where to keep clock speed up, you make the data-loop 3 cycles instead of 2 (say if we had only a input buffer)
I'd like to remind you that this is just my suspicion, i don't know why for sure single precision is 3 times faster then double
I suspect that this means that N-vidia will have to address this if they wish to remain performance king, and if my suspicions of them using 32bit processors is correct, then that will take up a lot of their new effective space, thus resulting in not much actual things being put on.
If i was to make as much of a educated guess as i could based on the way N-vidia seems to go, it would probly be either of following
Based on 980 arch:
6 GPCs each with 4 SMMs on each GPCs and 128 Cuda cores on each SMM ==3072 Cuda cores
or
Based on GTXTB arch:
6 GPCs each with 3 SMX on each GPCs and 256 Cuda cores on each SMX == 4608
Both are being optermistic, but both seem posible, the GTXTB to what i can see in arch has:
256 Cuda cores physicly on each SMX (3840), however only 192 functional on each SMX (2880)
but due to that reason it wouldn't suprise me at all to see a:
6GPCs with 3SMX each and 192 Cuda cores on each SMX == 3456.
It should be noted that to what i can find 980 has 4GPCs with 4SMMs with 128 Cuda physicaly and functional while the GTXTB has 5GPCs with 3SMXs with 256 (3840) phys Cuda and 192 (2880) Funct Cuda
i scaled both up to 6GPCs based on how much extra efective room i expected to be the new die then the old ones they was based on, this assumed that the issue i propose of N-vidia useing a 32bit processor isn't increased to a 64bit.
It also seems safe to assume N-vidia would 6x64b (384b) memory interface clocked at 1750MHz (7GHz effective) however with a increase in cores it wouldn't be unreasonable that they will either use a even higher clock memory controler or more of then (maybe 8x64b(512b) memory interface)
to correct GG, the 390X has listed 64CUs, 64SPs per CU and thus 4096SP
This wouldn't be hard to expect being a step up from the 290X with the 290X having:
4SEs, 11CUs per SE (44), 64SPs per CU (2816) and also has 8x64b (512b) memory interface
from this we can say fairly safely that the 390X with supposed stats would architecturally be as follows:
4SEs, 16CUs per SE (64), 64SPs per CU (4096), 8x64b (512b) MI
By this point you probly notice simularitys between N-vidia and AMD, this isn't really a corincidence, both of them in their most fundamental states are M-SIMD PEA (Multiple-Single Instruction stream Multiple Data stream Processing Element Array)
we can say that fundamentally both Cuda and SP cores are SIMD PE's controled by a common CU, the CUs in AMD GPUs are quite obviously CUs, and in N-Vidia you may also concider the SMXs and SMMs as CUs.
Now having busing going to all the CUs from the PCIe Interface would be ugly, so we use another device of which's technical name i'm unaware of, so i'll go by what AMD calls this based on them having quite simular naming to what we do and so going with the name of SE (Shader Engine), the N-Vidia's GPCs could be concidered the equal.
It should be noted that the CU also makes use of other devices besides the PEs
So a SE in a rather lose speech will buss data to and from the CUs, PCIe Interface and extraneous components within the SE
Rather convienently both AMD and N-vidia have roughtly the same clock speeds on just about everything, only real difference is the memory interface, where AMD uses more bit-width to achive higher bandwidth, N-vidia uses higher clock speeds to get higher bandwidth, Theoretically:
Refrance cards often have:
AMD: 512b x 5GHz = 2560Gb/s
N-Vidia: 384b x 7GHz = 2688Gb/s
after market cards often have:
AMD: 512b x 5.5GHz = 2816Gb/s
N-Vidia: 384b x 7.2GHz = 2764.8Gb/s
Latency is also another factor for performance, although less so with GPUs and arn't to much to worry about.
Having more of SEs, CUs and/or PEs is allways a good thing, Having more:
SEs; will help with trafficing data around (good if you have a lot of data movement)
CUs; will allow you to work on different instruction streams (this almost acts pretty like a thread)
PEs; with more PEs comes more processes thus more computational power.
You might be curious about the AMD 7850K & 7700K APUs at this point knowing that AMD designated them as 12 & 10 "Compute Cores" and yes, that is based on the CU count for the GPU cores, each CU having 64 SPs per CU, thus 8(64SP/CU)=512SPs & 6(64SP/CU)=384SPs
on a note about 7850K and 7700K is that they do use the module design as the FX series, however the Stream-Roller Modules are a more optimized version of the Pilerdrive Module, resulting in better performance overall per core per clock.
Refrances:
GTX980 arch: http://hexus.net/tech/reviews/graphics/7...m-maxwell/
GTXTB arch: http://www.bit-tech.net/hardware/graphic...k-review/1
290X arch: http://www.guru3d.com/articles-pages/rad...rks,4.html
A10-7 arch: http://www.anandtech.com/show/7677/amd-k...10-7850k/3
Holy fuck. You put time into this.