Forums - Open Redstone Engineers
R9-390X and GTX Titan II - Printable Version

+- Forums - Open Redstone Engineers (https://forum.openredstone.org)
+-- Forum: Off-Topic (https://forum.openredstone.org/forum-4.html)
+--- Forum: General computing and engineering (https://forum.openredstone.org/forum-66.html)
+--- Thread: R9-390X and GTX Titan II (/thread-5047.html)

Pages: 1 2 3


R9-390X and GTX Titan II - greatgamer34 - 11-12-2014

heres some info leaked on the new R9

heres some shit on the Titan II

Yw


RE: R9-390X and GTX Titan II - TSO - 11-13-2014

You're funny.
Anyway, I don't know how many flops they can do, which is all that matters. (Yes, I/O bandwidth is important, but anyone of decent skill can figure out how to work around that.)


RE: R9-390X and GTX Titan II - greatgamer34 - 11-13-2014

well with the 390x having 4096 CU i think its gonna rival the 295x2's performance. And with nvidia. Theyre just gonna blow shit outta the water!


RE: R9-390X and GTX Titan II - Magazorb - 11-14-2014

Somehow i doubt Nvidia will make much of a dent... do remember double precision performance (programmers and developers are starting to make more use of this for various reasons) on nvidia is often bad, no idea how, GTX Titan blacks have 4.5TFlops of single precision floats (can't remember it's esact number for base clock speed) but only 1.5TFlops of double (i don't think they use 64bit processors, rather emulated 64 with 32bit processors, maybe fancey CU stuff, that's normaly N-vidia's style. 290X has around same amount of double precision for significantly less, and scales perfectly for single precision, meaning twice the speed then double.
Yes GTX 770 and up take the single precision win against AMD... but with no double support untill you reach Titan on N-vidias consumer line up, their will probly be plenty of games that run bad with N-vidia, Starcitizen for example, which will use doubles (currently single but the engine is being modded to support double)

My thoughs about why it's using a 32bit cuda processor over 64bits is mainly over the fact that it's performance in double precision is merely 1/3 the performance of it's single precision performance, normaly the differance of running a 32bit processor instead of 64bits but having twice as many would easily be though as having the same effect as having a 64bit processor that can process two 32bit when required, or even two 32bit processors running with carry condtionaly in series based on which precision is used.

How ever if we think about this in further depths we know that to keep good clock speeds we are required to pipeline, this then becomes a issue if your computations for 64bit value is based on running the lower half in first and chasing the 2nd half in after, ofcourse doing this you require to wait for the first half to compute, and for the carry flag to cycle and be ready, in a low pipeline count processor this probably wouldn't be much of a issue, but in more complex pipelines, it may not be able to be ready by the next cycle (this is very easily the case if you want really high throughput you would general have stage buffers at both input and output of the execution stage and thus in doing so create a issue where to keep clock speed up, you make the data-loop 3 cycles instead of 2 (say if we had only a input buffer)
I'd like to remind you that this is just my suspicion, i don't know why for sure single precision is 3 times faster then double

I suspect that this means that N-vidia will have to address this if they wish to remain performance king, and if my suspicions of them using 32bit processors is correct, then that will take up a lot of their new effective space, thus resulting in not much actual things being put on.

If i was to make as much of a educated guess as i could based on the way N-vidia seems to go, it would probly be either of following
Based on 980 arch:
6 GPCs each with 4 SMMs on each GPCs and 128 Cuda cores on each SMM ==3072 Cuda cores
or
Based on GTXTB arch:
6 GPCs each with 3 SMX on each GPCs and 256 Cuda cores on each SMX == 4608
Both are being optermistic, but both seem posible, the GTXTB to what i can see in arch has:
256 Cuda cores physicly on each SMX (3840), however only 192 functional on each SMX (2880)
but due to that reason it wouldn't suprise me at all to see a:
6GPCs with 3SMX each and 192 Cuda cores on each SMX == 3456.

It should be noted that to what i can find 980 has 4GPCs with 4SMMs with 128 Cuda physicaly and functional while the GTXTB has 5GPCs with 3SMXs with 256 (3840) phys Cuda and 192 (2880) Funct Cuda
i scaled both up to 6GPCs based on how much extra efective room i expected to be the new die then the old ones they was based on, this assumed that the issue i propose of N-vidia useing a 32bit processor isn't increased to a 64bit.

It also seems safe to assume N-vidia would 6x64b (384b) memory interface clocked at 1750MHz (7GHz effective) however with a increase in cores it wouldn't be unreasonable that they will either use a even higher clock memory controler or more of then (maybe 8x64b(512b) memory interface)

to correct GG, the 390X has listed 64CUs, 64SPs per CU and thus 4096SP
This wouldn't be hard to expect being a step up from the 290X with the 290X having:
4SEs, 11CUs per SE (44), 64SPs per CU (2816) and also has 8x64b (512b) memory interface
from this we can say fairly safely that the 390X with supposed stats would architecturally be as follows:
4SEs, 16CUs per SE (64), 64SPs per CU (4096), 8x64b (512b) MI

By this point you probly notice simularitys between N-vidia and AMD, this isn't really a corincidence, both of them in their most fundamental states are M-SIMD PEA (Multiple-Single Instruction stream Multiple Data stream Processing Element Array)
we can say that fundamentally both Cuda and SP cores are SIMD PE's controled by a common CU, the CUs in AMD GPUs are quite obviously CUs, and in N-Vidia you may also concider the SMXs and SMMs as CUs.

Now having busing going to all the CUs from the PCIe Interface would be ugly, so we use another device of which's technical name i'm unaware of, so i'll go by what AMD calls this based on them having quite simular naming to what we do and so going with the name of SE (Shader Engine), the N-Vidia's GPCs could be concidered the equal.
It should be noted that the CU also makes use of other devices besides the PEs

So a SE in a rather lose speech will buss data to and from the CUs, PCIe Interface and extraneous components within the SE

Rather convienently both AMD and N-vidia have roughtly the same clock speeds on just about everything, only real difference is the memory interface, where AMD uses more bit-width to achive higher bandwidth, N-vidia uses higher clock speeds to get higher bandwidth, Theoretically:

Refrance cards often have:
AMD: 512b x 5GHz = 2560Gb/s
N-Vidia: 384b x 7GHz = 2688Gb/s

after market cards often have:
AMD: 512b x 5.5GHz = 2816Gb/s
N-Vidia: 384b x 7.2GHz = 2764.8Gb/s

Latency is also another factor for performance, although less so with GPUs and arn't to much to worry about.

Having more of SEs, CUs and/or PEs is allways a good thing, Having more:
SEs; will help with trafficing data around (good if you have a lot of data movement)
CUs; will allow you to work on different instruction streams (this almost acts pretty like a thread)
PEs; with more PEs comes more processes thus more computational power.

You might be curious about the AMD 7850K & 7700K APUs at this point knowing that AMD designated them as 12 & 10 "Compute Cores" and yes, that is based on the CU count for the GPU cores, each CU having 64 SPs per CU, thus 8(64SP/CU)=512SPs & 6(64SP/CU)=384SPs

on a note about 7850K and 7700K is that they do use the module design as the FX series, however the Stream-Roller Modules are a more optimized version of the Pilerdrive Module, resulting in better performance overall per core per clock.

Refrances:
GTX980 arch: http://hexus.net/tech/reviews/graphics/74849-nvidia-geforce-gtx-980-28nm-maxwell/
GTXTB arch: http://www.bit-tech.net/hardware/graphics/2014/02/26/nvidia-geforce-gtx-titan-black-review/1
290X arch: http://www.guru3d.com/articles-pages/radeon-r9-290-review-benchmarks,4.html
A10-7 arch: http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/3


RE: R9-390X and GTX Titan II - Nickster258 - 11-15-2014

(11-14-2014, 03:07 AM)Magazorb Wrote: Somehow i doubt Nvidia will make much of a dent... do remember double precision performance (programmers and deverlopers are starting to make more use of this for verious reasons) on nvidia is often bad, no idea how, GTX Titan blacks wear somewhere between 8 and 9 GFlops of single precision floats (can't remember it's esact number for base clock speed) but only around 2.6GFlops of double (i don't think they use 64bit processors, rather emulated 64 with 32bit processors, maybe fancey CU stuff, that's normaly N-vidia's style. 290X has around same amount of double precision for significantly less, and scales perfectly for single precision, meaning twice the speed then double.
Yes GTX 770 and up take the single precision win against AMD... but with no double support untill you reach Titan on N-vidias consumer line up, their will probly be plenty of games that run bad with N-vidia, Starcitizen for example, which will use doubles (currently single but the engine is being modded to support double)

My thoughs about why it's using a 32bit cuda processor over 64bits is mainly over the fact that it's performance in double precision is merely 1/3 the performance of it's single precision performance, normaly the differance of running a 32bit processor instead of 64bits but having twice as many would easily be though as having the same effect as having a 64bit processor that can process two 32bit when required, or even two 32bit processors running with carry condtionaly in series based on which precision is used.

How ever if we think about this in further depths we know that to keep good clock speeds we are required to pipeline, this then becomes a issue if your computations for 64bit value is based on running the lower half in first and chasing the 2nd half in after, ofcourse doing this you require to wait for the first half to compute, and for the carry flag to cycle and be ready, in a low pipeline count processor this probably wouldn't be much of a issue, but in more complex pipelines, it may not be able to be ready by the next cycle (this is very easily the case if you want really high throughput you would general have stage buffers at both input and output of the execution stage and thus in doing so create a issue where to keep clock speed up, you make the data-loop 3 cycles instead of 2 (say if we had only a input buffer)
I'd like to remind you that this is just my suspicion, i don't know why for sure single precision is 3 times faster then double

I suspect that this means that N-vidia will have to address this if they wish to remain performance king, and if my suspicions of them using 32bit processors is correct, then that will take up a lot of their new effective space, thus resulting in not much actual things being put on.

If i was to make as much of a educated guess as i could based on the way N-vidia seems to go, it would probly be either of following
Based on 980 arch:
6 GPCs each with 4 SMMs on each GPCs and 128 Cuda cores on each SMM ==3072 Cuda cores
or
Based on GTXTB arch:
6 GPCs each with 3 SMX on each GPCs and 256 Cuda cores on each SMX == 4608
Both are being optermistic, but both seem posible, the GTXTB to what i can see in arch has:
256 Cuda cores physicly on each SMX (3840), however only 192 functional on each SMX (2880)
but due to that reason it wouldn't suprise me at all to see a:
6GPCs with 3SMX each and 192 Cuda cores on each SMX == 3456.

It should be noted that to what i can find 980 has 4GPCs with 4SMMs with 128 Cuda physicaly and functional while the GTXTB has 5GPCs with 3SMXs with 256 (3840) phys Cuda and 192 (2880) Funct Cuda
i scaled both up to 6GPCs based on how much extra efective room i expected to be the new die then the old ones they was based on, this assumed that the issue i propose of N-vidia useing a 32bit processor isn't increased to a 64bit.

It also seems safe to assume N-vidia would 6x64b (384b) memory interface clocked at 1750MHz (7GHz effective) however with a increase in cores it wouldn't be unreasonable that they will either use a even higher clock memory controler or more of then (maybe 8x64b(512b) memory interface)

to correct GG, the 390X has listed 64CUs, 64SPs per CU and thus 4096SP
This wouldn't be hard to expect being a step up from the 290X with the 290X having:
4SEs, 11CUs per SE (44), 64SPs per CU (2816) and also has 8x64b (512b) memory interface
from this we can say fairly safely that the 390X with supposed stats would architecturally be as follows:
4SEs, 16CUs per SE (64), 64SPs per CU (4096), 8x64b (512b) MI

By this point you probly notice simularitys between N-vidia and AMD, this isn't really a corincidence, both of them in their most fundamental states are M-SIMD PEA (Multiple-Single Instruction stream Multiple Data stream Processing Element Array)
we can say that fundamentally both Cuda and SP cores are SIMD PE's controled by a common CU, the CUs in AMD GPUs are quite obviously CUs, and in N-Vidia you may also concider the SMXs and SMMs as CUs.

Now having busing going to all the CUs from the PCIe Interface would be ugly, so we use another device of which's technical name i'm unaware of, so i'll go by what AMD calls this based on them having quite simular naming to what we do and so going with the name of SE (Shader Engine), the N-Vidia's GPCs could be concidered the equal.
It should be noted that the CU also makes use of other devices besides the PEs

So a SE in a rather lose speech will buss data to and from the CUs, PCIe Interface and extraneous components within the SE

Rather convienently both AMD and N-vidia have roughtly the same clock speeds on just about everything, only real difference is the memory interface, where AMD uses more bit-width to achive higher bandwidth, N-vidia uses higher clock speeds to get higher bandwidth, Theoretically:

Refrance cards often have:
AMD: 512b x 5GHz = 2560Gb/s
N-Vidia: 384b x 7GHz = 2688Gb/s

after market cards often have:
AMD: 512b x 5.5GHz = 2816Gb/s
N-Vidia: 384b x 7.2GHz = 2764.8Gb/s

Latency is also another factor for performance, although less so with GPUs and arn't to much to worry about.

Having more of SEs, CUs and/or PEs is allways a good thing, Having more:
SEs; will help with trafficing data around (good if you have a lot of data movement)
CUs; will allow you to work on different instruction streams (this almost acts pretty like a thread)
PEs; with more PEs comes more processes thus more computational power.

You might be curious about the AMD 7850K & 7700K APUs at this point knowing that AMD designated them as 12 & 10 "Compute Cores" and yes, that is based on the CU count for the GPU cores, each CU having 64 SPs per CU, thus 8(64SP/CU)=512SPs & 6(64SP/CU)=384SPs

on a note about 7850K and 7700K is that they do use the module design as the FX series, however the Stream-Roller Modules are a more optimized version of the Pilerdrive Module, resulting in better performance overall per core per clock.

Refrances:
GTX980 arch: http://hexus.net/tech/reviews/graphics/74849-nvidia-geforce-gtx-980-28nm-maxwell/
GTXTB arch: http://www.bit-tech.net/hardware/graphics/2014/02/26/nvidia-geforce-gtx-titan-black-review/1
290X arch: http://www.guru3d.com/articles-pages/radeon-r9-290-review-benchmarks,4.html
A10-7 arch: http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600-a10-7850k/3

Holy fuck. You put time into this.


RE: R9-390X and GTX Titan II - Magazorb - 11-15-2014

(11-15-2014, 06:11 AM)Nickster258 Wrote: Holy fuck. You put time into this.

Ofcourse, I'm trying to work on ways to get a SIMD processor to work in MC that wouldn't be so bad, would love it if i could get to the point where i have enough power to process simple 3d shapes with some color data Big Grin


RE: R9-390X and GTX Titan II - TSO - 11-15-2014

I have an even longer and more time wasting response that will help clarify some stuff for you, maga. Mainly it addresses the fact that double precision is not actually twice as precise as single precision, nor can it be easily modeled as two single precision registers. It adds three bits to the exponent and another 29 to the integer. Single precision starts with eight bits in the exponent and 23 bits in the integer. The integer is in sign-value form and, IIRC, the exponent is in two's complement. The register layout is the same, and the extra three bits in the exponent aren't a huge problem because you can just shift some things around to make the relative exponents match, but what really matters is the extra six bits and the missing second sign number that double point has compared to two single point registers. This means we have to make three single point registers: one for the bits 0-22 in the less significant data register, one single point register for bits 23-31 of the lower data register and bits 0-13 of the upper data register, and a third single point register for bits 14-19 of the upper data register. This is the short answer for why it's going to be three times slower, my long reply to your reply will have everything fully explained.


RE: R9-390X and GTX Titan II - greatgamer34 - 11-15-2014

TSO

it depends on what standard your using. if your talking about the IEEE 7xx standard then it add more bits to the exponent.


RE: R9-390X and GTX Titan II - TSO - 11-16-2014

If it doesn't add more bits to the exponent, then there would be an extra 9 integer bits instad of the above stated six to deal with, giving the same issue again.


RE: R9-390X and GTX Titan II - Magazorb - 11-16-2014

at any rate, i was curious about that (4096bit) part in the title, looked arround and theirs some speculation they have moved the VRAM onto the die.