11-12-2014, 07:34 PM
(This post was last modified: 11-12-2014, 08:15 PM by greatgamer34.)
11-13-2014, 06:56 AM
You're funny.
Anyway, I don't know how many flops they can do, which is all that matters. (Yes, I/O bandwidth is important, but anyone of decent skill can figure out how to work around that.)
11-13-2014, 05:54 PM
well with the 390x having 4096 CU i think its gonna rival the 295x2's performance. And with nvidia. Theyre just gonna blow shit outta the water!
Somehow i doubt Nvidia will make much of a dent... do remember double precision performance (programmers and developers are starting to make more use of this for various reasons) on nvidia is often bad, no idea how, GTX Titan blacks have 4.5TFlops of single precision floats (can't remember it's esact number for base clock speed) but only 1.5TFlops of double (i don't think they use 64bit processors, rather emulated 64 with 32bit processors, maybe fancey CU stuff, that's normaly N-vidia's style. 290X has around same amount of double precision for significantly less, and scales perfectly for single precision, meaning twice the speed then double.
Yes GTX 770 and up take the single precision win against AMD... but with no double support untill you reach Titan on N-vidias consumer line up, their will probly be plenty of games that run bad with N-vidia, Starcitizen for example, which will use doubles (currently single but the engine is being modded to support double) My thoughs about why it's using a 32bit cuda processor over 64bits is mainly over the fact that it's performance in double precision is merely 1/3 the performance of it's single precision performance, normaly the differance of running a 32bit processor instead of 64bits but having twice as many would easily be though as having the same effect as having a 64bit processor that can process two 32bit when required, or even two 32bit processors running with carry condtionaly in series based on which precision is used. How ever if we think about this in further depths we know that to keep good clock speeds we are required to pipeline, this then becomes a issue if your computations for 64bit value is based on running the lower half in first and chasing the 2nd half in after, ofcourse doing this you require to wait for the first half to compute, and for the carry flag to cycle and be ready, in a low pipeline count processor this probably wouldn't be much of a issue, but in more complex pipelines, it may not be able to be ready by the next cycle (this is very easily the case if you want really high throughput you would general have stage buffers at both input and output of the execution stage and thus in doing so create a issue where to keep clock speed up, you make the data-loop 3 cycles instead of 2 (say if we had only a input buffer) I'd like to remind you that this is just my suspicion, i don't know why for sure single precision is 3 times faster then double I suspect that this means that N-vidia will have to address this if they wish to remain performance king, and if my suspicions of them using 32bit processors is correct, then that will take up a lot of their new effective space, thus resulting in not much actual things being put on. If i was to make as much of a educated guess as i could based on the way N-vidia seems to go, it would probly be either of following Based on 980 arch: 6 GPCs each with 4 SMMs on each GPCs and 128 Cuda cores on each SMM ==3072 Cuda cores or Based on GTXTB arch: 6 GPCs each with 3 SMX on each GPCs and 256 Cuda cores on each SMX == 4608 Both are being optermistic, but both seem posible, the GTXTB to what i can see in arch has: 256 Cuda cores physicly on each SMX (3840), however only 192 functional on each SMX (2880) but due to that reason it wouldn't suprise me at all to see a: 6GPCs with 3SMX each and 192 Cuda cores on each SMX == 3456. It should be noted that to what i can find 980 has 4GPCs with 4SMMs with 128 Cuda physicaly and functional while the GTXTB has 5GPCs with 3SMXs with 256 (3840) phys Cuda and 192 (2880) Funct Cuda i scaled both up to 6GPCs based on how much extra efective room i expected to be the new die then the old ones they was based on, this assumed that the issue i propose of N-vidia useing a 32bit processor isn't increased to a 64bit. It also seems safe to assume N-vidia would 6x64b (384b) memory interface clocked at 1750MHz (7GHz effective) however with a increase in cores it wouldn't be unreasonable that they will either use a even higher clock memory controler or more of then (maybe 8x64b(512b) memory interface) to correct GG, the 390X has listed 64CUs, 64SPs per CU and thus 4096SP This wouldn't be hard to expect being a step up from the 290X with the 290X having: 4SEs, 11CUs per SE (44), 64SPs per CU (2816) and also has 8x64b (512b) memory interface from this we can say fairly safely that the 390X with supposed stats would architecturally be as follows: 4SEs, 16CUs per SE (64), 64SPs per CU (4096), 8x64b (512b) MI By this point you probly notice simularitys between N-vidia and AMD, this isn't really a corincidence, both of them in their most fundamental states are M-SIMD PEA (Multiple-Single Instruction stream Multiple Data stream Processing Element Array) we can say that fundamentally both Cuda and SP cores are SIMD PE's controled by a common CU, the CUs in AMD GPUs are quite obviously CUs, and in N-Vidia you may also concider the SMXs and SMMs as CUs. Now having busing going to all the CUs from the PCIe Interface would be ugly, so we use another device of which's technical name i'm unaware of, so i'll go by what AMD calls this based on them having quite simular naming to what we do and so going with the name of SE (Shader Engine), the N-Vidia's GPCs could be concidered the equal. It should be noted that the CU also makes use of other devices besides the PEs So a SE in a rather lose speech will buss data to and from the CUs, PCIe Interface and extraneous components within the SE Rather convienently both AMD and N-vidia have roughtly the same clock speeds on just about everything, only real difference is the memory interface, where AMD uses more bit-width to achive higher bandwidth, N-vidia uses higher clock speeds to get higher bandwidth, Theoretically: Refrance cards often have: AMD: 512b x 5GHz = 2560Gb/s N-Vidia: 384b x 7GHz = 2688Gb/s after market cards often have: AMD: 512b x 5.5GHz = 2816Gb/s N-Vidia: 384b x 7.2GHz = 2764.8Gb/s Latency is also another factor for performance, although less so with GPUs and arn't to much to worry about. Having more of SEs, CUs and/or PEs is allways a good thing, Having more: SEs; will help with trafficing data around (good if you have a lot of data movement) CUs; will allow you to work on different instruction streams (this almost acts pretty like a thread) PEs; with more PEs comes more processes thus more computational power. You might be curious about the AMD 7850K & 7700K APUs at this point knowing that AMD designated them as 12 & 10 "Compute Cores" and yes, that is based on the CU count for the GPU cores, each CU having 64 SPs per CU, thus 8(64SP/CU)=512SPs & 6(64SP/CU)=384SPs on a note about 7850K and 7700K is that they do use the module design as the FX series, however the Stream-Roller Modules are a more optimized version of the Pilerdrive Module, resulting in better performance overall per core per clock. Refrances: GTX980 arch: http://hexus.net/tech/reviews/graphics/7...m-maxwell/ GTXTB arch: http://www.bit-tech.net/hardware/graphic...k-review/1 290X arch: http://www.guru3d.com/articles-pages/rad...rks,4.html A10-7 arch: http://www.anandtech.com/show/7677/amd-k...10-7850k/3
11-15-2014, 06:11 AM
(11-14-2014, 03:07 AM)Magazorb Wrote: Somehow i doubt Nvidia will make much of a dent... do remember double precision performance (programmers and deverlopers are starting to make more use of this for verious reasons) on nvidia is often bad, no idea how, GTX Titan blacks wear somewhere between 8 and 9 GFlops of single precision floats (can't remember it's esact number for base clock speed) but only around 2.6GFlops of double (i don't think they use 64bit processors, rather emulated 64 with 32bit processors, maybe fancey CU stuff, that's normaly N-vidia's style. 290X has around same amount of double precision for significantly less, and scales perfectly for single precision, meaning twice the speed then double. Holy fuck. You put time into this.
11-15-2014, 06:15 AM
11-15-2014, 08:46 AM
I have an even longer and more time wasting response that will help clarify some stuff for you, maga. Mainly it addresses the fact that double precision is not actually twice as precise as single precision, nor can it be easily modeled as two single precision registers. It adds three bits to the exponent and another 29 to the integer. Single precision starts with eight bits in the exponent and 23 bits in the integer. The integer is in sign-value form and, IIRC, the exponent is in two's complement. The register layout is the same, and the extra three bits in the exponent aren't a huge problem because you can just shift some things around to make the relative exponents match, but what really matters is the extra six bits and the missing second sign number that double point has compared to two single point registers. This means we have to make three single point registers: one for the bits 0-22 in the less significant data register, one single point register for bits 23-31 of the lower data register and bits 0-13 of the upper data register, and a third single point register for bits 14-19 of the upper data register. This is the short answer for why it's going to be three times slower, my long reply to your reply will have everything fully explained.
11-15-2014, 08:54 PM
TSO
it depends on what standard your using. if your talking about the IEEE 7xx standard then it add more bits to the exponent.
11-16-2014, 05:30 PM
at any rate, i was curious about that (4096bit) part in the title, looked arround and theirs some speculation they have moved the VRAM onto the die.
11-17-2014, 05:17 PM
(note that 100% of this was written before my first reply to this thread, also, due to time constraints, do no expect the promised assemby function explanations to come any time soon, if at all. I have too much work to do at the moment.)
@maga: Before I start with the quoting, I will immediately say that you have made several mistakes with assumptions about single and double precision compatibility. I will make an edit sometime tomorrow that details the exact method that 32 and 64 bit CPU's would deal with each type as well as how 32 and 64 bit FPU's would deal with it (an FPU has hardware support for at least one of the precision types.) To make things confusing, the CPU stuff will all be written for x86 without any debugging or checking (good luck actually making it work) in AT&T syntax (you have been warned). Convenient ASCII art diagram thingies might be included because I'm not a total monster and vim actually makes their creation pretty easy and the 32 CPU codes will also contain a diagram of The Stack because I ran out of registers after about three lines of code. Do note that I'm not really using The Stack as a stack, just as an extremely convenient and rather large register. (Also another great quote from my father I just got this evening, "Nobody actually codes in assembly anymore because computers are so fast now... hell, modern computers are so fast they can almost run java." His programming quotes are the best...) (11-14-2014, 03:07 AM)Magazorb Wrote: Somehow i doubt Nvidia will make much of a dent... do remember double precision performance (programmers and deverlopers are starting to make more use of this for verious reasons) on nvidia is often bad, no idea how, GTX Titan blacks wear somewhere between 8 and 9 GFlops of single precision floats (can't remember it's esact number for base clock speed) but only around 2.6GFlops of double (i don't think they use 64bit processors, rather emulated 64 with 32bit processors, maybe fancey CU stuff, that's normaly N-vidia's style. 290X has around same amount of double precision for significantly less, and scales perfectly for single precision, meaning twice the speed then double.Only 3 GFlops? My laptop's GPU (intel HD 3000) does 129 GFlops at 1.3 GHz with 12 cores (about 4600 less than the cited gpu's) and whatever the Sandy Bridge I/O bus speed is for the badwidth, an xbox does 137 GFlops at about 60GB/s bandwidth, and my i5 can do even better than that due to the AVX extensions that nobody uses (256 bit processing using all four threads). Methinks you hath misplaced some decimals (though, I think the HD graphics unit is 64 bit, and the i5 certainly is). magazorb Wrote:Yes GTX 770 and up take the single precision win against AMD... but with no double support untill you reach Titan on N-vidias consumer line up, their will probly be plenty of games that run bad with N-vidia, Starcitizen for example, which will use doubles (currently single but the engine is being modded to support double) No. Two 32 bit processors running 32 bit single point will be mutually independent. A 64 bit processor can not perform two single point operations simultaneously without many, many complications because the single precision operations still need to be independent, so either the system will need to correct wrong carries and mask the exponents back into the lower half, while masking in the leading digits of the upper number and then removing the bit 31 carry and accounting for the bit 22 carry if they do carry (this adds quite a few steps to the execution), or by having a two parallel single point hardware units and one double point hardware unit all sharing the same 64 bit bus with an opcode extension choosing between the two (so now we need to invoke something like SADDQ or DADDQ for our single point and double point opcodes. This makes each 64 processor four times larger than each 32 bit processor, and that doesn't account for the extra hardware needed to support the additional operations). Instead, double point processors tend to pad the exponent and numerical portions with leading zeroes, therefore being the exact same speed as one 32 bit processor (half as fast as two of them). These all assume that the execution is done in dedicated hardware. If software is used, the difference is in which bits are masked out for which part of each process, and then some method is used to prevent the numerical operands from interfering while monitoring the carries in the upper and lower halves, such as separating with three zeroes in the case of an add and then monitoring the carry out of bit 23 (hint:it's in bit 24) and then adding one to the exponent of the corresponding numbers if there is a carry, which are in a different register, bit shifting only the value that carries, and doing nothing to the signs which are in yet another register, and then masking all three registers back together to get our two single point numbers. This actually has a very small loss compared to the double point, which comes from the extra shift on the lower value (it has to decide which of four masks to use when recombining the three registers.) magazorb Wrote:How ever if we think about this in further depths we know that to keep good clock speeds we are required to pipeline, this then becomes a issue if your computations for 64bit value is based on running the lower half in first and chasing the 2nd half in after, ofcourse doing this you require to wait for the first half to compute, and for the carry flag to cycle and be ready, in a low pipeline count processor this probably wouldn't be much of a issue, but in more complex pipelines, it may not be able to be ready by the next cycle (this is very easily the case if you want really high throughput you would general have stage buffers at both input and output of the execution stage and thus in doing so create a issue where to keep clock speed up, you make the data-loop 3 cycles instead of 2 (say if we had only a input buffer) This depends on choices, but the extra cycles more likely come from the actual composition of the single and double point numbers in the registers. They have to be parsed a few more times and then put back together again when not using double point on 64 bit or single point on 32 bit. (I will edit in some x86-like asm to show you what each system would have to do.) magazorb Wrote:I suspect that this means that N-vidia will have to address this if they wish to remain performance king, and if my suspicions of them using 32bit processors is correct, then that will take up a lot of their new effective space, thus resulting in not much actual things being put on.... seems legit, except that (I'll use your layout of the 390x) it sounds more like they are all composed of either 4096 one-bit processors, 64 individual 64-bit processors (since you could view each individual bit trace in an ALU is a processing unit in it's own right), or 64 RISC processors with 64 possible operations and each ALU internal unit (like each adder, shifter, multiplier ect.) is being referred to as a processing unit. No matter what, it seems fairly obvious that these all are MEG compatible (the Marketing Engineering instruction set extension.) Also, you forgot that this isn't a single central control unit to a GPU. That would be rather silly considering that absolutely every single internal processing unit is doing exactly the same thing at exactly the same time to exactly the same data for exactly the same reason: they all share a control bus. They don't need a central unit because they aren't CPU's, they are CPU add-ons that use the existing CPU as their control unit. For example, (this is my best guess for some of these thisgs, I haven't quite had a chance to read the full GT manual, but the pictures seem to indicate this) according to my laptop, the operating data is in memory 0xE2800000-0xE2BFFFFF, this space is a series of structures that contain different matricies that operate on the primary bitmap held in memory 0xD000000-0xDFFFFFFF. The corresponding opcodes are sent through in I/O 0x8000-0x803F, I/O 0x3B0-3BB and I/O 0x3CO-0x3DF (as far as I can tell, one of those sets is the VGA operations port, one is the HDMI operations port, and one is the 3D operations port, in no particular order). It should be noted that the VGA terminal bitmap is memory 0xA0000-0xBFFFF (this is a fact that I am 100% certain of.) On the memory input port of the graphics unit is a simple decoder/synchronizer that transforms incoming data into the proper type depending upon if it is intended for VGA, HDMI, or 3D, and ensures that the proper corresponding opcode also gets sent to the 12 processing units inside the GPU. If there was a central unit to the GPU, then there would only be the central unit's I/O ports and then why wouldn't you just spend an extra hundred or so bucks and have up that central unit be an actual multicore x86 processor with all the bells and whistles (except hyper threading, in this case hyper threading is very bad juju) that serves as the whole computer's CPU. magazorb Wrote:So a SE in a rather lose speech will buss data to and from the CUs, PCIe Interface and extraneous components within the SE That's not how that works. The input bit rate is actually determined by a.) how quickly these things can get code out of the processor. Unfortunately, data conflicts are rather prominent in the system, so their memory bandwidth goes down quite a bit and b.) the actual interface to the processor. I don't think they just come with a 512 pin plug onboard. I'm farily certain that PCIe x32 is like 300 pins, so you're oing to have to cut down that number immediately, and I know that pcie 32 is only capable of 31.5 Gb/s, so you've got yourself a problem there. magazorb Wrote:Latency is also another factor for performance, although less so with GPUs and arn't to much to worry about. Multiple processes in a GPU is actually rather rare because their entire goal is to make one image at a time, so they all aim to do exactly the same thing at exactly the same time to as much corresponding data as possible. They do not aim for the MIMD goal of CPU's because a GPU needs to perform only the matrix and encoding transforms for the bitmap to turn into the final code for the screen to interpret and decode into pixel illuminations. A GPU is inherently a unitasker and is fully intended to stay this way.
11-17-2014, 06:17 PM
i think he meant TFLOPs
wow, doesn't he know how to bitch? XD yer sorry about the FPU i did write up a post that i was ment to put up, i think i just might have not have press send but i lost that and can't be asked to write up a full on apologie about my missunderstanding of FPUs again, but yes, what i say about FPUs take litely, but i'm aware their are units which are of 64bit that can be broken down to run several other smaller previsions (weather they have extra stuff for the required extra stuff or what i've no idea so don't qoute me for the essacts, but they excist, they do... search it up)
al so i DID make it clear they had several control units, modern gpus for the most part are only M-SIMD processors, meaning 1 CU controls many processors doing virualy the same thing, and having several arrays of those, for AMD it's 40/44 in the 290/290X respectivly, with suspecting that the 390X would be 64, each having 64PEs, thus 64x64=4096 cores (64 works rather beutifuly here so it's easy to see why someone would choice it) also to correct you a lot of times in smaller GPGPU tasks GPUs will have to do many things at a time, especially if you branch, this is well documented in many GPGPU researches, i never once said or tried to suggest that GPUs would try to run as MIMD, however they do have some aspects that loosely relate. Also i did mean TFLOPs (that kind of typo would have been obvious to self correct XD, if you're here for the argument really don't bother, i don't care for arguing, i just pointed out some points, the fact they arnt 100% accuratly explained wasn't the point, i gave it in really low acuratcy on purpose, because most people don't care for knowing absolutely everything. Also the bandwidth i gave was bandwidth, i'm fully aware of how fast they execute instructions is that, however their's times in a GPU where they have to load alot of data rather quickly (normally in GPGPU) this is when the bandwidth proves it's worth and they have shown useful improvements in the past, hence why both AMD and N-vidia have been trying to increase memory bandwidth (N-vidia i believe did a 512bit memory interface once then bailed on it due to costs to produce, meanwhile AMD made a push for it on the 290(X) either way it was a bandwidth measurement, not a instruction measurement. No offence if GPUs only needed to do graphics processing they wouldn't have cores, the whole transition to Cores was because they really did increase how versitile the GPUs ware, AMD did a huge thing with core cound and N-vidia got left behind back when N-vidia had 400 series, many people still do choice AMD for their double precision performance over N-vidia Do please try to remember like 10 years ago you would be hard pressed to find a GPU that had any cores on them, they normally had dedicated processor for particular graphic processing things, think of them almost like ASICs but with some verability I do know a little about image processing, but my post wasn't about image processing, it was about compute, which image processing also does, and yes for image processing it mostly has the GPU running informally but they did still have several processses to go through (much more now days) to resolve a single imagine, i'm aware that every pixel is processed and blarr blarr blarr, again that wasn't the sole purpose of my post, if you want only that you can go and make your self a ASIC that does nothing but image processing, it would get you better performance but GPUs do a whole lot more then what they used too. Don't get me wrong i don't mean to be rude, i made some observations about where N-vidia was lacking behind AMD (mainly Double Precision) and states some of the things i believed was why, then made a few speculations to the future specs based on recent architectures for both vendors, then explained about the M-SIMD arrays that they tend to have. also about CUs, N-vidia only recently moved to several CUs per a whole GPU, they seriously did run several hundreds of cores of a single CU, it was just impractical hence why they moved away from it. I could have also explained how CPUs use 1D memory arrays in their ISs while GPUs normaly use 2D arrays, further more they are made very differently, i'm not going to explain the differences, they should be apparent enough. but may i ask you a reason, if you have two single precision FPUs, what's to stop you adding a few muxes and a few of the extra parts required to replicate a double FPU, or vice versa? just because things are different doesn't mean that they can't be used for it. as for input rates i never said anything about them, for the most part GPUs have close to full use of that bandwidth going into them, but not much for returning values, but this normally isn't a issue on anything you'd use a consumer grade GPU on anyhow. Also i'd advice learning C++ i you want to learn to program in this much depth, that or learn all the other ISs related to processors and the features they bring to ASM and how to make full use of them (most people will say they know ASM at the point they know how to do 5+3, big deal, but the true programmers will probably decline knowing it in well depth even when they're top 10% in terms of programming skill, just purely because they realize how much they don't know, get your self their and you're doing well I hope this clarifies some of what i was saying in that post, but i don't mind if this goes on so much , Again apologies on lack of FPU knowledge P.S. Tell your Father his correct, nobody does, because C++ and C compile really well into ASM that it's pretty pointless, the only reason the better programmers like to do this is to see if they can little optimizations or see if theirs a small issue with something, normally it's used for optimizing and understanding what's going on better. but only a fool will believe java will do everything, those who believe in good programming practice typically believe it's a humans job, not a job for a tool.
Or just maybe Nvidias "New"card is just a K80 that seems to have some what obscure results.
http://www.anandtech.com/show/8729/nvidi...-gk210-gpu http://www.nvidia.com/object/tesla-servers.html got abit suspecious as why i though a GTXTB had such high GFlops that i qouted, i made a derp, GTXTZ had 8.1TFlops (sorry) TFlops for GTXTB was Double: 1.5, Single: 4.5 So i guess technicly 290X in raw compute outdoes GTXTB (but N-vidia still has more of the extra tech to help process image data relative to gaming) So it seems like the AMD has a nice opportunity to catch up where it lacks behind and get ahead where it's not, depending how well their new buss system goes. Some intresting things here http://wccftech.com/amd-fiji-r9-390x-specs-leak/ they seem to suck on the adds, sorry.
11-24-2014, 02:40 PM
(This post was last modified: 11-24-2014, 02:41 PM by greatgamer34.)
FLoating point Operations Per Second
12-02-2014, 09:28 PM
Oh
like a t-flip-flop right
12-03-2014, 01:35 AM
12-03-2014, 05:20 AM
12-03-2014, 06:02 AM
flip.....
fl0p
12-07-2014, 04:01 AM
oh
so if it is flip-flop then it help feet run fast and run fast cpu is good thx |
|