The Architecture
The passage of time enables an engineering company to fine-tune, polish and tweak an extant architecture so that it is a better fit for modern games. Appreciating what has been mentioned on the previous page regarding AMD putting more focus on software than hardware this time around, and knowing that the GCN architecture is the backbone of both major gaming consoles, AMD doesn't have an opportunity for a complete overhaul, even if time and money was no object. RX 480 is an evolution.
Let's take it from the top. The now-familiar block diagram shows RX 480 comprises the aforementioned 36 Compute Units arranged in four blocks of nine. Each block, containing 576 cores, has its own geometry processor which is fed via four Asynchronous Compute Engines (ACE), or half as many as R9 290X. Though schedulers do exist in each CU, the RX 480 has a couple of hardware schedulers (HWS) up top, just like the 3rd Generation Fury series of GPUs.
What's interesting now is that Polaris can use its HWS and comandeer a chunk of the CUs by reserving them for time-sensitive operations. An example of this might be running audio - TrueAudio is now computed through the CUs, not a hardware chip - for, say, virtual reality, where the sound has to match the display. In effect, part of the Polaris chip can be run in what we'd term a high-priority mode.
Efficiency is gained by doing less work for the same output, and this is especially true at the front of the pipeline. A primitive is the most basic element of graphics - a triangle is a three-vertices primitive - and anything unseen by the viewer is wasted rendering. Imagine a scene where an object is turning so that its back is no longer seen. Those primitives are rendered pointlessly because they have no visible area and should be occluded. The Primitive Discard Accelerator (PDA), enhanced in this series, aims to remove more, especially when the scene is bulked up via tessellation. Nvidia already has an efficient engine installed within its latest GeForce cards to do just this; AMD catches up with Polaris. The real-world benefits aren't anywhere near as pronounced as shown on the above chart, however, because rendering is more than efficient occlusion.
Geometry instancing, which has been around since the DX9 days, is a great technique for reducing overall scene bandwidth as it uses multiple copies to represent, say, a bunch of trees, lots of grass or a number of buildings that naturally look similar to each other in most games. A small cache reduces the need to move more data around the card, and any saved bandwidth is a boon when other parts of the GPU are bound by its bandwidth.
We could copy and paste the following statement for any recent CPU or GPU: improving the prefetch unit helps keep the compute unit busier and more efficient. AMD didn't go on to describe exactly what has changed in the generations of GCN. What we do know is that the basic topology of each CU is identical to the R9 290X's, which itself harks back to older architecture.
An area-expensive way of increasing performance is to simply add more cache/buffers at each potential bottleneck. This is what happens with RX 480; AMD sacrifices some die area in order to keep the CUs better fed. Overall local L2 GPU cache is increased to 2MB, and a number of buffers have been augmented, as well.
Again, one cannot launch a new GPU without improving the memory subsystem in a way that increases usable bandwidth. Knowing the RX 480 ships with a 256-bit bus, which is considered optimum for a mainstream card but too narrow for something truly high-end, AMD jacks up the potential memory bandwidth by using GDDR5 memory rated at 8Gbps, or the same as the GTX 1070. Simple maths tells us that 256GB/s is on tap, but it actually works out higher because of the enhanced memory-compression technology. Do bear in mind that some RX 480 retail cards will ship with 7Gbps in order to achieve a lower price point.
You'll remember that the Fury cards used an improved delta colour compression (DCC) technology that yielded more usable bandwidth when data was compressible. AMD adds some more DCC sauce on RX 480, through software and hardware, with a top 8:1 compression if the data colour is identical.
The sum of these improvements offer, according to AMD, up to 15 per cent better performance on a clock-for-clock CU basis when comparing RX 480 to R9 290X. The 'up to' statement is important because in many current applications the gains will remain in single-digit numbers.
Architecturally, AMD set out to create a GPU whose major design decisions were set against a backdrop of increasing energy efficiency whilst also maintaining the throughput seen from previous generations. This feels like a Radeon R9 290(X) card cajoled into a smaller energy footprint through the use of 14nm process technology and iterative improvements. Drawing cross-company parallels, the GTX 980 Ti has had the same treatment to effectively become the GTX 1070.
Display
The video encode/decode ecosystem changes faster than graphics standards such as DX or Vulkan. RX 480 adds support for VP9, though it will only be operational through a software update. HEVC 4K60 encode was on the previous-generation Fiji hardware but we don't know the encode speed here.
Those who bemoaned the fact that previous Radeons were stunted as living-room-friendly cards will rejoice at RX 480 supporting HDMI 2.0b and DisplayPort 1.4. And of course, the GPU is compatible with the growing number of FreeSync monitors in the marketplace.
Virtual Reality and Asynchronous Compute
It's no secret that AMD is pitching RX 480 as the mass-market VR card of choice, stating that it opens up the possibility of an excellent VR experience at mainstream prices. Perusing the architecture shows no overt technology primed for VR, unlike Simultaneous Multi-Projection from Nvidia which reduces rendering overhead by using single-pass stereo. AMD continues to use a variable-resolution technology for minimising the load on the GPU, working by reducing the res at the peripheries of your vision. We'll have to see how it pans out against the competition once more VR benchmarks become available, but what we can say for now is that the RX 480 does pass Valve's VR Ready Test with a score of 6.4, or GeForce GTX 970 territory.
AMD, though, has always been strong in asynchronous compute, because it has dedicated hardware engines to handle the task of doing more by concurrent (graphics and compute) execution. RX 480 adds in a feature called Quick Response Queue (QRQ) whose job it is to designate a particular task as a priority. Other, lower-priority tasks are still processed by the ACE engines concurrently, but the QRQ is able to install that high-priority in-between. A classic example of QRQ is time warping for VR, and it can use the reservation function to ensure that the task is done within the appropriate time
All of these enhancements give the RX 480 an iterative improvement over what's gone before. A couple of per cent here, a couple of per cent there should add up to solid performance at common resolutions.