RV530
RV530 is a fairly different beast internally to RV515, although there are many functional similarities. Refer back a page to RV515 as needed, as we build up the picture of RV530 based on differences compared to that chip.Vertex processing
RV530 shares the same VS processing hardware as RV515, just there's five units to RV515's pair.Pixel processing and thread dispatch
There's still only one fragment quad in RV530, just like RV515, the difference being the four fragment processors that make up that quad are three times as 'wide'. The fragment to be shaded is therefore processed by three ALUs in lock step, rather than a single ALU. The ALUs are the same '4D' units, each capable of the same instruction issue rate, and each block of three ALUs has access to the same texturing resources.The increase in ALU instructions in the pixel shader programs of modern games means that the triply parallel design of RV530's fragment hardware has obvious benefits. Pure arithmetic rate goes up threefold and doing more work per cycle is the underlying tenet of modern 3D graphics processing.
So pixel threads are also three times bigger, batched 12 wide in queues of 4, but RV530 only maintains the same 128 threads in flight that RV515 does.
Texture processing and pixel output
RV530 possesses the same texture processing ability of RV515, pairing 4 address and sampler units to create the entire TMU. It can also bilinearly fetch from single channel textures and pack the fetch into a four channel result, like RV515, in a feature ATI call Fetch4.ROP count is the same as RV515 at 4, and all features are identical, barring RV530's double Z-only rate. So still two colour and two Z writes per cycle, but four Z (depth) writes if you mask off colour.
Ring bus memory controller
RV530, at 157 million transistors, has the same dual-ring memory controller as R520, just with a 256-bit internal width and 128-bit external interface to the DRAM devices. The ring bus controller allows client interfaces to ask for memory requests, with writes to the DRAMs going via a crossbar switch which arbitrates write access to the correct device.Read requests traverse the ring intelligently, at least as much as the memory controller has the ability to govern given its programmable interface anyway. Given that the memory controller 'knows' where each broad block of data is and stores addresses for those blocks, it sends requests round the shortest path to each of the five ring stops on the ring.
Four ring stops are for the DRAM devices themselves, which connect to their stop in pairs. The fifth is for general I/O to things like the PCI Express bus and by extension ATI HyperMemory, allowing the memory controller to address those resources properly.
We'll go further in depth in a separate piece, but suffice to say the memory controller was designed for flexibility, reduced latency and scalability in terms of clock rate. Wire density is reduced because of the controller's layout, and only cost holds back the external interface, ATI showing the full double-wide variant with R520.