“Low-level thinking in high-level shading languages” (Emil Persson, 2013), along with its followup “Low-level Shader Optimization for Next-Gen and DX11“, is in my top 3 most influential presentations, one that changed the way I think about shader programming in general (since I know you are wondering the other 2 are Natty Hoffman’s Physically Based Shading and John Hable’s Uncharted 2 HDR Lighting). When I started graphics programming shaders were handcrafted in assembly, the HLSL compiler being in its infancy. It used to be the case that you could beat the compiler and manually produce superior shader assembly. This changed over the years, the compiler improved immensely and I learned to rely more on it and not pay much attention to, or think about the produced assembly code.
Continue reading “Low-level thinking in high-level shading languages 2023”A gentler introduction to ReSTIR
Recently I started exploring ReSTIR, using mainly the Gentle Introduction to ReSTIR Siggraph course and the original paper. I began with direct illumination (ReSTIR DI), to quickly set it up and get something working. ReSTIR is a very interesting technique that gives great results but there is a lot of Maths behind it that might dissuade people that want to dip their toes in it, which is a shame. Resources like the Gentle Introduction help a lot towards clarifying some of the theory behind it but it is still Maths heavy. In this post I will be attempting a more “qualitative” discussion of ReSTIR, going straight to the results, avoiding referencing the Maths behind it too much.
Continue reading “A gentler introduction to ReSTIR”Raytraced Order Independent Transparency part 2
In the previous blog post I discussed how raytracing can be used to achieve order independent transparency (OIT) for some types of transparencies and how it compares to other OIT methods like per pixel linked lists and Multi-layer Alpha blending (MLAB). The basic idea, since DXR doesn’t support distance sorted traversal of the BVH, was to use a closest hit shader to find the closest to the camera intersection and then use the position of the intersection as the origin of a new ray to trace through the BVH. That worked well in that it achieves OIT but the fact that each ray has to traverse the TLAS from the top every time we find an intersection is not ideal.
Continue reading “Raytraced Order Independent Transparency part 2”Raytraced Order Independent Transparency
About a year ago I reviewed a number of Order Independent Transparency (OIT) techniques (part 1, part 2, part 3), each achieving a difference combination of performance, quality and memory requirements. None of them fully solved OIT though and I ended the series wondering what raytraced transparency would look like. Recently I added (some) DXR support to the toy engine and I was curious to see how it would work, so I did a quick implementation.
Continue reading “Raytraced Order Independent Transparency”Experimenting with fp16, part 2
In the previous blog post I discussed how enabling fp16 for a particular shader didn’t seem to make a performance difference and also forced the compiler to allocate a larger number of VGPRs compared to the fp32 version (108 vs 81), which seemed weird as one of the (expected) advantages of fp16 is reduced register allocation. So I spent some more time investigating why this is happening. The shader I am referring to is the ResolveTemporal.hlsl one from the FidelityFX SSSR sample I recently integrated to my toy renderer.
Continue reading “Experimenting with fp16, part 2”Experimenting with fp16 in shaders
With recent GPUs and shader models there is good support for 16 bit floating point numbers and operations in shaders. On paper, the main advantages of the a fp16 representation are that it allows packing two 16 numbers into a single 32 bit register, reducing the register allocation for a shader/increasing occupancy, and also allows reduction of ALU instruction count by performing instructions to packed 32 bit registers directly (i.e. affecting the two packed fp16 numbers independently). I spent some time investigating what fp16 looks like at the ISA level (GCN 5) and am sharing some notes I took.
I started with a very simple compute shader implementing some fp16 maths as a test. I compiled it using the 6.2 shading model and the -enable-16bit-types DXC command line argument.
Continue reading “Experimenting with fp16 in shaders”Stream compaction using wave intrinsics
It is common knowledge that removing unnecessary work is a crucial mechanism for achieving good performance on the GPU. We routinely create lists of visible model instances of example using frustum and other means of culling to avoid rendering geometry that will not contribute to the final image. While it is easy to create such lists on the CPU, it may not be as trivial for work generated on the GPU, for example when using GPU driven culling/rendering, or deciding which pixels in the image to raytrace reflections for. Such operations typically produce lists with invalid (culled) work items, which is not a very effective way to make use of a GPU’s batch processing nature, either having to skip over shader code or introduce idle (inactive) threads in a wave.
Continue reading “Stream compaction using wave intrinsics”Notes on screenspace reflections with FidelityFX SSSR
Today I set out to replace the old SSR implementation in the toy engine with AMD’s FidelityFX’s one but in the end I got distracted and spent the day studying how it works instead. This is a modern SSR solution that implements a lot of good practices so I’ve gathered my notes in a blog post in case someone finds it of interest. This is not intended as an exhaustive description of the technique, more like a few interesting observations.
The technique takes as an input the main rendertarget, the worldspace normal buffer, a roughness buffer, a hierarchical depth buffer and an environment cubemap. The hierarchical depth buffer is a mip chain where each mip level pixel is the minimum of the previous level’s 2×2 area depths (mip 0 corresponds to the screen-sized, original depth buffer). It will used later to speed up raymarching but can also used in many other techniques, like GPU occlusion culling.
Continue reading “Notes on screenspace reflections with FidelityFX SSSR”Order Independent Transparency: Endgame
In the past 2 posts (part 1, part 2), I discussed the complexity of correctly sorting and rendering transparent surfaces and I went through a few OIT options, including per pixel linked lists, transmission function approximations and the role rasteriser order views can play in all this. In this last post I will continue and wrap up my OIT exploration discussing a couple more transmittance function approximations that can be used to implement improved transparency rendering.
Continue reading “Order Independent Transparency: Endgame”Order independent transparency, part 2
In the previous blog post we discussed how to use a per-pixel linked list (PPLL) to implement order independent transparency and how the unbounded nature of overlapping transparent surfaces can be problematic in terms of memory requirements, and ultimately may lead to rendering artifacts. In this blog post we explore approximations that are bounded in terms of memory.
Also in the previous blog post we discussed the transmittance function
and how it can be used to describe how radiance is reduced as it travels through transparent surfaces