当前位置：网站首页>Directx11 advanced tutorial tiled based deffered shading

Directx11 advanced tutorial tiled based deffered shading

2022-06-12 06:01:00 【Senior brother Dai Dai】

Preface

There are a lot of point lights in many games (PointLight), Environmental artists want the game to simulate the atmosphere of reality , Put down thousands of point lights in a scene (PointLight) Not surprisingly .

The following describes the performance of a large number of point lights in the traditional rendering pipeline .

Traditional forward rendering (Traditional Forward Rendering) Point light calculation for

The general meaning is that each object is tested once renderPass, And put the point light sources that affect the object as an array in Shader Calculate

summary ： Traditional forward rendering because the same pixel may cover a large number of objects , cause Overdraw Very high , A lot of calculation is wasted , Many calculations are unnecessary , Because of the Shading Only the first pixel appears on the screen . Therefore, delayed rendering is introduced .

Traditional delayed rendering (Traditional Deffered Rendering)

Traditional delayed rendering is simple , It is to render the objects of the whole scene and output multiple geometric maps , Then use the geometric map to draw in a full screen Shading Middle computation . The most popular way to render point light is to treat it as a geometric sphere (LightSphereVolume), Render to full screen , Effective calculation of the pixels affected within the radius of each point light source . And set up RT Is the accumulation mode ,N Light sources add up N Time , Finally, get the final effect of all point light shading .

Pixels inside , It is not necessary for each light source to calculate the full screen pixels once .

summary : Compared to forward rendering , Because we only render the first layer of pixels , overdraw A substantial reduction in , Wasteful calculations are also reduced , however N A point light means calculation N Secondary light sphere RenderPass, Every pass We all read a variety of gbuffer And write once shading result , This leads to GPU bandwidth Waste is serious . As shown below :

Therefore, graphics engineers propose a more efficient rendering pipeline for calculating point lights for delayed rendering ：Tiled Deffered Shading

Slice based delayed rendering (Tiled Based Deffered Shading)

The above schematic diagram of traditional delayed rendering illustrates the GPU Bangwidth High defect , According to the ideal improved model, the following :

The ideal state is ： For shading, each pixel should be read only once GBuffer And write only once Shading result

For this ideal state model , Graphics rendering engineer proposed Block (tiled) Thought : On the basis of delayed rendering, the whole screen is divided into NxN block , A piece of (tile) The resolution is 16x16, Take advantage of the powerful parallel capabilities computeShader Calculate which light sources and which blocks (tile), And these effective point lights are used to detect the pixels of the corresponding block Pixel To color

Hereinafter referred to as TiledBasedDefferedShading by TBDS

TBDS Rendering process of :

（1） Render the entire scene GBuffer

(2) stay computeShader Divide each piece in the (tile), One piece (tile) It's usually 16x16 perhaps 32x32, Calculate each tile All pixels ( General camera space is quite good ) The biggest and the smallest PosZ value

Texture2D<float4> DepthTex:register(t0);
Texture2D<float4> WorldPosTex:register(t1);
Texture2D<float4> WorldNormalTex:register(t2);
Texture2D<float4> SpecularRoughMetalTex:register(t3);
Texture2D<float4> AlbedoTex:register(t4);
SamplerState clampLinearSample:register(s0);
StructuredBuffer<PointLight> PointLights : register(t5);
RWTexture2D<float4> OutputTexture : register(u0);
groupshared uint minDepthInt;
groupshared uint maxDepthInt;
groupshared uint visibleLightCount = 0;
groupshared uint visibleLightIndices[1024];

[numthreads(GroundThreadSize, GroundThreadSize, 1)]
void CS(
	uint3 groupId :  SV_GroupID,
	uint3 groupThreadId : SV_GroupThreadID,
	uint groupIndex : SV_GroupIndex,
	uint3 dispatchThreadId : SV_DispatchThreadID)

//(2) Calculate each Tiled Camera space MaxZ and MinZ
	float depth = DepthTex[dispatchThreadId.xy].r;
	float viewZ = DepthBufferConvertToLinear(depth);
	uint depthInt = asuint(viewZ);
	minDepthInt = 0xFFFFFFFF;
	maxDepthInt = 0;
	GroupMemoryBarrierWithGroupSync();

	if (depth != 0.0)
	{
		InterlockedMin(minDepthInt, depthInt);
		InterlockedMax(maxDepthInt, depthInt);
	}

	GroupMemoryBarrierWithGroupSync();

	float minViewZ = asfloat(minDepthInt);
	float maxViewZ = asfloat(maxDepthInt);

(3) Calculate each block (tile) Corresponding frustum( View frustum in camera space )

	float3 frustumEqn0, frustumEqn1, frustumEqn2, frustumEqn3;
	uint tileResWidth = GroundThreadSize * GetNumTilesX();
	uint tileResHeight = GroundThreadSize * GetNumTilesY();
	uint pxm = GroundThreadSize * groupId.x;
	uint pym = GroundThreadSize * groupId.y;
	uint pxp = GroundThreadSize * (groupId.x + 1);
	uint pyp = GroundThreadSize * (groupId.y + 1);

	// four corners of the tile, clockwise from top-left
	float3 frustum0 = ConvertProjToView(float4(pxm / (float)tileResWidth*2.f - 1.f, (tileResHeight - pym) / (float)tileResHeight*2.f - 1.f, 1.f, 1.f)).xyz;
	float3 frustum1 = ConvertProjToView(float4(pxp / (float)tileResWidth*2.f - 1.f, (tileResHeight - pym) / (float)tileResHeight*2.f - 1.f, 1.f, 1.f)).xyz;
	float3 frustum2 = ConvertProjToView(float4(pxp / (float)tileResWidth*2.f - 1.f, (tileResHeight - pyp) / (float)tileResHeight*2.f - 1.f, 1.f, 1.f)).xyz;
	float3 frustum3 = ConvertProjToView(float4(pxm / (float)tileResWidth*2.f - 1.f, (tileResHeight - pyp) / (float)tileResHeight*2.f - 1.f, 1.f, 1.f)).xyz;
	frustumEqn0 = CreatePlaneEquation(frustum0, frustum1);
	frustumEqn1 = CreatePlaneEquation(frustum1, frustum2);
	frustumEqn2 = CreatePlaneEquation(frustum2, frustum3);
	frustumEqn3 = CreatePlaneEquation(frustum3, frustum0);

(4) For each piece (tile), Traverse all point lights , use frustum and Depth Double elimination , And the global index that affects the point light source is added to the block (tile) List of visible light sources

//(3) Calculate and each Tiled Number of intersecting point lights , And record their index 
	uint threadCount = GroundThreadSize * GroundThreadSize;
	uint passCount = (int(lightCount) + threadCount - 1) / threadCount;

	for (uint i = 0; i < passCount; ++i)
	{
		uint lightIndex = i * threadCount + groupIndex;
		if (lightIndex >= lightCount)
			continue;

		PointLight light = PointLights[lightIndex];
		float3 viewLightPos = mul(float4(light.pos, 1.0), View).xyz;
		if(TestFrustumSides(viewLightPos, light.radius, frustumEqn0, frustumEqn1, frustumEqn2, frustumEqn3))
		{
			if (minViewZ - viewLightPos.z < light.radius && viewLightPos.z - maxViewZ < light.radius)
			{
				uint offset;
				InterlockedAdd(visibleLightCount, 1, offset);
				visibleLightIndices[offset] = lightIndex;
			}
		}
	}

	GroupMemoryBarrierWithGroupSync();

(5) Traversal block (tile) Of Light source of visible light source list , Shade all pixels in the block , such GBuffer Various RT Read only once , And write only once Shading result , GPU bandwidth low


	if (visibleLightCount > 0)
	{
		//G-Buffer-Pos( waste 1 float)
		float2 uv = float2(float(dispatchThreadId.x) / ScreenWidth, float(dispatchThreadId.y) / ScreenHeight);
		float3 worldPos = WorldPosTex.SampleLevel(clampLinearSample, uv, 0).xyz;

		//G-Buffer-Normal( waste 1 float)
		float3 worldNormal = WorldNormalTex.SampleLevel(clampLinearSample, uv, 0).xyz;
		worldNormal = normalize(worldNormal);

		float3 albedo = AlbedoTex.SampleLevel(clampLinearSample, uv, 0).xyz;

		//G-Buffer-Specual-Rough-Metal( waste 1 float)
		float3 gBufferAttrbite = SpecularRoughMetalTex.SampleLevel(clampLinearSample, uv, 0).xyz;
		float specular = gBufferAttrbite.x;
		float roughness = gBufferAttrbite.y;
		float metal = gBufferAttrbite.z;

		for (uint index = 0; index < visibleLightCount; ++index)
		{
			uint lightIndex = visibleLightIndices[index];
			PointLight light = PointLights[lightIndex];
			float3 pixelToLightDir = light.pos - worldPos;
			float distance = length(pixelToLightDir);
			float3 L = normalize(pixelToLightDir);
			float3 V = normalize(cameraPos - worldPos);
			float3 H = normalize(L + V);
			float4 attenuation = light.attenuation;
			float attenua = 1.0 / (attenuation.x + attenuation.y * distance + distance * distance * attenuation.z);
			float3 radiance = light.color * attenua;

			//f(cook_torrance) = D* F * G /(4 * (wo.n) * (wi.n))
			float D = DistributionGGX(worldNormal, H, roughness);
			float G = GeometrySmith(worldNormal, V, L, roughness);
			float3 fo = GetFresnelF0(albedo, metal);
			float cosTheta = max(dot(V, H), 0.0);
			float3 F = FresnelSchlick(cosTheta, fo);
			float3 ks = F;
			float3 kd = float3(1.0, 1.0, 1.0) - ks;
			kd *= 1.0 - metal;

			float3 dfg = D * G * F;
			float nDotl = max(dot(worldNormal, L), 0.0);
			float nDotv = max(dot(worldNormal, V), 0.0);
			float denominator = 4.0 * nDotv * nDotl;
			float3 specularFactor = dfg / max(denominator, 0.001);

			color.xyz += (kd * albedo / PI + specularFactor * specular) * radiance * nDotl * 2.2;
		}
	}

	OutputTexture[dispatchThreadId.xy] = color;