GPU Cache Hierarchy: Understanding L1, L2, and VRAM | Charles Grassi
Visit Website
Charles Grassi
December 29, 2025
Summary
This article discusses GPU cache hierarchy, explaining how memory access patterns affect texture fetch cycles and shader performance. It covers L1, L2 caches, VRAM, cache lines, spatial locality, coherent vs incoherent texture sampling, warp and cache coalescing, mipmapping, and practical optimization strategies like texture atlasing and channel packing. It emphasizes the importance of profiling to identify and address cache-related bottlenecks.
Content Sections
GPU Cache Hierarchy: Understanding L1, L2, and VRAM
Every texture sample in your shader triggers a memory request. The latency of this request depends on where the data lives in the GPU's memory hierarchy. Fixing cache access patterns can significantly improve shader performance, even without algorithmic changes.
GPU Memory Hierarchy Overview
GPU memory is a hierarchy consisting of Registers, L1 Cache, L2 Cache, and VRAM. Registers offer the fastest access but smallest size, while VRAM provides the largest capacity but highest latency. Key characteristics of each level are provided, including typical latency, size per streaming multiprocessor (SM), and bandwidth. Hitting L1 cache is significantly faster than accessing VRAM.
Cache Lines and Spatial Locality
When sampling a texel, the GPU fetches an entire cache line (128 bytes for L1). This leverages spatial locality, assuming neighboring pixels are likely to be accessed. Coherent access patterns, where neighboring fragments sample neighboring texels, are more efficient than random access, which causes cache thrashing.
The Warp and Cache Coalescing
GPUs run threads in warps (32 threads on NVIDIA). The hardware coalesces memory requests within a warp if threads access addresses within the same 128-byte chunk, resulting in a single fetch. Different memory access patterns are ranked by performance, from broadcast and coalesced to scattered and bank conflicts.
What Causes Cache Misses?
Common causes of cache misses include random UVs, dependent texture reads, large UV jumps, texture thrashing (too many unique textures), and working set overflow. These patterns can lead to significant latency penalties.
Mipmapping: The Cache's Best Friend
Mipmaps serve as a cache optimization. They allow for smaller textures to be sampled when objects are far away, which increases cache hits.
Explicit LOD for Cache Control
GLSL provides functions like textureLod and textureGrad for explicit control over mipmap levels and derivatives, allowing for cache-aware sampling.
Measuring Cache Efficiency
GPU profilers like NVIDIA Nsight and AMD Radeon GPU Profiler expose cache hit rates and memory throughput. Key metrics include L1 and L2 hit rates and texture memory throughput.
Texture Fetch Latency Hiding
GPUs hide memory latency by scheduling other warps while one warp waits for data. High occupancy is crucial for effective latency hiding. Avoid dependent texture reads and maintain a small working set.
Texture Memory Layout
GPUs use Morton order (Z-order curves) to store textures, keeping 2D-neighboring texels close in memory.
Practical Optimization Strategies
Practical optimization strategies include texture atlasing, channel packing, reducing dependent reads, and using bindless textures. Channel packing involves combining related data into fewer textures to reduce memory traffic.
Architecture Differences
Cache behavior varies across different GPU architectures (NVIDIA, AMD, mobile). Optimizations should be performed on the target hardware.
Quick Wins: Copy-Paste Patterns
ORM texture packing and mip bias for blur passes are quick wins that can be implemented directly in shaders.
Key Takeaways
Key takeaways include the importance of coherent access patterns, mipmaps as a cache optimization, packing related data, avoiding dependent texture reads, and profiling with tools like Nsight/RGP to identify cache-related issues. Measure first, optimize second, and verify improvements.