Are UUIDv4s Sabotaging Your App? Indexes Under Fire

Ahmed Ghazey
6 min readDec 23, 2023

--

In my role as a backend engineer, I had the opportunity to work on a globally-used application with millions of users. With data insertion and retrieval being critical aspects at this scale, optimizing for speed and performance was imperative, especially for tables containing millions of records. When working with relational databases of this size, writing and reading data efficiently is key. Column indexes within these databases play a major part in optimization.

Indexes are powerful weapons in our arsenal for tuning SQL query performance. The core purpose of an index is to provide rapid, O(log N) lookups of rows matching conditions on the indexed columns. This works by organizing a tree structure derived from the indexed column’s values. Well-chosen indexes therefore allow the database to avoid costly linear scans of entire tables.

Before diving into the details of SQL indexing algorithms, it is helpful to briefly review CPU and memory caching fundamentals. When a processor executes database operations, the instructions and data ultimately get fetched into the CPU registers and calculation units. However, prior to this there is an entire memory hierarchy at play. Data passes from disk or SSD storage into main memory, getting cached along the way in structures like the buffer pool and CPU caches. Pages of data are swapped between these storage layers based on algorithms trying to predict what the CPU will need next.

Modern processors have a cache hierarchy to accelerate data access time. At the core are L1 and L2 caches in the CPU using static RAM, with very fast access but smaller capacities (KBs to MBs). L3 is a larger (MBs), slower cache shared across cores. When the CPU needs data, it first checks these processor caches, which store frequently used values, index entries, etc. Cache misses cause fetches from main memory (RAM) which is 40x+ slower. If not in memory, disk/SSD storage (1000x+ slower) is accessed, relying on buffer caches to minimize this. Data is swapped between storage layers in optimized block sizes based on locality, sequential prefetching, and replacement algorithms. So while disk latency to load entire database pages is harshest (10+ ms), optimizations like indexing aim to exploit faster memory and cache access. In-CPU word size (32/64 bit) limits how much data the core can process per cycle for any fetched chunk. Understanding this pipeline from disk through registers is key to SQL performance.

Now that we have covered the data pipeline from disk up to the processor, we can connect this back to our original discussion about using UUIDs as primary keys. As a reminder, the core question is: How do UUID-based primary keys affect performance for write/read-intensive applications in SQL databases like MySQL and PostgreSQL?

UUID version 4 identifiers rely on a cryptographically-secure pseudo-random number generator to produce a unique value with extremely high probability across space and time. The version 4 spec defines a 128-bit number comprised of 122 random bits, along with 4 pre-defined version bits labeled ‘0100’ in the identifier standard. The remaining structure includes variant field bits along with a predefined layout for 8 hexadecimal groupings interspersed with dashes, totaling 36 characters when stringified (e.g. df6fdea1–10c3–474c–ae62–e63def80de0b). This structure coupled with the sheer probability space of 2¹²² potential values, means version 4 UUIDs can be produced independently without coordination and still practically guarantee uniqueness.

As we have established, UUID version 4 values satisfy the uniqueness requirements of a primary key through their fully random generation. However, while UUIDs can uniquely identify rows, understanding database performance requires looking under the hood. Specifically, we need to explore how SQL databases like PostgreSQL and MySQL physically store data on disk, and how their indexing structures are impacted by data types like UUIDs

At the heart of database systems is efficiently inserting new records and rapidly retrieving required information. This challenges the naive approach of a sequential search — looping through each entry until we find a match is unsustainably slow. The inefficiency is apparent in two areas:

Query Performance: Scan operations exhibit O(N) read complexity for result lookup, plagued by disk I/O to cache and iterate pages sequentially.

Modification Throughput: Updating records forces rewriting of entire storage pages, flushing pipeline caches due to shifted offsets — an equally heavy O(N) write toll.

Designing Data Intensive Applications by Martin Kleppman provides an excellent foundational primer (Ch. 2 and 3)

MySQL and PostgreSQL rely on B-Trees to structure database indexes [1] [2]. To understand the performance implications of different data patterns, it helps to visualize how B-Tree nodes — representing memory pages — reshape to accommodate inserts. The University of San Francisco’s B-Tree simulator provides an interactive way to experiment.

Key observations:

  • Inserting sorted data minimally disrupts tree structure. Pages rarely need rebalancing, avoiding expensive rewrite I/O.
  • Inserting random, unsorted data shakes up the tree, triggering cascading node splits and merges to maintain balance. This index reorganization incurs substantial disk access costs.

The instability of ingesting randomized data forces extensive tree rotation and page rewriting as nodes overflow and underfill frequently. Sorted inserts neatly extend existing patterns. This experiment highlights how the sequence of indexed values directly impacts underlying maintenance costs.

So far we have explored how the randomness of UUID values can degrade write performance for insert, update, and delete operations due to increased B-tree index maintenance. However, to fully evaluate the tradeoffs of using UUIDs as keys, we also need to examine the potential impact on read query performance.

While UUIDs deliver uniqueness, their randomness comes at a storage size cost. This can put significant pressure on cache efficiency, indexing, sequential I/O, JOINs, and sorting for databases trying to service read queries.

  • Indexing overhead — The randomness of UUIDs as keys destroys index locality and clustering. This leads to increased index sizes, more levels in b-trees, and reduced cache efficiency. Index scans for queries have to traverse larger indexes and more blocks.
  • Memory/cache pressure — Larger indexes mean fewer can be cached in available memory and buffer caches. More expensive I/O results to fetch index pages for scans. Random access patterns also reduce sequentially pre-fetched blocks.
  • Table scans — For large table scans, the clustering inefficiencies of UUID indexes increase the blocks and I/O needed to scan entire tables. A lack of locality hurts sequential block access.
  • Join performance — When joining on UUID columns, the randomness causes a loss of locality that particularly slows nested loop join performance. Compute resources are wasted comparing random values.

In this story, we explored UUIDs and their problematic relationship with database indexing in depth. While UUIDs provide uniqueness, their randomness introduces performance tradeoffs around write amplification, read efficiency, and storage overhead.

As next steps, I plan to cover alternative unique identifier schemes that aim to preserve UUID benefits while optimizing sequential efficiency. Potential topics include:

  • ULIDs
  • Snowflake
  • Composite keys

If you found this information useful, I sincerely appreciate you taking a moment to hit the clap button and follow me to stay updated on future posts. Reader feedback is invaluable for helping guide and improve my technical writing. Until next time, happy engineering!

--

--

Ahmed Ghazey
Ahmed Ghazey

Written by Ahmed Ghazey

Seasoned Senior Staff Software Engineer with a proven track record of delivering complex software solutions, managing teams, and providing technical leadership.