gitGood.dev
Back to Blog

Top 50 Database Interview Questions in 2026 (Beyond SQL Syntax)

P
Patrick Wilson
13 min read

Top 50 Database Interview Questions in 2026 (Beyond SQL Syntax)

Three things changed about database interviews in the past two years:

  1. Vector databases went from niche to standard. Every backend interview now asks at least one question about embedding storage and ANN search.
  2. Distributed SQL matured. CockroachDB, Spanner, TiDB, Aurora DSQL, Neon - candidates are expected to reason about transactions across regions.
  3. Postgres won. The market consolidated around Postgres-flavored databases for transactional workloads. If you don't know Postgres internals, you'll struggle.

Here are the 50 questions you'll see in 2026 backend interviews, organized by area.


SQL Fundamentals (1-10)

1. What's the difference between INNER, LEFT, RIGHT, and FULL JOIN?

INNER returns only matched rows. LEFT keeps all rows from the left table (NULLs on no match). RIGHT keeps all from the right. FULL keeps all from both.

Trap: candidates write LEFT JOIN then filter with WHERE right.col = X, which silently turns it into an INNER JOIN. Use the join condition or WHERE right.col IS NULL OR right.col = X.

2. What's the difference between WHERE and HAVING?

WHERE filters before aggregation. HAVING filters after. WHERE is faster (less data to aggregate).

3. Explain GROUP BY with multiple columns.

Each unique combination becomes one row. Selecting columns not in the GROUP BY (or aggregates) is undefined behavior in many databases - Postgres rejects it, MySQL silently picks an arbitrary value.

4. What's a window function?

Computes across a set of rows related to the current row, without collapsing rows like GROUP BY does. Examples: ROW_NUMBER(), RANK(), LAG(), SUM() OVER (PARTITION BY ...).

The classic interview problem: "give me the top 3 highest-paid employees per department." Window functions solve it cleanly.

5. What's a CTE (Common Table Expression) and when do you use them?

WITH foo AS (SELECT ...) lets you name a subquery. Use for readability and recursive queries (org charts, comment threads). Be aware: in Postgres, CTEs were materialization-fenced before v12, which could hurt performance.

6. What's the difference between UNION and UNION ALL?

UNION removes duplicates (requires a sort). UNION ALL keeps everything. UNION ALL is always faster - use it unless you need dedup.

7. Explain subqueries vs joins.

Often equivalent results. Joins are usually faster because the planner has more flexibility. Use subqueries when the logic is naturally nested (correlated subqueries for "exists this related row").

8. What does EXPLAIN show you?

The query plan: which indexes are used, join order, estimated row counts, costs. EXPLAIN ANALYZE runs the query and shows actual values. The first thing you check when a query is slow.

9. What's a covering index?

An index that contains all columns the query needs, so the database doesn't need to read the table. Add columns with INCLUDE (Postgres, SQL Server) for index-only scans.

10. What's a composite index and why does column order matter?

An index on multiple columns. The order matters because the index is sorted by the first column, then the second within the first, etc. An index on (a, b) helps queries filtering on a or (a, b) - but not b alone.


Indexes and Query Performance (11-18)

11. When does the database refuse to use an index?

  • Function applied to the indexed column (WHERE LOWER(name) = ... unless you have a functional index)
  • Implicit type conversion (WHERE varchar_col = 123)
  • Selectivity too low - if 80% of rows match, a full scan is faster
  • Bind parameter sniffing went wrong (cached plan was for a different value distribution)

12. What's a B-tree index vs a hash index?

B-tree - sorted, supports range queries, the default for most use cases.
Hash - O(1) equality lookups, no range queries. Rarely the right choice in practice.

Postgres also has GIN (for arrays/JSONB), GiST (for geometric/full-text), and BRIN (for naturally-ordered data like timestamps).

13. Explain partial indexes.

An index that covers only rows matching a WHERE clause. Useful when most queries filter on a small subset (WHERE status = 'active'). Smaller index, faster updates.

14. What's a clustered index?

The table data is physically sorted by the index. SQL Server's primary key is clustered by default. Postgres doesn't have clustered indexes - use CLUSTER to physically reorder, but it's a one-time operation.

15. How do you debug a slow query?

  1. EXPLAIN ANALYZE to see the actual plan
  2. Check for sequential scans on large tables
  3. Look at row count estimates vs actuals - large mismatches mean stale stats (ANALYZE)
  4. Check indexes - are they being used?
  5. Check parameter types and casts
  6. Check if the query is hitting a bad cached plan

16. What's connection pooling and why do you need it?

Each connection has memory and process overhead. Without pooling, a spike in clients exhausts the database. Use PgBouncer or your driver's pool. For Postgres, also know max_connections is a hard limit and pooling is mandatory at scale.

17. What's a lock and when do you get blocked?

Locks prevent concurrent operations from conflicting. Common blockers: long-running transactions holding locks, DDL changes (e.g., ALTER TABLE taking an ACCESS EXCLUSIVE lock), foreign key checks.

18. What's VACUUM in Postgres?

Reclaims space from dead tuples (created by UPDATE/DELETE under MVCC). Autovacuum runs in the background. Manually tune for high-churn tables. VACUUM FULL rewrites the whole table - rarely the right answer.


Transactions and Isolation (19-26)

19. Explain ACID.

Atomicity - all or nothing. Consistency - constraints hold. Isolation - concurrent transactions don't interfere (depending on level). Durability - committed data survives crashes.

20. What are the SQL isolation levels?

Read Uncommitted - sees uncommitted writes (rare in practice).
Read Committed - sees only committed data. Default in Postgres.
Repeatable Read - reads in a transaction see a consistent snapshot. Default in MySQL.
Serializable - transactions appear to run one at a time.

21. What's a "phantom read"?

A query returns different rows when re-run in the same transaction because another transaction inserted matching rows. Prevented by Serializable isolation.

22. Explain MVCC (multi-version concurrency control).

Each write creates a new version with a transaction ID. Reads see the version visible to their snapshot. No read locks. Trade-off: dead tuples accumulate, requiring vacuum.

23. What's a deadlock and how do databases handle them?

Two transactions each hold a lock the other wants. The database detects the cycle and aborts one transaction. Always retry transactions that fail with deadlock errors.

24. What's SELECT FOR UPDATE?

Acquires a row lock for the duration of the transaction. Use to prevent the "read-modify-write" race. Be careful - holds the lock until commit, can cascade into deadlocks.

25. Explain optimistic vs pessimistic locking.

Pessimistic - lock the row when you read it. Safer, slower under contention.
Optimistic - read without locking; on write, check the version hasn't changed. Faster when conflicts are rare. Use a version column or row hash.

26. What's a distributed transaction and why are they hard?

A transaction spanning multiple databases or services. Hard because there's no global clock and partial failures are common. 2PC works but is slow and can stall on coordinator failure. Most modern systems prefer eventual consistency or sagas.


Schema Design (27-32)

27. Normalize vs denormalize - when do you do each?

Normalize for OLTP (write-heavy, consistency-critical). Denormalize for OLAP/reads (avoid joins). Most production systems do both - normalized core with denormalized read models.

28. When do you use a JSON/JSONB column?

When the schema is genuinely variable (per-tenant config, plugin data) or for sparse fields. Don't use it as a lazy way to avoid schema design - you lose type safety, indexing efficiency, and validation.

29. UUID vs auto-increment integer for primary keys?

UUIDs - distributed-friendly, no collision across services, but larger and bad for clustered indexes. Use UUIDv7 (time-ordered) to avoid the page-fragmentation problem of UUIDv4.
Auto-increment - smaller, faster, but exposes growth rate and is hard in distributed setups.

In 2026, UUIDv7 is the default for new systems.

30. Soft delete vs hard delete?

Soft delete (deleted_at timestamp) for user-facing data where audit and recovery matter. Hard delete for ephemeral data and GDPR compliance. Most systems use both - soft delete then purge after a retention period.

31. How do you handle schema migrations safely?

Expand-migrate-contract:

  1. Add new column, write to both
  2. Backfill data
  3. Switch reads to new column
  4. Drop old column

Never block writes. For huge tables, batch the backfill. Tools: Flyway, Liquibase, sqitch, golang-migrate.

32. How do you model many-to-many relationships?

Junction table: users_groups (user_id, group_id, ...). Composite primary key on the foreign keys. Index both directions if you query both ways.


Distributed Databases (33-40)

33. Explain CAP theorem.

In the presence of a network partition, you must choose between consistency (all nodes see the same data) and availability (every request gets a response). You can't have both. In normal operation, you can have all three.

Most modern systems are CP (consistency + partition tolerance) or AP (availability + partition tolerance) with tunable consistency.

34. What's eventual consistency?

If no new writes occur, all replicas eventually converge to the same value. Used in Dynamo-style systems. Acceptable for many use cases (likes, view counts), unacceptable for others (account balances).

35. Explain leader-follower replication.

One primary handles writes; replicas asynchronously copy from the primary. Reads can go to replicas. Risks: replication lag, stale reads, split-brain on failover.

36. What's quorum (R + W > N)?

In Dynamo-style systems, with N replicas, R = read quorum, W = write quorum. If R + W > N, every read sees at least one acknowledged write. Tune for your consistency vs latency needs.

37. Explain consistent hashing.

Maps keys to a ring of nodes. Adding/removing a node only re-keys ~1/N of data. Used by Dynamo, Cassandra, memcached, Redis Cluster, distributed caches everywhere.

38. What's Spanner's TrueTime and why does it matter?

A globally synchronized clock with bounded uncertainty. Lets Spanner do externally consistent transactions across regions without a single global coordinator. The first system to make global ACID practical.

39. CockroachDB / TiDB / Aurora DSQL - what are they?

Distributed SQL databases that scale horizontally while maintaining ACID transactions. CockroachDB uses Raft + multi-version timestamps. TiDB separates storage (TiKV) from SQL. Aurora DSQL (AWS, 2024) targets serverless, infinite scale, with strong consistency and active-active multi-region.

40. When would you choose DynamoDB over Postgres?

  • Write throughput beyond what a single Postgres instance handles
  • Predictable single-key access patterns
  • Need for serverless scaling without operational burden
  • Tolerance for the access-pattern-driven schema design

When NOT to choose DynamoDB: complex queries, joins, analytics, strong relational integrity, frequent schema changes.


Modern: Vector, Time-Series, Search (41-46)

41. What's a vector database?

Stores high-dimensional vectors (embeddings) and provides approximate nearest neighbor (ANN) search. Used for semantic search, RAG, recommendations. Examples: Pinecone, Weaviate, Qdrant, pgvector.

42. Explain HNSW.

Hierarchical Navigable Small World graphs. The dominant ANN algorithm in 2026. Builds a layered graph; search starts from a high-level node and descends. Trade-off: more accurate than IVF, more memory.

43. When would you use pgvector vs a dedicated vector DB?

pgvector - you already have Postgres, vector workload is moderate, you want SQL + vector search in one place. Dedicated vector DB - very high recall requirements, billions of vectors, very high QPS.

For most teams in 2026, pgvector is the right call until you have a real reason not to.

44. What's a time-series database?

Optimized for append-heavy, time-stamped data with retention policies. Examples: InfluxDB, TimescaleDB (Postgres extension), Prometheus' TSDB. Storage is column-oriented and time-partitioned.

Combine keyword search (BM25) with semantic search (vectors), then rerank. Catches both exact-match queries (where vectors fail) and concept queries (where keywords fail). Standard pattern for RAG in 2026.

46. What's a graph database and when do you need one?

Stores nodes and edges. Optimized for traversals (friends of friends, paths). Examples: Neo4j, Amazon Neptune. Use only when your queries are inherently graph-shaped - otherwise relational with proper indexes is faster.


Operational and Design (47-50)

47. How do you back up a production database?

Continuous WAL archiving + periodic base backups. Test the restore - a backup you've never restored doesn't exist. For Postgres: pg_basebackup + WAL archiving, or managed solutions (RDS, Aurora, Neon).

48. How do you scale reads vs writes?

Reads - read replicas, caching, denormalized read models, CQRS.
Writes - vertical scaling first, then sharding, then a distributed database.

Most apps die from cache invalidation problems before they need write sharding.

49. Design a multi-tenant database architecture.

Three options:

  1. Shared schema, tenant_id column - cheap, hard to isolate
  2. Schema per tenant - middle ground, harder migrations
  3. Database per tenant - strong isolation, expensive operationally

Pick based on tenant size and isolation requirements.

50. When would you use NoSQL?

  • Predictable single-key access patterns at scale (DynamoDB)
  • Document model fits naturally (MongoDB)
  • Time-series or telemetry (Influx, Timescale)
  • Graph traversals (Neo4j)
  • Embeddings + ANN (vector DBs)

When NOT: you need joins, transactions, ad-hoc queries, or schema evolution. Default to Postgres - reach for NoSQL only when there's a clear reason.


How to Approach Database Questions in an Interview

Three things separate strong candidates from weak ones:

  1. Reason about workload first. Read/write ratio, access patterns, consistency requirements. Then pick the database. Picking before understanding is a red flag.
  2. Talk about cost. $/GB, $/IOPS, scaling cost curves. "It scales horizontally" without "it costs $X to add a node" is a junior answer.
  3. Acknowledge trade-offs. Every choice loses something. Saying "this is the right answer" is weaker than "this is the right answer because we accepted X trade-off."

Final Thoughts

Most production database problems are not about syntax. They're about understanding when your assumptions break - the index that helped at 10K rows hurts at 10M, the eventual consistency that worked yesterday creates a customer-facing bug today, the schema that fit the v1 product is wrong for v3.

The fastest way to ace these interviews: load 100M rows of fake data into Postgres on your laptop and try to break it. The questions become obvious when you've felt the bottlenecks.


Want practice databases interview prep with structured mock interviews? gitGood.dev has database tracks covering SQL fluency, indexing, isolation, and modern distributed systems.