Sep 14, 2025

2 min read

Notes | How Discord Indexes Trillions of Messages

Photo by Bruna Fossile

How Discord Indexes Trillions of Messages

Early Architecture

Used Elasticsearch with messages sharded over indices across two clusters.
Messages were lazily indexed: only when needed for search.
Workers pulled messages in batches for bulk indexing.
Redis backed the real-time indexing queue.

Issues with Redis & Bulk Indexing

Redis queues started dropping messages once CPU maxed out.
Bulk indexing problems:
- A batch of 50 messages could fan out to 50 Elasticsearch nodes.
- If one failed → entire batch retried, adding load.
Larger clusters → indexing slowdown & higher failure rates.

Cluster Growth Problems

Clusters grew to 200+ nodes with terabytes of data.
Bulk ops spread across many nodes → performance bottlenecks.
Hard to perform software upgrades:
- Draining nodes took too long.
- Forced to keep legacy OS/Elasticsearch versions.
Critical patches (e.g., log4shell) required full downtime.

Lucene Limitations

Each Elasticsearch index = a Lucene index.
MAX_DOC limit ≈ 2B docs per index.
Once hit → all indexing operations fail.
Large guilds = very large indices → hit limit quickly.

Kubernetes & Cells

Initially avoided stateful workloads on Kubernetes.
Elastic Operator looked promising for orchestration.
Problems:
- Clusters had 200+ nodes → high coordination overhead.
- Cluster states don’t scale well.
- Master nodes often went OOM, causing failures.
Solution: introduce logical “cells” (multiple clusters grouped).

Improved Reliability & Scaling

Adopted dedicated node roles:
- Master nodes → cluster coordination.
- Ingest nodes → stateless, handle preprocessing.
- Data nodes → indexing & queries.
Switched Redis → PubSub for indexing:
- Guaranteed delivery.
- Could tolerate large backlogs.
- Elasticsearch failures = slowdown, not message loss.
Built PubSub router:
- Streams & batches messages.
- Routes to correct cluster/index.

Direct Messages (DMs)

Sharded by user instead of channel.
Stores all of a user’s DMs together → efficient search.

Scaling Guilds

Growing guilds hit the Lucene MAX_DOC limit.
Needed strategies to split data across shards without performance loss.

Query Optimization

Indices typically use a single primary shard.
Ensures all data is co-located → avoids cross-shard queries.