akshatsonic

2 min read

Notes | How Discord Indexes Trillions of Messages

Photo by Bruna Fossile

How Discord Indexes Trillions of Messages

Early Architecture

  • Used Elasticsearch with messages sharded over indices across two clusters.
  • Messages were lazily indexed: only when needed for search.
  • Workers pulled messages in batches for bulk indexing.
  • Redis backed the real-time indexing queue.

Issues with Redis & Bulk Indexing

  • Redis queues started dropping messages once CPU maxed out.
  • Bulk indexing problems:
    • A batch of 50 messages could fan out to 50 Elasticsearch nodes.
    • If one failed → entire batch retried, adding load.
  • Larger clusters → indexing slowdown & higher failure rates.

Cluster Growth Problems

  • Clusters grew to 200+ nodes with terabytes of data.
  • Bulk ops spread across many nodes → performance bottlenecks.
  • Hard to perform software upgrades:
    • Draining nodes took too long.
    • Forced to keep legacy OS/Elasticsearch versions.
  • Critical patches (e.g., log4shell) required full downtime.

Lucene Limitations

  • Each Elasticsearch index = a Lucene index.
  • MAX_DOC limit ≈ 2B docs per index.
  • Once hit → all indexing operations fail.
  • Large guilds = very large indices → hit limit quickly.

Kubernetes & Cells

  • Initially avoided stateful workloads on Kubernetes.
  • Elastic Operator looked promising for orchestration.
  • Problems:
    • Clusters had 200+ nodes → high coordination overhead.
    • Cluster states don’t scale well.
    • Master nodes often went OOM, causing failures.
  • Solution: introduce logical “cells” (multiple clusters grouped).

Improved Reliability & Scaling

  • Adopted dedicated node roles:
    • Master nodes → cluster coordination.
    • Ingest nodes → stateless, handle preprocessing.
    • Data nodes → indexing & queries.
  • Switched Redis → PubSub for indexing:
    • Guaranteed delivery.
    • Could tolerate large backlogs.
    • Elasticsearch failures = slowdown, not message loss.
  • Built PubSub router:
    • Streams & batches messages.
    • Routes to correct cluster/index.

Direct Messages (DMs)

  • Sharded by user instead of channel.
  • Stores all of a user’s DMs together → efficient search.

Scaling Guilds

  • Growing guilds hit the Lucene MAX_DOC limit.
  • Needed strategies to split data across shards without performance loss.

Query Optimization

  • Indices typically use a single primary shard.
  • Ensures all data is co-located → avoids cross-shard queries.