A Leiden run on your entity graph groups `Acme`, `Widgetco`, and three shared executives into one cluster, while a separate `Healthcare Regulation` group of entities ends up in a different cluster. What made the algorithm group them this way?

It analysed the graph's edge structure and grouped nodes that are densely connected to each other but sparsely connected to the rest of the graph.

It read the document text and classified topics semantically.

It analysed the graph's edge structure and grouped nodes that are densely connected to each other but sparsely connected to the rest of the graph.

It grouped entities alphabetically by name.

It used the vector store's partitioning scheme.

What is a 'community' in the GraphRAG sense, and why is it useful?

A cluster of entities densely connected to each other but sparsely to the rest — a discovered topic you can summarise once for global questions.

A group of platform users; useful for billing.

A cluster of entities densely connected to each other but sparsely to the rest — a discovered topic you can summarise once for global questions.

A vector-store shard; useful for latency.

A set of prompts; useful for caching.

Community detection over the entity graph

Using network topology (Leiden/Louvain) to discover the latent topical neighbourhoods hiding inside a chaotic entity graph.

0/4 done

Overview

Using network topology (Leiden/Louvain) to discover the latent topical neighbourhoods hiding inside a chaotic entity graph.

Why it matters

Once entity extraction has produced thousands of nodes and edges, the graph is just as overwhelming as an unindexed pile of documents — unless you can see its shape. Community detection algorithms (most commonly Louvain or its more stable successor Leiden) solve this by analysing the graph's topology alone — not the text, just which nodes connect to which — and finding clusters of entities that are densely connected to each other but only sparsely connected to the rest of the graph.

Concretely: if Acme, Widgetco, and three of their executives all link to each other through acquired, worksAt, and foundedBy edges, but rarely connect to entities in an unrelated Healthcare Regulation cluster, the algorithm groups the first five nodes into one community and leaves the second group separate — without ever having 'read' what the documents were about.

This matters for retrieval because communities become the unit for the next lesson's hierarchical summaries: instead of the LLM trying to reason over 10,000 raw nodes, it can reason over a much smaller number of pre-summarised topical clusters, then drill into the specific community relevant to a query. Community detection is what turns 'a big tangled graph' into 'an organised map of the corpus's actual subject areas.'

How it actually works

Once you have an entity graph, community detection (Leiden, Louvain) partitions it into clusters that are densely connected internally and sparsely connected to the rest. These clusters are latent topics the data discovered on its own — not topics you hand-labelled.

CALL gds.graph.project('entityGraph', ['Entity'], {MENTIONS: {orientation: 'UNDIRECTED'}});
CALL gds.leiden.write('entityGraph', { writeProperty: 'communityId' })
  YIELD communityCount, modularity;

Why this matters for retrieval. Naive RAG evaluates every chunk in isolation, so it structurally cannot answer global questions like 'what are the major themes this quarter?' — no single chunk contains the theme. Communities give you a scalable scaffold: summarise each community once, and a global query reads a handful of community summaries instead of thousands of chunks.

Quality dependency. Community structure is only as good as the edges feeding it. If low-confidence or false extraction edges are included, two unrelated topics merge into one noisy community and every summary built on it inherits the error. So filter edges by confidence before running detection, and validate a few communities with a domain expert before automating on them.

Modularity is your sanity score: a higher modularity means the partition found genuinely separable groups rather than slicing a uniform blob.

Analogy

Community detection is spotting friend-groups at a wedding. Nobody handed you a guest list by table — you infer the clusters from who keeps talking to whom. Add a few fake friendships (bad edges) and you'll seat rivals together; clean the connections first.

Pitfalls & how to avoid them

Running detection on noisy edges. Symptom: incoherent communities. Fix: confidence-filter edges first.
Treating communities as ground truth. Fix: validate with SMEs before automating retrieval on them.
Ignoring modularity. Symptom: meaningless partitions. Fix: check modularity; low score = no real structure.
One flat level of communities. Fix: build a hierarchy (next lesson) for multi-scale questions.

Apply it to your system

Imagine your corpus as a graph.

›Which 'global' questions do your users ask that no single chunk can answer?
›Whose expert eye would you trust to sanity-check the discovered communities?
›What edge-confidence threshold would you set before clustering?

Reading in progress · 0 of 4 activities done