Why does standard RAG fail on multi-hop questions?

Standard RAG retrieves passages by similarity to the question in one shot, so it rarely finds a second fact that connects only through a first. On HotpotQA, BM25 retrieval accuracy drops from 53.7% on single-hop questions to 25.9% on multi-hop ones, a problem called semantic drift.

ReMindRAG is the method in the OpenReview paper JnKfAqLJb4. It uses LLM-guided graph traversal with node exploration, node exploitation, and memory replay, storing past traversal paths inside knowledge-graph edge embeddings so repeat queries cost fewer tokens.

How does guided traversal reduce inference cost?

Guided traversal visits only the nodes a query needs rather than re-ranking a large passage set, and ReMindRAG caches successful paths in edge embeddings without retraining, so similar future questions skip redundant LLM calls.

How is guided-traversal RAG relevant to enterprise search?

Most real enterprise questions are multi-hop: they chain a person to a project to a document to a decision across separate tools. Guided traversal over a knowledge graph follows those links directly, which is the model SemanticOS uses to connect fragmented systems.

Guided-Traversal RAG: Fixing Multi-Hop Retrieval

Q: What is guided-traversal RAG?

Guided-traversal RAG is a retrieval method that walks a knowledge graph step by step, using a language model to decide which connected node to visit next, instead of fetching a flat list of passages by similarity. The OpenReview paper ReMindRAG is one such method, built for multi-hop questions.

TL;DR: Guided-traversal RAG from OpenReview (ReMindRAG) walks a knowledge graph one connected node at a time, letting a language model choose the next hop, instead of pulling a flat list of similar passages. That fixes multi-hop retrieval, the failure mode enterprise RAG hits most, and it cuts inference cost by caching successful paths in edge embeddings. If your questions chain facts across systems, brute-force similarity search is the wrong tool.

Ask a retrieval system a simple question and standard RAG does fine. Ask it a question whose answer depends on a second fact that only connects through a first, and it usually misses. That gap is the subject of a 2025 OpenReview paper, ReMindRAG, and it is the exact shape of most questions people ask inside a company.

What is guided-traversal RAG?

Guided-traversal RAG is retrieval that moves through a knowledge graph step by step. A knowledge graph is a set of entities (people, documents, projects, decisions) joined by typed relationships. Rather than scoring every passage against the query at once, the system starts at a relevant node and uses a language model to pick the next edge to follow, then the next, until it has gathered the facts an answer needs.

The contrast is with brute-force retrieval: embed the question, compare it to every chunk in a vector index, return the top matches. That works when the answer sits in one passage. It breaks when the answer is assembled from several passages that are not individually similar to the question.

The OpenReview paper calls its method ReMindRAG and frames the problem as a synergy failure: existing knowledge-graph RAG systems “often struggle to achieve effective synergy between system effectiveness and cost efficiency,” ending up with either weak answers or excessive prompt tokens and inference time (OpenReview, 2025).

Why does brute-force RAG fail on multi-hop questions?

A multi-hop question requires combining two or more facts that connect through an intermediate entity. “Which hospital employs the cardiologist who co-authored the 2024 sepsis protocol?” needs the protocol, then its author, then that author’s employer. No single passage is similar to the whole question.

The numbers are stark. On HotpotQA, a standard BM25 retriever’s accuracy drops from 53.7% on single-hop (“easy”) questions to 25.9% on multi-hop (“hard”) questions (arXiv survey, 2022). The same survey names the cause: semantic drift, where any retrieval error on one hop accumulates into the next, so the system wanders away from the real answer.

The benchmark exists because this is hard. HotpotQA was built by researchers at CMU, Stanford, and Université de Montréal specifically to test “natural, multi-hop questions” against the full scope of Wikipedia, and they report that performance in the open setting is “substantially lower” than when the right paragraphs are handed over (HotpotQA, 2018). Adding more documents to a brute-force index makes this worse, not better, because it adds distractors.

How guided traversal changes the retrieval path

Guided traversal attacks semantic drift by never asking for the whole answer in one query. The OpenReview method, ReMindRAG, describes three moves (OpenReview, 2025):

Node exploration: branch out to neighboring entities the query might need.
Node exploitation: follow the edge that most likely leads to the answer.
Memory replay: the part the authors flag as “most notably” theirs.

Memory replay is the cost lever. ReMindRAG “memorizes traversal experience within KG edge embeddings, mirroring the way LLMs ‘memorize’ world knowledge within their parameters, but in a train-free manner” (OpenReview, 2025). In plain terms: when a traversal succeeds, the path it took is written back into the graph’s edges. A later question that needs the same route can reuse it instead of paying for the LLM to rediscover it. No retraining, no fine-tuning.

That matters because the cost of knowledge-graph RAG is dominated by LLM calls during traversal. Re-ranking a large passage set on every query, or re-reasoning a path you have already walked, burns tokens and adds latency. Visiting only the nodes a query needs, and caching the ones that worked, is how the paper claims to improve accuracy and cost at the same time. The authors report ReMindRAG outperforms existing baselines “across various benchmark datasets and LLM backbones” and have released their code (OpenReview, 2025).

Why this is the failure enterprise RAG hits most

Public benchmarks use Wikipedia, but the structure is the same inside a company, and usually worse. Real questions are almost never single-hop. “Why did we give Vantage Health that pricing exception, and who signed off?” chains a customer to a deal to an approval thread to a person. Those facts live in a CRM, a contract store, a chat tool, and an email account. No vector index of document chunks links them, because the connection is relational, not textual.

This is the case for treating retrieval as graph traversal rather than passage search. If the entities and their relationships are modeled explicitly, a query can walk from customer to deal to approver in three hops. If they are not, the system is back to guessing which chunks look similar to a question that no chunk resembles.

A concrete example

Vantage Health, a mid-size insurer, ran a support assistant on standard RAG over its internal wiki, ticket history, and policy documents. Single-fact questions worked. Then an agent asked: “For the claims escalation we approved last quarter, what exception applied, and which underwriter owns that policy line now?”

Brute-force retrieval returned the escalation ticket and a generic exceptions page, but never connected the exception to the underwriter, because the underwriter’s name appeared in neither document. The agent got a confident, wrong answer. Semantic drift in one screen.

Modeled as a graph, the same question is a short walk: escalation → exception clause → policy line → current owner. Each hop is a known relationship, so traversal follows the chain instead of hoping one passage holds all four facts. This is the model SemanticOS is built around: a knowledge graph and AI search layer that connects fragmented tools so a single query can traverse people, documents, and decisions across systems. The retrieval research and the enterprise problem are the same problem at different scales.

Key takeaways

Guided-traversal RAG walks a knowledge graph hop by hop with LLM guidance, instead of retrieving a flat list of similar passages.
Brute-force RAG fails on multi-hop questions: BM25 accuracy falls from 53.7% to 25.9% between single-hop and multi-hop questions on HotpotQA, driven by semantic drift (arXiv, 2022).
The OpenReview method ReMindRAG combines node exploration, exploitation, and memory replay, caching successful paths in edge embeddings to raise accuracy and cut inference cost, without retraining (OpenReview, 2025).
Most enterprise questions are multi-hop and relational, so retrieval there is better framed as graph traversal than passage similarity.

Guided-Traversal RAG: Fixing Multi-Hop Retrieval

What is guided-traversal RAG?

Why does brute-force RAG fail on multi-hop questions?

How guided traversal changes the retrieval path

Why this is the failure enterprise RAG hits most

A concrete example

Key takeaways

Frequently asked questions

What is guided-traversal RAG?

Why does standard RAG fail on multi-hop questions?

What is ReMindRAG?

How does guided traversal reduce inference cost?

How is guided-traversal RAG relevant to enterprise search?

Sources

Put a semantic brain behind your stack

Join the Waitlist

Related reading

NeurIPS 2025: Hypergraph & Guided-Traversal RAG

GraphRAG vs Vector RAG for Multi-Hop Reasoning

Applying GraphRAG for Improved LLM Results