I audited a RAG pipeline at work last month. 43% of the retrieval calls were for questions the model already knew the answer to. Another 19% retrieved documents that were technically related to the query but completely useless for answering it. Over half the retrieval budget was wasted.

This is not a bug in our system. It is a design problem baked into how RAG works. Standard pipelines retrieve on every query, whether the model needs external knowledge or not. And when they retrieve, they use the same generic strategy regardless of what kind of knowledge gap the model actually has.

A paper called Skill-RAG landed last week and it addresses both problems at once. The idea is simple but the execution is sharp: use the model’s own hidden states to detect when it’s about to fail a query, and then route that query to a retrieval strategy matched to the specific type of failure.

The Waste Problem in Standard RAG

Think about how you use Google for a minute. You don’t search for things you already know. You search when you hit a gap in your knowledge. But standard RAG systems don’t have this filter. Every incoming query triggers a vector search, regardless of whether the model could have answered it perfectly well from its own parameters.

This creates two kinds of waste:

Unnecessary retrieval: The model knows the answer, but we search anyway. The retrieved documents either confirm what the model already knows (wasted compute) or introduce conflicting information that confuses the answer (active harm).

Wrong retrieval strategy: Different knowledge gaps need different retrieval approaches. A factual recall failure (“What year was the Eiffel Tower built?”) needs a simple document lookup. A multi-hop reasoning failure (“Which country has the highest GDP per capita among nations that also have universal healthcare?”) needs a chained retrieval strategy that pulls multiple documents and synthesizes across them. One-size-fits-all retrieval handles neither well.

How Skill-RAG Detects Knowledge Failures

The clever part of Skill-RAG is the trigger mechanism. Instead of retrieving on every query, it trains a lightweight probe on the LLM’s hidden representations to predict whether the model is about to fail.

The probe looks at the model’s internal state while it is processing the query, before it generates any output. If the hidden representations show patterns associated with confident, well-grounded knowledge, the probe says “skip retrieval.” If they show patterns associated with uncertainty or approaching failure, the probe says “retrieve.”

When a model knows something well, its hidden states form tight, consistent representations. When it is uncertain, the representations are scattered and inconsistent across layers. The probe learns to distinguish these two patterns.

# Skill-RAG: Hidden-state probe for retrieval triggering
import torch
import torch.nn as nn
class KnowledgeFailureProbe(nn.Module):
    """
    Lightweight binary classifier trained on LLM hidden states.
    Predicts whether the model will fail the current query
    based on internal representations BEFORE generation begins.
    """
    def __init__(self, hidden_dim, num_layers_to_probe=4):
        super().__init__()
        # Pool representations from last N layers
        self.num_layers = num_layers_to_probe
        input_dim = hidden_dim * num_layers_to_probe
        self.classifier = nn.Sequential(
        nn.Linear(input_dim, 256),
        nn.ReLU(),
        nn.Dropout(0.1),
        nn.Linear(256, 64),
        nn.ReLU(),
        nn.Linear(64, 1),
        nn.Sigmoid()
        )
    def forward(self, hidden_states_per_layer):
        """
        hidden_states_per_layer: list of [batch, seq_len, hidden_dim]
        from the last N transformer layers.
        """
        # Take the last token's representation from each layer
        layer_reps = []
        for h in hidden_states_per_layer[-self.num_layers:]:
        layer_reps.append(h[:, -1, :]) # last token
        combined = torch.cat(layer_reps, dim=-1)
        failure_prob = self.classifier(combined)
        return failure_prob # > 0.5 means "model will likely fail, retrieve"
def should_retrieve(model, query_tokens, probe, threshold=0.5):
    """
    Decide whether to trigger retrieval for this query.
    Returns True if the probe predicts the model will fail.
    """
    with torch.no_grad():
    outputs = model(
    query_tokens,
    output_hidden_states=True
    )
    hidden_states = outputs.hidden_states
    failure_prob = probe(hidden_states)
    return failure_prob.item() > threshold

Training this probe is cheaper. You run the base LLM on a dataset of questions where you know the correct answers, record which ones it gets right and wrong, and train the probe to predict that outcome from the hidden states. A few thousand examples and a couple hours of training on a single GPU.

Skill-Matched Retrieval Strategies

The second innovation is routing. When the probe says “retrieve,” Skill-RAG doesn’t just fire off a generic vector search. It classifies the failure mode and selects a retrieval “skill” matched to that specific type of gap.

The paper identifies three main failure categories:

Factual recall failure: The model lacks a specific piece of information. Route to a dense passage retriever that returns the most relevant single document.

Multi-hop reasoning failure: The model needs to connect information across multiple sources. Route to a chained retriever that pulls multiple documents and presents them in a structured reasoning format.

Temporal knowledge failure: The model’s knowledge is outdated. Route to a retriever that prioritizes recency, filtering by publication date or temporal metadata.

# Skill-matched retrieval routing
class SkillRAG:
    def __init__(self, model, probe, retrievers):
        self.model = model
        self.probe = probe
        self.retrievers = {
        'factual': retrievers['dense_passage'],
        'multi_hop': retrievers['chained'],
        'temporal': retrievers['recency_weighted']
        }
        self.failure_classifier = FailureModeClassifier()
    def answer(self, query):
        tokens = self.tokenize(query)
        # Step 1: Should we retrieve at all?
        if not should_retrieve(self.model, tokens, self.probe):
        # Model knows this, skip retrieval entirely
        return self.model.generate(tokens)
        # Step 2: What kind of failure is this?
        failure_mode = self.failure_classifier.classify(
        query, self.model.get_hidden_states(tokens)
        )
        # Step 3: Route to the matched retrieval skill
        retriever = self.retrievers[failure_mode]
        documents = retriever.retrieve(query, top_k=5)
        # Step 4: Generate with retrieved context
        augmented_input = self.format_with_context(query, documents)
        return self.model.generate(augmented_input)
class FailureModeClassifier:
    """
    Classifies the type of knowledge gap based on query features
    and hidden state patterns.
    """
    def classify(self, query, hidden_states):
        # Temporal signals: dates, "latest", "current", "recent"
        temporal_keywords = ['latest', 'current', 'recent', '2024', '2025', '2026']
        if any(kw in query.lower() for kw in temporal_keywords):
        return 'temporal'
        # Multi-hop signals: comparison, causation, multi-entity
        multi_hop_patterns = ['compared to', 'relationship between',
        'how does X affect Y', 'which.*among']
        if self._matches_patterns(query, multi_hop_patterns):
        return 'multi_hop'
        # Default: factual recall
        return 'factual'

The Results

Skill-RAG was evaluated on three standard QA benchmarks:

HotpotQA (multi-hop)
Natural Questions (factual) and
TriviaQA (broad knowledge)

The accuracy improvements over uniform RAG baselines are solid but not jaw-dropping. What caught my eye is the efficiency story. The per-query retrieval cost drops significantly when you skip retrieval for questions the model can already answer. On TriviaQA, where the base model already knows a large fraction of answers, the retrieval cost reduction is substantial.

This is where the real value sits. If you are running a RAG system at scale, processing millions of queries per day, cutting unnecessary retrieval calls by even 30% translates directly to lower latency and lower infrastructure costs. And you are not losing accuracy to get there; you are gaining it, because you are no longer introducing noise from irrelevant retrievals.

This Is Where RAG Is Heading

I have been watching the RAG space evolve for about two years now, and the direction is clear. We were moving from monolithic pipelines to composable retrieval systems.

Early RAG was simple: embed your documents, embed your query, find the nearest neighbors, stuff them into the prompt. It worked surprisingly well for a first pass.

Then came retrieval quality improvements: reranking, hybrid search, metadata filtering. These made individual retrievals better but didn’t change the fundamental architecture.

Skill-RAG represents the next shift. The questions that matter now are not “how do we retrieve better documents” but “when should we retrieve at all” and “what kind of retrieval does this specific query need.”

Knowing when to search and what kind of search to run is becoming the central design question.

What You Can Steal From This Paper

Even if you don’t implement the full Skill-RAG system, there are two ideas you can apply to your existing RAG pipeline today.

Add a retrieval gate - Before you retrieve, run a quick check on whether the model actually needs help. This can be as simple as asking the model to generate an answer with a confidence score, then only retrieving when confidence is below a threshold. The hidden-state probe is more elegant but a confidence-based gate gets you 80% of the benefit with 10% of the effort.

Separate your retrieval strategies - Don’t use the same vector search for every query type. At minimum, have a single-document retriever for factual questions and a multi-document retriever for complex ones. Route based on query complexity. You can start with keyword heuristics and graduate to a classifier later.

The Catch

The probe needs to be trained on failures from your specific model on your specific domain. The paper’s probes were trained on general QA benchmarks. If your domain has a different failure distribution, you will need to retrain.

The failure mode classifier is somewhat brittle. The three categories (factual, multi-hop, temporal) don’t cover all possible failure modes. What about numerical reasoning failures? Or failures caused by ambiguous queries? The framework is extensible, but the paper only demonstrates three skills.

And there is a bootstrap problem. To train the probe, you need a dataset of queries where you know whether the model will succeed or fail. Building that dataset requires running the model on a large evaluation set, which takes compute upfront.

The Takeaway

RAG systems are maturing. The era of retrieve-on-everything, one-strategy-fits-all pipelines is ending. Skill-RAG shows a concrete path toward systems that are smarter about when they retrieve and what retrieval strategy they use.

If you are building or maintaining a RAG system, start thinking about retrieval as a collection of specialized skills rather than a single pipe. The plumbing matters more than most people think.

Most RAG Systems Waste 60% of Their Retrieval Calls. Skill-RAG Fixes That.

The Waste Problem in Standard RAG

How Skill-RAG Detects Knowledge Failures

Skill-Matched Retrieval Strategies

The Results

This Is Where RAG Is Heading

What You Can Steal From This Paper

The Catch

The Takeaway

Comments

Promote your content

Join our developer community

Main Menu

Most RAG Systems Waste 60% of Their Retrieval Calls. Skill-RAG Fixes That.

The Waste Problem in Standard RAG

How Skill-RAG Detects Knowledge Failures

Skill-Matched Retrieval Strategies

The Results

This Is Where RAG Is Heading

What You Can Steal From This Paper

The Catch

The Takeaway

Comments

Promote your content

Join our developer community