Sablebourne · Methodology
How we measure LLM brand visibility
A transparent, deterministic approach to tracking how large language models represent a brand. Every number on the dashboard traces back to an inspectable rule.
1. The query universe
We curate a representative set of natural prompts that reflect how potential customers, journalists, analysts, and enthusiasts actually ask about a brand. Each query carries:
- Category: e.g. direct brand, performance, style, comparisons, skepticism.
- Priority weight: 1–5, reflecting business impact. Weight-5 queries count five times more in aggregate metrics.
- Query style:
structured(research-style phrasing) orconsumer_voice(natural language). LLMs often behave differently on the two, so we track both.
For Nike, we ship 10 categories spanning direct-brand awareness through comparison, innovation, sustainability, and skepticism-bait queries.
2. The response pipeline
- Each query is sent to every configured run target (provider/model pair) — or replayed from fixture in this demo. See section 3 for how multi-model output is aggregated.
- URLs are extracted from the response and normalised to domains.
- Domains are classified by rule into
source_type(owned_brand, editorial, retailer, forum, wikipedia, review, social, government) andquality_tier. - A deterministic analysis engine runs regex-based checks for brand presence, mention rank, directness, competitor mentions, framing tags, and owned-source support.
- A composite visibility score (0–100) is computed from the analysis output.
- Typed recommendations are generated from the analysis using readable rules.
No LLM-as-judge. The only LLM call is the one that produced the response under analysis. Everything after that is readable TypeScript you can step through.
3. Multi-model aggregation rule
We run the same query against multiple LLMs (e.g. Claude, OpenAI) and store every run independently. Because each provider produces a different answer, mixing them in a single number would silently average behaviour you actually want to compare.
Headline KPIs use a single primary target. The primary target is the first entry of DEFAULT_RUN_TARGETS. Weighted Visibility Index, mention rate, first-mention rate, owned-source penetration, category performance, top opportunities, query-style breakdown, and movers all reflect this single target. The dashboard header surfaces which target is in use.
Multi-provider views span every target. The 30-day trend renders one line per target; the Model Performance panel compares every target head-to-head; the query detail page exposes a tab strip so analysts can compare providers run-by-run for one query.
A switcher could replace this rule with per-provider headline KPIs later. We keep the policy in one place — getPrimaryRunTarget() — so swapping it is a single-line change.
4. The Query Visibility Score (0–100)
The score is the sum of five transparent components. Weights are defined in src/lib/config/index.ts and can be tuned per engagement.
Presence · max 25
Is the brand mentioned at all?
Mention Rank · max 25
First = 25 · Top-3 = 18 · Later = 10
Directness · max 20
Direct = 20 · Category incl. = 12
Framing · max 15
Sum of per-tag score impacts, clamped to 0–15.
Source Support · max 15
Owned-domain citation (+6), multi-domain breadth (+4 for 3+), single-domain support (+2).
5. The Weighted Visibility Index
WVI is the priority-weighted average of the latest Query Visibility Score across all active queries:
WVI = Σ(score × priorityWeight) / Σ(priorityWeight)
A weight-5 query counts five times more than a weight-1 query. This ensures executive metrics reflect business priority, not query count.
6. Framing tags
Framing tags capture the sentiment and narrative lens around how the brand is discussed. They are the most nuanced signal: positive framings reward the score, mixed framings are neutral or slightly negative, and negative framings reduce it.
See the full dictionary with score impacts in the Settings page.
7. Recommendations
Three scopes of recommendation are generated:
- Query-level, generated immediately after each run from deterministic rules.
- Cluster: pattern detection across the latest run per query (e.g. repeated absence, repeated negative framing in a category).
- Global: threshold-triggered across the whole query universe (e.g. overall absence rate > 40%).
Each recommendation carries an evidence payload linking back to the specific runs or queries that triggered it, never a black box.
15 action types in the full taxonomy.
Honest caveat
This score is a directional brand-intelligence tool, not a scientific measurement. Use it to identify patterns, prioritise action, and track relative change over time, not as an absolute truth. Sablebourne exists to help brands interpret these signals in the context of their category, audience, and strategic goals.