Documentation

Agent Experience Score

The Agent Experience Score report surfaces insights about where AI agents are hitting bottlenecks during sessions—missing context, ambiguous instructions, scope drift. This is a fundamentally new signal: rather than inferring effectiveness from output, each session transcript is evaluated from the agent’s perspective to identify what went well and what created friction.

The report provides an overall agent effectiveness score across your organization, with a detailed per-session view of the score, its components, and the work the session produced.

Note: This report is powered by AI Code Insights. AI Code Insights must be deployed to developer machines before data will appear in this report. Agent Experience scoring must be enabled by your DX admin.

When to use Agent Experience Score

This report helps teams:

Identify systemic friction points — See where agents are consistently struggling—unclear requirements, lack of developer steering, or scope drift—so you can address root causes rather than symptoms.
Compare agent effectiveness across teams and tools — Break down scores by team, repository, or AI tool to understand where agents perform well and where they don’t, and why.
Connect friction to output — Open a session’s output to see whether agent friction affected the commits, PRs, and deployments that came out of the session.
Guide improvements to developer-agent workflows — Use dimension-level scores to target specific interventions: better prompting practices, clearer initial requirements, or tighter session scoping.

How the score works

After each AI coding session, the session transcript is evaluated across three dimensions using a structured rubric called the Agentic Friction Scale. Each dimension receives a score from 1 to 5.

Session score = mean of the three dimension scores.

The report headline shows a selected statistic—mean, median, p75, or p90—across all session scores in the filtered period. You can switch between these using the calculation control.

Scoring dimensions

Each session is evaluated on three dimensions that capture different aspects of the developer-agent interaction:

Dimension	What it measures
Requirements	Whether the initial goal was clear and the context provided in the initial prompt was helpful.
Steering	As the session progressed, whether the user provided helpful input and context to work towards the goal.
Scope	Whether the session stayed on track and did not deviate from the initial goal.

The report shows the aggregate score for each dimension alongside the overall score, so you can quickly see which dimensions are dragging the score down.

Session details

Clicking into a data point on the chart opens a session list showing every evaluated session in that time period. The list includes the contributor, session title, AI tool, message count, Agent Experience score, and date—sortable by any column.

Clicking View details on any session opens a detail view with three tabs:

Overview

The Overview tab shows:

AI summary — A short AI-generated summary of what happened during the session.
Session insights — A condensed view of the session’s output, including the repositories and branches where commits landed, linked PRs, and the latest deployment when available.
Agent Experience Score — The session’s overall score (average of the three dimensions) and a per-dimension breakdown. Each dimension shows its individual score on a dial and the agent’s comment explaining why it gave that score—grounded in specific evidence from the session.
Session details — The contributor, AI tool and version, total message count (with a user vs. agent breakdown), and session start time.
Token usage — Total tokens consumed in the session, broken down into input, cached input, and output tokens. Available for Claude Code sessions.

Output

The Output tab connects the session to the work it produced. It shows:

Initiated from — The repositories and branches where the session’s commits landed.
Shipped output — One card per PR, including state, title, PR number, additions, deletions, and AI-generated percentage. Each PR card expands to show individual commits with per-commit metrics.
Latest deployment — The most recent successful deployment associated with any of the session’s PRs, with a count of additional deployments when more are available.

Transcript

The Transcript tab shows the full message history of the session—every user prompt and agent response in order, with timestamps, model used, and per-message token counts. Transcripts can be downloaded as CSV.

You can optionally limit the visibility of full session transcripts. Learn more in IC metrics.

How to use this data

The real power of this report is in the per-dimension breakdown. Each dimension points to a different category of improvement:

Low requirements — The initial prompt lacked a clear goal or sufficient context for the agent to get started effectively. Developers may need coaching on prompt engineering, or teams may benefit from standardized task templates. Consider whether requirements documents or issue descriptions provide enough context for an agent to work from the start.
Low steering — As the session progressed, the developer’s follow-up input added ambiguity rather than resolving it. That’s a signal to invest in training around how to redirect agents when they’re stuck—being specific about what to change rather than describing the problem abstractly.
Low scope — The session drifted from the initial goal, with the agent or developer pulling work into unrelated areas. This can indicate that the original task was too broadly defined, or that mid-session requests shifted the objective. Tighter scoping up front and resisting tangents during a session both help.

Use the breakdown by team and AI tool to identify whether friction is localized (one team’s codebase, one tool’s limitations) or systemic (organization-wide patterns).

How evaluations are generated

After a session ends, DX evaluates the transcript server-side using a separate model—not the same model (or models) used during the session. Because Agent Experience dimensions measure how the conversation went rather than the content of the code, post-session evaluation is accurate and produces the highest-quality scores.

Sessions with insufficient data are automatically skipped. The evaluation rubric is consistent across all tools, so scores are comparable regardless of which AI coding tool was used.