Designing a Magenta Book evaluation

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

This vignette walks through the four canonical Magenta Book stages for a worked example: a hypothetical GBP 50m skills programme aimed at increasing employment among long-term unemployed claimants. We move from theory of change to evaluation plan to power calculation to confidence rating, all in one R session.

Stage 1: theory of change

The theory of change links inputs through to long-run impact. mb_theory_of_change() captures the five canonical Magenta Book levels plus assumptions and external factors.

toc <- mb_theory_of_change(
  inputs     = c("GBP 50m grant", "12 FTE programme team",
                 "Partnership with Jobcentre Plus"),
  activities = c("Design training curriculum",
                 "Deliver workshops in 50 sites",
                 "Provide ongoing mentoring"),
  outputs    = c("500 workshops delivered",
                 "8000 attendees",
                 "5000 completed mentoring blocks"),
  outcomes   = c("Improved employability skills",
                 "Increased job-search confidence",
                 "Higher application rates"),
  impact     = "Higher 12-month employment among long-term unemployed",
  assumptions = c(
    "Workshops cause skills uplift (not just selection of motivated attendees)",
    "Skills uplift translates into application behaviour",
    "Local labour markets absorb the additional applicants"
  ),
  external_factors = c(
    "Macro labour market remains broadly stable",
    "No competing employability programme launches in same areas"
  ),
  name = "Skills uplift programme"
)
toc
#> 
#> ── Theory of change: Skills uplift programme ───────────────────────────────────
#> Inputs: GBP 50m grant; 12 FTE programme team; Partnership with Jobcentre Plus
#> Activities: Design training curriculum; Deliver workshops in 50 sites; Provide
#> ongoing mentoring
#> Outputs: 500 workshops delivered; 8000 attendees; 5000 completed mentoring
#> blocks
#> Outcomes: Improved employability skills; Increased job-search confidence;
#> Higher application rates
#> Impact: Higher 12-month employment among long-term unemployed
#> Assumptions: Workshops cause skills uplift (not just selection of motivated
#> attendees); Skills uplift translates into application behaviour; Local labour
#> markets absorb the additional applicants
#> External factors: Macro labour market remains broadly stable; No competing
#> employability programme launches in same areas
#> Vintage: magentabook "0.1.0"

Pivoting to a logframe with indicators, means of verification, and risks:

mb_logframe(
  toc,
  indicators = list(
    outputs  = c("Workshops delivered", "Attendees per workshop"),
    outcomes = c("Skills score (post)", "Application count"),
    impact   = "Employment rate at 12 months"
  ),
  mov = list(
    outputs  = "Programme delivery log",
    outcomes = c("Pre/post survey", "DWP admin data"),
    impact   = "Linked HMRC PAYE records"
  ),
  risks = list(
    outputs  = "Attendance below planned levels",
    outcomes = "Self-report bias in skills score",
    impact   = "Macro shock confounds the estimate"
  )
)
#> 
#> ── Logframe: Skills uplift programme ───────────────────────────────────────────
#>                 level
#> inputs         inputs
#> activities activities
#> outputs       outputs
#> outcomes     outcomes
#> impact         impact
#>                                                                                         description
#> inputs                        GBP 50m grant; 12 FTE programme team; Partnership with Jobcentre Plus
#> activities     Design training curriculum; Deliver workshops in 50 sites; Provide ongoing mentoring
#> outputs                    500 workshops delivered; 8000 attendees; 5000 completed mentoring blocks
#> outcomes   Improved employability skills; Increased job-search confidence; Higher application rates
#> impact                                        Higher 12-month employment among long-term unemployed
#>                                              indicator
#> inputs                                            <NA>
#> activities                                        <NA>
#> outputs    Workshops delivered; Attendees per workshop
#> outcomes        Skills score (post); Application count
#> impact                    Employment rate at 12 months
#>                                        mov                               risk
#> inputs                                <NA>                               <NA>
#> activities                            <NA>                               <NA>
#> outputs             Programme delivery log    Attendance below planned levels
#> outcomes   Pre/post survey; DWP admin data   Self-report bias in skills score
#> impact            Linked HMRC PAYE records Macro shock confounds the estimate

The high-criticality assumptions belong in a separate register:

mb_assumptions(
  level = c("activities", "outcomes", "impact"),
  description = c(
    "Workshops are well-attended",
    "Skills uplift translates into job entry",
    "Employment rise persists at 12 months"
  ),
  evidence = c(
    "Pilot attendance was 80%",
    "Indirect: similar programmes show 0.3 SD effect",
    "Limited evidence on longer-run persistence"
  ),
  criticality = c("medium", "high", "high")
)
#> 
#> ── Assumption register (3 items) ───────────────────────────────────────────────
#>        level                             description
#> 1 activities             Workshops are well-attended
#> 2   outcomes Skills uplift translates into job entry
#> 3     impact   Employment rise persists at 12 months
#>                                          evidence criticality
#> 1                        Pilot attendance was 80%      medium
#> 2 Indirect: similar programmes show 0.3 SD effect        high
#> 3      Limited evidence on longer-run persistence        high

Stage 2: evaluation plan

Tag the evaluation questions by Magenta Book type:

qs <- mb_questions(
  text = c(
    "Did the programme cause higher 12-month employment",
    "How large is the effect, and for whom",
    "Was delivery faithful to the design",
    "What was the cost per additional job"
  ),
  type     = c("impact", "impact", "process", "economic"),
  priority = c("primary", "secondary", "secondary", "primary")
)
qs
#> 
#> ── Evaluation questions (4 items) ──────────────────────────────────────────────
#>                                                 text     type  priority
#> 1 Did the programme cause higher 12-month employment   impact   primary
#> 2              How large is the effect, and for whom   impact secondary
#> 3                Was delivery faithful to the design  process secondary
#> 4               What was the cost per additional job economic   primary

Pin down the counterfactual:

cf <- mb_counterfactual(
  definition  = "Eligible non-applicants matched on age, prior unemployment duration, and region",
  source      = "quasi-experimental",
  credibility = "Moderate; selection on observables only, but rich admin covariates available"
)
cf
#> 
#> ── Counterfactual ──────────────────────────────────────────────────────────────
#> Definition: Eligible non-applicants matched on age, prior unemployment
#> duration, and region
#> Source: quasi-experimental
#> Credibility: Moderate; selection on observables only, but rich admin covariates
#> available

Map stakeholders for governance:

mb_stakeholders(
  name = c("HM Treasury", "DWP", "Local authorities", "What Works Centre"),
  role = c("Funder", "Policy lead", "Delivery", "Synthesis"),
  raci = c("A", "R", "C", "I"),
  interest  = c(5, 5, 4, 3),
  influence = c(5, 5, 3, 2)
)
#> 
#> ── Stakeholders (4 items) ──────────────────────────────────────────────────────
#>                name        role raci interest influence
#> 1       HM Treasury      Funder    A        5         5
#> 2               DWP Policy lead    R        5         5
#> 3 Local authorities    Delivery    C        4         3
#> 4 What Works Centre   Synthesis    I        3         2

Bundle into a plan:

plan <- mb_evaluation_plan(
  scope = "GBP 50m programme, 50 sites, 2026-2029",
  questions = qs,
  methods = c(
    impact   = "Difference-in-differences with matched comparison group",
    process  = "Mixed-methods implementation review",
    economic = "Cost per job, with QALY-adjusted variant"
  ),
  timing = c(baseline = "2026-Q1", midline = "2027-Q4", endline = "2029-Q2"),
  governance = "Joint HMT / DWP steering group; peer review by What Works Centre",
  budget = 1.5e6
)
plan
#> 
#> ── Evaluation plan ─────────────────────────────────────────────────────────────
#> Scope: GBP 50m programme, 50 sites, 2026-2029
#> Questions: 4 (primary: 2)
#> Method (impact): Difference-in-differences with matched comparison group
#> Method (process): Mixed-methods implementation review
#> Method (economic): Cost per job, with QALY-adjusted variant
#> Timing: 2026-Q1; 2027-Q4; 2029-Q2
#> Governance: Joint HMT / DWP steering group; peer review by What Works Centre
#> Budget: "GBP 1.50m"
#> Vintage: magentabook "0.1.0"

Stage 3: power and sample size

The Magenta Book stresses that an evaluation is only worth running if it can detect effects of policy-relevant size. We size the study assuming a target detectable effect of 5 percentage points on the employment rate, baseline employment of 30 percent, and 80 percent power.

Naive (individual-level) sample size:

mb_sample_size(
  type = "proportion", p1 = 0.30, p2 = 0.35,
  power = 0.8, alpha = 0.05
)
#> [1] 1376

But the programme is delivered in clusters (sites), so we need to inflate by the design effect. Jobcentre-level outcomes have an ICC around 0.04 (per the bundled DWP reference values):

mb_icc_reference("employment")
#>       domain   outcome unit_of_clustering icc_low icc_central icc_high
#> 8 employment job_entry          jobcentre    0.02        0.04     0.08
#> 9 employment      wage          jobcentre    0.03        0.06     0.10
#>       value_source
#> 8 central_estimate
#> 9 central_estimate
#>                                                          source
#> 8 DWP impact evaluations (synthesis across multiple programmes)
#> 9                                        DWP impact evaluations
#>                                                                                          notes
#> 8 Claimant-level outcomes within Jobcentre Plus offices; central value is researcher synthesis
#> 9              Claimant wage outcomes within Jobcentres; central value is researcher synthesis
mb_cluster_design(individuals_per_cluster = 50, icc = 0.04, n_clusters = 25)
#> $deff
#> [1] 2.96
#> 
#> $n_total_per_arm
#> [1] 1250
#> 
#> $n_effective_per_arm
#> [1] 422.2973

The design effect is a meaningful uplift; we would need roughly that multiple of the naive N per arm. Alternatively, a stepped-wedge design could trade a larger total N for staggered rollout that fits programme delivery:

mb_stepped_wedge(
  steps = 5, clusters_per_step = 5,
  individuals_per_cluster = 50, icc = 0.04
)
#> $deff_cluster
#> [1] 10.96
#> 
#> $correction_factor
#> [1] 0.3
#> 
#> $deff_sw
#> [1] 3.288
#> 
#> $n_total
#> [1] 1250

What is the smallest effect we can detect with the planned design?

mb_mde(
  n_per_group = 600, type = "proportion",
  baseline = 0.30, power = 0.8
)
#> [1] 0.07641078

Stage 4: rate the evidence

Once the evaluation has run, score it on the Maryland SMS:

sms <- mb_sms_rate(
  level  = 4,
  study  = "Smith et al. (2029) Skills uplift evaluation",
  design = "Difference-in-differences with matched comparison",
  notes  = "Parallel trends supported by 4 pre-period observations; cluster-robust SEs"
)
sms
#> 
#> ── Maryland SMS Level 4: Strong ────────────────────────────────────────────────
#> Study: Smith et al. (2029) Skills uplift evaluation
#> Design: Difference-in-differences with matched comparison
#> Notes: Parallel trends supported by 4 pre-period observations; cluster-robust
#> SEs
#> Description: Comparison between treatment and comparison units accounting for
#> unobservable differences
#> Causal inference: Strong if identifying assumptions hold

Record a structured confidence rating:

conf_main <- mb_confidence(
  rating                 = "medium",
  question               = "Did the programme raise 12-month employment",
  evidence_strength      = "One Level 4 DiD (n = 12000); supportive Level 3 cohort study",
  methodological_quality = "Adequate; parallel trends plausible; some attrition concerns",
  generalisability       = "Established across 50 sites in two regions",
  rationale              = "Effect direction consistent across two studies but limited replication outside the programme footprint"
)
conf_main
#> 
#> ── Medium confidence ───────────────────────────────────────────────────────────
#> Question: Did the programme raise 12-month employment
#> Evidence strength: One Level 4 DiD (n = 12000); supportive Level 3 cohort study
#> Methodological quality: Adequate; parallel trends plausible; some attrition
#> concerns
#> Generalisability: Established across 50 sites in two regions
#> Rationale: Effect direction consistent across two studies but limited
#> replication outside the programme footprint
#> Decision implication: Indicative evidence; supports continued investment with
#> monitoring

conf_process <- mb_confidence(
  rating                 = "high",
  question               = "Was the programme implemented faithfully",
  evidence_strength      = "Mixed-methods process evaluation; 50-site fidelity audit",
  methodological_quality = "Strong; documented fidelity protocol with inter-rater reliability",
  generalisability       = "All sites covered",
  rationale              = "Comprehensive coverage; consistent fidelity scores"
)

mb_confidence_summary(conf_main, conf_process)
#> 
#> ── Confidence summary (2 ratings) ──────────────────────────────────────────────
#> high: 1
#> medium: 1
#> low: 0
#> 
#> ── Ratings ──
#> 
#>                                      question rating
#> 1 Did the programme raise 12-month employment medium
#> 2    Was the programme implemented faithfully   high
#>                                                                                                rationale
#> 1 Effect direction consistent across two studies but limited replication outside the programme footprint
#> 2                                                     Comprehensive coverage; consistent fidelity scores

Bringing it together

A single mb_report object aggregates everything:

report <- mb_evaluation_report(
  plan       = plan,
  toc        = toc,
  sms        = sms,
  confidence = list(conf_main, conf_process),
  name       = "Skills uplift evaluation"
)
report
#> ── Magenta Book evaluation report: Skills uplift evaluation ────────────────────
#> Theory of change: present
#> Plan: present
#> SMS ratings: 1
#> Confidence ratings: 2
#> Cost-effectiveness items: 0
#> Vintage: magentabook "0.1.0"

Export to LaTeX for a one-pager:

cat(mb_to_latex(report, caption = "Skills uplift evaluation summary"))
#> \begin{table}[h]
#> \centering
#> \begin{tabular}{ll}
#> \hline
#> Component & Value \\
#> \hline
#> Name & Skills uplift evaluation \\
#> Vintage & magentabook 0.1.0 \\
#> Has theory of change & yes \\
#> Has plan & yes \\
#> SMS ratings & 1 \\
#> Confidence ratings & 2 \\
#> Cost-effectiveness items & 0 \\
#> \hline
#> \end{tabular}\caption{Skills uplift evaluation summary}
#> 
#> \end{table}

Word and Excel exports are available via mb_to_word() and mb_to_excel() (both require optional packages: officer + flextable, and openxlsx respectively).

Reproducibility

Every result object stamps the package vintage. Bundled rubric and reference tables expose their source via mb_data_versions():

mb_data_versions()
#>             dataset
#> 1        sms_rubric
#> 2 confidence_rubric
#> 3     icc_reference
#> 4 question_taxonomy
#>                                                                                                                                                                                                                                                                                                                                                                                         source
#> 1                                                                                                                                                                                          Sherman, Gottfredson, MacKenzie, Eck, Reuter & Bushway (1997). Preventing Crime: What Works, What Doesn't, What's Promising. Numeric levels 1-5 are the original Maryland Scientific Methods Scale.
#> 2 Synthesised from What Works Centre confidence-rating traditions: Education Endowment Foundation (5 padlocks), Early Intervention Foundation (Foundation Standards), College of Policing (1-5 scale), and the Justice Data Lab (red / amber / green). Three-level high / medium / low structure adopted to align with HM Treasury Magenta Book (2020) supplementary value-for-money guidance.
#> 3                                                                                                                                                                                                      Hedges & Hedberg (2007); Adams, Gulliford, Ukoumunne, Eldridge, Chinn & Campbell (2004); Campbell, Mollison & Grimshaw (2000); EEF / DfE / DWP / MHCLG / MoJ impact-evaluation reports.
#> 4                                                                                                                                                                                                                      HM Treasury Magenta Book (2020) chapters on process, impact, and economic evaluation; supplementary Magenta Book guides on value for money and theory-based evaluation.
#>   last_updated
#> 1   2026-04-27
#> 2   2026-04-27
#> 3   2026-04-27
#> 4   2026-04-27
#>                                                                                                                                                                                                                                                                                                                                                 notes
#> 1                                                                                         Numeric levels 1-5 are direct from Sherman et al. (1997). Word labels (Weakest / Weak / Moderate / Strong / Strongest) follow What Works UK / Education Endowment Foundation convention. Design examples and typical-use columns are magentabook synthesis.
#> 2                                                                                                                                    Not a direct quotation from the Magenta Book. magentabook synthesis of cross-What-Works-Centre confidence-rating traditions. Three-level structure designed for Treasury / consultancy decision-grade reporting.
#> 3 Reference intra-class correlation coefficients across UK policy domains. Each row is tagged in the bundled CSV with value_source = 'table_quote' (direct extraction with table number) or 'central_estimate' (researcher synthesis within published range). Practitioners should compute domain-specific ICCs from baseline data wherever feasible.
#> 4                                                                                                                                                Magenta Book canonical evaluation question taxonomy with methods and chapter references. Sub-types (e.g. 'attribution', 'fidelity') are conventional categories used across HMG evaluation practice.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.