The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette walks through the four canonical Magenta Book stages for a worked example: a hypothetical GBP 50m skills programme aimed at increasing employment among long-term unemployed claimants. We move from theory of change to evaluation plan to power calculation to confidence rating, all in one R session.
The theory of change links inputs through to long-run impact.
mb_theory_of_change() captures the five canonical Magenta
Book levels plus assumptions and external factors.
toc <- mb_theory_of_change(
inputs = c("GBP 50m grant", "12 FTE programme team",
"Partnership with Jobcentre Plus"),
activities = c("Design training curriculum",
"Deliver workshops in 50 sites",
"Provide ongoing mentoring"),
outputs = c("500 workshops delivered",
"8000 attendees",
"5000 completed mentoring blocks"),
outcomes = c("Improved employability skills",
"Increased job-search confidence",
"Higher application rates"),
impact = "Higher 12-month employment among long-term unemployed",
assumptions = c(
"Workshops cause skills uplift (not just selection of motivated attendees)",
"Skills uplift translates into application behaviour",
"Local labour markets absorb the additional applicants"
),
external_factors = c(
"Macro labour market remains broadly stable",
"No competing employability programme launches in same areas"
),
name = "Skills uplift programme"
)
toc
#>
#> ── Theory of change: Skills uplift programme ───────────────────────────────────
#> Inputs: GBP 50m grant; 12 FTE programme team; Partnership with Jobcentre Plus
#> Activities: Design training curriculum; Deliver workshops in 50 sites; Provide
#> ongoing mentoring
#> Outputs: 500 workshops delivered; 8000 attendees; 5000 completed mentoring
#> blocks
#> Outcomes: Improved employability skills; Increased job-search confidence;
#> Higher application rates
#> Impact: Higher 12-month employment among long-term unemployed
#> Assumptions: Workshops cause skills uplift (not just selection of motivated
#> attendees); Skills uplift translates into application behaviour; Local labour
#> markets absorb the additional applicants
#> External factors: Macro labour market remains broadly stable; No competing
#> employability programme launches in same areas
#> Vintage: magentabook "0.1.0"Pivoting to a logframe with indicators, means of verification, and risks:
mb_logframe(
toc,
indicators = list(
outputs = c("Workshops delivered", "Attendees per workshop"),
outcomes = c("Skills score (post)", "Application count"),
impact = "Employment rate at 12 months"
),
mov = list(
outputs = "Programme delivery log",
outcomes = c("Pre/post survey", "DWP admin data"),
impact = "Linked HMRC PAYE records"
),
risks = list(
outputs = "Attendance below planned levels",
outcomes = "Self-report bias in skills score",
impact = "Macro shock confounds the estimate"
)
)
#>
#> ── Logframe: Skills uplift programme ───────────────────────────────────────────
#> level
#> inputs inputs
#> activities activities
#> outputs outputs
#> outcomes outcomes
#> impact impact
#> description
#> inputs GBP 50m grant; 12 FTE programme team; Partnership with Jobcentre Plus
#> activities Design training curriculum; Deliver workshops in 50 sites; Provide ongoing mentoring
#> outputs 500 workshops delivered; 8000 attendees; 5000 completed mentoring blocks
#> outcomes Improved employability skills; Increased job-search confidence; Higher application rates
#> impact Higher 12-month employment among long-term unemployed
#> indicator
#> inputs <NA>
#> activities <NA>
#> outputs Workshops delivered; Attendees per workshop
#> outcomes Skills score (post); Application count
#> impact Employment rate at 12 months
#> mov risk
#> inputs <NA> <NA>
#> activities <NA> <NA>
#> outputs Programme delivery log Attendance below planned levels
#> outcomes Pre/post survey; DWP admin data Self-report bias in skills score
#> impact Linked HMRC PAYE records Macro shock confounds the estimateThe high-criticality assumptions belong in a separate register:
mb_assumptions(
level = c("activities", "outcomes", "impact"),
description = c(
"Workshops are well-attended",
"Skills uplift translates into job entry",
"Employment rise persists at 12 months"
),
evidence = c(
"Pilot attendance was 80%",
"Indirect: similar programmes show 0.3 SD effect",
"Limited evidence on longer-run persistence"
),
criticality = c("medium", "high", "high")
)
#>
#> ── Assumption register (3 items) ───────────────────────────────────────────────
#> level description
#> 1 activities Workshops are well-attended
#> 2 outcomes Skills uplift translates into job entry
#> 3 impact Employment rise persists at 12 months
#> evidence criticality
#> 1 Pilot attendance was 80% medium
#> 2 Indirect: similar programmes show 0.3 SD effect high
#> 3 Limited evidence on longer-run persistence highTag the evaluation questions by Magenta Book type:
qs <- mb_questions(
text = c(
"Did the programme cause higher 12-month employment",
"How large is the effect, and for whom",
"Was delivery faithful to the design",
"What was the cost per additional job"
),
type = c("impact", "impact", "process", "economic"),
priority = c("primary", "secondary", "secondary", "primary")
)
qs
#>
#> ── Evaluation questions (4 items) ──────────────────────────────────────────────
#> text type priority
#> 1 Did the programme cause higher 12-month employment impact primary
#> 2 How large is the effect, and for whom impact secondary
#> 3 Was delivery faithful to the design process secondary
#> 4 What was the cost per additional job economic primaryPin down the counterfactual:
cf <- mb_counterfactual(
definition = "Eligible non-applicants matched on age, prior unemployment duration, and region",
source = "quasi-experimental",
credibility = "Moderate; selection on observables only, but rich admin covariates available"
)
cf
#>
#> ── Counterfactual ──────────────────────────────────────────────────────────────
#> Definition: Eligible non-applicants matched on age, prior unemployment
#> duration, and region
#> Source: quasi-experimental
#> Credibility: Moderate; selection on observables only, but rich admin covariates
#> availableMap stakeholders for governance:
mb_stakeholders(
name = c("HM Treasury", "DWP", "Local authorities", "What Works Centre"),
role = c("Funder", "Policy lead", "Delivery", "Synthesis"),
raci = c("A", "R", "C", "I"),
interest = c(5, 5, 4, 3),
influence = c(5, 5, 3, 2)
)
#>
#> ── Stakeholders (4 items) ──────────────────────────────────────────────────────
#> name role raci interest influence
#> 1 HM Treasury Funder A 5 5
#> 2 DWP Policy lead R 5 5
#> 3 Local authorities Delivery C 4 3
#> 4 What Works Centre Synthesis I 3 2Bundle into a plan:
plan <- mb_evaluation_plan(
scope = "GBP 50m programme, 50 sites, 2026-2029",
questions = qs,
methods = c(
impact = "Difference-in-differences with matched comparison group",
process = "Mixed-methods implementation review",
economic = "Cost per job, with QALY-adjusted variant"
),
timing = c(baseline = "2026-Q1", midline = "2027-Q4", endline = "2029-Q2"),
governance = "Joint HMT / DWP steering group; peer review by What Works Centre",
budget = 1.5e6
)
plan
#>
#> ── Evaluation plan ─────────────────────────────────────────────────────────────
#> Scope: GBP 50m programme, 50 sites, 2026-2029
#> Questions: 4 (primary: 2)
#> Method (impact): Difference-in-differences with matched comparison group
#> Method (process): Mixed-methods implementation review
#> Method (economic): Cost per job, with QALY-adjusted variant
#> Timing: 2026-Q1; 2027-Q4; 2029-Q2
#> Governance: Joint HMT / DWP steering group; peer review by What Works Centre
#> Budget: "GBP 1.50m"
#> Vintage: magentabook "0.1.0"The Magenta Book stresses that an evaluation is only worth running if it can detect effects of policy-relevant size. We size the study assuming a target detectable effect of 5 percentage points on the employment rate, baseline employment of 30 percent, and 80 percent power.
Naive (individual-level) sample size:
But the programme is delivered in clusters (sites), so we need to inflate by the design effect. Jobcentre-level outcomes have an ICC around 0.04 (per the bundled DWP reference values):
mb_icc_reference("employment")
#> domain outcome unit_of_clustering icc_low icc_central icc_high
#> 8 employment job_entry jobcentre 0.02 0.04 0.08
#> 9 employment wage jobcentre 0.03 0.06 0.10
#> value_source
#> 8 central_estimate
#> 9 central_estimate
#> source
#> 8 DWP impact evaluations (synthesis across multiple programmes)
#> 9 DWP impact evaluations
#> notes
#> 8 Claimant-level outcomes within Jobcentre Plus offices; central value is researcher synthesis
#> 9 Claimant wage outcomes within Jobcentres; central value is researcher synthesis
mb_cluster_design(individuals_per_cluster = 50, icc = 0.04, n_clusters = 25)
#> $deff
#> [1] 2.96
#>
#> $n_total_per_arm
#> [1] 1250
#>
#> $n_effective_per_arm
#> [1] 422.2973The design effect is a meaningful uplift; we would need roughly that multiple of the naive N per arm. Alternatively, a stepped-wedge design could trade a larger total N for staggered rollout that fits programme delivery:
mb_stepped_wedge(
steps = 5, clusters_per_step = 5,
individuals_per_cluster = 50, icc = 0.04
)
#> $deff_cluster
#> [1] 10.96
#>
#> $correction_factor
#> [1] 0.3
#>
#> $deff_sw
#> [1] 3.288
#>
#> $n_total
#> [1] 1250What is the smallest effect we can detect with the planned design?
Once the evaluation has run, score it on the Maryland SMS:
sms <- mb_sms_rate(
level = 4,
study = "Smith et al. (2029) Skills uplift evaluation",
design = "Difference-in-differences with matched comparison",
notes = "Parallel trends supported by 4 pre-period observations; cluster-robust SEs"
)
sms
#>
#> ── Maryland SMS Level 4: Strong ────────────────────────────────────────────────
#> Study: Smith et al. (2029) Skills uplift evaluation
#> Design: Difference-in-differences with matched comparison
#> Notes: Parallel trends supported by 4 pre-period observations; cluster-robust
#> SEs
#> Description: Comparison between treatment and comparison units accounting for
#> unobservable differences
#> Causal inference: Strong if identifying assumptions holdRecord a structured confidence rating:
conf_main <- mb_confidence(
rating = "medium",
question = "Did the programme raise 12-month employment",
evidence_strength = "One Level 4 DiD (n = 12000); supportive Level 3 cohort study",
methodological_quality = "Adequate; parallel trends plausible; some attrition concerns",
generalisability = "Established across 50 sites in two regions",
rationale = "Effect direction consistent across two studies but limited replication outside the programme footprint"
)
conf_main
#>
#> ── Medium confidence ───────────────────────────────────────────────────────────
#> Question: Did the programme raise 12-month employment
#> Evidence strength: One Level 4 DiD (n = 12000); supportive Level 3 cohort study
#> Methodological quality: Adequate; parallel trends plausible; some attrition
#> concerns
#> Generalisability: Established across 50 sites in two regions
#> Rationale: Effect direction consistent across two studies but limited
#> replication outside the programme footprint
#> Decision implication: Indicative evidence; supports continued investment with
#> monitoring
conf_process <- mb_confidence(
rating = "high",
question = "Was the programme implemented faithfully",
evidence_strength = "Mixed-methods process evaluation; 50-site fidelity audit",
methodological_quality = "Strong; documented fidelity protocol with inter-rater reliability",
generalisability = "All sites covered",
rationale = "Comprehensive coverage; consistent fidelity scores"
)
mb_confidence_summary(conf_main, conf_process)
#>
#> ── Confidence summary (2 ratings) ──────────────────────────────────────────────
#> high: 1
#> medium: 1
#> low: 0
#>
#> ── Ratings ──
#>
#> question rating
#> 1 Did the programme raise 12-month employment medium
#> 2 Was the programme implemented faithfully high
#> rationale
#> 1 Effect direction consistent across two studies but limited replication outside the programme footprint
#> 2 Comprehensive coverage; consistent fidelity scoresA single mb_report object aggregates everything:
report <- mb_evaluation_report(
plan = plan,
toc = toc,
sms = sms,
confidence = list(conf_main, conf_process),
name = "Skills uplift evaluation"
)
report
#> ── Magenta Book evaluation report: Skills uplift evaluation ────────────────────
#> Theory of change: present
#> Plan: present
#> SMS ratings: 1
#> Confidence ratings: 2
#> Cost-effectiveness items: 0
#> Vintage: magentabook "0.1.0"Export to LaTeX for a one-pager:
cat(mb_to_latex(report, caption = "Skills uplift evaluation summary"))
#> \begin{table}[h]
#> \centering
#> \begin{tabular}{ll}
#> \hline
#> Component & Value \\
#> \hline
#> Name & Skills uplift evaluation \\
#> Vintage & magentabook 0.1.0 \\
#> Has theory of change & yes \\
#> Has plan & yes \\
#> SMS ratings & 1 \\
#> Confidence ratings & 2 \\
#> Cost-effectiveness items & 0 \\
#> \hline
#> \end{tabular}\caption{Skills uplift evaluation summary}
#>
#> \end{table}Word and Excel exports are available via mb_to_word()
and mb_to_excel() (both require optional packages:
officer + flextable, and openxlsx
respectively).
Every result object stamps the package vintage. Bundled rubric and
reference tables expose their source via
mb_data_versions():
mb_data_versions()
#> dataset
#> 1 sms_rubric
#> 2 confidence_rubric
#> 3 icc_reference
#> 4 question_taxonomy
#> source
#> 1 Sherman, Gottfredson, MacKenzie, Eck, Reuter & Bushway (1997). Preventing Crime: What Works, What Doesn't, What's Promising. Numeric levels 1-5 are the original Maryland Scientific Methods Scale.
#> 2 Synthesised from What Works Centre confidence-rating traditions: Education Endowment Foundation (5 padlocks), Early Intervention Foundation (Foundation Standards), College of Policing (1-5 scale), and the Justice Data Lab (red / amber / green). Three-level high / medium / low structure adopted to align with HM Treasury Magenta Book (2020) supplementary value-for-money guidance.
#> 3 Hedges & Hedberg (2007); Adams, Gulliford, Ukoumunne, Eldridge, Chinn & Campbell (2004); Campbell, Mollison & Grimshaw (2000); EEF / DfE / DWP / MHCLG / MoJ impact-evaluation reports.
#> 4 HM Treasury Magenta Book (2020) chapters on process, impact, and economic evaluation; supplementary Magenta Book guides on value for money and theory-based evaluation.
#> last_updated
#> 1 2026-04-27
#> 2 2026-04-27
#> 3 2026-04-27
#> 4 2026-04-27
#> notes
#> 1 Numeric levels 1-5 are direct from Sherman et al. (1997). Word labels (Weakest / Weak / Moderate / Strong / Strongest) follow What Works UK / Education Endowment Foundation convention. Design examples and typical-use columns are magentabook synthesis.
#> 2 Not a direct quotation from the Magenta Book. magentabook synthesis of cross-What-Works-Centre confidence-rating traditions. Three-level structure designed for Treasury / consultancy decision-grade reporting.
#> 3 Reference intra-class correlation coefficients across UK policy domains. Each row is tagged in the bundled CSV with value_source = 'table_quote' (direct extraction with table number) or 'central_estimate' (researcher synthesis within published range). Practitioners should compute domain-specific ICCs from baseline data wherever feasible.
#> 4 Magenta Book canonical evaluation question taxonomy with methods and chapter references. Sub-types (e.g. 'attribution', 'fidelity') are conventional categories used across HMG evaluation practice.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.