What is birt?
birt (Bayesian IRT) is an R package for fitting Item Response Theory models using Bayesian estimation via CmdStan. It supports three dichotomous IRT models:
- Rasch (1PL): Every item has the same discrimination — items differ only in difficulty.
- 2PL: Each item gets its own discrimination parameter controlling how sharply it separates students of different ability.
- 3PL: Adds a guessing parameter representing the probability of a correct response by pure chance.
All models are estimated using Hamiltonian Monte Carlo via CmdStan, producing full posterior distributions for every parameter. This means you get credible intervals, convergence diagnostics, and posterior predictive checks — not just point estimates.
Installation
birt requires CmdStan installed on your computer.
# Step 1: Install cmdstanr (the R interface to CmdStan)
install.packages("cmdstanr", repos = c(
"https://stan-dev.r-universe.dev/",
getOption("repos")
))
# Step 2: Install CmdStan itself (~5 minutes)
cmdstanr::install_cmdstan()
# Step 3: Install birt from GitHub
# install.packages("remotes")
remotes::install_github("Ndukaboika/birt")Verify the installation:
cmdstanr::cmdstan_version()
library(birt)The Models
All three models share a core idea: the probability of a correct response depends on the difference between a person’s ability and an item’s difficulty. They differ in how many item parameters they estimate.
Rasch Model (1PL)
Where:
- = overall mean ability
- = how student deviates from the mean
- = difficulty of item
- Total ability:
When ability equals difficulty (), the student has a 50% chance of answering correctly.
2PL Model
Adds discrimination . When , the item separates students more sharply. When , the item is less discriminating. When , this reduces to the Rasch model.
Prior Distributions
birt uses weakly informative default priors that work across a wide range of testing scenarios. The defaults are deliberately neutral — they make no strong assumptions about your test or students. Users with domain knowledge can override any prior at two levels: class-level (same prior for all items) or per-item (different prior for each individual item).
Default Priors
| Parameter | Default Prior | Class-level Argument | Per-item Arguments | 95% Range |
|---|---|---|---|---|
| (mean ability) | Normal(0, 1) | prior_delta = c(0, 1) |
— | -2.0 to 2.0 |
| (ability deviation) | Normal(0, 1.5) | prior_alpha_sd = 1.5 |
— | -3.0 to 3.0 |
| (difficulty) | Normal(0, 1.5) | prior_beta = c(0, 1.5) |
prior_beta_mean, prior_beta_sd
|
-3.0 to 3.0 |
| (discrimination) | LogNormal(0, 0.5) | prior_a = c(0, 0.5) |
prior_a_meanlog, prior_a_sdlog
|
0.37 to 2.72 |
| (guessing) | Beta(2, 8) | prior_c = c(2, 8) |
prior_c_alpha, prior_c_beta
|
0.03 to 0.45 |
These defaults are informed by recommendations from the Stan User’s Guide (Stan Development Team, 2024), Luo and Jiao (2018), and the edstan package (Furr, 2017). The ability prior also serves to identify the scale of the model.
Why These Defaults?
- delta ~ Normal(0, 1): Centered at zero — no assumption about whether students are above or below average difficulty. The data determines this.
- alpha ~ Normal(0, 1.5): Wide enough to accommodate very strong and very weak students. A deviation of 3 logits shifts P(correct) from 50% to about 95%.
- beta ~ Normal(0, 1.5): Same rationale. Covers the typical difficulty range of well-constructed test items.
- a ~ LogNormal(0, 0.5): Always positive (negative discrimination contradicts IRT assumptions), centered at 1.0 (the Rasch case).
- c ~ Beta(2, 8): Weakly informative with a mean around 0.2. Wide enough to accommodate different item formats without making strong assumptions.
Customizing Priors: Class-Level
Set the same prior for all items of a parameter type:
# If you know students tend to perform above average
fit <- rasch_fit(data,
prior_delta = c(0.75, 0.5),
seed = 123
)
# Wider difficulty prior for all items
fit <- rasch_fit(data,
prior_beta = c(0, 3),
seed = 123
)
# 4-option multiple choice (guessing around 0.25)
fit3 <- threepl_fit(data,
prior_c = c(5, 15), # Beta(5,15), mean = 0.25
seed = 123
)
# 5-option multiple choice (guessing around 0.20)
fit3 <- threepl_fit(data,
prior_c = c(5, 20), # Beta(5,20), mean = 0.20
seed = 123
)
# Free-response items (very little guessing)
fit3 <- threepl_fit(data,
prior_c = c(1, 19), # Beta(1,19), mean = 0.05
seed = 123
)
# Wide priors for large samples (let data fully dominate)
fit <- rasch_fit(data,
prior_delta = c(0, 3),
prior_alpha_sd = 3,
prior_beta = c(0, 3),
seed = 123
)Customizing Priors: Per-Item
Set a different prior for each individual item. Use this when you have specific knowledge about certain items from pilot testing, expert judgment, or previous administrations.
K <- ncol(data)
# Difficulty: item 3 is known to be hard, item 7 is known to be easy
b_mean <- rep(0, K) # default for all items
b_mean[3] <- 2.0 # item 3: prior centered at 2.0 (hard)
b_mean[7] <- -1.5 # item 7: prior centered at -1.5 (easy)
b_sd <- rep(1.5, K) # default uncertainty for all items
b_sd[c(3, 7)] <- 0.5 # tighter prior for items we know about
fit <- rasch_fit(data,
prior_beta_mean = b_mean,
prior_beta_sd = b_sd,
seed = 123
)
# Discrimination: item 5 is known to be poorly discriminating
a_meanlog <- rep(0, K) # default: centered at 1.0 for all items
a_meanlog[5] <- -0.5 # item 5: prior centered below 1.0
a_sdlog <- rep(0.5, K) # default uncertainty for all items
a_sdlog[5] <- 0.3 # tighter prior for item 5
fit2 <- twopl_fit(data,
prior_a_meanlog = a_meanlog,
prior_a_sdlog = a_sdlog,
seed = 123
)
# Guessing: mixed item formats on the same test
# Items 1-5 are 4-option MC, items 6-10 are 5-option MC
c_alpha <- rep(5, K)
c_beta <- c(rep(15, 5), rep(20, 5)) # mean 0.25 vs 0.20
fit3 <- threepl_fit(data,
prior_c_alpha = c_alpha,
prior_c_beta = c_beta,
seed = 123
)Per-item arguments override class-level arguments. If you provide
prior_beta_mean, the prior_beta argument is
ignored for the mean. If you provide prior_beta_sd, the
prior_beta argument is ignored for the SD.
Checking What Priors Were Used
The priors are stored in the fitted object as vectors (one value per item):
fit <- rasch_fit(data, seed = 123)
fit$priors
# Shows:
# $delta — c(0, 1)
# $alpha_sd — 1.5
# $beta_mean — c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0) one per item
# $beta_sd — c(1.5, 1.5, 1.5, ...) one per itemQuick Start with Simulated Data
Simulate Data
rasch_simulate() generates data with known true
parameters so you can verify the model recovers the correct values.
library(birt)
sim <- rasch_simulate(
J = 300, # 300 students
K = 10, # 10 items
delta_true = 0.75, # mean ability
alpha_sd = 1, # spread of abilities
seed = 42 # reproducible
)
# The response matrix
head(sim$data)
# True values we want to recover
sim$beta # item difficulties
sim$delta # mean abilityUnderstanding the Output
The summary() output contains:
Mean Ability (delta): The estimated overall mean
ability with a credible interval. Compare this to sim$delta
to check recovery.
Item Difficulties (beta): One row per item with the posterior mean (point estimate), credible interval, and convergence diagnostics. Check that Rhat < 1.01 and ESS > 400 for all items.
Person Ability Summary: Overview of estimated abilities across all students.
Extract Parameters
# Item difficulties
item_params(fit)
# Person abilities (total: alpha + delta)
head(person_params(fit), 10)
# Mean ability
delta_param(fit)
# Change credible interval width
item_params(fit, prob = 0.90) # 90% interval
item_params(fit, prob = 0.99) # 99% intervalItem Fit Diagnostics
Outfit and infit are mean-square statistics measuring how well each item conforms to the model. Both should be near 1.0.
ifit <- item_fit(fit)
ifit
# Flag misfitting items
ifit[ifit$outfit > 1.3 | ifit$outfit < 0.7, ]Interpretation:
- Outfit > 1.3: The item behaves erratically — unexpected responses from students far from the item’s difficulty level.
- Outfit < 0.7: The item is too predictable — possibly redundant with other items.
- Infit: Same interpretation but more sensitive to systematic patterns near the item’s difficulty and less affected by outliers.
Person Fit Diagnostics
pfit <- person_fit(fit)
head(pfit)
# Unusual response patterns
pfit[pfit$outfit > 1.3, ]High person outfit may indicate guessing, careless responding, or cheating.
Plots
Item Characteristic Curves
Each curve shows P(correct) as a function of ability. Harder items are shifted right. Shaded bands show 95% credible intervals.
Wright Map
Places students (histogram) and items (triangles) on the same logit scale. This reveals whether the test is well-targeted:
- Items covering the full student range = good targeting.
- All items to the left of students = test is too easy.
- Gaps with no items = poor measurement in that ability range.
plot(fit, type = "wright")Test Information Function
Shows where on the ability scale the test is most precise. Higher information means more precise measurement.
Trace Plots
Check MCMC convergence. Chains should overlap and look like “fuzzy caterpillars.” If chains are stuck in different places or trending, the model hasn’t converged.
plot(fit, type = "trace")Parameter Recovery
With simulated data, verify the model recovers the truth:
items_est <- item_params(fit)
# Plot true vs estimated
plot(sim$beta, items_est$mean,
xlab = "True Difficulty",
ylab = "Estimated Difficulty",
pch = 19, main = "Parameter Recovery")
abline(0, 1, col = "red", lty = 2)
# Correlation (should be > 0.95)
cor(sim$beta, items_est$mean)
# Delta recovery
cat("True delta:", sim$delta, "\n")
cat("Estimated delta:", delta_param(fit)$mean, "\n")Fitting the 2PL Model
Use the 2PL when you suspect items differ in how well they discriminate between high- and low-ability students.
Discrimination Parameters
discrim_params(fit2)Interpretation:
- : Similar to Rasch — average discrimination.
- : Highly discriminating — steep ICC, strongly separates students.
- : Poorly discriminating — flat ICC. Consider removing the item.
2PL Plots
Notice the different slopes in the ICCs — steeper curves correspond to higher discrimination:
Custom Priors for 2PL
# Class-level: wider discrimination prior for all items
fit2_wide <- twopl_fit(sim$data,
prior_a = c(0, 1),
seed = 123
)
# Per-item: item 5 is known to be poorly discriminating
K <- ncol(sim$data)
a_meanlog <- rep(0, K)
a_meanlog[5] <- -0.5
fit2_item <- twopl_fit(sim$data,
prior_a_meanlog = a_meanlog,
seed = 123
)
fit2_item$priorsFitting the 3PL Model
Use the 3PL for multiple-choice tests where guessing is plausible. Requires 500+ students for stable estimation.
sim_large <- rasch_simulate(J = 500, K = 10, seed = 42)
fit3 <- threepl_fit(sim_large$data, seed = 123)
summary(fit3)Guessing Parameters
guessing_params(fit3)Interpretation:
- : Typical for 4-option multiple choice.
- : No guessing (expected for free-response items).
- : Unusually high — the item may have poor distractors.
3PL with Class-Level Guessing Priors
# 4-option multiple choice
fit3_mc4 <- threepl_fit(sim_large$data,
prior_c = c(5, 15), # Beta(5,15), mean = 0.25
seed = 123
)
# 5-option multiple choice
fit3_mc5 <- threepl_fit(sim_large$data,
prior_c = c(5, 20), # Beta(5,20), mean = 0.20
seed = 123
)
# Free-response items
fit3_fr <- threepl_fit(sim_large$data,
prior_c = c(1, 19), # Beta(1,19), mean = 0.05
seed = 123
)3PL with Per-Item Guessing Priors
# Mixed test: items 1-5 are 4-option MC, items 6-10 are 5-option MC
K <- 10
c_alpha <- rep(5, K)
c_beta <- c(rep(15, 5), rep(20, 5)) # mean 0.25 vs 0.20
fit3_mixed <- threepl_fit(sim_large$data,
prior_c_alpha = c_alpha,
prior_c_beta = c_beta,
seed = 123
)
fit3_mixed$priorsWorking with Real Data
Compare Rasch vs 2PL on Real Data
fit_alg_2pl <- twopl_fit(algebra, seed = 123)
# Are discriminations similar or very different?
discrim_params(fit_alg_2pl)
# Compare ICCs
plot(fit_alg, type = "icc") # Rasch: parallel curves
plot(fit_alg_2pl, type = "icc") # 2PL: varying slopesPer-Item Priors on Real Data
# Suppose from pilot testing you know items Alg3 and Alg8 are hard
K <- ncol(algebra)
b_mean <- rep(0, K)
b_mean[3] <- 1.5 # Alg3 is hard
b_mean[8] <- 2.0 # Alg8 is very hard
fit_alg_informed <- rasch_fit(algebra,
prior_beta_mean = b_mean,
seed = 123
)
# Compare default vs informed estimates
items_default <- item_params(fit_alg)
items_informed <- item_params(fit_alg_informed)
cbind(
item = items_default$item,
default = round(items_default$mean, 2),
informed = round(items_informed$mean, 2)
)Sensitivity Analysis
A sensitivity analysis checks whether your results depend on the prior. Re-fit the model with different priors and compare. If estimates barely change, your conclusions are data-driven.
# Default weakly informative priors
fit_default <- rasch_fit(algebra, seed = 123)
# Tighter priors
fit_tight <- rasch_fit(algebra,
prior_alpha_sd = 0.5,
prior_beta = c(0, 0.5),
seed = 123
)
# Very wide priors
fit_wide <- rasch_fit(algebra,
prior_delta = c(0, 5),
prior_alpha_sd = 3,
prior_beta = c(0, 3),
seed = 123
)
# Compare item difficulty estimates
items_default <- item_params(fit_default)
items_tight <- item_params(fit_tight)
items_wide <- item_params(fit_wide)
# Correlations above 0.99 indicate results are robust to prior choice
cor(items_default$mean, items_tight$mean)
cor(items_default$mean, items_wide$mean)
# Visual comparison
plot(items_default$mean, items_wide$mean,
xlab = "Default Priors", ylab = "Wide Priors",
pch = 19, main = "Prior Sensitivity")
abline(0, 1, col = "red", lty = 2)Advanced Usage
Controlling MCMC Sampling
# More iterations for better estimates
fit <- rasch_fit(data, iter_sampling = 2000, seed = 123)
# Fix divergent transitions
fit <- rasch_fit(data, adapt_delta = 0.95, seed = 123)
# Fewer chains for speed (not recommended for final analysis)
fit <- rasch_fit(data, chains = 2, parallel_chains = 2, seed = 123)Choosing a Model
Start with Rasch. Simplest and most interpretable. If item fit statistics are acceptable (outfit/infit between 0.7 and 1.3), stop here.
Try 2PL if Rasch fit is poor or you have theoretical reasons to expect varying discrimination. Compare ICCs — if slopes differ meaningfully, the discrimination parameter is capturing real differences.
Try 3PL only with multiple-choice items, 500+ students, and clear evidence of guessing. If estimated guessing parameters are all near zero, the 2PL is sufficient.
Troubleshooting
Divergent Transitions
The sampler had trouble exploring the posterior. Try increasing adapt_delta:
fit <- rasch_fit(data, adapt_delta = 0.95, seed = 123)Values between 0.9 and 0.99 usually help.
Low ESS
Not enough effective samples. Try more iterations:
fit <- rasch_fit(data, iter_sampling = 2000, seed = 123)Slow Compilation
The Stan model compiles to C++ on first use (~30-60 seconds). It is cached afterward, so subsequent calls are fast.
Corrupt Database Error (R 4.5)
If you see “lazy-load database is corrupt”, reinstall with:
remove.packages("birt")
devtools::install(args = "--no-byte-compile")References
- Luo, Y., & Jiao, H. (2018). Using the Stan program for Bayesian item response theory. Educational and Psychological Measurement, 78(3), 384-408.
- Stan Development Team (2024). Stan User’s Guide, Section 1.11: Item-Response Theory Models.
- Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics, 2(4), 1360-1383.
- Culpepper, S. A. (2016). Revisiting the 4-parameter item response model: Bayesian estimation and application. Psychometrika, 81(4), 1142-1163.
- Furr, D. C. (2017). edstan: Stan models for item response theory. R package.