Missing Data Analysis

This page provides a guide to understanding missing data mechanisms and selecting the right statistical methods.

Missing Data Mechanisms

Understanding the mechanism behind missing data is crucial for selecting appropriate analytical methods.

MCAR

Missing Completely At Random

Missingness is independent of both observed and unobserved data. The easiest to handle statistically.

——————————-

Example: In a study measuring blood pressure, some participants miss the follow-up visit because the clinic ran out of appointment slots that day. Their missing measurements are unrelated to their actual blood pressure or any other patient characteristic.

MAR

Missing At Random

Missingness depends on observed data but not on the missing values themselves. Requires careful handling.

——————————-

Example: In a study measuring cholesterol levels, older patients are less likely to attend follow-up blood tests. The missing measurements depend on the patients’ age which is observed, but not on the actual cholesterol values themselves.

MNAR

Missing Not At Random

Missingness depends on the unobserved values. The most challenging scenario requiring specialized methods.

——————————-

Example: In a study tracking depression symptoms, patients with the most severe symptoms may be less likely to complete follow-up questionnaires. The missing data depend on the symptom severity itself which is unobserved, making the analysis more challenging.

Statistical analysis method

Choose the right method based on your missing data mechanism and research context.

Listwise Deletion (Complete Case Analysis)

Listwise deletion, also known as complete case analysis, involves analysing only observations that have complete data on all variables included in the analysis. Any case with one or more missing values is excluded entirely.

Applicability by Missing Data Mechanism

✔ Missing Completely at Random (MCAR)
🟢 Recommended
Produces unbiased estimates, but reduces sample size and statistical power.
⚠ Missing at Random (MAR)
🟡 Use with caution
May introduce bias. Valid only if complete cases represent a random subset of the data.
✖ Missing Not at Random (MNAR)
🔴 Not recommended
Likely to produce biased estimates and should generally be avoided unless missingness is minimal.

Advantages

✔ Simple to implement and available in all statistical software
✔ Preserves observed relationships without introducing imputed values
✔ Appropriate when data are MCAR and sample size is sufficiently large
✔ No additional assumptions beyond the MCAR mechanism

Limitations

⚠ Can substantially reduce sample size, leading to loss of statistical power
⚠ Produces biased estimates when data are MAR or MNAR
⚠ May result in unrepresentative samples if missingness is systematic
⚠ Inefficient use of available information

Implementation Tips

ℹ Most statistical software applies listwise deletion by default. Analysts should verify whether complete case analysis is being used implicitly. Monitor the number of excluded cases to assess the impact on sample size and potential bias.

Pairwise Deletion (Available Case Analysis)

Pairwise deletion, also known as available case analysis, uses all available data for each individual analysis. Statistics are computed using all cases that have observed values for the specific pair of variables involved, rather than requiring complete data across all variables.

Applicability by Missing Data Mechanism

✔ Missing Completely at Random (MCAR)
🟢 Acceptable
Retains more data than listwise deletion and produces unbiased estimates.
⚠ Missing at Random (MAR)
🟡 Use with caution
May introduce inconsistencies. Bias depends on the pattern of missingness.
✖ Missing Not at Random (MNAR)
🔴 Not recommended
Likely to produce biased and inconsistent estimates.

Advantages

✔ Uses more available data than listwise deletion
✔ Maintains larger sample sizes for individual analyses
✔ Can be more efficient when data are MCAR
✔ Simple to implement in many statistical software packages

Limitations

⚠ May produce correlation or covariance matrices that are not positive definite
⚠ Different analyses may rely on different subsets of data
⚠ Sample sizes vary across statistics, complicating interpretation
⚠ Can lead to logically inconsistent results
⚠ Standard errors are difficult to estimate correctly

Implementation Tips

ℹ Pairwise deletion must be explicitly specified in many software packages. For example, in R this can be set using use="pairwise.complete.obs". Use caution when applying this approach in multivariate models, and clearly document which observations contribute to each analysis.

Mean or Median Imputation

Mean or median imputation replaces missing values with the mean or median of the observed values for that variable. It is a simple single imputation approach often used for exploratory analyses.

Applicability by Missing Data Mechanism

✔ Missing Completely at Random (MCAR)
🟡 Use with caution
Can reduce bias in means but distorts variability and relationships.
⚠ Missing at Random (MAR)
🔴 Not recommended
Produces biased estimates and attenuated associations.
✖ Missing Not at Random (MNAR)
🔴 Not recommended
Likely to produce misleading results.

Advantages

✔ Very easy to implement
✔ Retains full sample size
✔ Useful for quick descriptive summaries

Limitations

⚠ Underestimates variability
⚠ Distorts correlations and regression coefficients
⚠ Treats imputed values as if they were observed
⚠ Generally unsuitable for inferential analysis

Implementation Tips

ℹ This method should be limited to exploratory analyses. Avoid using it for hypothesis testing or modelling unless clearly justified.

Regression Imputation

Regression imputation replaces missing values using predictions from a regression model fitted to observed data. Each missing value is replaced with a single predicted value.

Applicability by Missing Data Mechanism

✔ Missing Completely at Random (MCAR)
🟢 Acceptable
Can recover mean structure but still underestimates uncertainty.
⚠ Missing at Random (MAR)
🟡 Use with caution
Depends strongly on correct model specification.
✖ Missing Not at Random (MNAR)
🔴 Not recommended
Likely to produce biased results.

Advantages

✔ Uses relationships between variables
✔ Retains full sample size
✔ Simple extension of standard regression models

Limitations

⚠ Underestimates variability and standard errors
⚠ Overstates precision of estimates
⚠ Does not reflect uncertainty in the imputed values

Implementation Tips

ℹ Regression imputation should generally be avoided for final inference. If used, clearly acknowledge its limitations and consider stochastic or multiple imputation alternatives.

Multiple Imputation (MI)

Multiple imputation replaces each missing value with several plausible values drawn from a predictive distribution. Analyses are performed on each imputed dataset and results are combined to reflect uncertainty due to missing data.

Applicability by Missing Data Mechanism

✔ Missing Completely at Random (MCAR)
🟢 Recommended
Produces unbiased estimates and valid inference.
✔ Missing at Random (MAR)
🟢 Recommended
Widely accepted as the preferred approach.
⚠ Missing Not at Random (MNAR)
🟡 Use with caution
Requires additional assumptions or sensitivity analyses.

Advantages

✔ Accounts for uncertainty in missing values
✔ Produces valid standard errors and confidence intervals
✔ Flexible and widely supported
✔ Suitable for complex models

Limitations

⚠ Requires careful model specification
⚠ Computationally more intensive
⚠ Can be challenging to implement correctly

Implementation Tips

ℹ Include all variables related to missingness and the analysis model in the imputation process. Always check convergence and perform diagnostics.

Maximum Likelihood (ML) and Full Information Maximum Likelihood (FIML)

Maximum likelihood based approaches estimate model parameters directly using all available data without explicitly imputing missing values. FIML is commonly used in structural equation and longitudinal models.

Applicability by Missing Data Mechanism

✔ Missing Completely at Random (MCAR)
🟢 Recommended
Produces unbiased and efficient estimates.
✔ Missing at Random (MAR)
🟢 Recommended
Performs well when model assumptions are met.
⚠ Missing Not at Random (MNAR)
🟡 Use with caution
Requires explicit modelling of missingness.

Advantages

✔ Uses all available information
✔ No need to create imputed datasets
✔ Statistically efficient

Limitations

⚠ Relies on correct model specification
⚠ Less flexible than multiple imputation in some settings
⚠ Not available for all types of analyses

Implementation Tips

ℹ Ensure the analysis model is correctly specified and assess model fit carefully. Report assumptions clearly.

Sensitivity Analysis

Sensitivity analysis evaluates how results change under different assumptions about the missing data mechanism. It is particularly important when MNAR cannot be ruled out.

Applicability by Missing Data Mechanism

⚠ Missing Completely at Random (MCAR)
🟡 Optional
Not typically necessary under MCAR, but can provide additional confidence.
✔ Missing at Random (MAR)
🟢 Recommended
Good practice to verify that conclusions are robust to departures from MAR.
⬛ Missing Not at Random (MNAR)
🔵 Essential
Critical when MNAR is suspected. Helps quantify the impact of untestable assumptions.

Advantages

✔ Makes assumptions explicit
✔ Improves transparency and credibility
✔ Helps decision makers interpret uncertainty

Limitations

⚠ Does not identify the true missing data mechanism
⚠ Results depend on chosen scenarios

Implementation Tips

ℹ Predefine sensitivity scenarios where possible and report results alongside primary analyses rather than as an afterthought.

Method	MCAR	MAR	MNAR	Key Notes
Listwise Deletion	🟢 Recommended	🟡 Caution	🔴 Not recommended	Simple but can greatly reduce sample size
Pairwise Deletion	🟢 Acceptable	🟡 Caution	🔴 Not recommended	Uses more data but may give inconsistent results
Mean or Median Imputation	🟡 Caution	🔴 Not recommended	🔴 Not recommended	Distorts variability and relationships
Regression Imputation	🟢 Acceptable	🟡 Caution	🔴 Not recommended	Underestimates uncertainty
Multiple Imputation	🟢 Recommended	🟢 Recommended	🟡 Caution	Preferred general purpose approach
Maximum Likelihood / FIML	🟢 Recommended	🟢 Recommended	🟡 Caution	Efficient if model is correctly specified
Sensitivity Analysis	🟡 Optional	🟢 Recommended	🔵 Essential	Assesses robustness to assumptions; critical under MNAR