Skip Navigation

Review Process

Producing Study Ratings

Reviewing Eligible Studies

Trained reviewers evaluate randomized controlled trials (RCTs) and quasi-experimental designs (QEDs) identified for each prioritized model, assessing the research design and methodology of each study using a standard review protocol. To ensure the accuracy of reviews, each study is reviewed by two members of the review team. The first reviewer evaluates the study, assigns a study rating (see Study Ratings tab), and records the review results. A second reviewer examines the study and the results of the first review. If the second reviewer disagrees with any of the first reviewer’s decisions, the two reviewers discuss these differences to reach a consensus rating. The principal investigator or senior reviewer confirms all consensus rating decisions.

Some studies are missing information related to the study rating, for example, information on attrition or baseline equivalence of the treatment and comparison groups. In these cases, the HomVEE review team sends queries to authors to request the missing information. Authors are given one week to respond, but the HomVEE team includes information sent after the deadline whenever feasible. If the authors do not respond or provide the necessary information, a HomVEE reviewer assigns rating based on the available information.

If a study has a design for which HomVEE does not have existing standards, it will not be reviewed, nor will it contribute to the program model’s evidence base.

Top

Study Ratings

We review and rate studies identified in the literature search that examine the impact of a home visiting model using quantitative data and statistical analyses. The study-level ratings—(1) high, (2) moderate, and (3) low—provide a measure of how well the study design could provide unbiased estimates of model impacts. In brief, the high rating is reserved for random assignment studies with low attrition of sample members and no reassignment of sample members after the original random assignment, and single case and regression discontinuity designs that meet the What Works Clearinghouse (WWC) design standards (Table 1). The WWC, established by the Institute of Education Sciences in the U.S. Department of Education, reviews education research. The moderate rating applies to random assignment studies that, due to flaws in the study design or analysis (for example, high sample attrition), do not meet all the criteria for the high rating; matched comparison group designs; and single case and regression discontinuity designs that meet WWC design standards with reservations. Studies that do not meet all of the criteria for either the high or moderate ratings are assigned the low rating.

Table 1: Summary of Study Rating Criteria for the HomVEE Review

HomVEE Study Rating

Randomized Controlled Trials

Quasi-Experimental Designs

Matched Comparison Group

Single Case Designb

Regression Discontinuityb

High

  • Random assignment
  • Meets WWC standards for acceptable rates of  overall and differential attritiona
  • No reassignment; analysis must be based on original assignment to study arms
  • No confounding factors; must have at least 2 participants in each study arm and no systematic differences in data collection methods
  • Baseline equivalence established on tested outcomes and demographic characteristics OR controls for these measures

Not applicable

  • Timing of intervention is systematically manipulated
  • Outcomes meet WWC standards for interassessor agreement
  • At least three attempts to demonstrate an effect
  • At least five data points in relevant phases
  • Integrity of forcing variable is maintained
  • Meets WWC standards for low overall and differential attrition
  • The relationship between the outcome and the forcing variable is continuous
  • Meets WWC standards for functional form and bandwidth

Moderate

  • Reassignment OR unacceptable rates of overall or differential attritiona
  • Baseline equivalence established on tested outcomes and demographic characteristics AND controls for baseline measures of tested outcomes, if applicablec
  • No confounding factors; must have at least 2 participants in each study arm and no systematic differences in data collection methods
  • Baseline equivalence established on tested outcomes and demographic characteristics AND controls for baseline measures of tested outcomes, if applicablec
  • No confounding factors; must have at least 2 participants in each study arm and no systematic differences in data collection methods
  • Timing of intervention is systematically manipulated
  • Outcomes meet WWC standards for interassessor agreement
  • At least three attempts to demonstrate an effect
  • At least three data points in relevant phases
  • Integrity of forcing variable is maintained
  • Meets WWC standards for low attrition
  • Meets WWC standards for functional form and bandwidth

Low

Studies that do not meet the requirements for a high or moderate rating.

NOTE: “Or” implies that one of the criteria must be present to result in the specified rating.

a The What Works Clearinghouse (WWC), established by the Institute of Education Sciences in the U.S. Department of Education, reviews education research (http://ies.ed.gov/ncee/wwc/). The WWC standard for attrition is transparent and statistically based, taking into account both overall attrition (the percentage of study participants lost in the total study sample) and differential attrition (the differences in attrition rates between treatment and control groups).

b For ease of presentation, some of the criteria are described very broadly. Additional details are available for single case design standards in Appendix F of the WWC version 2.1 standards (http://ies.ed.gov/ncee/wwc/Docs/referenceresources/wwc_procedures_v2_1_standards_handbook.pdf) and in a specific document about regression discontinuity designs (http://ies.ed.gov/ncee/wwc/Document/258).

c The variables that must be used to establish equivalence depend on whether (1) it is possible to collect the measure at baseline vs. (2) it is difficult or impossible to collect the measure at baseline. See section below on baseline equivalence for more details.

Study Design

Below we provide more details about the study rating criteria for randomized controlled trials and matched comparison group quasi-experimental designs (QEDs) in the following categories: (1) study design, (2) attrition, (3) baseline equivalence, (4) reassignment, and (5) confounding factors. (For details about design standards for single case designs, please refer to the WWC Single-Case Study Design Technical Document in Appendix F of the WWC standards; more information about study design standards for regression discontinuity designs can be found in the WWC Standards for Regression Discontinuity Designs.) Authors may also wish to consult the HomVEE reporting guide for study authors, or the flowcharts that illustrate standards for randomized controlled trials and for matched comparison group design studies.

Studies that use random assignment create two or more groups that are, on average, similar to each other at the start of the study. These studies provide strong evidence that differences in the outcomes between the treatment and control groups at the end of the study can be attributed to the intervention rather than to pre-existing differences between the groups (Shadish et al. 2002). (Designs based on functionally random assignment, or assignment to groups in ways that ensure any differences between and within groups are not systematic at the outset of the experiment, such as alternation based on last name, date of birth, or certain digits of an identification number are also eligible for the high rating.) Therefore, studies that randomly assign subjects can receive a high rating.

Matched comparison designs with an external comparison group can achieve at best a moderate rating. In such studies, subjects are sorted into the study arms through a process other than random assignment; therefore, even if the treatment and comparison groups are well matched based on observed characteristics, they may still differ on unmeasured characteristics. It is, therefore, impossible to rule out the possibility that the findings are attributable to unmeasured group differences. The moderate rating is also possible for random assignment designs that do not meet other criteria for the high rating (that is, attrition or reassignment), as explained in more detail below.

Designs without a comparison group (for example, pre-post designs) offer no way to assess what the sample’s outcomes would have been in the absence of the intervention. These study designs cannot rule out that that changes were caused by, for example, history (an event besides the treatment that could have produced the observed outcome) or maturation (participants’ natural changes over time could have produced the outcome) (Shadish et al. 2002). Therefore, these studies with these designs cannot meet the criteria for either the high or moderate ratings.

Attrition

In random assignment studies, a loss of study participants can bias the impact estimates by creating differences in the characteristics of the treatment and control groups. If people in the treatment and comparison groups who remain in the study were initially different from one another, posttest outcomes could differ even in the absence of treatment (Shadish et al. 2002). The HomVEE review used the WWC standard for attrition. The WWC standard for attrition is transparent and statistically based, taking into account both overall attrition (the percentage of study participants lost in the total study sample) and differential attrition (the differences in attrition rates between treatment and control groups). It recognizes an important trade-off between overall and differential attrition—namely, that for an expected level of bias, studies with a relatively low level of overall attrition can tolerate a relatively high level of differential attrition, whereas studies with a relatively high level of overall attrition require a lower level of differential attrition. (WWC Procedures and Standards Handbook v2.1 - Assessing Attrition Bias). The WWC attrition standard classifies studies as having either “high” or “low” attrition based on a combination of overall and differential attrition (see Figure 1).

FIGURE 1. CUTOFFS FOR WWC ATTRITION STANDARDS
Tradeoffs Between Overall and Differential Attrition: Both overall and differential attrition contribute to the potential bias of the estimated effect. The WWC has developed a model of attrition bias to calculate the potential bias under assumptions about the relationship between response and the outcome of interest.

Note: The red area indicates combinations of overall and differential attrition that produce a rating of “high” attrition. The green area indicates combinations that produce a rating of “low” attrition.

Random assignment studies that meet the standard for low attrition are considered for the high study rating. Random assignment studies with high attrition are considered for the moderate study rating and must meet the other criteria for this rating.

For clustered random assignment designs, in which a cluster or group, such as a neighborhood, is assigned to treatment or comparison conditions, attrition is assessed at two levels. Attrition must be low at both the cluster and individual levels to receive a high rating. If attrition is high at either or both the cluster and individual levels then baseline equivalence must be established, and the highest possible rating is a moderate. These are the same attrition standards as used by the WWC.

ATTRITION STANDARDS FOR CLUSTER RANDOMIZED TRIALS

Level of sample attrition

Highest possible study rating

Cluster level

Individual level

HomVEE

High

Low

Moderate, with evidence of baseline equivalence and controls for baseline measures of outcomes

High

High

Moderate, with evidence of baseline equivalence and controls for baseline measures of outcomes

Low

Low

High, with evidence of baseline equivalence or controls for baseline measures of outcomes as well as race/ethnicity, and SES

Low

High

Moderate, with evidence of baseline equivalence and controls for baseline measures of outcomes

The attrition standards do not apply to quasi-experimental comparison group studies. These studies are evaluated on the basis of the final analysis sample, from which there is no attrition (by definition).

Baseline Equivalence

A comparison group is intended to represent what would have happened to the treatment group in absence of the treatment. To provide the strongest evidence of this counterfactual, the treatment and comparison groups should be as similar as possible at the study’s onset (that is, baseline). When the treatment and comparison groups are dissimilar, the results cannot support causal conclusions about the differential effect of the treatment (Rubin 1997).

To consider studies using matched comparison group designs and RCTs with high attrition for a moderate rating, the HomVEE review requires that:

  1. The study establish baseline equivalence on: (1) race and ethnicity, (2) socioeconomic status (SES) and (3) baseline measures of outcomes (when feasible); AND
  2. Baseline measures of outcomes also be used as controls in the impacts analysis.

(RCTs with low attrition may be considered for a high rating if they meet either of these criteria; Table 1 describes other criteria for high and moderate ratings.) Equivalence between the treatment and comparison groups must be established at baseline, that is, before the intervention being studied is provided to the program group. Establishing baseline equivalence supports conclusions that the treatment, rather than pre-existing differences, led to any observed difference in outcomes (Shadish et al. 2002). This equivalence is established if there are no statistically significant differences (α = 0.05) on the specified variables at baseline. Equivalence must be established on the sample used in the analysis. If baseline equivalence is established on a subgroup, the results must be replicated in the same domain in two or more studies with a high or moderate rating using non-overlapping analytic study samples for the results to be considered for the determination of whether a program model meets the DHHS criteria (see DHHS Criteria for Evidence-Based Program Models).

We require baseline equivalence on the demographic characteristics because they may be related to the outcome domains that are the focus of the HomVEE review. Research links SES and outcomes such as child health and child cognitive and social-emotional development (Bradley and Corwyn 2002). Similarly, outcomes may vary by the race/ethnicity of the participant. For example, research shows that birth outcomes are significantly different between different race/ethnicity groups (MacDorman 2011).

SES can be measured in multiple ways, but for this review we prefer to see equivalence on specific economic well-being measures – income, earnings, or poverty levels according to federal thresholds – because of the body of research that shows their association with child well-being, such as cognitive ability and achievement (for example, Duncan and Brooks-Gunn 1997, Fagan and Lee 2012). We also accept mean-tested assistance (such as AFDC/TANF or food stamps/SNAP receipt), maternal education, and employment of at least one member in the household if at least two such alternative measures of SES are provided, because they are closely tied to the HomVEE preferred measures of SES (income, earnings, poverty level). These measures are commonly used indicators of SES and are relevant to the population targeted to home visiting programs. In contexts outside of the nation of the United States, other measures of economic well-being will be considered.

The HomVEE review also requires that all RCTs with high attrition and all matched comparison group design studies establish baseline equivalence on the same outcomes used to examine impacts at follow-up when possible to collect baseline data on those measures. For the HomVEE review, however, researchers in some situations could not examine the same variables at baseline and follow-up. For example, if an outcome of interest is children’s cognitive development, a study cannot collect baseline cognitive skills when program services start prenatally. Therefore, the variables that must be used to establish equivalence depend on whether (1) it is possible to collect the measure at baseline vs. (2) it is difficult or impossible to collect the measure at baseline. We present our criteria for these two scenarios below.

  • Measures assessing variables identical or sufficiently similar to the outcomes of interest are feasible at baseline. When possible, baseline equivalence should be established on the outcomes of interest; additionally, the baseline outcome should be used as a control in the analysis of impacts. Controlling for the baseline outcome ensures that any marginal differences did not bias the impact estimates at follow up. In addition, as described above, we also require that studies establish equivalence on (1) race/ethnicity and (2) socioeconomic status (SES).
  • Measures of outcomes of interest could not be collected at baseline. For some outcomes, it is not feasible to collect baseline measures on the outcomes of interest, for example, children’s cognitive and behavioral outcomes when the baseline is conducted prenatally, or parenting outcomes when parents enroll in the study before their child is born. For these studies, baseline equivalence must be established on two demographic factors, as described above: (1) the parent or child’s race/ethnicity and (2) SES.

In addition to these requirements, project leadership has the discretion to determine other cases where baseline equivalence is insufficiently demonstrated. For example, some measures that combine a wide range of responses (such as all non-White persons) may be inappropriate. In addition, a study may present comparisons for other factors at baseline, such as family structure or maternal behaviors, which are not required to establish baseline equivalence for the purposes of the HomVEE review. If a study shows statistically significant differences on these variables, it may be downgraded (that is, no longer eligible to receive the highest rating for its design). The decision to downgrade depends on the magnitude of these differences and the variables under consideration. Project leadership makes the decision in these cases, and the rationale is thoroughly documented.

Although random assignment is expected to produce groups that are equivalent, on average, on measured and unmeasured characteristics, studies with this design that otherwise meet the criteria for the high rating occasionally show statistically significant differences on selected variables (that is, race and ethnicity, SES, or an outcome measured at baseline). If such studies show that the treatment and comparison groups are not equivalent, or if it cannot be determined that the groups are similar on those factors, the variable(s) must be used as a control in the analysis of effects. The highest HomVEE rating that random assignment studies that do not control for statistically significant baseline differences can receive is moderate.

Reassignment

In random assignment studies, deviation from the original random assignment can also bias the impact estimates. For example, consider a study in which a program administrator reassigned families she felt could greatly benefit from the intervention from the comparison to the intervention group. Such nonrandom selection could lead to bias in the treatment effect estimates or compromise baseline equivalence (Gartin 1995). Therefore, in order for a RCT to meet our criteria for the high rating, the analysis must be performed on the sample as originally assigned. Subjects may not be reassigned for reasons such as contamination, noncompliance, or level of exposure. RCTs that somehow alter the original random assignment but otherwise meet the criteria for the high rating are considered for a moderate study rating, provided they meet the other criteria for that rating. Our criteria are similar to those developed by the WWC, which allows a study to be downgraded as a result of reassignment.

Confounding Factors

In certain cases, a component of the research design or methods lines up exactly with the intervention being tested, making it impossible to attribute an observed effect solely to the intervention. For example, if there is only one subject or group in the treatment or control condition, there is no way to distinguish the effects of the program model from the influence of the characteristics of that one subject or group. This would occur if one home visitor were assigned to all of the families in one of the study conditions. In this case, the effect of the particular home visitor could not be separated from the treatment effect. A confounding factor could also arise from systematic differences in the way data are collected from the treatment and comparison groups—for example, if program staff collected data from all subjects in the treatment group but an independent group of staff collected data from the control group. Because the effect of the confounding factor cannot be separated from the effect of the intervention, the study findings cannot be attributed to the intervention alone (Leon 1993).

Given the severe effect that such confounding factors can have on the quality of a study, studies receive a low rating when there is either (1) only one subject or group in the treatment and control condition or (2) a systematic difference in data collection procedures between the treatment and control groups. If during the review process we find examples of other confounding factors that line up exactly with the intervention, project leadership will decide if the confounding factors may affect the study rating.

Clustering

If the unit of assignment is different from the unit of analysis, the analysis must account for this clustering. If a correction is not made, the statistical significance of the findings may be overstated. That is, a finding may be mis-classified as statistically significant may not be when properly adjusted. If the authors do not correct for clustering at the unit of assignment, HomVEE will make an adjustment, if sufficient information is available. The default intraclass correlations used for these corrections is 0.10, based on a summary of behavioral and attitudinal outcomes (WWC Procedures and Standards Handbook v2.1 – Appendix C. Clustering Correction of the Statistical Significance of Effects Estimated with Mismatched Analyses). If HomVEE does not have enough information to make the correction, the uncorrected outcomes will be excluded from the review.

Top

Describing Effects

Four categories are used to describe the current research findings within a HomVEE outcome domain for a selected home visiting program model. The categories take into account all studies of a program model that meet the HomVEE standards for a high or moderate rating. The four categories are as follows:

  • Favorable. A statistically significant impact on an outcome measure in a direction that is beneficial for children and parents. An impact could be statistically positive or negative, and is determined “favorable” based on the end result. For example, a favorable impact could be an increase in children’s vocabulary or daily reading to children by parents, or a reduction in harsh parenting practices or maternal depression. These results are represented in the tables throughout the site in green font.
  • No effect. Findings for a program model that are not statistically significant. These results are represented in the tables throughout the site in black font.
  • Unfavorable or ambiguous. A statistically significant impact on an outcome measure in a direction that may indicate potential harm to children and/or parents. An impact could statistically be positive or negative, and is determined “unfavorable or ambiguous” based on the end result. While some outcomes are clearly unfavorable, for other outcomes it is not as clear which direction is desirable. For example, an increase in children’s behavior problems is clearly unfavorable, while an increase in number of days mothers are hospitalized is more ambiguous. This may be viewed as an unfavorable impact because it indicates that mothers have more health problems, but it could also indicate that mothers have increased access to needed health care due to their participation in a home visiting program. These results are represented in the tables throughout the site in red font.
  • Not measured. Current research (meeting HomVEE standards for a high or moderate rating) does not include any measures in this domain.

References

Bradley, R. H. and R. F. Corwyn. “Socioeconomic Status and Child Development.” Annual Review of Psychology, vol. 53, no. 1, 2002, pp. 371-399.

Duncan, G.J., and Brooks-Gunn, J. Eds. Consequences of Growing Up Poor. 1997. New York, Russell Sage Foundation.

Fagan, J., & Lee, Y. (2012). “Effects of fathers' and mothers' cognitive stimulation and household income on toddlers' cognition: Variations by family structure and child risk. Fathering, vol. 10, pp. 14-158.

Gartin, P. R. “Dealing with Design Failures in Randomized Field Experiments: Analytic Issues regarding the Evaluation of Treatment Effects.” Journal of Research in Crime and Delinquency, vol. 32, no. 4, 1995, pp. 425.

Leon, DA. “Failed Or Misleading Adjustment for Confounding.” The Lancet, vol. 342, no. 8869, 1993, pp. 479-481.

MacDorman, M.F. “Race and Ethnic Disparities in Fetal Mortality, Preterm Birth, and Infant Mortality in the United States: An Overview.” Seminars in Perinatology (Science Direct), August 2011, vol. 34, no. 4, p 200-208.

Rubin, D. B. “Estimating Causal Effects from Large Data Sets using Propensity Scores.” Annals of Internal Medicine, 1997, vol. 127, no. 8 Part 2, pp. 757-763.

Shadish, W. R., T. D. Cook, and D. T. Campbell. “Experimental and Quasi-Experimental Designs for Generalized Causal Inference.” New York: Houghton Mifflin Company, 2002.

Top