|
The Diogenes Effect: Rebecca C. Greenberg and Herbert J. Walberg A Greek, Diogenes died around 320 BC, probably at Corinth. He stressed stoic self-sufficiency, the rejection of luxury, and rigorous morals. Tradition attributes to him the search for an honest person conducted in broad daylight with a lighted lantern. Adapted from Encyclopaedia Britannica, CD-ROM, 1998. Widely circulated practitioner periodicals such as Education Week, Phi Delta Kappan, and Educational Leadership publish article after article describing programs that raise student achievement, particularly that of poor and minority children. Articles in revered journals such as the Educational Researcher and American Educational Research Journal also describe programs that apparently raise students' test scores. Yet, analyses of the best indicator of U.S. students' achievement, the National Assessment of Educational Progress (NAEP), show no consistent upward trends during the past three decades. The latest international comparisons show U.S. students fall further behind the longer they are in school so that seniors wind up at the back of the pack. Employers and university staff continue to complain of the poor preparation of high school graduates. What explains this paradox? We argue that few programs are subject to rigorous evaluations, particularly large scale multiple site programs. The limited number of evaluations that appear to be rigorous, moreover, are rarely conducted by independent investigators. Program developers themselves, who ardently believe in and strongly advocate the tenets of their programs, usually carry out or commission evaluations of their programs. Evaluation Biases Consciously or not, such strongly held beliefs affect the conduct and results of evaluation. Barber (1973) described a number of ways investigator bias affects research outcomes (summarized below). The general principle of "conflict of interest" is hardly news. Aristotle warned us to consider the source, and the Romans asked who benefits? Such bias is well recognized in modern medical research. The "double-blind experiment" insures that neither patients nor caregivers know which medicine is administered to guard against "the placebo effects" in which the belief in treatment efficacy alone rather than physiological effects produces a cure. In much evaluation research, placebo effects are built in rather than controlled. Not only do strongly held beliefs directly affect evaluations but large-scale government and foundation support of program development raises pressures that are even more powerful. The federal government, for example, has spent more than $100 billion on Chapter 1/Title I for disadvantaged children alone. Although they may be at universities and other not-for-profit agencies, developers have strong financial interests in the success of their programs: Their jobs and perquisites depend on continued government and foundation funding. and they increasingly sell their materials and services to schools. In such an era, so called independent evaluators are funded by the same government agencies and foundations who fund the programs. Having said the programs would succeed, can funding agency administrators easily return to Congress or their governing boards to admit they were wrong? Thus, even independent evaluators are pressured to find success--often through "soft" impressionistic evidence without control-group achievement comparisons. The purpose of this chapter is to ask how general are Venezky's findings about Success for All (SFA) documented in the previous chapter. We review the evaluation claims made by the chief developer and compare them to independent evaluations (Venezky, 1998; Jones, Gottfredson, & Gottfredson, 1997). In brief, we find that Slavin's claims for SFA are extraordinary, but the independent evaluations show negative effects of the program. Contrary to the promise of its title and its expressed goal of at or near grade level performance, children participating in SFA fall increasingly behind national norms the longer they are in the program. We also analyze similar claims for Reading Recovery, a widely employed early intervention program, the evaluations of which also have built-in bias. Our analyses and examples suggest a general taxonomy of biases in program evaluation (shown in Table 1) that may be used to plan, conduct, and evaluate evaluations. Developer's Evaluation The chief developer of SFA Robert Slavin and his Johns Hopkins University colleague Olatokunbo Fashola review 13 school reform programs. Writing for Phi Delta Kappan, a widely circulated journal for educational practitioners, they acknowledge that Prospects, a large national study of Chapter 1/Title I education for disadvantaged students "has called into question the effectiveness of the entire program" (p. 370). The title of their article directly reveals their purpose "Schoolwide Reform Models: What Works." How can schools better spend the more than $7 billion dollars given to them by the federal government under Chapter 1/Title I? Fashola and Slavin recommend to educators two questions to ask in deciding which program to select: Does it meet evaluation criteria for achievement? And has it been widely replicated or employed successfully at many schools? They provide answers to these questions about SFA and its extended form Roots and Wings developed by Slavin and others at Johns Hopkins University, and eleven other programs. Their review of the programs reveals only two to which affirmative answers can be given--the Johns Hopkins programs SFA and Roots and Wings. In accord with their assessment, they gave the two programs three full columns of print. The other eleven programs got between about a fifth and a full column. Fashola and Slavin gave estimates of the effects of SFA on achievement, which are listed in Table 1. The average of their estimates is 1.13, which is one of the highest effect sizes ever estimated in education research. Fashola and Slavin consider an effect size of .25 to be "educationally significant" (p. 372). Their smallest estimate of the effects of their program is twice this effect size. Independent Evaluations Other evaluations of SFA's effects differ sharply from Fashola and Slavin's. As shown in Table 1, Jones, Gottfredson, and Gottfredson (1997) estimates are considerably lower; only one is higher than the lowest of the estimates of the developer. The average of their estimates is 4 percent of the size of the developer's estimates. Of their ten estimates, only two are higher than .25. In fact, in five of the ten cases, students in control groups did better than those in SFA. As shown in Table 1, Jones and others also compiled six estimated effects from other independent evaluations of SFA. In two cases, control groups did better; in two cases, the differences were not educationally significant by the Fashola-Slavin criterion of .25, and in two instances or one third of the six cases, SFA did better than control groups. Can SFA Meet its Distinctive Goal? SFA developers and some independent evaluators have focussed on control-group comparisons. SFA's expressed goal, however, is to bring all children to or near grade level by third grade so they may progress normally in the later grades. Venezky points out, "According to the project's own reports, SFA has clearly not led to all students achieving at or near grade level by the end of grade three, even with only reading and language arts included in the outcomes assessment" (p. 7) [which SFA emphasizes]. Venezky carried out an independent evaluation in the Baltimore schools where SFA originated and would be expected to do exceptionally well. Venezky, nonetheless, concludes, "Not only is the average SFA student failing to reach grade-level performance by the end of grade three, but even with further SFA instruction, continues to fall farther behind national norms" (p. 23). "After the early primary grades, SFA students begin to fall behind the average students nationally and by the end of fifth grade are almost 2.4 years behind. In addition, increasing time in SFA schools does not lead to increasing advantage in reading performance" (p. 24). Thus, the SFA and independent reviewers differ substantially in their evaluation of the effects of the program. What accounts for these differences? Social research and evaluation are often plagued with confounding variables that bias evaluations. Factors Biasing Research Outcomes Given these huge differences, we can now ask what general factors cause bias in evaluation of educational programs. Evaluation is, of course, problematic. Humans are fallible and come with built in prejudices; they are rarely able to make completely objective judgements. Prone to bias, evaluators make choices about design, instrumentation, and conclusions (Worthen & Sanders, 1987). In order to reduce the biases that may occur in a research or evaluation design, evaluators should be mindful of possible biases from the outset. The taxonomy of biases in Table 2 provides a checklist combining the works of Barber (1973), Campbell and Stanley (1963), Worthen and Sanders (1987) and Fullan (1991). Campbell and Stanley and Barber have discussed issues of validity and reliability in research utilizing scientific method. Although they focus on experimental designs, their concepts are relevant to most evaluations using comparison groups. The use of comparison and control group designs implies the use of scientific method to arrive at conclusions. Worthen and Sanders and Fullan offer helpful principles for designing and executing successful evaluations. Our discussion of the biases in the taxonomy employs examples from the SFA and RR programs to illustrate how the biases can affect evaluations. When an evaluation fails to control for these biases, the effect of the stimulus or the treatment can be confounded and unduly favorable to new programs. Biases in Design Choice of Evaluator Worthen and Sanders (1987) describe the uses of formative and summative evaluation. Internal staff members usually conduct formative evaluations to provide on-going evaluative information for improvement to program developers. For potential users, external or independent evaluators should ordinarily conduct summative evaluations about a program's worth or merit at the end of implementation. Independent evaluations provide more impartial and credible results. Because independent evaluators lack personal stakes, their evaluations are more objective. They should also be more replicable by other independent evaluators. Independent evaluators, moreover, are more likely to be specially trained and more experienced than program developers. They can offer a fresh outside perspective and allow the audience to see the forest and the trees. In addition, program participants may be more willing to disclose information more readily to an outsider than to an internal evaluator who may breach confidentiality. SFA developers carried out a formative evaluation but described it as if it were a summative evaluation for the public at large. As documented above, the independent evaluations contradict SFA's evaluation. A number of biases discussed below account for the discrepancy. To begin, SFA used their own reading assessments as evidence of success and support for national use of this program. These assessments may not be impartial for they are based on SFA's own criteria for achievement (Fashola & Slavin, 1998). If SFA is an effective program, we should expect participants to meet criteria for achievement on various types of assessments in all the major subject. Standardized tests are the most valid, comprehensive, and accepted measures. SFA students should not only be prepared to perform well on one type of assessment but across a variety of tests and other assessments. The results discussed above show the results of SFA's internal evaluation were not replicated in the two independent evaluations by Venezky (1998) and Jones, Gottfredson, and Gottfredson (1997). These two independent evaluations both yielded negative effects about the program. Paradigm and Framework The evaluator's paradigm includes basic assumptions about the program to be evaluated. The paradigm provides the framework for the questions to be asked, the relevant data, and the methods of data collection, analysis, and interpretation. As noted above, the choices are subjective and therefore may be biased especially if the developer influences or conducts the evaluation. The selection of a paradigm and consequently the questions and methodologies selected can lead to inescapably biased conclusions. Consider SFA: Slavin and the Johns Hopkins Team designed SFA as a whole school model centered on reading. The assumptions tested by the internal evaluator reflected this choice but overlooked possible negative effects on other subjects such as mathematics. Achieving success in one area should not occur at the expense of others. Analysis and Data Interpretation Evaluators need to make decisions about the collection, analysis, and interpretation of data. Internal evaluators, however, may be particularly prone to decisions that favor their own programs. Favorable parts of available data, for example, may be chosen for analysis and interpretation while other aspects may be disregarded. When methods of analysis are not planned prior to collection, the evaluator may be tempted to examine data that looks favorable. This can happen when a great number of computer generated statistical comparisons are conducted and only favorable results are selected for reporting. More importantly, an evaluator may not arrive at results that confirm initial hypotheses and choose alternative methods to arrive at favorable conclusions without disclosing the unfavorable results. One goal of SFA, for example, is to bring students' achievement to grade level. Generally grade level refers to achievement on nationally standardized tests. Yet, in their Kappan comparison of alternative programs, Slavin and Fashola chose not to report their analyses of standardized tests, which showed negative results for their program. They focussed instead on control group comparisons of their own tests, which suggested far more favorable results. As Jones, Gottfredson, and Gottfredson (1997) suggest, analysis of gains or grade level are the more telling comparisons. Some of SFA's evaluations, moreover, included very small sample sizes, using only students that were in the program from its first year in the school. Because schools in high poverty areas have high rates of mobility, the number in the program decreases over years making the sample less representative of the population. Students remaining in the program are likely to come from more stable families, again biasing the results in favor of the SFA program. History It is necessary to consider events that take place between observations that may have produced a change effect other than the program at hand. Schools in South Carolina were found upside down when Hurricane Hugo hit the area in 1989 (Education Week February 4, 1998). Other less extreme events occur as well. Venezky (1998) points out that some principals had been replaced in his study which can effect the school's focus on already existing programs particularly when the existing programs conflict with the instructional philosophies of a new principal. States are also constantly legislating new reforms which can effect funding in these schools as well (Venezky,1998). Realigning budgets and funding may shift focus to other programs. These are certainly not violations committed by programs but they can effect their outcomes. Maturation Change may occur as a result of developmental processes of the participants rather than the program itself. RR targets young children and gains in learning may be due to students' natural development which could confound the results. This is a strength of both SFA and RR evaluations that employed control groups, which would be subject to the same developmental processes. Selection Slavin and others, commendably, have conducted control-group evaluations since the program's inception twelve years ago. They have, however, compared unlike groups. To use SFA, 80% of the faculty must vote for it in a secret ballot. Control schools do not take a trial vote. Schools that achieve such a consensus are likely to have more uniform practices and philosophies than schools do not. A similar problem occurs in the RR program. For cross-classroom comparisons, RR used classrooms that were not interested in participating in the program as controls. Why were these teachers not interested? Their reasons may vary greatly across teachers and teaching practices making sound comparisons difficult. In addition students participating in RR received more instructional time than students in comparison groups. Fashola and Slavin (1998) discuss the program's success in several bilingual schools, particularly for Cambodian students in Philadelphia. Yet short term gains in beginning reading may be attributable to the natural process of learning English in schools and society rather than the SFA program. Moreover, such students, who score poorly at the start, may be likely to "regress toward the mean." Shanahan and Barr (1995) further note that because RR is costly, students most likely to be selected for the program are those that are in greatest need and score at the lowest end on tests at the beginning of the year. This type of sample will generally tend up and scores will most likely increase. These scores are compared with students who are already higher scoring, otherwise they would have made it into the program, and those students will either go up or down leading to an misleading comparison. This bias could be reduced if similar students from the onset were chosen for comparison. Selecting the lowest students for the program and comparing them to the next lowest group in the classroom introduces bias, again favoring the new program. In the RR program, students are selected on the basis of their performance relative to their other classmates and the judgment of the teacher. This makes creating a comparison group even more difficult. Shanahan and Barr (1995) also point out to us the difficulties of comparing across RR sites. Some may have higher concentrations of more disadvantaged students than others. Because results depend on students' initial achievement, inferences about effects may be questionable. Cross-site analyses could be particularly flawed when making international comparisons. For instance, in New Zealand, where RR was initially designed, schooling is centralized and children begin school younger. Therefore RR selection may be markedly different from that the U.S. (Shanahan & Barr, 1995). Multiple Treatment Interference In what has been called "the Christmas Tree effect," simultaneous treatments may account for apparent program effects. Outside agencies, particularly the federal government, contribute substantial resources, grants, services, and programs to schools, especially urban schools. It may be misleading to attribute success to a single program. SFA and RR, moreover, are costly programs, which may mean that some programs that are actually working well must be rejected or abndoned to pay for them. Mortality Significant numbers of students in SFA and RR do not complete the program because of transfers, absences, or assignment to special education. They are generally excluded from results, but this again biases results in favor of the program. In RR, students may be removed if they are not performing well, which again biases the findings. Biases in Implementation Loose Protocol A loose or ambiguous protocol concerns the lack of uniform procedures for conducting the program. This is not only problematic for drawing overall conclusions but replication is unlikely. In RR, for example, teachers recommend students relative to other students in the class and based on their own judgment. Thus, RR does not employ uniform procedures across sites to make these referrals. Implementer Attributes Implementers in program and control groups may differ considerably. The 80 percent secret ballot requirement for SFA, for example, means that SFA faculties will differ systematically from control faculties. In addition, SFA's use of its own test and test administrators may allow unconscious bias to affect the results. Implementor Failure to Follow Protocol When protocols are loose, implementers may engage in a variety of behaviors, some of which may enhance program effects, other which may detract. Teachers in SFA, for example, received training during initial stages of implementation but training diminished over the years particularly when new textbooks were introduced; teachers apparently received little support to adapt these new lessons and curriculum (Venezky, 1998). In this situation, teachers would have to make their own decisions possibly causing implementation to be inconsistent across classrooms and sites. This problem may have been alleviated had SFA invested less heavily on materials and more heavily on teacher expertise (Education Week, February 4, 1998). Biases in Measurement Tools Implementer Unintentional Expectancy Programs and research studies are generally based on strong assumptions, prior research and theoretical frameworks. Hence, implementers may have strong expectations and desires for particular results to manifest. They would like to see the proposed hypothesis supported. Subjects may be sensitive to these expectations and desires as cues and respond accordingly. These strong desires may effect adherence to protocol and recording of data. Implementers may interact differently with different groups of subjects based on expected results. Testing Repeating similar or identical tests before, during, and after program units allows students to familiarize themselves with them. If the content or skills are also emphasized in the program, students may concentrate on them during their studies. As a result, success on subsequent tests may be explained by familiarity rather than positive effects of the program. Control groups may not have this advantage. As mentioned above under selection, students entering RR are ones that have the lowest scores and they will most likely tend up upon second administration of the test. In addition, SFA, administers tests every eight weeks to track improvement. Higher scores may be a reflection of students learning the content and format of the test rather than a reflection of better performance. Instrumentation As mentioned above, one goal of SFA is to bring students' performance to grade level. Venezky (1998) notes that the battery of reading tests used by SFA was not designed to be compared to national norms since they are individual diagnostic tests. Shanahan and Barr (1995) raise the possibility that RR tests may not have equal intervals. If, for example, the tests are too easy for control groups that initially score above RR students, they could not show gains. In addition, pre and post measures were not the same between RR groups and regular classroom students. So, tests with different properties may be the cause of differences between RR and control students. Biases in Reporting Falsifying evidence is probably rare, but selective reporting occurs. The competition for funding can be fierce and may motivate internal evaluators to leave negative findings unreported. Fashola and Slavin (1998), for example, reported very positive achievement results for SFA in Kappan, a widely circulated journal for practitioners. Venezky (1998), however, points out that SFA's own internal reports show that SFA students were not performing at grade level by the end of third grade. Conclusion This chapter compiles, defines, and illustrates major biases or threats to methodological validity common to evaluations of educational programs. We have argued that several types of evaluations are strongly prone to such biases: those commissioned by government agencies and foundations that also fund programs in question and those conducted by the developers themselves. Our suggestion to policy makers and practitioners is to exercise caution about adopting programs without well designed, multiple, and favorable independent evaluations. This is hardly new advice since the ancient Romans suggested that buyers beware, and medical researcher routinely protect studies from their own and others' conscious and unconscious biases. What is perhaps new is the view that even well meaning education funders and program developers may be subject to similar biases. Diogenes, lend us your lantern. Table 1 Factors Biasing Research Outcomes
References Barber, T. X. (1973). Pitfalls in research: Nine investigator and experimenter effects. In ( R. M. W. Travers, Editor) Handbook of Research on Teaching. Pp. 382-404. Campbell, D. T. & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research on teaching. In N. L. Gage, Ed., Handbook of Research on Teaching. Pp. 171-246. Chicago: Rand McNally & Company. Encyclopedia Britannica (1998) Chicago: Encyclopedia Britannica, Inc. Fashola, O. S. & Slavin, R. E. (January 1998) Schoolwide reform models: What works? Phi Delta Kappan. 370-379. Fullan, M. G. (1991). The new meaning of educational change. New York, NY: Teachers College Press. Jones, E. M., Gottfredson, G. D., & Gottfredson, D. C. Success for some: An evaluation of a "Success for All Program" College Park, MD. University of Maryland, 1998. Olson, L. (February 6, 1998). Will success spoil Success for All? Education Week, pp. 42-45. Shanahan, T. & Barr, R. (1995). Reading Recovery: An independent evaluation of the effects of an early instructional intervention for at-risk learners. Reading Research Quarterly, 30 (4), 958-996. Venezky, R. L. An alternative perspective on Success for All. This volume. Worthen, B. R. & Sanders, J. R. (1987) Educational evaluation: Alternative approaches and practical guidelines. White Plains, NY: Longman. |