Analysis of LA Times series shows pitfalls of using test scores to evaluate teachers

(This commentary first appeared in TOP-Ed.)

Nearly half the rankings handed out to L.A. Unified teachers by the Los Angeles Times may be wrong. This is one of the conclusions reached by Derek Briggs and Ben Domingue of the University of Colorado at Boulder, who conducted a reanalysis of the data used by the Times in their value-added analysis of teacher performance. Using very strong language for the semi-polite world of social science, they concluded that the newspaper’s teacher effectiveness ratings were “based on unreliable and invalid research.”

At issue here is the validity of the original research conducted by Richard Buddin, the wisdom and responsibility of the Times in publishing individual teacher names, and the newspaper’s response to the reanalysis. In a policy environment where value-added analysis is actively being considered as part of how teachers should be evaluated, the reanalysis results are a big deal.

The teacher rankings story was published as an exposé last August under the subhead “a Times analysis, using data largely ignored by LAUSD, looks at which educators help students learn, and which hold them back.” In the Briggs and Domingue reanalysis, the which sometimes got switched:

More than half of the teachers (53.6%) had a different effectiveness rating in reading in the Briggs/Domingue alternative analysis than they did in Buddin’s original analysis. In math, about 40 percent of the teachers would have been in a different effectiveness ranking.

When extended to teachers who were labeled “ineffective” or “effective,” over 8 percent of those the Times identified as ineffective in teaching reading were identified as effective by the alternative analysis. Some 12.6% of the teachers graded as effective in the Times were found ineffective in the alternative analysis.

Why these differences? The answer is simple: not all value-added analyses are the same, and some are a lot better than others. The whole idea of valued-added assessment is to statistically control for the elements of a student’s experience that are out of the teacher’s direct control. It’s obvious that a student’s achievement at the end of the 5^th grade, for example, can’t all be attributed to that year’s teacher. Some of it results from the student’s family, some from prior teachers, and some is commonly attributed to economic and social circumstance. There are different ways of doing this, and researchers who work with value-added analysis usually approach this task with a great deal of transparency because they know that the outcome depends on the method they use.

Here’s what we know about the two analyses: First, it makes a big difference who gets counted. Briggs/Domingue got different results when they analyzed the Times data using the same technique used by Buddin, who works at the RAND Corporation but did the value-added analysis on his own time. As it turns out, each of the researchers had made different decisions about which teachers and which students to exclude from the study. Almost all studies exclude cases because data on some students are incomplete or the teacher may have had only a few students with valid test scores. The rules about who is counted and who is not turn out to be very influential to a teacher’s ranking.

Second, it makes a big difference what outside-the-classroom factors are included. Buddin’s technique included relatively few of these factors, fewer than used in the leading studies in the field. In his model, a teacher’s value-added score was based on student California Standardized Test performance moderated by a student’s test performance the prior year, English language proficiency, eligibility for federal Title I (low income) services, and whether he or she began school in LAUSD before or after kindergarten.

Briggs, who is the chair of the research and methodology program at the University of Colorado education school, and Domingue, a doctoral student, added a handful of new variables. Particularly, they looked at what I call the “tracking effect,” whether a student came from a high performing or low performing classroom or school. The results led to starkly different conclusions about rankings for nearly half the teachers.

They also found that when the standard statistical confidence level was put in place, between 43% and 52% of the teachers could not be distinguished from the category of “average” effectiveness. As a consequence, the technique would be useless in ranking the broad swath of teachers in the middle for bonuses or salary bumps.

Briggs/Domingue also disputed Buddin’s assertion that traditional teacher qualifications have no effect on effectiveness rankings. In their analysis, experience and training count.

If the differences in results were just a disparity of opinion among statisticians, the question would be resolved in academic journals, not the media. But reporters Jason Song and Jason Felch, and the Times editorial board, have raised the use of value-added assessment to a matter of public debate. Secretary of Education Arne Duncan has endorsed the idea, and other newspapers are following the Times’ lead in demanding teacher-identified student test data from their school districts.

So, how the public policy process deals with criticism of value-added techniques is important. On Monday, Felch’s story about the statistical reanalysis appeared under a headline that said, “Separate study confirms many Los Angeles Times findings on teacher effectiveness.” It does not, and I do not understand how one could reasonably draw such a conclusion from the highly readable 32-page research report written by Briggs/Domingue. I invite you to read their report, a summary of it, and their response to Monday’s story. The only large point on which the two analyses agree is that there is big variation between the most and least effective teachers in Los Angeles Unified.

Is that a surprise to anyone? The Times’ contribution has been in raising the issue of variations in teacher effectiveness politically, and, as I said last August, the original Song-Felch story was one of those rare “this changes everything” moments in journalism. But whatever contribution raising the issue of variable teacher performance made is undermined by the continued defense of the technique’s application against a barrage of criticism from scholars and statisticians. The reanalysis grounds this criticism in the original data. Those who want to use value-added assessment to pay teachers or fire them need to acknowledge the technical difficulties involved.

The Times also raised the question of sponsorship of the Briggs/Domingue study by the National Center for Education Policy, a reliably left-leaning think tank that has financial ties to teacher unions. Briggs was quick to claim his independence, saying that his work had received no interference at all from the Center: “I have no ideological axe to grind at all about the use of value-added assessment or the L.A Times. I do feel passionately about the quality of the research.”

The substantive underlying question is whether value-added analysis can help improve schools or teaching in any practical way.

My response is that it can, but only if we moderate our language.< As the final paragraphs of the Briggs/Domingue analysis note, we need to avoid two extreme positions:

Unless value-added analyses can be shown to be perfect they should not be used at all. Value-added measurements do not need to be perfect to be a great improvement over the techniques currently used in teacher evaluation;
Any critique of the method or its conclusion constitutes an endorsement of the status quo. There are many useful improvements to current data use practice. Some don’t use value-added analysis at all, and others use it in more sophisticated and less mechanical ways than the Times analysis.

We all need to be realistic about what the findings show and don’t show. Just because we have the statistical chops to perform an analysis, that doesn’t mean that it’s good or that it should rapidly be married to a pay and personnel system to “winnow away ineffective teachers while rewarding the effective ones.” Causal analysis may be the Holy Grail of statistics, but casual application of it is not.

At the very least, the reanalysis underscores the invalidity of extending causal analysis to individual teachers, and the folly of releasing teacher names and their rankings.

Charles Taylor Kerchner is Research Professor in the School of Educational Studies at Claremont Graduate University, and a specialist in educational organizations, educational policy, and teachers unions. In 2008, he and his colleagues completed a four-year study of education reform of the Los Angeles Unified School District. The results of that research can be found in The Transformation of Great American School Districts and in Learning from L.A.: Institutional Change in American Public Education, published by Harvard Education Press.

To get more reports like this one, click here to sign up for EdSource’s no-cost daily email on latest developments in education.

Share Article