Analysis of LA Times series shows pitfalls of using test scores to evaluate teachers

Charles Taylor Kerchner

February 10, 2011

(This commentary first appeared in TOP-Ed.)

Nearly half the rankings handed out to L.A. Unified teachers by the Los Angeles Times may be wrong. This is one of the conclusions reached by Derek Briggs and Ben Domingue of the University of Colorado at Boulder, who conducted a reanalysis of the data used by the Times in their value-added analysis of teacher performance. Using very strong language for the semi-polite world of social science, they concluded that the newspaper’s teacher effectiveness ratings were “based on unreliable and invalid research.”

At issue here is the validity of the original research conducted by Richard Buddin, the wisdom and responsibility of the Times in publishing individual teacher names, and the newspaper’s response to the reanalysis. In a policy environment where value-added analysis is actively being considered as part of how teachers should be evaluated, the reanalysis results are a big deal.

The teacher rankings story was published as an exposé last August under the subhead “a Times analysis, using data largely ignored by LAUSD, looks at which educators help students learn, and which hold them back.” In the Briggs and Domingue reanalysis, the which sometimes got switched:

More than half of the teachers (53.6%) had a different effectiveness rating in reading in the Briggs/Domingue alternative analysis than they did in Buddin’s original analysis. In math, about 40 percent of the teachers would have been in a different effectiveness ranking.

When extended to teachers who were labeled “ineffective” or “effective,” over 8 percent of those the Times identified as ineffective in teaching reading were identified as effective by the alternative analysis. Some 12.6% of the teachers graded as effective in the Times were found ineffective in the alternative analysis.

Why these differences? The answer is simple: not all value-added analyses are the same, and some are a lot better than others. The whole idea of valued-added assessment is to statistically control for the elements of a student’s experience that are out of the teacher’s direct control. It’s obvious that a student’s achievement at the end of the 5^th grade, for example, can’t all be attributed to that year’s teacher. Some of it results from the student’s family, some from prior teachers, and some is commonly attributed to economic and social circumstance. There are different ways of doing this, and researchers who work with value-added analysis usually approach this task with a great deal of transparency because they know that the outcome depends on the method they use.

Here’s what we know about the two analyses: First, it makes a big difference who gets counted. Briggs/Domingue got different results when they analyzed the Times data using the same technique used by Buddin, who works at the RAND Corporation but did the value-added analysis on his own time. As it turns out, each of the researchers had made different decisions about which teachers and which students to exclude from the study. Almost all studies exclude cases because data on some students are incomplete or the teacher may have had only a few students with valid test scores. The rules about who is counted and who is not turn out to be very influential to a teacher’s ranking.

Second, it makes a big difference what outside-the-classroom factors are included. Buddin’s technique included relatively few of these factors, fewer than used in the leading studies in the field. In his model, a teacher’s value-added score was based on student California Standardized Test performance moderated by a student’s test performance the prior year, English language proficiency, eligibility for federal Title I (low income) services, and whether he or she began school in LAUSD before or after kindergarten.

Briggs, who is the chair of the research and methodology program at the University of Colorado education school, and Domingue, a doctoral student, added a handful of new variables. Particularly, they looked at what I call the “tracking effect,” whether a student came from a high performing or low performing classroom or school. The results led to starkly different conclusions about rankings for nearly half the teachers.

They also found that when the standard statistical confidence level was put in place, between 43% and 52% of the teachers could not be distinguished from the category of “average” effectiveness. As a consequence, the technique would be useless in ranking the broad swath of teachers in the middle for bonuses or salary bumps.

Briggs/Domingue also disputed Buddin’s assertion that traditional teacher qualifications have no effect on effectiveness rankings. In their analysis, experience and training count.

If the differences in results were just a disparity of opinion among statisticians, the question would be resolved in academic journals, not the media. But reporters Jason Song and Jason Felch, and the Times editorial board, have raised the use of value-added assessment to a matter of public debate. Secretary of Education Arne Duncan has endorsed the idea, and other newspapers are following the Times’ lead in demanding teacher-identified student test data from their school districts.

So, how the public policy process deals with criticism of value-added techniques is important. On Monday, Felch’s story about the statistical reanalysis appeared under a headline that said, “Separate study confirms many Los Angeles Times findings on teacher effectiveness.” It does not, and I do not understand how one could reasonably draw such a conclusion from the highly readable 32-page research report written by Briggs/Domingue. I invite you to read their report, a summary of it, and their response to Monday’s story. The only large point on which the two analyses agree is that there is big variation between the most and least effective teachers in Los Angeles Unified.

Is that a surprise to anyone? The Times’ contribution has been in raising the issue of variations in teacher effectiveness politically, and, as I said last August, the original Song-Felch story was one of those rare “this changes everything” moments in journalism. But whatever contribution raising the issue of variable teacher performance made is undermined by the continued defense of the technique’s application against a barrage of criticism from scholars and statisticians. The reanalysis grounds this criticism in the original data. Those who want to use value-added assessment to pay teachers or fire them need to acknowledge the technical difficulties involved.

The Times also raised the question of sponsorship of the Briggs/Domingue study by the National Center for Education Policy, a reliably left-leaning think tank that has financial ties to teacher unions. Briggs was quick to claim his independence, saying that his work had received no interference at all from the Center: “I have no ideological axe to grind at all about the use of value-added assessment or the L.A Times. I do feel passionately about the quality of the research.”

The substantive underlying question is whether value-added analysis can help improve schools or teaching in any practical way.

My response is that it can, but only if we moderate our language.< As the final paragraphs of the Briggs/Domingue analysis note, we need to avoid two extreme positions:

Unless value-added analyses can be shown to be perfect they should not be used at all. Value-added measurements do not need to be perfect to be a great improvement over the techniques currently used in teacher evaluation;
Any critique of the method or its conclusion constitutes an endorsement of the status quo. There are many useful improvements to current data use practice. Some don’t use value-added analysis at all, and others use it in more sophisticated and less mechanical ways than the Times analysis.

We all need to be realistic about what the findings show and don’t show. Just because we have the statistical chops to perform an analysis, that doesn’t mean that it’s good or that it should rapidly be married to a pay and personnel system to “winnow away ineffective teachers while rewarding the effective ones.” Causal analysis may be the Holy Grail of statistics, but casual application of it is not.

At the very least, the reanalysis underscores the invalidity of extending causal analysis to individual teachers, and the folly of releasing teacher names and their rankings.

Charles Taylor Kerchner is Research Professor in the School of Educational Studies at Claremont Graduate University, and a specialist in educational organizations, educational policy, and teachers unions. In 2008, he and his colleagues completed a four-year study of education reform of the Los Angeles Unified School District. The results of that research can be found in The Transformation of Great American School Districts and in Learning from L.A.: Institutional Change in American Public Education, published by Harvard Education Press.

To get more reports like this one, click here to sign up for EdSource’s no-cost daily email on latest developments in education.

Comments are closed

Join the conversation by going to Edsource's Twitter or Facebook pages. If you do not have a social media account, you can learn how to create a Twitter account here and a Facebook account here.

Stephen Cox 13 years ago13 years ago
As a high school math teacher, I was wondering if the same “value-added” algorithm is used for all subjects? Does any of the research break up the data by subject? If I had to guess, I would think that the use of student test scores would be much more valid for math than for English. What do you think?
Ze'ev Wurman 13 years ago13 years ago
A very nice reporting of the dissenting positions, yet I take exception to a key statement here. "The results led to starkly different conclusions about rankings for nearly half the teachers." I am not sure at all I would call the differences "stark." If one expects perfection in social sciences then certainly the somewhat differing results may bring a heartburn. Yet if one looks on the bigger picture, one that mostly wants to identify the very strong … Read More
A very nice reporting of the dissenting positions, yet I take exception to a key statement here.

“The results led to starkly different conclusions about rankings for nearly half the teachers.”

I am not sure at all I would call the differences “stark.” If one expects perfection in social sciences then certainly the somewhat differing results may bring a heartburn. Yet if one looks on the bigger picture, one that mostly wants to identify the very strong and very weak teachers, the results actually look eerily similar. In mathematics, 80% of those that the RAND analysis placed in the top 20% are also in the top 20% of the new analysis (and 98% in top 40%). Similarly, almost 80% in the RAND’s bottom 20% remain in the bottom 20% (and 98 in bottom 40%). No teachers from top 20% move to the bottom 20% in the NEPC analysis, and no teachers from the RAND’s bottom 20% move to NEPC’s top 20%. In reading the results are only slightly weaker (about 60% of top and bottom categories stay there, and 80-90% stay within top and bottom 40%, respectively) but no teachers from RAND’s bottom 20% end up in NEPC’s top 20%, and tiny fraction, about 1%, of RAND’s top scorers end in the lowest NEPC category.

In other words, most of the bemoaned (and overall small) “movement” happens one quintile apart — few of the worst move to one better, few of the best move one rank down, some move from a bit above or below average to average. This sort of things. Which is exactly what is expected from social science effort to rank soft skills. But the VAM analysis does extremely well what is really important — identify the top teachers and, even more crucially, the bottom ones.

Anyone who has just a bit sense and minimal experience with how such things are done knows that the purpose of this is not to rank 100,000 teachers in a linear list or to try to adjust every teacher’s compensation on an infinitely granular scale. The purpose is to identify the most effective teachers and make sure they are recognized and rewarded to maximize their retention probability, and to try and do something about the least effective, from training and up to and inclusive of letting them go. And do little if anything else about the rest. That is what promoters of value-added measures want. For example, Rick Hanushek speaks about using them to get rid of the lowest 5%, not of using such measures to dictate every teacher’s salary. As a guide to who belongs to each of those groups, the current value added measures, both in their RAND/LATimes version and the NEPC version, seem perfectly satisfactory.

In fact I will say something more. If this is the best that a teacher-union sponsored research could come up with, I consider it a great victory for VA measures. Both studies match almost perfectly in their identification of the top and bottom quintiles, and if they were to aim at identifying the top and bottom 5% the results probably could be made even crispier. Based on these results VAM is ready for use — carefully — even today.
Peter Schrag 13 years ago13 years ago
Congratulations on the Kerchner piece. He properly pointed out that the LATimes falsely claimed that the Colorado study confirmed its findings in the value added study of LA teachers. He didn’t mention that the Times broke the embargo on publication of the Colorado report. So double (or maybe triple) shame on the Times.
Ariel Paisley 13 years ago13 years ago
Please visit the web-link below for the Washington Post article that I reference in this post. http://www.washingtonpost.com/wp-dyn/content/article/2011/02/08/AR2011020804813.html With the recent arbitrator's ruling reversing the firing of 75 teachers by Michelle Rhee during her tenure as DC school chief (see web-link), it would be wise for California districts (and news agencies) to think twice before using test score data to challenge teacher competencies. There is a tendency to misinterpret test data as an indicator of teacher proficiency; there are … Read More
Please visit the web-link below for the Washington Post article that I reference in this post.
http://www.washingtonpost.com/wp-dyn/content/article/2011/02/08/AR2011020804813.html

With the recent arbitrator’s ruling reversing the firing of 75 teachers by Michelle Rhee during her tenure as DC school chief (see web-link), it would be wise for California districts (and news agencies) to think twice before using test score data to challenge teacher competencies.

There is a tendency to misinterpret test data as an indicator of teacher proficiency; there are actually many additional variables that need to be considered to make sense of test data. Only a reductionist approach to teacher evaluation would attempt to use such test data to measure something as complex as effective teaching.

Reductionist methods have some value in the hard sciences where it is easier to control for unmeasured or unwanted variables, but they have questionable validity in the social sciences where complexly intertwined dimensions of human behavior and social conditions can’t easily be measured in isolation.

The agenda driving this test-based type of teacher evaluation is driven by politics, not by science. Any good scientist would quickly recognize this as an invalid use of data.

Commentary

Analysis of LA Times series shows pitfalls of using test scores to evaluate teachers

Commentary

February 10, 2011

Charles Taylor Kerchner

Charles Taylor Kerchner

February 10, 2011

Comments are closed

Stephen Cox 13 years ago13 years ago

Ze'ev Wurman 13 years ago13 years ago

Peter Schrag 13 years ago13 years ago

Ariel Paisley 13 years ago13 years ago

Commentary

Analysis of LA Times series shows pitfalls of using test scores to evaluate teachers

Commentary

February 10, 2011

Charles Taylor Kerchner

Charles Taylor Kerchner

February 10, 2011

Share Article

Comments are closed

Stephen Cox 13 years ago13 years ago

Ze'ev Wurman 13 years ago13 years ago

Peter Schrag 13 years ago13 years ago

Ariel Paisley 13 years ago13 years ago

Stay informed with our daily newsletter