Creating the Ratings

As Buddin began to work with the LAUSD information, he first created a pool of raw data: the standardized test results for students in grades two through five. The data covered some 603,500 elementary students taught by about 18,000 teachers in 520 schools. Buddin could not use it all, howeverhe was interested in the student test results of English and math teachers teaching 3 rd -5 th grade (testing started in second grade, and Buddin needed at least two consecutive years of test scores to conduct a value-added analysis). That gave him a pool of some 6,000 teachers.

The data, he found, included information on students gender, age, poverty level, number of years in the LAUSD, and whether a student was a non-native English speaker (the Times requested additional demographic data on race and ethnicity, but the LAUSD refused due to privacy laws). The test results could be sorted by teacher as well as by school, type of school (standard or charter), and grade level.

Buddin used complex mathematical formulas and regression analysis to try to determine what effect individual teachers had on their students learning over time. He posed three questions he hoped the data could help answer: how much did teacher quality vary from school to school and from teacher to teacher; what qualifications or background influenced teachers success in the classroom; and how did traditional measures compare with value-added measures of teacher and school effectiveness? [8] He hoped in particular that his analysis would help document whether standard teacher credentialsadvanced degrees, special training, or years of experiencecorrelated with the achievement of their students in the classroom.

As insurance, Data Analysis Editor Doug Smith proposed (and Lauter approved) that he create a parallel system to double check Buddins results. While Smiths set-up was not as sophisticated as Buddins, he had the capability to analyze the LAUSD data. Smith amplifies:

We did a simple gain-score analysis. That is, we ranked the students in each grade level into percentiles, calculated each students change in percentile from year to year, and summed those differences for each teachers students. We then ranked the teachers into quintiles based on their average student gain scores. Then we repeated the process for schools. [9]

Listen to Doug Smith discuss his analysis:

Smith and Buddin continually compared the two sets of results.

Crisis of Confidence . By early 2010, Buddin had some preliminary findings for particular teachers. To test these, reporter Felch in late January spent two days at the Carpenter Avenue Elementary School. He went specifically to check on one teacher who, according to Buddins calculations, had a very low score. Felch wanted to see if Buddins theoretical results matched with the teachers actual performance in the classroom. During his visit, Felch saw engaged students and what looked to him like a good teacher. He also spoke with the principal and other teachers; there was no sense that anything was amiss with the teacher in question. Felch recalls:

I came away with grave concerns about the quality of the data. I came back, and I said, You know, guys, this aint it. If this is what the datas telling us, I dont think its really that valuable.

Felch sat down with Buddin and Smith to try and figure out what was wrong. Buddin pulled apart his complex statistical analysis, and quickly found a major error in arithmetic. As it turned out, fully one-third of the teachers had received erroneous scores, including the one Felch had visited. While a significant mistake, it was easy to fix.

Other fixes followed. For example, Smith and Buddin had made different decisions about which teachers and students to include or exclude. Buddin, coming from an academic background, typically worked with very large databases and was ablewithout distorting the resultsto discard any data deemed possibly unreliable. So he had left out test results from charter schools. But Smith wanted as large a pool as possible in order to improve the accuracy of the analysis. While the Times had plenty of district-wide data, the amount of data linked to any individual teacher was small; just a few missing students could affect a teachers rating significantly. So they agreed to include charter schools. In general, Smith restricted discards to demonstrably inaccurate datafor example, students who had two math scores for the same year, or a teacher who was listed at two schools.

There were other adjustments. Buddin and Felch (before Smith joined the project) had originally agreed to a standard for inclusion in the analysis: a teacher had to have taught at least 60 students, and those students had to have had at least two consecutive years of standardized test scores in math and English. For some reason (Smith suspects himself of carelessly saying greater than 60 rather than greater than or equal to 60), Buddin included only teachers who had taught more than 60 students. That small difference had eliminated a surprising number of eligible teachers. When Buddin shifted to the at least 60 standard, it meant more teachers in the database, and changed the rating for many.

In another example, in comparing notes on teachers for whom Buddin and Smith had different scores, they discovered that each had processed test results for English Language Learner (ELL) students in a different way. Buddin had boosted the ELL students test scores in order to compensate for classes with numerous ELL students. Without the boost, it would have been unfair to compare teachers with no ELL students to those who had a significant number (the boost raised the teachers scores). Smith realized that Buddin was correct on this point and changed his own methodology. Smith also discovered that he had mistakenly included in the pool some teachers who had transferred to middle school, beyond the scope of the study, and removed them.

With these major problems resolved, the project looked like it had a good chance of working. So on February 14, 2010, Lauter and his team set a budget which included additional compensation for Buddin. Importantly, on the same day Lauter secured Editor Stantons agreement that if the value-added analysis did not generate solid results, the Times would not feel obligated to run a story. That meant considerable risk: the paper might invest substantial resources and nevertheless hit a wall. But Stanton was willing to take the risk. They would publish only if the product was credible.

More Reality Checks . By late March, Buddin completed the teacher ratings. But how to test them in the real world? Editors Marquis and Shuster came up with a solid idea, says AME Lauter:

We started talking about how are we going to test this? What we early on decided was, if we could find teachers who were clearly disparate in their ratings [yet] who were teaching very similar kids at similar schools, that that would illustrate how the method works in a way thats very intuitively understandable for people So we started looking for those sort of matched pairs of teachers.

Once they identified the pairs, reporters Felch and Song during April and May 2010 visited schools to observe and conduct interviews. They observed more than 50 elementary school teachers in over a dozen schools. They found that Buddins ratings matched with their reporting on the ground. Song, for one, was reassured. He had never been fully confident that value-added analysis would produce more than theoretical results: I wasnt sure whether it was going to work, whether the value-added system even had any kind of relevance to reality. [10] He was glad to see it did.

Data Analysis Editor Smith, too, was satisfied that the value-added model was generating reliable results. But the ratings themselves disturbed him. Most teachers, about 80 percent, received average ratings; another 10 percent were rated highly effective. But fully 10 percent were rated highly ineffective. Smith and Buddin found that students of these teachers dropped seven to 15 percentage points in their test scores. Smith says:

I got sick in my stomach looking at what happened to the students who had the worst teachers. The differences were assaultive. I mean, it wasnt on a gray scale. These students that got the worst teachers were diving. [11]

With these results, AME Lauter and his group of editors and reporters believed more strongly than ever that it was their responsibility as journalists to get this information to the public. Still, the question remained: should they name individual teachers?



[8] For more detail, see: http://www.latimes.com/media/acrobat/2010-08/55538493.pdf

[9] Excerpt from Doug Smith email to author, June 14, 2011.

[10] Authors interview with Jason Song on March 28, 2011, in Los Angeles. All further quotes from Song, unless otherwise attributed, are from this interview.

[11] Authors interview with Doug Smith on March 30, 2011, in Los Angeles. All further quotes from Smith, unless otherwise attributed, are from this interview