May 17, 2002


“Accountability” Versus Science in the Bilingual Education Debate

By James Crawford

How should we judge the success of bilingual education, structured English immersion, and other programs for English language learners? Many members of the public, and most journalists, seem to rely primarily on standardized tests of student achievement. Thanks to the “accountability” movement, scores from a growing number of tests—reported by district, school, grade, and numerous demographic categories—are easily accessible via the Internet. For those interested in educational issues, the temptation to download and analyze these numbers can be irresistible. That’s especially true when the experts are divided about what works. It seems that research evidence can always be cited to support one conflicting theory or another. Frustrated laypersons tend to ask: Why not draw our own conclusions based on “real world” test results?

Following this logic, the Boston Globe recently editorialized about the need to “reform” bilingual education in Massachusetts. As proof it cited data from the MCAS, a state-mandated test, showing “that children with limited English proficiency failed at more than three times the rate of other students.” Such a disparity must mean that something is terribly wrong with the way these children are being taught, the newspaper concluded. What could be more obvious?

Unfortunately, the meaning of test scores is seldom transparent. In this case it is especially clouded by language—so much so that the Globe editors might want to reconsider their reasoning. The MCAS is a test designed to assess the academic skills of English speakers and it is administered entirely in English. Children who do not understand the language of the test will have trouble, to varying degrees, in showing what they have learned. If limited-English-proficient students (LEP) scored well on the MCAS, it would be reasonable to conclude that something was terribly wrong with the test—or that these students were no longer LEP. In other words, the MCAS is the wrong yardstick for measuring the academic achievement of English language learners and for eval-uating programs that serve them.

In fairness, the Globe’s mistake is hardly unique. It is one of countless examples from the bilingual education debate, in which journalists (not to mention advocates) believe themselves qualified to practice social science without a license. People who would never presume to challenge the findings of medical researchers or physicists or even meteorologists seem to give little credence to the experts when it comes to education. If the proof of the pudding is in the eating, they reason, the proof of a pedagogical approach must be in the test-taking. Or, as one opponent of bilingual education puts it, “Reality trumps theory.”

What accounts for this trend and how is it likely to affect education policy? To answer these questions, we need to consider the political context.

Holding schools accountable for student performance sounds like a fine idea to the average taxpayer. So does relying on “scientifically based research” —rather than, say, the latest fad— to guide educational programs and policies. Such matters used to be seen as judgment calls, best left to professional educators and local school boards.

But in recent years, public trust in our educational system has been eroded. A steady stream of negative reports, flowing from policy centers and media outlets, has convinced a majority of Americans that the public schools are in trouble. This, in turn, has produced a bipartisan tide of reform, aimed at perceived mismanagement and misguided methodologies in the classroom, culminating in the No Child Left Behind Act passed earlier this year.

The Elementary and Secondary Education Act, which had defined (and limited) the federal role in this field since 1965, has been mostly dismantled. In its place, the new law creates a command structure for American education founded on the twin priorities of accountability and science. Along with a modest increase in federal funding, it authorizes an unprecedented expansion of federal power. A mandate for the annual testing of students in grades 3 through 8 is designed to provide the necessary leverage.

To enhance the accountability of state and local education agencies, the legislation introduces an array of management controls: planning, reporting, deadlines, assurances, high standards, measurable goals, progress indicators, financial rewards, corrective actions, and of course, sanctions. Success or failure under this regime will be gauged almost entirely on the basis of test scores, raw and unadjusted for social and resource inequities. Holding schools accountable will mean accepting no excuses. To legitimize this “high stakes” system, test results are given enormous weight and credibility.

Simultaneously, No Child Left Behind will require all recipients of federal funding —that is to say, virtually every school district in the United States— to employ scientific “research-based” instructional methodologies, classroom materials, academic assessments, teacher training, and remedial tutoring, as well as anti-drug, school safety, dropout prevention, gifted-and-talented, parent involvement, English language learner, and Indian education programs. Experts will need to back up their claims with hard evidence from scientific research before they will be authorized to design programs or train teachers. To qualify as scientific, research will have to be rigorous, empirical, systematic, objective, experimental, replicable, and peer-reviewed —in other words, highly controlled to ensure validity and relevance.

It would be hard to overstate the magnitude of these changes, at least on paper. From an educational perspective, one thing is clear: the goals of quick accountability and rigorous science are on a collision course, with the potential to do serious harm. Indeed, the crash is already under way.

A campaign to replace bilingual education with all-English “immersion” programs, approved by voters in California (1998) and Arizona (2000), continues to spread. Similar ballot initiatives are being organized this year in Massachusetts and Colorado, generating bitter debate and substantial media interest. One point of contention, not surprisingly, has been the impact of the first of these measures, California’s Proposition 227. As usual, test scores are at the center of the controversy.

Ron Unz, the Silicon Valley millionaire behind these initiatives, says his campaign has been vindicated by English learners’ performance on the Stanford 9 achievement test. As he argued in a debate at Harv-ard University last fall:

“The facts are now in. The largest controlled educational experiment in the history of the world took place a few years ago involving over a million students in California who were largely shifted away from bilingual education to English immersion... The average test scores of over a million immigrant students have gone up by 50 percent in less than three years. Those school districts that most strictly followed the initiative and got rid of their bilingual programs doubled their test scores in three years. Don’t believe me. Believe the New York Times, the Washington Post, CBS News—every major media source. The war is over. Or at least it should be over if academics were willing to look at the reality of the world rather than at their own research.”

In response, Harvard professor Catherine Snow noted that not a single expert in language education or psychometrics has endorsed this interpretation of the Stanford 9 results. Why, then, should we take the journalists’ word for it? Since Unz describes himself as “a theoretical physicist by training” —he dropped out of graduate school to pursue a political career—Snow wondered whether, on the basis of media reports, he also believes in “cold fusion.” Unz countered:

“I think academics should look at the reality of the world rather than at theories published in a lot of books, which may or may not be correct. ... Reality trumps theory. Theory cannot defeat reality. You really have to ask yourself whether you believe the reality of your own senses, the test scores of a million immigrant students, or four or five books written by some professors at Harvard”.

Reality versus theory, experience versus books, a million immigrant children versus a few elite academics... which side are you on? Unz may not have have made it as a physicist, but he deserves a doctorate in demagoguery. Judging by the public reaction, his tactics seem to be working.

The most influential account appeared in the New York Times, which highlighted “striking rates” of improvement for English learners on the Stanford 9 “after Californians voted to end bilingual education.” In particular, the article contrasted scores in the Oceanside Unified School District, which Unz had hailed as a showcase for English immersion, with those in a neighboring district where bilingual programs continued in some schools.

“In Oceanside, the average score of third graders who primarily speak Spanish improved by 11 percentage points in reading over the last two years, to the 22nd percentile; in Vista, the gain was a more modest 5 percentage points, to the 18th percentile. In fifth grade in Oceanside, limited English speakers gained 10 percentage points in reading, with the average in the 19th percentile; in Vista, there was no increase, the average of limited English speakers staying flat, in the 12th percentile.”

One educational researcher, Kenji Hakuta of Stanford University, was quoted briefly by the Times, cautioning that no scientific conclusions about Proposition 227 could be drawn from these data. But it was Unz’s interpretation that received the lion’s share of attention. “The test scores these last two years have risen, and risen dramatically,” he said. “Something has gone tremendously right for immigrants being educated in California.”

As it happened, Hakuta and some colleagues had already conducted an extensive analysis of the California test results, which went unmentioned in the Times article. Their conclusion: “Scores rose for all students, and in no clear pattern that could be attributable to Proposition 227.”

Among numerous possible explanations for the pattern of rising scores, Hakuta cited a California initiative to reduce class size in the early grades, a movement toward higher standards and accountability, and more effective preparation for the Stanford 9 as teachers become more familiar with the test. This last factor may be especially important, since 1998—the year before Proposition 227 took effect—was the first year of California’s statewide testing program. As Krashen explains, “Typical test score inflation is about 1.5 to 2 points per year, which accounts for a great deal of the gains seen in grades 2-6 in California,” both for English learners and for English-proficient students.

But what about Oceanside? English learners’ scores did rise substantially there in the two years after the initiative passed. Could this mean that English immersion is working “mir-acles” in that district? Based on the Stanford 9 results, it is impossible to say. Numerous other explanations are equally plausible, however, and they have nothing to do with the dismantling of bilingual education.

First, the school district started from a dismal position in 1998 —with reading scores sliding from the 12th percentile (grade 2) to the 6th percentile (grade 5) to the 2nd percentile (grade 10)— well below statewide averages for English learners. With intensive test preparation, such results can be improved significantly. Oceanside superintendent Ken Noonan has reported that, before Proposition 227, these students were taught entirely in Spanish for the first four years, sometimes longer. With no substantial exposure to English in the classroom, it’s no wonder they did so poorly on English-language tests.

Finally, there’s the statistical phenomenon of “regression to the mean.” As Hakuta notes, “Oceanside finally managed to drag its test scores from rock bottom up to the statewide average for EL students. This is not a story about excellence, hardly a miracle.”

In 2001, the reading scores for Oceanside’s English learners leveled off in grades 2 and 3 and declined in grades 4 through 9 and 11. The only “good news” was posted in grade 10—an increase from the 6th to the 7th percentile. What does this all mean? Probably not very much, just another regression to the mean, as Hakuta argues. But it does highlight another reason to dismiss the Stanford 9 scores for English learners as mostly meaningless. No confidence can be placed in these year-to-year comparisons because students may differ in ways that cannot be statistically controlled.

Moreover, the English learner category itself is constantly changing. Students enter and exit at varying rates, depending on how much English they arrive with and how long it takes them to be “redesignated” as fully fluent in the language. Naturally those who have acquired more English will do better on English-language achievement tests than those who have acquired less English. This means that when students are redesignated as fluent in English, their scores are no longer counted in LEP group, usually lowering the overall average. In effect, a district’s successes in teaching English count against it when Stanford 9 scores are calculated, even if students are doing well. Because there is no statewide gauge of English proficiency—criteria and procedures vary considerably among California districts—test results for English learners can be easily manipulated. For example, to boost average scores on the Stanford 9, all a clever administrator would need to do is slow down the official resde-signation of students as fluent in English, which would automatically retain many high-scorers in the LEP category. That stratagem could well account for Oceanside’s “striking” performance reported in 2000. That year only 4 percent of these students were deemed ready for the mainstream—about half the statewide average. Then, in 2001, test scores fell off as the district’s redesignation rate jumped to nearly 18 percent.

To sum up, there are excellent reasons to suspect that rising scores for English learners in Oceanside and other California districts have more to do with extraneous factors than with what is happening in the classroom. But who knows? Available data are insufficient to prove, or disprove, any hypothesis about the impact of Proposition 227 on English learners’ achievement. What is needed is a truly controlled educational experiment that tracks the academic progress of individual children over several years. The California legislature has authorized a study along these lines, with interim results due to be reported soon. Unfortunately, in drawing conclusions about student outcomes, the researchers are relying on a single assessment tool: the Stanford 9.

None of this has inspired the New York Times or other media to reassess their verdict. According to the conventional wisdom, Proposition 227 remains a great success. Meanwhile, Ron Unz is urging voters and legislators elsewhere to follow California’s lead—in fact, to go even further—in mandating English-only instruction. Bilingual programs could be restricted or banned outright in numerous states, largely on the basis of claims about the Stanford 9. A high-stakes test indeed.

James Crawford, Independent Scholar ( from the Language Policy Research Unit (LPRU) and the Education Policy Studies Laboratory (EPSL) at Arizona State University, April 30, 2002. For the complete report please visit:

Return to the Frontpage