Value Added Measures: What they are and what they’re not

Value Added Measures, also referred to as VAM, and sometimes referred to as Value Added Modeling or Value Added Assessments, has made it’s way to Seattle. It has been pushed by Bill Gates and other ed reformers who are not educated in the field of education or for that matter in mathematical formulas and their meanings.

In our state, this evaluation system was heralded by the League of Education Voters (LEV) and the Washington State PTA (WSPTA) as a way to judge “teacher effectiveness”, a big item on the agenda of the ed reformers. Legislation was pushed through by legislators with the lobbying efforts of LEV, Stand for Children (SFC) and WSPTA in Olympia and praised as a success by LEV when the legislation passed.

The reason for this push is to basically devalue the idea of seniority and place the emphasis of success or failure of our schools squarely on the shoulders of teachers rather than an entire set of circumstance that are not in their control. What it has done is dumb down the curriculum to the point where the focus in the classroom is on test preparation on a narrow scope of knowledge in the subjects of math and English.

These same zealots, including Mayor Bloomberg at the time, began to publish test score evaluations of teachers in newspapers in Los Angeles and New York with disastrous results. The publishing of these test scores was even applauded by the Secretary of Education, Arne Duncan. One teacher in New York City was called out as the worst teacher in the city with her photo published on the cover of one of these rags. She was highly regarded by her principal and colleagues but humiliated publicly. One young teacher in Los Angeles committed suicide after his test scores were published. His family and friends believe that much of it had to do with his evaluation based on student test scores. See A teacher pushed to the edge.

For other articles on the subject of these witch hunts see: Carolyn Abbott, The Worst 8th Grade Math Teacher In New York City, Victim Of Her Own Success, These Are The Worst Teachers In New York City and The True Story of Pascale Mauclair.

For a simple breakdown of VAM, I would recommend this video.

For a more scholarly description there is Mathematical Intimidation: Driven by the Data written by mathematician John Ewing:

Mathematicians occasionally worry about the misuse of their subject. G. H. Hardy famously wrote about mathematics used for war in his autobiography, A Mathematician’s Apology (and solidified his reputation as a foe of applied mathematics in doing so). More recently, groups of mathematicians tried to organize a boycott of the Star Wars [missile defense] project on the grounds that it was an abuse of mathematics. And even more recently some fretted about the role of mathematics in the financial meltdown.

But the most common misuse of mathematics is simpler, more pervasive, and (alas) more insidious: mathematics employed as a rhetorical weapon—an intellectual credential to convince the public that an idea or a process is “objective” and hence better than other competing ideas or processes. This is mathematical intimidation. It is especially persuasive because so many people are awed by mathematics and yet do not understand it—a dangerous combination.

The latest instance of the phenomenon is valued-added modeling (VAM), used to interpret test data. Value-added modeling pops up everywhere today, from newspapers to television to political campaigns. VAM is heavily promoted with unbridled and uncritical enthusiasm by the press, by politicians, and even by (some) educational experts, and it is touted as the modern, “scientific” way to measure educational success in everything from charter schools to individual teachers.

Yet most of those promoting value-added modeling are ill-equipped to judge either its effectiveness or its limitations. Some of those who are equipped make extravagant claims without much detail, reassuring us that someone has checked into our concerns and we shouldn’t worry. Value-added modeling is promoted because it has the right pedigree — because it is based on “sophisticated mathematics.”As a consequence, mathematics that ought to be used to illuminate ends up being used to intimidate. When that happens, mathematicians have a responsibility to speak out.

Background

Value-added models are all about tests—standardized tests that have become ubiquitous in K–12 education in the past few decades. These tests have been around for many years, but their scale, scope, and potential utility have changed dramatically.

Fifty years ago, at a few key points in their education, schoolchildren would bring home a piece of paper that showed academic achievement, usually with a percentile score showing where they landed among a large group. Parents could take pride in their child’s progress (or fret over its lack); teachers could sort students into those who excelled and those who needed remediation; students could make plans for higher education.

Today, tests have more consequences. “No Child Left Behind” mandated that tests in reading and mathematics be administered in grades 3–8. Often more tests are given in high school, including high-stakes tests for graduation.

With all that accumulating data, it was inevitable that people would want to use tests to evaluate everything educational—not merely teachers, schools, and entire states but also new curricula, teacher training programs, or teacher selection criteria. Are the new standards better than the old? Are experienced teachers better than novice? Do teachers need to know the content they teach?

Using data from tests to answer such questions is part of the current “student achievement” ethos—the belief that the goal of education is to produce high test scores. But it is also part of a broader trend in modern society to place a higher value on numerical (objective) measurements than verbal (subjective) evidence. But using tests to evaluate teachers, schools, or programs has many problems. (For a readable and comprehensive account, see [Koretz 2008].) Here are four of the most important problems, taken from a much longer list.

1. Influences. Test scores are affected by many factors, including the incoming levels of achievement, the influence of previous teachers, the attitudes of peers, and parental support. One cannot immediately separate the influence of a particular teacher or program among all those variables.

2. Polls. Like polls, tests are only samples. They cover only a small selection of material from a larger domain. A student’s score is meant to represent how much has been learned on all material, but tests (like polls) can be misleading.

3. Intangibles. Tests (especially multiple-choice tests) measure the learning of facts and procedures rather than the many other goals of teaching. Attitude, engagement, and the ability to learn further on one’s own are difficult to measure with tests. In some cases, these “intangible” goals may be more important than those measured by tests. (The father of modern standardized testing, E. F. Lindquist, wrote eloquently about this [Lindquist 1951]; a synopsis of his comments can be found in [Koretz 2008, 37].)

4. Inflation. Test scores can be increased without increasing student learning. This assertion has been convincingly demonstrated, but it is widely ignored by many in the education establishment [Koretz 2008, chap. 10]. In fact, the assertion should not be surprising. Every teacher knows that providing strategies for test-taking can improve student performance and that narrowing the curriculum to conform precisely to the test (“teaching to the test”) can have an even greater effect. The evidence shows that these effects can be substantial: One can dramatically increase test scores while at the same time actually decreasing student learning. “Test scores” are not the same as “student achievement.”

This last problem plays a larger role as the stakes increase. This is often referred to as Campbell’s Law: “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to measure” [Campbell 1976]. In its simplest form, this can mean that high-stakes tests are likely to induce some people (students, teachers, or administrators) to cheat … and they do [Gabriel 2010].

But the more common consequence of Campbell’s Law is a distortion of the education experience, ignoring things that are not tested (for example, student engagement and attitude) and concentrating on precisely those things that are.

Value-Added Models

In the past two decades, a group of statisticians has focused on addressing the first of these four problems. This was natural. Mathematicians routinely create models for complicated systems that are similar to a large collection of students and teachers with many factors affecting individual outcomes over time.

Here’s a typical, although simplified, example, called the “split-plot design.” You want to test fertilizer on a number of different varieties of some crop. You have many plots, each divided into subplots. After assigning particular varieties to each subplot and randomly assigning levels of fertilizer to each whole plot, you can then sit back and watch how the plants grow as you apply the fertilizer. The task is to determine the effect of the fertilizer on growth, distinguishing it from the effects from the different varieties. Statisticians have developed standard mathematical tools (mixed models) to do this.

Does this situation sound familiar? Varieties, plots, fertilizer …students, classrooms, teachers?

Dozens of similar situations arise in many areas, from agriculture to MRI analysis, always with the same basic ingredients—a mixture of fixed and random effects—and it is therefore not surprising that statisticians suggested using mixed models to analyze test data and determine “teacher effects.”

This is often explained to the public by analogy. One cannot accurately measure the quality of a teacher merely by looking at the scores on a single test at the end of a school year. If one teacher starts with all poorly prepared students, while another starts with all excellent, we would be misled by scores from a single test given to each class.

To account for such differences, we might use two tests, comparing scores from the end of one year to the next. The focus is on how much the scores increase rather than the scores themselves. That’s the basic idea behind “value added.” But value-added models (VAMs) are much more than merely comparing successive test scores.

Given many scores (say, grades 3–8) for many students with many teachers at many schools, one creates a mixed model for this complicated situation. The model is supposed to take into account all the factors that might influence test results — past history of the student, socioeconomic status, and so forth. The aim is to predict, based on all these past factors, the growth in test scores for students taught by a particular teacher. The actual change represents this more sophisticated “value added”— good when it’s larger than expected; bad when it’s smaller.

The best-known VAM, devised by William Sanders, is a mixed model (actually, several models), which is based on Henderson’s mixed-model equations, although mixed models originate much earlier [Sanders 1997]. One calculates (a huge computational effort!) the best linear unbiased predictors for the effects of teachers on scores. The precise details are unimportant here, but the process is similar to all mathematical modeling, with underlying assumptions and a number of choices in the model’s construction.

History

When value-added models were first conceived, even their most ardent supporters cautioned about their use [Sanders 1995, abstract]. They were a new tool that allowed us to make sense of mountains of data, using mathematics in the same way it was used to understand the growth of crops or the effects of a drug. But that tool was based on a statistical model, and inferences about individual teachers might not be valid, either because of faulty assumptions or because of normal (andexpected) variation.

Such cautions were qualified, however, and one can see the roots of the modern embrace of VAMs in two juxtaposed quotes from William Sanders, the father of the value-added movement, which appeared in an article in Teacher Magazine in the year 2000. The article’s author reiterates the familiar cautions about VAMs, yet in the next paragraphseems to forget them:

Sanders has always said that scores for individual teachers should not be released publicly. “That would be totally inappropriate,” he says. “This is about trying to improve our schools, not embarrassing teachers. If their scores were made available, it would create chaos because most parents would be trying to get their kids into the same classroom.”

Still, Sanders says, it’s critical that ineffective teachers be identified. “The evidence is overwhelming,” he says, “that if any child catches two very weak teachers in a row, unless there is a major intervention, that kid never recovers from it. And that’s something that as a society we can’t ignore” [Hill 2000].

Over the past decade, such cautions about VAM slowly evaporated, especially in the popular press. A 2004 article in The School Administrator complains that there have not been ways to evaluate teachers in the past but excitedly touts value added as a solution:

“Fortunately, significant help is available in the form of a relatively new tool known as value-added assessment. Because value-added isolates the impact of instruction on student learning, it provides detailed information at the classroom level. Its rich diagnostic data can be used to improve teaching and student learning. It can be the basis for a needed improvement in the calculation of adequate yearly progress. In time, once teachers and administrators grow comfortable with its fairness, value-added also may serve as the foundation for an accountability system at the level of individual educators [Hershberg 2004, 1].”

And newspapers such as The Los Angeles Times get their hands on seven years of test scores for students in the L.A. schools and then publish a series of exposés about teachers, based on a value-added analysis of test data, which was performed under contract [Felch 2010]. The article explains its methodology:

“The Times used a statistical approach known as value-added analysis, which rates teachers based on their students’ progress on standardized tests from year to year. Each student’s performance is compared with his or her own in past years, which largely controls for outside influences often blamed for academic failure: poverty, prior learning and other factors.

Though controversial among teachers and others, the method has been increasingly embraced by education leaders and policymakers across the country, including the Obama administration.”

It goes on to draw many conclusions, including:

“Many of the factors commonly assumed to be important to teachers’ effectiveness were not. Although teachers are paid more for experience, education and training, none of this had much bearing on whether they improved their students’ performance.”

The writer adds the now-common dismissal of any concerns:

“No one suggests using value-added analysis as the sole measure of a teacher. Many experts recommend that it count for half or less of a teacher’s overall evaluation.

“Nevertheless, value-added analysis offers the closest thing available to an objective assessment of teachers. And it might help in resolving the greater mystery of what makes for effective teaching, and whether such skills can be taught.”

The article goes on to do exactly what it says “no one suggests” — it measures teachers solely on the basis of their value-added scores.

What Might Be Wrong with VAM?

As the popular press promoted value-added models with ever-increasing zeal, there was a parallel, much less visible scholarly conversation about the limitations of value-added models. In 2003 a book with the title Evaluating Value-Added Models for Teacher Accountability laid out some of the problems and concluded:

“The research base is currently insufficient to support the use of VAM for high-stakes decisions. We have identified numerous possible sources of error in teacher effects and any attempt to use VAM estimates for high-stakes decisions must be informed by an understanding of these potential errors [McCaffrey 2003, xx].”

In the next few years, a number of scholarly papers and reports raising concerns were published, including papers with such titles as “The Promise and Peril of Using Valued-Added Modeling to Measure Teacher Effectiveness” [RAND, 2004], “Re-Examining the Role of Teacher Quality in the Educational Production Function” [Koedel 2007], and “Methodological Concerns about the Education Value-Added Assessment System” [Amrein-Beardsley 2008].

What were the concerns in these papers? Here is a sample that hints at the complexity of issues.

• In the real world of schools, data is frequently missing or corrupt. What if students are missing past test data? What if past data was recorded incorrectly (not rare in schools)? What if students transferred into the school from outside the system?

• The modern classroom is more variable than people imagine. What if students are team-taught? How do you apportion credit or blame among various teachers? Do teachers in one class (say mathematics) affect the learning in another (say science)?

• Every mathematical model in sociology has to make rules, and they sometimes seem arbitrary. For example, what if students move into a class during the year? (Rule: Include them if they are in class for 150 or more days.) What if we only have a couple years of test data, or possibly more than five years? (Rule: The range three to five years is fixed for all models.) What’s the rationale for these kinds of rules?

• Class sizes differ in modern schools, and the nature of the model means there will be more variability for small classes. (Think of a class of one student.) Adjusting for this will necessarily drive teacher effects for small classes toward the mean. How does one adjust sensibly?

• While the basic idea underlying value-added models is the same, there are in fact many models. Do different models applied to the same data sets produce the same results? Are value-added models “robust”?

•Since models are applied to longitudinal data sequentially, it is essential to ask whether the results are consistent year to year. Are the computed teacher effects comparable over successive years for individual teachers? Are value-added models “consistent”?

To read the complete paper, go to Mathematical Intimidation: Driven by the Data.

John Ewing is president of Math for America, a nonprofit organization dedicated to improving mathematics education in U.S. public high schools by recruiting, training and retaining great teachers. The article originally appeared in the May Notices of the American Mathematics Society.

Another take on the subject is from Diane Ravitch and titled The Problems With Value-Added Assessment.

Dora

3 thoughts on “Value Added Measures: What they are and what they’re not”

seattleducation2011 says:

June 6, 2012 at 7:16 PM

I have verified it. The school district principals recently had a training session on VAM.

It will be coming soon to your neighborhood school.

Dora

seattleducation2011 says:

June 6, 2012 at 9:48 AM

Thanks for the clarification. The reason that I thought it was VAM was that it was referred to by a principal within SPS. I will check my information.

Dora

chungatest says:

June 6, 2012 at 9:37 AM

Excellent background on VAM.

I don’t believe the Seattle eval process or WA evaluation legislation (based on TPEP program) use VAM. They use student growth measures, but not the statistical technique of VAM specifically. VAM is designed to specifically isolate the contribution of the teacher on student growth on test scores (it largely fails in this, but that’s what it’s designed to do). Seattle currently does and WA will (with the new legislation that recently passed) look at student growth on test scores, but will NOT use VAM to try to determine what impact the teacher had. Instead, they’ll presumably use evaluation guidelines and professional judgement. This is likely even WORSE than VAM since the growth measures are not even designed to determine the teacher’s impact and yet high-stakes are associated with it. For more on the difference between VAM and student growth measures, refer to the excellent article by Bruce Baker http://schoolfinance101.wordpress.com/2012/04/28/if-its-not-valid-reliability-doesnt-matter-so-much-more-on-vam-ing-sgp-ing-teacher-dismissal/