Monday, February 20, 2012

Value-added measures don't measure up for Evaluation


Value-Added Measures don't make a good foundation for a teacher  evaluation system.

A comment on a Joanne Jacobs article:
VAM measures the amount of improvement your students make. There are a number of ways to do this. Some of the early VAM methods were highly unstable. More sophisticated methods do seem to hold up well from year to year and also correlate with positive long term outcomes such as lower teen pregnancy rates and better education and employment as adults.

And here I thought I was supposed to be teaching math.

I don’t believe VA is anything on which to base bonus or termination. “Seem to correlate” does not mean “cause” … and that’s for the best measurements.

What of all the poor ones? “Some of the early VAM methods were highly unstable” ("Unstable" is a charitable term for "Any resemblance to a consistent reality is neither implied nor intended.") 

It means that the results cannot be trusted for grading the student who took them (that's stated plainly and explicitly in the administrator's notes) and it means that the test are worse at evaluating the teacher who didn't take them.

There are many issues with any kind of testing. What exactly are we supposed to be teaching and what results do we want out of it? What will we consider to be a success? Do the tests measure what we think they're measuring and does that result resemble the state of the student?

I am given a curriculum that I am to follow. The test is written for a different curriculum. Don't judge me based on something you tell me not to use.

Then, there's accuracy and repeatability. Use a ruler and get the same height every time - that's data you can trust. Give a test to students a second time, they would score differently. Give the same essay to ten scorers and you'll get 10 different scores. Read Making the Grade for a nasty dose of testing realism. Since the scoring of these tests is so "unstable", evaluations shouldn't be based on the results.

This graphic to the right cleverly pretends that measuring a child's height is exactly analogous to measuring his grade level.  Unfortunately, the accuracy possible in the one is not possible in the other. I would note with some amusement that the books he's standing on make even that height measurement into an exercise in systematic error.

Then, the tests claim to be able to discern between fractions of a grade level but the random error in such a measurement is a full grade level or more. The test to test changes on one of the best-known measurement systems, the SAT, can be as high as 100 points. They don't report scores, they report a range (520-540). The test is 600 points and the variation is 100 points. Now imagine the variations on your typical state test.

States routinely tell the testing company to instruct the scorers that averages HAD to be in a certain range - any test scoring that ran counter to that pre-determined result was wrong. As Todd Farley describes it, accuracy is a fantasy.

It doesn't make sense to evaluate me based on a test given to a fifteen year-old kid who has only had me for a short while, who has failed again and again, who has attendance "issues", who's strung out on something ("self-medicated"), using a test that pretends to accuracy but fails miserably at it and rarely is aligned to the same curriculum that I've been required to follow.

What about Value-Added?
Just the basic premise that you can differentiate teachers based on VAM is flawed. If I have a group of students that improves a lot this year but a different group that doesn’t do as well next year, are we to assume that I’ve been slacking off and just need a goad, a little taste of the whip to perform better or should we assume that my teaching is so variable that I can be bad, then great, then merely good?
If my students improve from a Level Equivalency of grade 4 to grade 8 in one year (even though no test can honestly make that claim in any accurate way) and my colleague raises his students from grade 10.2 to 11.3, which of us has done a better job? I may have convinced them to work harder at the end of the year but not actually done much teaching.

If I have a class with “issues” and they only improve from 9.5 to 9.8, that might be a tremendous leap for them but it wouldn’t show that way to the outside observer.

I find it troubling that we have this blind trust in a standardized testing program.

What are they good for?

VA measures are useful to me in a classroom, provided I get them in a reasonable amount of time, disaggregated so I know detail instead of a vague "You Suck" or "You're Great", and the high-stakes are left off it.

Selling newspapers is not a good use.


1 comment:

  1. What I especially love is that, at least here in New York State, test scores travel with students. In other words, we are responsible for the test scores of students who arrived in our building a week before the exam - or even after.

    ReplyDelete