Doug McRae

The Smarter Balanced Assessment Consortium provided a sneak peek for their final computer-adaptive tests in early October, tests to be administered to roughly 25 percent of the country’s grade 3-8 and 11 students in spring 2015 to measure, initially, status and, eventually, growth in achievement on the new Common Core academic standards for English Language Arts and Mathematics. The peek reveals the prospective tests are a work in progress – tests that I believe won’t be ready for prime time until at least spring 2016.

The sneak peek was provided via the process Smarter Balanced is using to determine “cut scores” for test results, or essentially how many test questions a student must answer correctly to be labeled proficient.

Smarter Balanced officials have yet to determine how the achievement categories will be labeled. They have indicated they will have four achievement categories for their results, which for now are just labeled Category 1, 2, 3, 4. For this commentary, I will use the labels below basic, basic, proficient, advanced or even A, B, C, D as substitutes for the concept of achievement categories.

The Smarter Balanced process involves structured judgments for test questions they plan to use in the spring of 2015. Judgments were elicited from volunteers who signed up for a 3-hour online session to review actual test questions and provide a judgment where a Category 3 or proficient “cut score” should be placed. The results of the online exercise were to be provided to more than 500 teachers and others nominated by 17 states to participate on “in-person” panels in mid-October to undergo formal cut-score-setting exercises for the 14 tests being developed by Smarter Balanced.

The Smarter Balanced process also involved two panels (one each for ELA and Math) to coordinate proposed cut scores across grade levels. Recommended cut scores were to be endorsed by Smarter Balanced member states on Nov. 6, but that portion for approval of recommended cut scores has been delayed.

We should be reminded that the actual Smarter Balanced tests for spring 2015 have not yet been finalized. Analyses from the Smarter Balanced field tests that students took in spring 2014, designed primarily to qualify test questions for use in final tests, have not yet been completed.

But the exercise that I participated in did provide a set of test questions that mirrored Smarter Balanced plans for their final tests, and a set of questions that mirrored the proposed balance between multiple-choice (and other test questions that can be scored electronically) and open-ended test questions that are needed to test many of the new Common Core academic standards in depth.

So, with care taken not to disclose any of the secure material involved in the online exercise, what were the observations of this experienced K-12 testing system designer?

I did the online exercise for grade 3 English Language Arts, and for this grade level and content area traditional multiple-choice questions dominated. In fact, 84 % of the questions were either multiple-choice or “check-the-box” questions that could be electronically scored, and these questions were very similar or identical to traditional “bubble” tests. Only 16 percent of the questions were open-ended questions, which many observers say are needed to measure Common Core standards.

The online exercise used a set of test items with the questions arranged in sequence by order of difficulty, from easy questions to hard questions. The exercise asked the participant to identify the first item in the sequence that a Category 3 or B-minus student would have less than a 50 percent chance to answer correctly. I identified that item after reviewing about 25 percent of the items to be reviewed. If a Category 3 or proficient cut score is set at only 25 percent of the available items or score points for a test that has primarily multiple-choice questions, clearly that cut score invites a strategy of randomly marking the answer sheet. The odds are that if a student uses a random marking strategy, he or she will get a proficient score quite often. This circumstance would result in many random (or invalid and unreliable) scores from the test, and reduce the overall credibility of the entire testing program.

It troubled me greatly that many of the test questions later in the sequence appeared to be far easier than the item I identified as the item marking a Category 3 or proficient cut score, per the directions for the online exercise. I found at least a quarter of the remaining items to be easier, including a cluster of clearly easier items placed about 2/3 of the way into the entire sequence. This calls into question whether or not the sequence of test questions used by Smarter Balanced was indeed in difficulty order from easy to hard items. If the sequence used was not strictly ordered from easy to hard test questions, then the results of the entire exercise have to be called into serious question.

There were several additional concerns about the Smarter Balanced cut-score-setting exercise this October that are too technical for full discussion in this commentary. Briefly, the exercise appeared not to include any use of “consequence” data that typically is included in a robust cut-score-setting process. Consequence data is estimated information on what percent of students will fall in each performance category, given the cut scores being recommended. I also questioned whether the spring 2014 Smarter Balanced field test data were used to guide the exercise in any significant way. Indeed, since the 2014 Smarter Balanced field test was essentially an item-tryout exercise, an exercise designed to qualify test questions for use in final tests, it did not generate the type of data needed for final cut score determinations in a number of significant ways.

Smarter Balanced calls their 2015 test administration test an “operational” test. But, any operational test needs more than qualified test questions to yield valid scores. It must also have valid scoring rules to generate meaningful scores for students, for teachers, for parents and for valid aggregate scores for schools, districts and important subgroups of students.

It is quite clear to me that the cut-score-setting exercises conducted by Smarter Balanced this month will not produce final or valid cut scores for timely use with spring 2015 Smarter Balanced tests. Spring 2015 tests will instead be benchmark tests (to use test development parlance), tests that yield data that then can be used to generate valid cut scores. That exercise will have to wait for September 2015 at the earliest. The Smarter Balanced website recognizes this by labeling the cut scores recommended in October 2014 as “preliminary” cut scores, to be validated by spring 2015 data.

California plans to use the cut scores recommended by the panels that met in October for disseminating millions of test scores in spring 2015. These plans are faced with the prospect that those scores will have to be “recalled” and replaced with true or valid scores just months after incorrect scores are disseminated. This is not a pretty picture for any large-scale statewide assessment program.

The bottom line: Smarter Balanced tests are still a work in progress. I think it will be spring 2016 before Smarter Balanced tests will be able to generate valid, meaningful test scores in a timely fashion for California students.

 • • •

Doug McRae is a retired educational measurement specialist who has served as an educational testing company executive in charge of design and development of K-12 tests widely used across the United States, as well as an adviser on the initial design and development of California’s STAR assessment system.

The opinions expressed in this commentary represent solely those of the author. EdSource welcomes commentaries representing diverse points of view. If you would like to submit a commentary, please contact us.

To get more reports like this one, click here to sign up for EdSource’s no-cost daily email on latest developments in education.

Share Article

Comments are closed

Join the conversation by going to Edsource's Twitter or Facebook pages. If you do not have a social media account, you can learn how to create a Twitter account here and a Facebook account here.

  1. Douglas Gray 9 years ago9 years ago

    I do tech support for local one-room school in Landaff NH.
    The teacher is reporting that Drag n Drop does not work on new Acer Desktop machines running Windows 8.1
    I have searched the Internet and found no reference to that problem. Have you ever heard any comments on this subject or do you know someone I could contact about this issue? Thanks

    Replies

    • Doug McRae 9 years ago9 years ago

      Douglas Gray — Your question is far too deep in the technology weeds for me to answer. Best communicate with other SBAC users for info in response to your question.

    • John Fensterwald 9 years ago9 years ago

      Douglas: I can’t answer your question, but I am thrilled to have an EdSource reader from Landaff, which, in my 25 years in New Hampshire, I never visited (and had a hard time finding on a map).

      • Douglas Gray 9 years ago9 years ago

        Landaff is in Northern NH between Woodsville (zip 03785) and Littleton (zip 03561) it is a very small town with same zip code as Lisbon, NH and Lyman, NH 03585.

        • John C. Osborn 9 years ago9 years ago

          Ex-Milford, NH resident here.

          • Doug McRae 9 years ago9 years ago

            Hey, now that we've got a NH and ex-NH group in the room, can anyone say what NH is doing for reporting SBAC scores this spring? NH and VT were the states that did not endorse the SBAC cut scores for scoring SBAC tests last November, with the problem of how to report scores without using the SBAC cut scores unresolved at that time. NH and VT are pretty much joined at the hip (kinda … Read More

            Hey, now that we’ve got a NH and ex-NH group in the room, can anyone say what NH is doing for reporting SBAC scores this spring? NH and VT were the states that did not endorse the SBAC cut scores for scoring SBAC tests last November, with the problem of how to report scores without using the SBAC cut scores unresolved at that time. NH and VT are pretty much joined at the hip (kinda like puzzle parts on a map) for their statewide assessment programs, so it would not surprise me they are doing the same thing for reporting scores. Are they maybe reporting scale scores only, not achievement categories like level 1, 2, 3, 4 or basic, proficient, advanced? Or not reporting anything immediately after test administrations are complete and rather waiting until validated cut scores can be developed based on 2015 SBAC census data, thus delaying reporting until fall 2015? One comment that appeared in the media last November from an unidentified state testing director was “Setting cut scores based on field test item-tryout data is bizzare.” Does anyone know if that quote came from the NH or VT state testing director?

  2. Angelo 9 years ago9 years ago

    I have a concern with the validity of the results of SBAC; especially the mathematics section. Students are taking the test before a good portion of the material is even taught. I know that there is a rather large time frame that the test can be administered, but I am fairly certain that schools are testing before the curriculum is completely covered. One other concern is the freedom to sequence the standards how ever school … Read More

    I have a concern with the validity of the results of SBAC; especially the mathematics section. Students are taking the test before a good portion of the material is even taught. I know that there is a rather large time frame that the test can be administered, but I am fairly certain that schools are testing before the curriculum is completely covered. One other concern is the freedom to sequence the standards how ever school districts / teachers see fit. So it would seem to me that all students are not equally prepared during testing.

    Replies

    • Doug McRae 9 years ago9 years ago

      Angelo -- I totally share your concern. In technical test speak language, the Smarter Balanced test do not have "instructional validity" and as such the results cannot be used to evaluate the results of instruction since instruction on the new common core content standards has not sufficiently been implemented as yet. Several weeks ago, SBE Pres Mike Kirst was quoted in the Sac Bee as believing that only about 1/3 of CA teachers are … Read More

      Angelo — I totally share your concern. In technical test speak language, the Smarter Balanced test do not have “instructional validity” and as such the results cannot be used to evaluate the results of instruction since instruction on the new common core content standards has not sufficiently been implemented as yet. Several weeks ago, SBE Pres Mike Kirst was quoted in the Sac Bee as believing that only about 1/3 of CA teachers are now adequately prepared to teach the common core. To a testing guy, that is a major major red flag that CA is not ready to implement common core statewide assessments at the end of this school year, that such an exercise will not generate credible usable testing data and will be a waste of time and money. Doug

  3. Elizabeth MacArthur 9 years ago9 years ago

    I agree completely with Doug McRae that the Smarter Balanced tests will not be ready for meaningful use this spring (2015). In fact, I wonder if they will ever be good tests. Here are two reasons for my hesitation. (There are many others, but I'll save them for another time!) First, my daughter took the seventh-grade common core field test last spring. She told me that there were numerous problems with … Read More

    I agree completely with Doug McRae that the Smarter Balanced tests will not be ready for meaningful use this spring (2015). In fact, I wonder if they will ever be good tests. Here are two reasons for my hesitation. (There are many others, but I’ll save them for another time!)

    First, my daughter took the seventh-grade common core field test last spring. She told me that there were numerous problems with the test, which made her doubt that the results would be a good indication of students’ abilities. For example, if she couldn’t answer one math question, suddenly all the remaining math questions were exceedingly easy, far too easy for her. It is possible that the computer program isn’t smart enough to adjust questions appropriately to a student’s level. A paper test would not have this issue.

    Second, I tried to participate in the score-cut exercise that Doug mentions. I chose English Language Arts for eleventh grade, since I used to be a professor of literature in the UC. I found the test to be of very poor quality. Many items were badly written; often I wasn’t sure what I was supposed to do. If a person with a Ph.D. who spent years teaching reading and writing is not sure what a question means, quite likely the question is not well designed or well written. I was also disturbed by the extremely low level of the reading required for the test items. The Smarter Balanced assessments are supposed to reveal how well prepared students are for college, but none of the reading passages was at anywhere near the level required in college courses in the UC. In addition, I encountered a technical glitch that made it impossible to finish the practice test (I was unable to highlight a piece of text, which I had to do in order to answer a particular question, but because I had not answered this question, I could not move forward in the test, and had to leave it unfinished). I think it very likely that our students will encounter some similar difficulties in spring of 2015.

    In conclusion, then, I fear that the results of the spring 2015 Smarter Balanced tests will be misleading at best. Many students who are actually thriving in the common core curriculum might not be able to demonstrate that success on these deeply flawed assessments.

    Replies

    • John Fensterwald 9 years ago9 years ago

      Elizabeth: Your daughter took a field test last spring, one purpose of which was to filter out poorly written questions, so let’s hope those will not be on future tests.

      The field test was not adaptive, so the level of difficulty of subsequent questions should not have been determined by previous answers. The first official Smarter Balanced test will be adaptive.

  4. James Realini 9 years ago9 years ago

    Teaching to standard means that you have a rubric/road map to a summative assessment that tells you if the student(s) met the standard. Designing the assessments (both formative and summative) are essential teacher tasks to developing the program of instruction that will lead to student success. Most teachers know this as the "backwards planning process." Keeping the test design A SECRET and not allowing the teachers to understand how the Smarter Balanced Test Questions (and answers) … Read More

    Teaching to standard means that you have a rubric/road map to a summative assessment that tells you if the student(s) met the standard. Designing the assessments (both formative and summative) are essential teacher tasks to developing the program of instruction that will lead to student success. Most teachers know this as the “backwards planning process.”
    Keeping the test design A SECRET and not allowing the teachers to understand how the Smarter Balanced Test Questions (and answers) relate to the standards, places teachers and their students in the untenable position of “you teach” and we’ll let you know if we think you taught it the way we think it should have been taught.
    Common Core Standards, for ELA & Math are wonderful; the teaching strategy (which includes the assessment activities) for Smarter Balanced is dysfunctional.
    Also, this unidimensional Common Core focus on ELA & Math, albeit there are Literacy Standards for non-ELA/Math disciplines, fails to address the content necessary for education in Science, History, Fine Arts & Music. This is critically important because observation of this Nations Financing of Education clearly demonstrates that monies will be allocated on Test Results in ELA & Math to the detriment of all other disciplines.
    The Common Core system of pedagogy is an unvetted program designed without true teacher input. Those in the ivory tower financed by philanthropic wealth have every right to dream, but the classroom teachers have the experience necessary to design a system that they know will work, but were not asked. If you are going to ignore countless years of experience then you surely must believe that no experience to teach is necessary.

  5. Don 9 years ago9 years ago

    "As for the testing, I don’t think those states should be able to opt out." Floyd, testing and Common Core are separate endeavors. When you say they shouldn't be able to opt out, do you understand that states are not compelled to adopt Common Core (coerced, yes). It is a voluntary effort, even if that's a simplification of the many drivers that forced states' hands. How can you be forced not to opt out of something … Read More

    “As for the testing, I don’t think those states should be able to opt out.”

    Floyd, testing and Common Core are separate endeavors. When you say they shouldn’t be able to opt out, do you understand that states are not compelled to adopt Common Core (coerced, yes). It is a voluntary effort, even if that’s a simplification of the many drivers that forced states’ hands. How can you be forced not to opt out of something you aren’t forced to opt in? And by whom? The federal government is prohibited from enforcing national standards. And nothing prohibits states from teaching courses of their own choosing. You are conflating testing, CCSS, and curriculum as well as instruction. I’m not sure what you’re talking about except that you obviously believe for some reason that national standards are the key to higher achievement. There’s no much to back that up since some countries without them score higher and some with them score lower – that horse race is of any importance.

    Replies

    • FloydThursby1941 9 years ago9 years ago

      Don, do you think San Francisco, a City in which Asian, Latino and African American students trail their counterparts in LA, SD, Oakland, Sacramento and San Jose but which is only ahead of those (though maybe no longer San Diego) due to it's high Asian percentage, should take kids out of a normal class in high school to force them to take an ethnic studies class which implies the reason for the achievement gap is … Read More

      Don, do you think San Francisco, a City in which Asian, Latino and African American students trail their counterparts in LA, SD, Oakland, Sacramento and San Jose but which is only ahead of those (though maybe no longer San Diego) due to it’s high Asian percentage, should take kids out of a normal class in high school to force them to take an ethnic studies class which implies the reason for the achievement gap is first and foremost, past racism, and whites should be deeply resented due to past racism and it is important for other groups to emphasize their differences, their oppression, rather than try to fit in and assimilate and be productive?

      This is a bizarre class. And it’s about to get shoved down our throats and waste a year of a class, instead of a 3d year of a language or science, or Public Speaking or Journalism, or AP US History, all of which are primarily taught from a liberal perspective, we’ll get an ethnic studies class.

      We should have national standards to avoid random localities from doing bizarre things like this. We need to focus on reading and math.

      I do think it’s good if we can compare States, families, teachers, ethnicities, cultural practices, income levels, etc. by various testing measures. It helps us learn to emulate the best and not the worst.

      Why should we teach children to be resentful? What is the point when 1 in 7 kids now is biracial?

  6. Don 9 years ago9 years ago

    Doug, what do politicians have to gain from general confusion and disappointment following release of lower test scores without comparability data? Common sense and the recent historical example in NY would dictate considerable political fallout as a result of this action This and other unexplained decisions like the one to release not ready for prime time test results in 2015 lead me to question if there isn't some other ulterior motive behind these curious … Read More

    Doug, what do politicians have to gain from general confusion and disappointment following release of lower test scores without comparability data? Common sense and the recent historical example in NY would dictate considerable political fallout as a result of this action This and other unexplained decisions like the one to release not ready for prime time test results in 2015 lead me to question if there isn’t some other ulterior motive behind these curious decisions. Is this another Trojan horse inside the first?

    Replies

    • Doug McRae 9 years ago9 years ago

      Don -- My take is that for the most part, the political angle on large scale testing has its focus on the short term and doesn't focus on longer term negatives. But, there are some who are simply anti-testing, anti-accountability and see things like lack of comparability info when switching tests as another way to discredit large scale testing in the hopes that it will just go away forever. Speculation on motivation is of course … Read More

      Don — My take is that for the most part, the political angle on large scale testing has its focus on the short term and doesn’t focus on longer term negatives. But, there are some who are simply anti-testing, anti-accountability and see things like lack of comparability info when switching tests as another way to discredit large scale testing in the hopes that it will just go away forever. Speculation on motivation is of course just speculation, and everyone is entitled to their own views, which is why an outlet like EdSource has the opinion traffic that it enjoys . . . . . with at times opinions not well grounded in thorough or accurate policy information.

      • Don 9 years ago9 years ago

        Yes, but 2015 IS short term and 2016 and 2017 as well.

        Doug, of course you can get all kinds of opinions on Ed Source, some more grounded in fact than others (as concerns testing policy in this case) , isn’t the issue at hand not the lack of grounding in fact of Ed Source opinions but of California state education policy?

        • navigio 9 years ago9 years ago

          There are all sorts of momentum drivers for imposing confusion on this issue, IMHO. Among them more money for testing companies, more difficulty in evaluating teachers, hide poor implementation, ie divert attention from responsibility, to even just discouraging those pesky community members. There were many ways the CSTs could have been made more transparent but were not. That said, intentionality is a pretty subjective topic.

          • Manuel 9 years ago9 years ago

            Yeah, "intentionality" is pretty subjective, specially when the test designers claim they had nothing, zero, zilch, nada, to do with how the test scores are manipulated by "folks at the higher pay grade." So who are these folks? Shocking, I tell you. Shocking. Of course, why should the pesky community people be told anything? We don't need to know anything and all we have to do is to trust those who know a thing or two … Read More

            Yeah, “intentionality” is pretty subjective, specially when the test designers claim they had nothing, zero, zilch, nada, to do with how the test scores are manipulated by “folks at the higher pay grade.” So who are these folks?

            Shocking, I tell you. Shocking.

            Of course, why should the pesky community people be told anything? We don’t need to know anything and all we have to do is to trust those who know a thing or two about test designs because, hey, they have nothing to do with how the scores get manipulated.

            I’ll repeat it again: Shocking. Truly shocking.

            It might even shock the conscience of some judge out there… 😉

        • FloydThursby1941 9 years ago9 years ago

          Realistically, they aren’t going to drop Common Core. It hurts more than it helps to constantly demand they do that. It’s better to try to make it work. Some of the problems are actually caused by those who don’t want it to work, I agree, who are anti-testing, anti-accountability, and anti-federalist. They want States Rights and local control. We tried that and it failed. Now we’re trying something else.

          • Don 9 years ago9 years ago

            Floyd, An August article in US New and World Report was entitled, "Common Core Support in Free Fall". In CA a minority of Democrat have positive things to say about it. When people knew little about it support was higher.They more they know the more support drops out. Three states,Oklahoma, South Carolina, and Indiana have officially repealed the Common Core standards. 30 states have introduced legislation to limit it or repeal it. A … Read More

            Floyd,

            An August article in US New and World Report was entitled, “Common Core Support in Free Fall”. In CA a minority of Democrat have positive things to say about it. When people knew little about it support was higher.They more they know the more support drops out.

            Three states,Oklahoma, South Carolina, and Indiana have officially repealed the Common Core standards. 30 states have introduced legislation to limit it or repeal it. A dozen or more states have withdrawn from SBAC or PARRC

            When you say “realistically they aren’t going to drop Common Core”, I’m not sure if you understand how poorly things are going for CCS.

            People who want scripted, restrictive one-size-fits-all instruction and high-stakes summative testing support Common Core. That’s why teacher support has dropped precipitously. The long game for those that want to do away with CCS is to let it collapse.

          • navigio 9 years ago9 years ago

            Common core is a content standard, distinct from all the other stuff people attribute to it. If CC is repealed in ca it will be for political reasons related to all that other stuff, not based on the standards themselves. Not sure that difference should matter except in that people should probably be more clear about what they are actually criticizing and rejecting (most of the time it is not the content standard themselves).

          • Don 9 years ago9 years ago

            Ya know,Floyd, I always hear this meme that public education has failed. If I compare my own public school education starting in 1958 with the education my children have received since 2003, I am astonished at how much it has improved since then (please refrain from the obvious jokes here.I made up for it later), primarily in the quality of the instruction. Are you disappointed in the education your kids received at the same schools as … Read More

            Ya know,Floyd, I always hear this meme that public education has failed.

            If I compare my own public school education starting in 1958 with the education my children have received since 2003, I am astonished at how much it has improved since then (please refrain from the obvious jokes here.I made up for it later), primarily in the quality of the instruction.

            Are you disappointed in the education your kids received at the same schools as mine?

            Plenty of kids are failing today, but this is largely to do with immigration from Latin America because the numbers would look very different otherwise.

            I’m not buying into the corporate reformers flaming of public education. While I’m not ideologically against charters, I am ideologically opposed to destructive propaganda that derides schools to promulgate the defunding of public education unless a certain set of reform measures come to the rescue. This isn’t to say some employment reform,for example, isn’t a good idea, but there are so many great things going on in schools nowadays – it’s just a travesty to refer to all this egalitarianism and progress as failure.

          • FloydThursby1941 9 years ago9 years ago

            Two points. One is I agree, I'm overall very happy with the education my kids received at the schools we had in common, all 3 of them. I had a couple teachers and incidents I was unhappy with but that is inevitable. 90-95% of it is good. It is true, immigration from Latin America has made it very difficult on us statistically. This is our biggest challenge. The numbers … Read More

            Two points. One is I agree, I’m overall very happy with the education my kids received at the schools we had in common, all 3 of them. I had a couple teachers and incidents I was unhappy with but that is inevitable. 90-95% of it is good. It is true, immigration from Latin America has made it very difficult on us statistically. This is our biggest challenge. The numbers are too big to accept as an anomaly. I think California is still thinking we’re in the ’50s, oh well, a few Latinos are getting Cs, no biggie, someone has to do that work. This is the majority. If we could get Latino performance to the Asian level we’d be one of the richest places in the world.

            But overall I do think schools have improved. We just ignore students. A diligent student at a horrible school can have a good future and a terrible student at a great school can have no future. We forget that with all the focus on instruction.

            As for the testing, I don’t think those states should be able to opt out. They should work towards a solution. Maybe we have to do it again but when 50 states and a district adopt one test, some states just have to accept that there may be a final decision they don’t feel is ideal. Maybe people in the South want religion in there, or don’t want evolution or global warming, and now some in SF want an ethnic studies course, in my view a mistake which diverts important time and resources. As the Rolling Stones once said, you can’t always get what you want, but if you try sometimes, you just might find, you get what you need….baby. Let’s all roll up our sleeves and try to mend it but don’t end it.

  7. Bob Valiant Ed.D. 9 years ago9 years ago

    McCrae only discusses the problems with Smarter Balanced that can be fixed over a period of time to meet the technical requirements. He does NOT discuss the fact that this type of test can never measure the true achievement of a child (let only the teacher of the child) over the broad range of education for a full year of school experiences. Even so, he does say it is not ready for prime time and … Read More

    McCrae only discusses the problems with Smarter Balanced that can be fixed over a period of time to meet the technical requirements. He does NOT discuss the fact that this type of test can never measure the true achievement of a child (let only the teacher of the child) over the broad range of education for a full year of school experiences. Even so, he does say it is not ready for prime time and should be delayed. I say Smarter Balanced and PARCC should be scrapped and the money that would be spent on them put to good use in classrooms across the country.

    Replies

    • Manuel 9 years ago9 years ago

      In all fairness, Doug presents himself first and foremost as a test designer, not as one who believes that the tests are true measures of a child's academic achievement. Or at least I've never noticed him saying anything close to that. Yes, he believes that the SBAC tests are not yet up to his standards (and, presumably, most of the honest practitioners of this mostly arcane art). But he has stated elsewhere that your suggestion, Dr. … Read More

      In all fairness, Doug presents himself first and foremost as a test designer, not as one who believes that the tests are true measures of a child’s academic achievement. Or at least I’ve never noticed him saying anything close to that.

      Yes, he believes that the SBAC tests are not yet up to his standards (and, presumably, most of the honest practitioners of this mostly arcane art). But he has stated elsewhere that your suggestion, Dr. Valiant, should not be followed. When I proposed this, he thought it equivalent to throwing the baby out with the bath water.

      I, of course, believe that this bath water is probably drowning too many babies every year. Ah, the things we do when pursuing an accountability model that eventually becomes just another “stack-and-rank” scheme.

      BTW, Doug did state below that tests are not there just to satisfy teachers only but that they have “multiple audiences and masters.” That’s a rather candid and very telling admission that speaks volumes about why tests such as these are imposed on the unwashed masses (and I say this because it is rumored that prestigious private schools do not force their students to take these tests).

    • Paul Muench 9 years ago9 years ago

      Smarter Balanced does not test for achievement in physics, chemistry, biology, economics, psychology, political science, drama, music, painting, sculpting, Spanish, Chinese, German, French, etc.. There was never any intent to do that. Smarter balanced like all other standardized tests is a test of minimums and not a test of fulfillment. When you have an education system designed to meet minimums you test for minimums. Does SB fool people into thinking it … Read More

      Smarter Balanced does not test for achievement in physics, chemistry, biology, economics, psychology, political science, drama, music, painting, sculpting, Spanish, Chinese, German, French, etc.. There was never any intent to do that. Smarter balanced like all other standardized tests is a test of minimums and not a test of fulfillment. When you have an education system designed to meet minimums you test for minimums. Does SB fool people into thinking it is more? So far I have not seen that,

      • Manuel 9 years ago9 years ago

        Paul, I am under the perhaps mistaken impression that all such tests are meant to, as the Ed Code puts it in Ed Code Section 52050.5(b): It is in the interest of the people and the future of this state to ensure that each child in California receives a high quality education consistent with all statewide content and performance standards, as adopted by the State Board of Education, and with a meaningful assessment system and reporting … Read More

        Paul, I am under the perhaps mistaken impression that all such tests are meant to, as the Ed Code puts it in Ed Code Section 52050.5(b):

        It is in the interest of the people and the future of this state to ensure that each child in California receives a high quality education consistent with all statewide content and performance standards, as adopted by the State Board of Education, and with a meaningful assessment system and reporting program requirements.

        (emphasis mine.) If the tests measure only “minimums,” then that’s not a “meaningful assessment” and the intent of the Legislature is being ignored. Does this make the tests essentially illegal as they don’t carry out the intent?

        (BTW, I just noticed that the PSAA of 1999 does indeed allow for “sanctions for schools that are continuously low performing.” How could I have missed that? And who is going to define the sanctions? The SBoE? Following what criteria? Yours? Mine? Duncan’s? Ravitch’s?)

        • Paul Muench 9 years ago9 years ago

          You’re right. I’m not limiting myself to the language and politics of the Ed Code in my use of the word minimums. My remark was in response to the “..true achievement of a child…” from Dr. Valiant. In which I gave some suggestions about the true potential of children.

      • navigio 9 years ago9 years ago

        I’d reverse the minimums comment: when you place high stakes on the measuring of minimums, the educational system adopts itself to make those minimums the goal.

        • Paul Muench 9 years ago9 years ago

          Perhaps high stakes tests made education worse in California. I dislike high stakes tests for reasons that are independent of that judgement. But the basic outline of the education in California ( and pretty much across the U.S. ) has remained basically the same so I’m confident my use of minimums would apply even before NCLB came into existence.

  8. Harold Capenter 9 years ago9 years ago

    Here's my assessment of the Smarter Balanced Test Cut Scores: 1) 2) The cuts scores are political and about the money. 2) Each time a child fails the test and they have to retake the test it cost money, more money. 3) 62% "expected" to fail now that's a problem. 4) 38% pass rate is abysmal. Common Core Test and Standards are not about our children getting a good education, … Read More

    Here’s my assessment of the Smarter Balanced Test Cut Scores: 1) 2) The cuts scores are political and about the money. 2) Each time a child fails the test and they have to retake the test it cost money, more money. 3) 62% “expected” to fail now that’s a problem. 4) 38% pass rate is abysmal. Common Core Test and Standards are not about our children getting a good education, instead its about money. Fix that, then maybe you got me at hello.

  9. Steve Rees 9 years ago9 years ago

    Thanks, Doug, for providing your independent judgment for the benefit of the rest of us. An informed opinion like yours, that rests on decades of knowledge and experience, is an opinion I value highly. Your comments also provide a helpful correction to the many misunderstandings of assessment fundamentals. I hope you continue to keep your eye on the ball, and call ’em as you see ’em.

  10. Gary Ravani 9 years ago9 years ago

    Allow me to go one step further than Mr. McRae on this one. The representative of SBAC, Wilhoit, was quite clear on the point of creating "cut scores:" The process is fundamentally "judgmental" in Wilhoit's words.The various panels and small groups looked at a series of questions and then placed "bookmarks" where they individually felt the cut score for four performance levels "ought" to be placed. Then they went through a variety of consensus building, … Read More

    Allow me to go one step further than Mr. McRae on this one. The representative of SBAC, Wilhoit, was quite clear on the point of creating “cut scores:” The process is fundamentally “judgmental” in Wilhoit’s words.The various panels and small groups looked at a series of questions and then placed “bookmarks” where they individually felt the cut score for four performance levels “ought” to be placed. Then they went through a variety of consensus building, group process, activities to develop a consensus based, but still very subjective, process to give a final (but still preliminary!) set of cut scores to be used to determine performance levels of students.

    Part of the process was for the in-person groups to be exposed to the opinions (and that’s what they were) of the on-line participants. It appears a number of people commenting participated in this.

    Anyone who advocates for using test scores for making high stakes decisions about students, schools, or teachers based on “OBJECTIVE” test data just needs to tune into the video of the SBE meeting and hear/see the presentation. No, test score data does not come from on high carved into stone tablets.

    This has always been the case for any of the tests. Some are better than others, mostly the individually administered ones, but in the end they are all based on fundamentally subjective criteria. First some group, sometimes from the business community or the political sector, decides what the content should be. Then a middle-class academic decides what the questions should be, then other groups also frequently middle-class academics, decide on “cut scores.” There is some statistical manipulation that enters in at various times to give it all an aura of “objectivity,” but it’s basically a very human endeavor full of all of the usual human quirks and biases.

    As Doug McRae points out, when properly put together, properly administered, and properly applied some useful applications of test data are possible. But, is it “Cambel’s Law,” that says the higher stakes you use in applying the results the higher the chance the results will be corrupted and cause more harm than good.

    And that has been the recent result of testing in the US for more than a decade and even longer in CA. (See the National Research Council on this.) Let’s try and do a better job with the tests and their uses this time around.

    The first step, again as Doug and others have pointed out, training, materials, technology, and time to teach all vary dramatically across CA. A single test, used at this time, to begin to hold all districts accountable makes no sense whatsoever. The most common sense thing to do is to is to hold off on the “benchmarking” process until all districts are up to speed on the CCSS and SBAC process.

    Replies

    • Doug McRae 9 years ago9 years ago

      Gary -- Test scores can be objective with a subjective process for determining cut scores, even subjective scoring for some test questions / tasks, provided test administration and scoring are conducted with an acceptable degree of consistency. The objectivity comes from the test administration and test scoring process. I will, however, agree that test scores do not come from on high carved in stone tablets -- at least they never have on my watch! I'm … Read More

      Gary — Test scores can be objective with a subjective process for determining cut scores, even subjective scoring for some test questions / tasks, provided test administration and scoring are conducted with an acceptable degree of consistency. The objectivity comes from the test administration and test scoring process. I will, however, agree that test scores do not come from on high carved in stone tablets — at least they never have on my watch!

      I’m not familiar with your “Cambel’s Law.” Rather, an informal law I follow is “the higher the stakes, the greater the effort needed to prevent corruption that would cause more harm than good.” And I’d disagree with your characterization of recent K-12 testing efforts in the US and CA.

      But I will agree we need training, materials, technology, and most of all time to implement common core instruction, and we need instruction firmly in place (90 % of districts?) before we use test data for high stakes purposes. And obviously I agree the best thing to do is to hold off benchmarking (or setting cut scores) until we have appropriate data upon which to benchmark. But, I also think it is possible that might be done as early as fall 2015 using spring 2015 SB data, at least data from subparts of the SB summative assessment system for most of the targeted grades.

      • Gary Ravani 9 years ago9 years ago

        The Inevitable Corruption of Indicators and Educators Through High-Stakes Testing Sharon L. Nichols University of Texas at San Antonio and David C. Berliner Arizona State University Executive Summary This research provides lengthy proof of a principle of social science known as Campbell’s law: “The more any quantitative social indicator is used for social decision- making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it … Read More

        The Inevitable Corruption of Indicators and Educators Through High-Stakes Testing
        Sharon L. Nichols University of Texas at San Antonio and
        David C. Berliner Arizona State University
        Executive Summary
        This research provides lengthy proof of a principle of social science known as Campbell’s law: “The more any quantitative social indicator is used for social decision- making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”1 Applying this principle, this study finds that the over-reliance on high-stakes testing has serious negative repercussions that are present at every level of the public school system.

        Forgot an “l” on Cambell.

        I am aware of your position on testing, Doug. And I have my position based on 35 years in the classroom. Then we have the study I posted above, and then we the the findings of the National Research Council that has asserted the kind of test based “accountability” we have had in place since 2002 in the US and (around) 1994 in CA has not only not resulted in increased learning, but has given us a false picture of that learning and has, via a narrowing of the curriculum, actually reduced the chance of improved learning.

        • Doug McRae 9 years ago9 years ago

          Nichols and Berliner are well known advocates for a teacher perception on what is needed from statewide tests, rather than unbiased positions evaluating the merits of various approaches to statewide tests. So, their work should be read with that grain of salt. The NRC study falls in the bucket of studies that try to evaluate the effect of changes in the overall K-12 education system via test scores [in this case, evaluating the effects … Read More

          Nichols and Berliner are well known advocates for a teacher perception on what is needed from statewide tests, rather than unbiased positions evaluating the merits of various approaches to statewide tests. So, their work should be read with that grain of salt. The NRC study falls in the bucket of studies that try to evaluate the effect of changes in the overall K-12 education system via test scores [in this case, evaluating the effects of statewide tests by using test scores . . . . ] and come up with a “no effect” conclusion. Not unlike the myriad of studies on class size come with “no effect” conclusions. I wouldn’t put great stock in any of those “no effect” studies as definitive for any kind of conclusion other than author opinions only loosely connected to any data. As said in other places, however, there is room for reasonable folks with differing perspectives to disagree . . . . .

          • Gary Ravani 9 years ago9 years ago

            Quite a point to make, Doug. Educational testing based on what teachers (aka, educators) perceive testing ought to be all about. What a concept!

          • Doug McRae 9 years ago9 years ago

            Quick point to make, Gary. Educational testing has multiple audiences and masters, beyond teachers and educators who clearly deserve a major but not unilateral voice in large scale educational testing for K-12 education.

  11. Steven Shapiro 9 years ago9 years ago

    I also participated in the online scoring exercise that Smarter Balanced ran in October. I did the High School Math test and like the author, I found that the sequence of the questions did not seem to progress from easiest to hardest. In fact, they covered a variety of math topics (algebra, geometry, statistics, etc.) that don't really allow an easiest to hardest sequencing. A student might really understand and have mastery … Read More

    I also participated in the online scoring exercise that Smarter Balanced ran in October. I did the High School Math test and like the author, I found that the sequence of the questions did not seem to progress from easiest to hardest. In fact, they covered a variety of math topics (algebra, geometry, statistics, etc.) that don’t really allow an easiest to hardest sequencing. A student might really understand and have mastery of geometry topics but be very weak in statistics. The way the “cut score” was requested (a single marker where level 3 begins) did not allow us to rate the relative complexity of each question.

    Replies

    • Doug McRae 9 years ago9 years ago

      Steven -- Good to know someone else also questioned whether the items were ordered from easy to hard from an experience with a different grade level and different content area. Anyone else who participated in the online exercise able to chime in with their experience? I know this isn't a scientific site, but ad hoc data from a number of grade levels and content areas can be useful to ascertain whether or not the … Read More

      Steven — Good to know someone else also questioned whether the items were ordered from easy to hard from an experience with a different grade level and different content area. Anyone else who participated in the online exercise able to chime in with their experience? I know this isn’t a scientific site, but ad hoc data from a number of grade levels and content areas can be useful to ascertain whether or not the sequencing of items for the SB standards-setting efforts was indeed problematic or my commentary reflects only an isolated view.

  12. Don 9 years ago9 years ago

    Doug, thank you providing the much needed counterbalance to the hasty decision-making going on with regard to these assessments. Confucius said- "When it is obvious that the goals cannot be reached, don't adjust the goals, adjust the action steps." I was particularly struck by the point that you made here: "The odds are that if a student uses a random marking strategy, he or she will get a proficient score quite often. This circumstance would result in many … Read More

    Doug, thank you providing the much needed counterbalance to the hasty decision-making going on with regard to these assessments.

    Confucius said- “When it is obvious that the goals cannot be reached, don’t adjust the goals, adjust the action steps.”

    I was particularly struck by the point that you made here:

    “The odds are that if a student uses a random marking strategy, he or she will get a proficient score quite often. This circumstance would result in many random (or invalid and unreliable) scores from the test, and reduce the overall credibility of the entire testing program.”

    If it is true that students who perform lower tend to guess more often, it seems reasonable to conclude that test results at lower performing schools will incur this random advantage, particularly if it is also true that questions are not presented in hierarchical order of difficulty. The net effect would be an inflating of scores predominately among students who historically test significantly lower and that students closing to the border of proficiency would likely be most disadvantaged by such a policy.

    As you pointed out, guessing upsets the whole apple cart, but I believe it isn’t going to have an equal affect across the proficiency spectrum. What is your take on the specifics of this as it affects the advanced to far- below-basic spread?

    Replies

    • FloydThursby1941 9 years ago9 years ago

      Come on, if you guess randomly you’ll do horribly. You won’t be rated as proficient. This is why I don’t agree with Don’s calls to delay a year. I know in a year you’ll be hammering them about something and making any statement you can to push for another delay, then another. You could not be rated as proficient by being random.

    • Doug McRae 9 years ago9 years ago

      Don -- Setting cut scores too close to a random cut score will advantage those below those such a cut score and disadvantage those truly above such a cut score. To be fair to all students, cut scores cannot be set so low that random marking advantages or disadvantages any student or group of students. Floyd -- It is simple math that if you have a test with 80 items involving 4 choices each, and you … Read More

      Don — Setting cut scores too close to a random cut score will advantage those below those such a cut score and disadvantage those truly above such a cut score. To be fair to all students, cut scores cannot be set so low that random marking advantages or disadvantages any student or group of students.

      Floyd — It is simple math that if you have a test with 80 items involving 4 choices each, and you set a cut score at the random mark of 20 items correct, then many will get proficient ratings by just randomly marking the answer sheet. That’s just another way of saying a test maker should not (must not) set cut scores too close to the random range for a test. Unfortunately, flaws in the Smarter Balanced Oct 2014 exercise suggested this scenario was quite possible.

      • FloydThursby1941 9 years ago9 years ago

        I'd have to do the math. I thought there were 5 choices. I rolled 100 dice several times and the highest one value ever got was 24. I think randomization would prove it highly unlikely to get from 25% to 80%, which is required for proficient. That would be going from 20 out of 80 to 64. I believe advanced requires at least 72. How many standard deviations would it … Read More

        I’d have to do the math. I thought there were 5 choices. I rolled 100 dice several times and the highest one value ever got was 24. I think randomization would prove it highly unlikely to get from 25% to 80%, which is required for proficient. That would be going from 20 out of 80 to 64. I believe advanced requires at least 72. How many standard deviations would it have to be off for that to happen? I realize it is theoretically possible for a monkey to type a Pulitzer Prize winner by pecking at random keys on a computer, but the odds would be astronomical. The lottery is similar, highly unlikely, so it reliably pays for a lot for our schools. Has a lousy student or even an above average student ever received a perfect SAT Score? We’re talking about feat achieved by ony the top 600 kids a year in America. I would assume winning the lottery would be more likely than an average student winning a perfect SAT Score. This must be studied.

        • Doug McRae 9 years ago9 years ago

          Floyd -- Where did you get 80% to be proficient or 92% to be advanced? The entire standards-setting exercise is meant to decide what the number of items correct should be to determine proficiency, and the number of items correct for determining advanced. If the recommendation is 20 items correct of 80 available items, then the recommended score for proficiency is 25% of the items correct, not 80%. Your 80% for proficient and 92% for … Read More

          Floyd — Where did you get 80% to be proficient or 92% to be advanced? The entire standards-setting exercise is meant to decide what the number of items correct should be to determine proficiency, and the number of items correct for determining advanced. If the recommendation is 20 items correct of 80 available items, then the recommended score for proficiency is 25% of the items correct, not 80%. Your 80% for proficient and 92% for advanced probably comes from A-B-C-D grading schemes, which have nothing to do with standards-setting for large scale K-12 tests . . . .

          • FloydThursby1941 9 years ago9 years ago

            I'm just basing it on what I've seen for the STAR Tests and CTBS tests when I was growing up, assuming it will be close. I doubt they'll make a test where most kids get over 70% wrong, even smart ones. I may be proven wrong. My kids generally got between 95 and 100% and were always well into advanced. They really ought to make more use of percentiles. It is very … Read More

            I’m just basing it on what I’ve seen for the STAR Tests and CTBS tests when I was growing up, assuming it will be close. I doubt they’ll make a test where most kids get over 70% wrong, even smart ones. I may be proven wrong. My kids generally got between 95 and 100% and were always well into advanced.

            They really ought to make more use of percentiles. It is very helpful to know what percentile your child is at, above or below average. Anyone under 50 should do a lot of work, top 30% will be college grads, top 5% the true leaders of society, or are on pace to be so. Percentiles are far more valuable than advanced/proficient. My fear is eventually everyone will be advanced and it will become meaningless.

          • Doug McRae 9 years ago9 years ago

            Floyd -- I'm pretty familiar with both STAR tests and CTBS, having been deeply involved in the development of both, and I can guarantee you that very very few kids get between 95 and 100 percent of the items correct on either of those sets of tests. Re use of percentiles . . . percentiles were the dominant score unit for norm-referenced tests such as CTBS through the end of the last century, but NRTs … Read More

            Floyd — I’m pretty familiar with both STAR tests and CTBS, having been deeply involved in the development of both, and I can guarantee you that very very few kids get between 95 and 100 percent of the items correct on either of those sets of tests. Re use of percentiles . . . percentiles were the dominant score unit for norm-referenced tests such as CTBS through the end of the last century, but NRTs have been replaced by Standards-Based Tests (such as the STAR CSTs) in the past 10-15 years, and the dominant score unit is now the below basic, basic, proficient, advanced achievement category score that is the topic of my commentary and this discussion. NRTs went out of favor because their scores were relative among students rather than interpretable in terms of achievement on a fixed set of content standards. In any case, percentiles have been out of favor for public school statewide testing for the past 10-15 years, even though many parents still like the type of data those tests provided. The demise of NRTs and rise of SBTs could well be the topic for a book on large scale tests for K-12 schools in the US . . . . it is a pretty complex topic.

          • FloydThursby1941 9 years ago9 years ago

            Interesting Doug. I would always figure out the number they got wrong and it was under 5% from 2d to 8th grade. In nigh school it was under 10%. My son got 100% on all 5 math sections one year and only got 1 total question wrong out of 150. However, because in San Francisco the test counts for Lowell, I would study for the test with my kids, have them … Read More

            Interesting Doug. I would always figure out the number they got wrong and it was under 5% from 2d to 8th grade. In nigh school it was under 10%. My son got 100% on all 5 math sections one year and only got 1 total question wrong out of 150. However, because in San Francisco the test counts for Lowell, I would study for the test with my kids, have them do a prep book and correct it, and even in San Francisco which is a City with very diligent parents, very few people actually do so. I encouraged it at our school and a few parents did.

            I feel leaving out percentiles is part of the self-esteem movement to make all kids feel good whether or not they should be feeling good. I subscribe to the Triple Package view that some insecurity is a good thing and that free self-esteem for kids makes them comfortable, cocky, and not work as hard. You have to feel you must earn your self-esteem, not that it is a given. You have to internalize drive and a desire to do one’s best and push to your limits. I believe this, not genetics, is the reason Asians are over 3.5 times as likely to make it to a top UC or the UC System, as whites. Some believe it’s just unpredictable, some kids have it, some don’t.

            Percentiles should be presented alongside the other information. You can get the percentiles but you have to call in and look it up.

            San Francisco is one of the few cities where the STAR test counts. I felt it was a useful test and should only have been abandoned once they were ready to give the other test.

            I remember as a kid I used to calculate the # I got wrong and it was under 10%, sometimes around 5, but the general range was 4-8, and for my kids 2-4. I just assumed they wouldn’t make a test with more wrong than right, but they may.

            It would be fascinating if all these numbers were available. I’d love to see across the board comparisons of non English Learners, or see how white kids from one state do vs. white kids from another, as immigration from Mexico and Latin America is often proferred as an excuse for California’s struggles, and Texas’s. Bush used it as an excuse for Texas’ low scores, so it’d be interesting to be able to play with the data. It’d also be interesting to see what percentage of California’s kids are above average nationwide. I can’t wait until every state has all the information out there so we can see scores by county, area code, zip code, ethnicity, income, parent’s education, parental involvement. All this information will help us and if we focus, if we make education our national mission and #1 priority, we can do in education what we’ve done in many other areas, be #1. I believe in America, but America has never made education a #1 priority.

          • Manuel 9 years ago9 years ago

            Doug, you stated: "I can guarantee you that very very few kids get between 95 and 100 percent of the items correct on either of those sets of tests" Is "very very few" less than what percent? I would think that if it is more than 2% of the total cohort for each grade then it is not "very very few." I say this because, for example, the 2012 ELA had 6.05% of the 2nd grade kids … Read More

            Doug, you stated:

            “I can guarantee you that very very few kids get between 95 and 100 percent of the items correct on either of those sets of tests”

            Is “very very few” less than what percent? I would think that if it is more than 2% of the total cohort for each grade then it is not “very very few.” I say this because, for example, the 2012 ELA had 6.05% of the 2nd grade kids (462,538 total) getting 62 out of 65 right (that’s 95.38%), 4.01% of the 3rd graders (440,407 total) getting 62 out of 65 right, 3.31% of the 4th graders (413,446 total) getting 79 out of 83 right (95.18%), and 4.75% of the 5th graders (428,868 total) getting 71 out of 75 right (94.67%). I am curious as to what test designers such as yourself consider “very very few.”

          • FloydThursby1941 9 years ago9 years ago

            Thanks Manuel you make a good point. A child raised well, who studies for the test with the books available and makes school their main priority in life, should reach this level. It's only rare because of the level of parenting in the State. I don't even think I do nearly as much as I should or could, but I study weekends with my kids, prepare them for Kindergarten, and have them … Read More

            Thanks Manuel you make a good point. A child raised well, who studies for the test with the books available and makes school their main priority in life, should reach this level. It’s only rare because of the level of parenting in the State. I don’t even think I do nearly as much as I should or could, but I study weekends with my kids, prepare them for Kindergarten, and have them do a book a year to prep for the test. They go to typical public schools, slightly above average but save for the high school, not unique. That gets them into this percentile consistently. I bet most kids who are from the Triple Package cultures (Jewish, Chinese, Korean, Indian, Lebanese, Nigerian, Cuban, Persian, Moron, Japanese, Kenyan) reach this level. Any child could if they work hard and have good parenting. 95% seems very rare to most but only because we don’t set high standards and do what it takes to meet them. Among parents on this board and who prioritize the importance of education, I bet this level is very common. And 1 in 20 to 1 in 30 is not very, very rare.

            Send us the test results. Let us decide if it’s valuable. Maybe send a disclaimer and try to make it better next year, but please send us the results. 95% is just a sign of responsible parenting and responsible behavior of children. We should know who is and who is not achieving this.

            And if it’s rare, show the percentiles. We can all know where we stack up with the percentiles, who needs to work harder and who should be proud.

          • Manuel 9 years ago9 years ago

            Floyd, you are a one-trick pony and therefore not worth much of a reply. But you have given me an opportunity to send people to the mother lode: The percentages I quote come from the Technical Report on the 2012 CST administration, which is available at the CDE's web site. The 2013 is there as well. Have fun. As for scoring at the 95%-correct-or-above level, you just don't get it, do you? In my opinion, the … Read More

            Floyd, you are a one-trick pony and therefore not worth much of a reply. But you have given me an opportunity to send people to the mother lode: The percentages I quote come from the Technical Report on the 2012 CST administration, which is available at the CDE’s web site. The 2013 is there as well. Have fun.

            As for scoring at the 95%-correct-or-above level, you just don’t get it, do you?

            In my opinion, the CST data shows the tests are a mix of criterion-referenced questions and norm-referenced questions, despite the intentions of test designers such as Doug. The above technical reports are the only ones displaying raw scores next to their scaled values. Without knowing the test items themselves and just looking at the distributions, both raw and scaled, it is a clear fact that the cohorts taking the tests are not homogeneous as the distributions seem to be the sum of two distributions, each with its own mean and standard deviation. This should not be a surprise since, for example, 20% of 5th graders in 2013-14 are English learners. That alone invalidates the idea that this is a fair test for all.

            The fact that the distributions do not change significantly from year to year indicates that the test is designed to produce roughly the same results, else the mathematics used to analyze it won’t give you reproducible results. Under this scheme, the tests are a zero-sum game: if your kids are always at the top, then others must always be at the bottom, regardless of how well you train any of them (and therefore has nothing to do with responsible parenting: these kids are just well trained hamsters). And, no, I am not making this up as it is the basis of any good statistics course in any good college.

            Bottom line: if every one starts scoring too high and this skews the curve so its mean moves to the right, the curve gets scaled so as to get the average back to the left and the Gaussian is eventually recovered. That can be accomplished by changing the test items or changing the mathematical transformation used to scale the raw scores. The result is the same.

            One last thing: that’s why the SATs are “recalibrated” every so often. How else can one compare this year’s cohort to the one, say, from 1990?

            Have a nice day.

          • Don 9 years ago9 years ago

            Manuel, to add to your excellent appraisal of the much under-appreciated limitations of the STAR test and the problems of comparability disaggregated subgroup scores, I would also add the regression to the mean issue, using Floyd as an example. Floyd places a great deal of significance on the utility of standardized testing as a valid appraisal of a student's overall worth, but shows no interest in the integrity of those … Read More

            Manuel, to add to your excellent appraisal of the much under-appreciated limitations of the STAR test and the problems of comparability disaggregated subgroup scores, I would also add the regression to the mean issue, using Floyd as an example.

            Floyd places a great deal of significance on the utility of standardized testing as a valid appraisal of a student’s overall worth, but shows no interest in the integrity of those tests as valid appraisals of said worth. That validity is all assumed on his part because he wants it to be valid, ergo it is valid. He then uses various methods to teach to the test and creates a testing error just as schools that teach to the test do as well. Using Floyd’s global assertion of student worth as measured by the test, what I think he has called a “neutral measure of human goodness” or some such silliness, he then holds his kids up as poster children for academic overachievement based upon test scores. However, in the big picture of overall holistic student performance (let’s say using multi-facted criteria of college admissions , his children will regress to the mean, particularly given the error factor he introduced as well as a history factor that seems to go with his style of parenting. Are his childrens’ high school grades commensurate with testing overperformance? No. Regression to the mean.

          • Doug McRae 9 years ago9 years ago

            Manuel -- I'd translate "very very few" to about 5% or less. Test makers generally try to have average scores in the 50-70% range of available score points, with very high scores in the 90-95% of the available score points to assure a test has a sufficient "ceiling" to adequately measure high achievers and low scores sufficiently above random to ensure a sufficient "floor" so as not to allow random marking to generate a misinterpreted … Read More

            Manuel — I’d translate “very very few” to about 5% or less. Test makers generally try to have average scores in the 50-70% range of available score points, with very high scores in the 90-95% of the available score points to assure a test has a sufficient “ceiling” to adequately measure high achievers and low scores sufficiently above random to ensure a sufficient “floor” so as not to allow random marking to generate a misinterpreted invalid score. These are just some of the basics for large scale standardized tests for K-12 education, characteristics typically somewhat different than teacher made tests.

            To briefly comment on other views expressed yesterday on this commentary, test designers make no claims at all that their tests measure “true” achievement; rather, the goal is to provide a credible estimate for achievement for targeted academic areas that satisfy educational measurement professional standards for validity, reliability, and fairness. “True” achievement is way above pay grade for test makers (grin). And Smarter Balanced tests, like their predecessor STAR tests in CA as well as national norm-referenced tests from the 30’s thru the 90’s, are not intended to measure minimum skills. Rather, they attempt to measure a reasonably full range of the targeted academic achievement. There are tests like CAHSEE that are minimal skills tests, for CAHSEE the minimum skills needed for a CA high school diploma. But generally speaking Smarter Balanced tests are not minimal skills tests, and if Smarter Balanced tests are to serve as HS grad tests then the administration and/or scoring would have to be adjusted for the minimal skills purpose.

          • navigio 9 years ago9 years ago

            my own comments surrounding minimums were meant to apply to two contexts: - how tested content is contrasted with non-tested content in policy decisions - how implied proficiency on an entity basis is nowhere near 'acceptable' schools are closed or kept open, punished or lauded, based on--with the exception of science in 5th grade--test results from english and math ONLY for the first 8 to 10 years of a student's school life. imho this is why we no … Read More

            my own comments surrounding minimums were meant to apply to two contexts:
            – how tested content is contrasted with non-tested content in policy decisions
            – how implied proficiency on an entity basis is nowhere near ‘acceptable’

            schools are closed or kept open, punished or lauded, based on–with the exception of science in 5th grade–test results from english and math ONLY for the first 8 to 10 years of a student’s school life. imho this is why we no longer have music, arts, pe or better science instruction for many elementary schools (unless parents pay for it) and even inconsistently teach social studies, history, second languages and computer skills, if at all, at those levels. its even why kids in some middle schools dont have the option of taking a geometry class, for example. While there is surely a good argument for limiting instruction to these ‘core’ components, especially for students who are in danger of never achieving them, it must come at the expense of exposure to these other domains, and when given the choice in policy, we virtually always choose having better test scores via limiting a child’s exposure to the things we thought made them well-rounded human beings. Ironically, many of us who would do this at at a policy level do not do so with our own children. Please insert my favorite john dewey quote here.

            Secondly, it is possible for a school to have an 800 API and still have over 60% of its students not proficient (granted thats not a likely scenario in reality, but it is possible, and more likely scenarios do occur closer to 50%). If the state setting an 800 target is not an example of a policy-based minimum, I’m not sure what is. Our own district sends out press reports with lists that include the ‘800 club’ (lone api inherently discards any notion of ‘similar schools’) to reinforce the idea of schools that have ‘made it’.

            The issue of the content standards themselves is a tricky one. For some people they are a minimum. But in a sense, I expect that is merely a reflection of our notion of grades.

            Personally, I dont understand how many times we think we have to measure our students to see how clearly they are distributed (and why).

          • Manuel 9 years ago9 years ago

            Doug, thank you for the reply. It is interesting to note that 1 in 20 is "very very few" in your professional circle, something I would never have guessed, what with all that talk about the 1%. FWIW, if the distribution is forced to be a Gaussian, that translates to 1.65 standard deviations and above. Also, thank you for clarifying that it is above your (and your colleagues) pay grade to "measure" true academic achievement. It would … Read More

            Doug, thank you for the reply.

            It is interesting to note that 1 in 20 is “very very few” in your professional circle, something I would never have guessed, what with all that talk about the 1%. FWIW, if the distribution is forced to be a Gaussian, that translates to 1.65 standard deviations and above.

            Also, thank you for clarifying that it is above your (and your colleagues) pay grade to “measure” true academic achievement. It would be nice if our political class and other educrats were honest about this before bestowing the “failing” label on schools that, for a variety of reasons, have to carry the load of being to the left of the average in the Gaussians representing test scores only in ELA and math.

            Given that some speak of the SBAC (and the CSTs before them) as an indicator of being “on grade level” and may even like to use them to assign classroom marks, it seems to me that you and your colleagues need to speak up on the clear fact that these uses are not what the tests were designed for. I suppose that those who are not retired need to be careful in how they express themselves, but you are no longer under the same circumstances. You know what is in the tests and their limitations as your response clearly shows. I believe you have sufficient stature to more forcefully say that the political class is continuing to make serious errors in their blind support of testing for “stack-and-rank” purposes. (Yes, this commentary is a good first start!)

            If the political class (and in this I include those who call themselves reformers) continue to insist that tests are the only way to go, then it is time for those in the general statistical community (yes, those actually making a living doing statistics, not school testing) to start getting involved in what I consider a misuse of their field of expertise. For the record, I am not one, but I know enough statistics to know that the fluid flowing on my back is definitely not rain.

          • Doug McRae 9 years ago9 years ago

            Manuel -- Educational tests are not "forced Gaussian" (i.e., forced normal or bell-shaped curve), never have been, neither previous norm-referenced nor current standards-based tests, nor other K-12 widely used tests. The normal curve appearance of the score distributions you keep referring to are the work of folks at the higher pay grade, and us testing folks just work with what nature provides. There is no point continuing to respond to your allegations along these lines … Read More

            Manuel — Educational tests are not “forced Gaussian” (i.e., forced normal or bell-shaped curve), never have been, neither previous norm-referenced nor current standards-based tests, nor other K-12 widely used tests. The normal curve appearance of the score distributions you keep referring to are the work of folks at the higher pay grade, and us testing folks just work with what nature provides. There is no point continuing to respond to your allegations along these lines . . . .

            Many of your other comments deal with foibles of how public education is governed by political forces, rather than things that are inherent within tests or other educational resources. On those things, democracy provides free speech for everyone to voice their own opinions. I agree test makers and other educational measurement specialists should be as active as possible calling out poor use of large scale testing data, and misinformation about tests. Ima trying to do my part . . . .

          • navigio 9 years ago9 years ago

            Doug, I will have to admit that the fact that the cde has avoided posting raw score distributions until the last year of the test is a curious behavior. I asked them for the other years and was told 'no'. I don't know why there is so much secrecy surrounding that, but I do think the comparison of those with the guassian(s) should at least be discussed (the distribution is very interesting, and more importantly, … Read More

            Doug, I will have to admit that the fact that the cde has avoided posting raw score distributions until the last year of the test is a curious behavior. I asked them for the other years and was told ‘no’. I don’t know why there is so much secrecy surrounding that, but I do think the comparison of those with the guassian(s) should at least be discussed (the distribution is very interesting, and more importantly, looks VERY different than the scaled versions.. oh well..).

            However, even independent of whether the scores are ‘forced’ to a guassian on a year-to-year basis, we do do this inherently when we move to a new test (or adjust the old one to a new or changing ‘reality’). By definition, being asked to analyze a new set of questions tied to a new set of expectations (content standards) above and below a grade-level expectation is, in essence, setting a new mean for that grade. I expect there is even value from a test-design and/or statistics standpoint to require that initial distribution to be a guassian (or political requirements for it not to be 😉 ). Personally, I think this is why we must move to new testing schemes or standards every so often, though I do understand in the test realm, there are other, real limits on their lifetimes. It’s probably also why new versions tend to try to figure out a way to avoid comparison to old ones. My own take is trying to compare them to previous versions is a necessity, if only in that we understand how valid such comparisons actually are… and how important teaching statistics in school is.. 😉

          • Manuel 9 years ago9 years ago

            Doug, thanks for telling me that it isn't the test designers who are responsible for the scaled scores to look like a Gaussian. I had, until now, thought that the testing folks worked with the ones doing the taffy pulling who, you tell us, get paid more than you all. Well, that explains it a lot. The corollary, of course, is why are the testing folks allowing their good work to be so modified that the … Read More

            Doug, thanks for telling me that it isn’t the test designers who are responsible for the scaled scores to look like a Gaussian. I had, until now, thought that the testing folks worked with the ones doing the taffy pulling who, you tell us, get paid more than you all.

            Well, that explains it a lot.

            The corollary, of course, is why are the testing folks allowing their good work to be so modified that the end result is what it is. Here I’ve been laying it all at your doorstep and you now tell me that there’s a back shop where all this black magic is done. And that you are not even allowed in the door due to your low pay grade! Shocking!

            Seriously, you have never seen the distributions of raw and/or scaled reports? Wow…

          • Doug McRae 9 years ago9 years ago

            Navigio -- When a state moves to new tests based on new standards, indeed we do have a new average and new distributions of scores. But, good educational measurement practice calls for a comparability study between new and old, so that the first year the new scores are available the state can release the new results along with a comparison to the old scores -- essentially an interpretation that says "Here are the new scores, … Read More

            Navigio — When a state moves to new tests based on new standards, indeed we do have a new average and new distributions of scores. But, good educational measurement practice calls for a comparability study between new and old, so that the first year the new scores are available the state can release the new results along with a comparison to the old scores — essentially an interpretation that says “Here are the new scores, but IF we were still using the old test then here is where we would be.” The comparison data takes the sting out of large decreases of scores from old to new, and/or takes the juice out of any large increases if those are the case, by providing an apples-to-apples comparison to the past.

            For CA, the SPI included comparability to old scores in his recommendation for a new statewide assessment system to the legislature in January 2013, in part based on discussion from the AB 312 advisory panel that met during 2012 to advise the SPI. That was the right thing to do, based on sound educational measurement practice. But, in early Sept 2013 when AB 484 was vastly revised during the final days of the 2013 legislative session, the comparability from old to new was eliminated and in fact prohibited by the revised language, and instead we got political rhetoric that is was time to make a “clean break” from the old test, that common core instruction could not be measured by the old test, with neither of these claims reflecting either good practice or an unbiased analysis of the facts.

            Two states have implemented new common core tests ahead of the consortia timelines: Kentucky and New York. Kentucky included comparability data from old to new, and when new results were released, they were able to tell schools and media and public how the new results compared to the old results. The apparent “decreases” in scores from old to new were heavily reported, but with legitimate comparison data there overall there was no firestorm over the apparent “decreases” in scores. NY did not do comparability from old to new, and when the results of the new tests showed major apparent “decreases” in scores, NY had no explanation. NY educators were clobbered by the media and the public, and needless to say the new results were not warmly received.

            As I’ve indicated elsewhere, politics trumps psychometrics frequently, but psychometrics can return to bite politics over time — that is what happened in NY, and it doesn’t take a rocket surgeon to forecast CA will likely face the same music as NY in time . . . . .

  13. Monty Neill 9 years ago9 years ago

    Well, SBAC went ahead and posted their cut scores. Here is what I pasted to the Ed Week page at http://www.edweek.org/ew/articles/2014/11/17/13sbac.h34.html: So, if this corresponds to reality, we are to think that only 11% of incoming college frosh take courses without remediation. Based on what I have seen for data on amounts of remediation, this is pure nonsense. Indeed, my off the cuff recall suggests that more students take courses without remediation than are projected to … Read More

    Well, SBAC went ahead and posted their cut scores. Here is what I pasted to the Ed Week page at http://www.edweek.org/ew/articles/2014/11/17/13sbac.h34.html:

    So, if this corresponds to reality, we are to think that only 11% of incoming college frosh take courses without remediation. Based on what I have seen for data on amounts of remediation, this is pure nonsense. Indeed, my off the cuff recall suggests that more students take courses without remediation than are projected to score level 3.

    Makes one wonder if the goal is to make sure schools look terrible. Or is it bad benchmarking?

    And at least one person who participated in the process concludes the levels are not ready for 2015 use. http://edsource.org/2014/smarter-balanced-tests-are-still-a-work-in-progress/69828#comment-15759

    Replies

    • navigio 9 years ago9 years ago

      The 11th grade metric was level 3 or above so the implied non-remediation rates are about 30% to 40%, not 11% unless you were calculating that from something else.
      It’s also interesting how the grade level scale scores are effectively cumulative. I wonder if that is an attempt to define a continuous achievement level regardless of grade, ie to measure things like ‘months or years of learning’.

      • Monty Neill 9 years ago9 years ago

        I was commenting on the advanced level; level 3 is as you say more inclusive – but still includes far fewer students than do in fact now attend college without remediation.

        • Monty Neill 9 years ago9 years ago

          To be more precise, I was responding to this in Ed Week: “Level 4, the highest level of the 11th grade Smarter Balanced test, is meant to indicate readiness for entry-level, credit-bearing courses in college, and comes with an exemption from remedial coursework at many universities.” That identifies Level 4, not 3, as the cutoff for presumably not needing remediation.

  14. Monty Neill 9 years ago9 years ago

    When I plowed through the proposals from SBAC and PARCC to the Department, it was clear to me that multiple-choice and short answer would continue to dominate - even with some serious weight given to the separate performance tasks. Nothing I have seen since has changed that conclusion, and Doug's comments reinforce it. Over the years, we at FairTest have seen many claims that m-c and short answer items will - finally - measure higher-order … Read More

    When I plowed through the proposals from SBAC and PARCC to the Department, it was clear to me that multiple-choice and short answer would continue to dominate – even with some serious weight given to the separate performance tasks. Nothing I have seen since has changed that conclusion, and Doug’s comments reinforce it. Over the years, we at FairTest have seen many claims that m-c and short answer items will – finally – measure higher-order skills. Repeatedly, that remains true only if the definition of ‘higher order’ is dumbed down. Indeed, same old, after $340 million of taxpayer funding (now to a continually declining number of states).

    If the cut off is 1/4 of the way through, that means separating out below basic and basic will be done on far fewer items. The reliability of those judgements will then be questionable. But in time the fates of children, educators and schools will rest on such spurious data – unless the rapidly rising resistance to the overuse and misuse of tests puts a halt to these schemes. (If you are interested in following developments in this resistance, subscribe to FairTest’s weekly Testing Resistance and Reform news, at http://fairtest.org/weekly-news-signup and see previous issues at http://www.fairtest.org/k-12/news.)

    The performance tasks

    Replies

    • Doug McRae 9 years ago9 years ago

      Well, Monte, I've never been a fan of FairTest's efforts to discredit large scale standardized tests . . . rather I've been a fan of well constructed large scale standardized tests for K-12 education. And I think I know a thing or two how to construct such tests. But, I'm also willing to call out poor test development practices when I see them, and I welcome your support for also calling out poor practices … Read More

      Well, Monte, I’ve never been a fan of FairTest’s efforts to discredit large scale standardized tests . . . rather I’ve been a fan of well constructed large scale standardized tests for K-12 education. And I think I know a thing or two how to construct such tests. But, I’m also willing to call out poor test development practices when I see them, and I welcome your support for also calling out poor practices when you see them. The Smarter Balanced standards-setting activity in October fails to meet reasonable professionalism on a number of dimensions. (1) They used data from many students who have not experienced instruction on common core standards [a quote from Meas Inc’s successful proposal to be the SB vendor for this activity is relevant here . . . Meas Inc said in their proposal “No matter how well the tests are constructed, no matter how well they are aligned to the common core, no matter how carefully we set the cut scores, if large numbers of students have not had the opportunity to learn the content of the tests, no cut score will be valid.” SB’s cut score setting activities failed this fundamental condition for defensible cut scores. (2) They used item-tryout data to establish order of test question difficulty required for the bookmark process; item tryout data from matrix sampling does not yield comparable test question difficulties from item to item, a rather basic element needed to establish easy to hard item sequencing. Item-tryout data also does not generate defensible estimates for percents of kids in various achievement categories. (3) Until the SB press release today, no one following the action had any idea SB had any estimates for percents of kids in each achievement category — item-tryout data also cannot be used to generate credible estimates for percents in each achievement category. (4) Finally, if we dig into the SB press release on this topic today, we find a statement designed to defend their process — “the auditor and both advisory panels certified that SB conducted a valid process that is considered consistent with best practices in the field.” Notice they did not claim their valid process generated valid cut scores — one can use a valid process with inappropriate data (for ordering the items, for estimating percents for each achievement category) and the result is garbage in, garbage out independent of the valid blender in the bookmark process. At the end of the day, the SB achievement level setting process did not yield valid cut scores, and those cut scores will not yield valid test scores next spring. So, I welcome your support calling out the Smarter Balanced activity as an inadequate poor test development exercise . . . . and I hope for a better effort next year with full computer-adaptive SB test scores for an suitable sample of kids (preferably a census 100 percent sample).

      • Manuel 9 years ago9 years ago

        Speaking of GIGO, Doug, and given that you know a thing or two on how to design standardized tests, would you expect any correlation between the scores a student gets in such tests and the classroom mark? To put it in the vernacular, would the distribution of scores for kids getting the highest possible mark in an elementary grade consist mostly of above proficient scores? If your answer is no, why not? Most parents would … Read More

        Speaking of GIGO, Doug, and given that you know a thing or two on how to design standardized tests, would you expect any correlation between the scores a student gets in such tests and the classroom mark? To put it in the vernacular, would the distribution of scores for kids getting the highest possible mark in an elementary grade consist mostly of above proficient scores?

        If your answer is no, why not? Most parents would expect their A student to at least score proficient.

        If the answer is yes, email me and I’ll send you some very interesting (and legitimate!) data for a rather large district that proves the opposite.

  15. Paul Muench 9 years ago9 years ago

    I remember 30% as the target for the percent of multiple choice questions on smarter balanced tests. Do I remember correctly? Is that still the target?

    Replies

    • Doug McRae 9 years ago9 years ago

      Paul -- I'm not sure there was a specific target for multiple choice questions in the SBAC blueprints. But I do recollect a description from SBAC honchos that they expected 70% of the final Smarter Balanced tests to be machine scored (including of course multiple choice questions and other questions that can be electronically scored) and thus part of the computer-adaptive portion of the test, while the remaining 30% would require at least some human … Read More

      Paul — I’m not sure there was a specific target for multiple choice questions in the SBAC blueprints. But I do recollect a description from SBAC honchos that they expected 70% of the final Smarter Balanced tests to be machine scored (including of course multiple choice questions and other questions that can be electronically scored) and thus part of the computer-adaptive portion of the test, while the remaining 30% would require at least some human scoring and thus would not be part of the computer-adaptive portion of the test. The grade 3 ELA test, of course, is only one of 14 SBAC tests, and it is quite feasible for SBAC to have one test with @ 85 percent machine scored yet over all grade levels and content areas have only 70 percent machine scored. The finding that grade 3 ELA has predominately machine scored questions perks up an interest in the percentages of machine scored vs human scored questions across grade levels and content areas . . . . that information may already be public via a dive into the published blueprints by grade by content area, so perhaps someone in the know may have a more detailed answer to your question.

      • Paul Muench 9 years ago9 years ago

        Machine scoring could cover a broad range of capabilities. What I’m really interested in is how fruitful it would be for teachers to teach to the test and teach test taking. My assumption has always been that multiple choice questions make it fruitful for teachers to do that. Any idea of whether the machine scored questions provide such on opportunity?

        • Doug McRae 9 years ago9 years ago

          Paul -- Teaching-to-the-test is problematic regardless of testing formats --- multiple choice vs machine scorable non-MC vs human scored open-ended formats. That's 'cause teaching-to-the-test limits the range of instruction to the content actually on the test, rather than the full range required by the content being measured. The major if not only way for a test maker to really discourage teaching-to-the-test is not to reveal in advance exactly the content to be measured by the … Read More

          Paul — Teaching-to-the-test is problematic regardless of testing formats — multiple choice vs machine scorable non-MC vs human scored open-ended formats. That’s ’cause teaching-to-the-test limits the range of instruction to the content actually on the test, rather than the full range required by the content being measured. The major if not only way for a test maker to really discourage teaching-to-the-test is not to reveal in advance exactly the content to be measured by the test (but only to reveal the full range of content that might be on the test) as well as not to reveal the formats to be used for the test. That’s hard to do — too much pushback from the trenches — but realistically it is what it is. Within the limits of MC and machine-scorable non-MC, it depends on how many choices there are for MC, and how the non-MC machine scorable questions are structured. For the reasonably simple non-MC machine scorable structures I saw on the Smarter Balanced online exercise in early October, I’d say most of those structures for grade 3 ELA were much the same as familiar MC test questions. But, it isn’t fair to generalize from grade 3 ELA to higher grade levels or math test questions, ’cause non-MC machine scorable questions for higher grade levels and/or math may well have more complex structures that would reduce (but not eliminate) attractions toward teaching-to-the-test [and/or random marking strategies].