The Smarter Balanced Assessment Consortium provided a sneak peek for their final computer-adaptive tests in early October, tests to be administered to roughly 25 percent of the country’s grade 3-8 and 11 students in spring 2015 to measure, initially, status and, eventually, growth in achievement on the new Common Core academic standards for English Language Arts and Mathematics. The peek reveals the prospective tests are a work in progress – tests that I believe won’t be ready for prime time until at least spring 2016.
The sneak peek was provided via the process Smarter Balanced is using to determine “cut scores” for test results, or essentially how many test questions a student must answer correctly to be labeled proficient.
Smarter Balanced officials have yet to determine how the achievement categories will be labeled. They have indicated they will have four achievement categories for their results, which for now are just labeled Category 1, 2, 3, 4. For this commentary, I will use the labels below basic, basic, proficient, advanced or even A, B, C, D as substitutes for the concept of achievement categories.
The Smarter Balanced process involves structured judgments for test questions they plan to use in the spring of 2015. Judgments were elicited from volunteers who signed up for a 3-hour online session to review actual test questions and provide a judgment where a Category 3 or proficient “cut score” should be placed. The results of the online exercise were to be provided to more than 500 teachers and others nominated by 17 states to participate on “in-person” panels in mid-October to undergo formal cut-score-setting exercises for the 14 tests being developed by Smarter Balanced.
The Smarter Balanced process also involved two panels (one each for ELA and Math) to coordinate proposed cut scores across grade levels. Recommended cut scores were to be endorsed by Smarter Balanced member states on Nov. 6, but that portion for approval of recommended cut scores has been delayed.
We should be reminded that the actual Smarter Balanced tests for spring 2015 have not yet been finalized. Analyses from the Smarter Balanced field tests that students took in spring 2014, designed primarily to qualify test questions for use in final tests, have not yet been completed.
But the exercise that I participated in did provide a set of test questions that mirrored Smarter Balanced plans for their final tests, and a set of questions that mirrored the proposed balance between multiple-choice (and other test questions that can be scored electronically) and open-ended test questions that are needed to test many of the new Common Core academic standards in depth.
So, with care taken not to disclose any of the secure material involved in the online exercise, what were the observations of this experienced K-12 testing system designer?
I did the online exercise for grade 3 English Language Arts, and for this grade level and content area traditional multiple-choice questions dominated. In fact, 84 % of the questions were either multiple-choice or “check-the-box” questions that could be electronically scored, and these questions were very similar or identical to traditional “bubble” tests. Only 16 percent of the questions were open-ended questions, which many observers say are needed to measure Common Core standards.
The online exercise used a set of test items with the questions arranged in sequence by order of difficulty, from easy questions to hard questions. The exercise asked the participant to identify the first item in the sequence that a Category 3 or B-minus student would have less than a 50 percent chance to answer correctly. I identified that item after reviewing about 25 percent of the items to be reviewed. If a Category 3 or proficient cut score is set at only 25 percent of the available items or score points for a test that has primarily multiple-choice questions, clearly that cut score invites a strategy of randomly marking the answer sheet. The odds are that if a student uses a random marking strategy, he or she will get a proficient score quite often. This circumstance would result in many random (or invalid and unreliable) scores from the test, and reduce the overall credibility of the entire testing program.
It troubled me greatly that many of the test questions later in the sequence appeared to be far easier than the item I identified as the item marking a Category 3 or proficient cut score, per the directions for the online exercise. I found at least a quarter of the remaining items to be easier, including a cluster of clearly easier items placed about 2/3 of the way into the entire sequence. This calls into question whether or not the sequence of test questions used by Smarter Balanced was indeed in difficulty order from easy to hard items. If the sequence used was not strictly ordered from easy to hard test questions, then the results of the entire exercise have to be called into serious question.
There were several additional concerns about the Smarter Balanced cut-score-setting exercise this October that are too technical for full discussion in this commentary. Briefly, the exercise appeared not to include any use of “consequence” data that typically is included in a robust cut-score-setting process. Consequence data is estimated information on what percent of students will fall in each performance category, given the cut scores being recommended. I also questioned whether the spring 2014 Smarter Balanced field test data were used to guide the exercise in any significant way. Indeed, since the 2014 Smarter Balanced field test was essentially an item-tryout exercise, an exercise designed to qualify test questions for use in final tests, it did not generate the type of data needed for final cut score determinations in a number of significant ways.
Smarter Balanced calls their 2015 test administration test an “operational” test. But, any operational test needs more than qualified test questions to yield valid scores. It must also have valid scoring rules to generate meaningful scores for students, for teachers, for parents and for valid aggregate scores for schools, districts and important subgroups of students.
It is quite clear to me that the cut-score-setting exercises conducted by Smarter Balanced this month will not produce final or valid cut scores for timely use with spring 2015 Smarter Balanced tests. Spring 2015 tests will instead be benchmark tests (to use test development parlance), tests that yield data that then can be used to generate valid cut scores. That exercise will have to wait for September 2015 at the earliest. The Smarter Balanced website recognizes this by labeling the cut scores recommended in October 2014 as “preliminary” cut scores, to be validated by spring 2015 data.
California plans to use the cut scores recommended by the panels that met in October for disseminating millions of test scores in spring 2015. These plans are faced with the prospect that those scores will have to be “recalled” and replaced with true or valid scores just months after incorrect scores are disseminated. This is not a pretty picture for any large-scale statewide assessment program.
The bottom line: Smarter Balanced tests are still a work in progress. I think it will be spring 2016 before Smarter Balanced tests will be able to generate valid, meaningful test scores in a timely fashion for California students.
• • •
Doug McRae is a retired educational measurement specialist who has served as an educational testing company executive in charge of design and development of K-12 tests widely used across the United States, as well as an adviser on the initial design and development of California’s STAR assessment system.
The opinions expressed in this commentary represent solely those of the author. EdSource welcomes commentaries representing diverse points of view. If you would like to submit a commentary, please contact us.