Duncan MacQuarrie
Manager of Student Assessment
Joe Willhoft
Executive Director, Research & Evaluation
Tacoma Public Schools
Paper Presented at the Annual Spring Conference of the
Abstract
The performances of schools and districts on the Washington Assessment of Student Achievement (WASL) have become the primary achievement indicators in the state accountability system. In addition, the WASL scores will become the indicators under the Title I accountability requirements of the recently reauthorized Elementary and Secondary Education Act (ESEA). An important dimension of the use of test scores for such purposes is the validity of inferences about student learning that are based on performance standards (cut-scores). In order for interpretations of scores based on such standards to be valid, the standards must be reasonable and fair. This study reports data that raise questions about the reasonableness of the performance standards in reading and mathematics at the elementary, middle, and high school levels. The reading standard at 7th grade deviates markedly from those at 4th and 10th grade. The mathematics performance standards are consistently difficult across the grade levels. However, the difficulty of the mathematics standards at grades 4 and 10 is much greater than is the difficulty of the standards for reading at those grades. These patterns raise additional questions about the reasonableness of the WASL performance standards.
Validity
Evidence for WASL
Performance
Standards
The Washington Assessment of
Student Learning (WASL) is a standards-based assessment program designed to
measure important aspects of the Essential Academic Learning Requirements
(EALR),
The WASL assessments, administered in the spring, were introduced successively at grades 4, 7, and 10 over a three-year period (1997 to 1999). As standards-based or criterion-referenced tests, performance standards were established for each of the component tests (listening, reading, writing, and mathematics) in the respective initial operational year. In reading and mathematics, four performance categories were established using a modified form of the “bookmark procedure” developed by staff from CTB/McGraw-Hill (Lewis, Mitzel, & Green, 1996).
In 1999 the state established elementary school goal-setting criteria based on the 4th grade WASL reading test. This was expanded in 2000 to include requiring school and district improvement goals in grades 4, 7, and 10 for WASL reading and mathematics.
It has been nearly five years since the first operational form of the WASL was administered to 4th grade students statewide. There now exists multiple years of data regarding the performance of students on the various WASL tests across years and grade levels. In addition, there are several cohorts of students with both norm-referenced test scores and WASL scores as well as two cohorts with 4th and 7th grade WASL scores and one cohort with 7th and 10th grade WASL scores. These rich data sets provide an opportunity to investigate various characteristics of the reading and mathematics performance levels.
Several patterns have emerged in the state data that are generally reflected in district data. First, the percent of students meeting the performance standards in mathematics at grades 4, 7 and 10 is lower than the corresponding percent meeting the reading standard. In grades 4 and 10 the differences are quite large. Second, the rate of growth in the percent of students meeting the performance standards has decelerated dramatically in 4th grade, remained more or less flat at 7th grade, and been modest at 10th grade. (See Table 1.)
|
|
|
1997 |
1998 |
1999 |
2000 |
2001 |
|
4th
Grade |
|
|
|
|
|
|
|
|
|
47.9 |
55.6 |
59.1 |
65.8 |
66.1 |
|
|
Math |
21.4 |
31.2 |
37.3 |
41.8 |
43.4 |
|
7th
Grade |
|
|
|
|
|
|
|
|
|
|
38.4 |
40.8 |
41.5 |
39.8 |
|
|
Math |
|
20.1 |
24.2 |
28.2 |
27.4 |
|
10th
Grade |
|
|
|
|
|
|
|
|
|
|
|
51.4 |
59.8 |
62.4 |
|
|
Math |
|
|
33.0 |
35.0 |
38.9 |
The WASL performance has become the primary indicator in the public reporting of the health of the public schools (local media) and in the more formal reporting recommended by the Academic Achievement and Accountability Commission (A+ Commission). It has been acknowledged that the performance standards are challenging, and that their degree of difficulty reflects the significant changes in the school curriculum that they are intended to promote. In addition, considerable emphasis has been placed on the importance of making continuous improvement in the percent of students meeting the standards rather than the absolute performance in any particular year. It also has been recognized that changes in the educational system are complex and that only a sustained and long-term commitment will accomplish this goal.
One characteristic of a standards-based reporting system is the use of a limited number of performance categories. Students’ performances on the WASL reading and mathematics tests are classified into two primary categories: Met Standard and Did Not Meet Standard. Each of these primary categories is further divided into two categories. Students meeting the standard are classified as either “at the standard” (level 3) or “well above the standard” (level 4). Likewise, students who do not meet the standard are classified as either “below the standard” (level 2) or “well below the standard” (level 1).
Such systems inevitably produce groups of students scoring just one or two points below the critical score that would result in their being classified in the next higher category. School staff frequently are encouraged to review frequency distributions and to take solace that they really did better than the reports indicate because a fair number of their students were just one or two points away from reaching the next level. Such observations are frequently combined with encouragement to work a little harder next year with those students “close to the standard” so as to help them get that extra point or two. Of course hindsight is so much clearer. We can see exactly which students could have been the focus of our extra effort. However, is there a way of knowing which students might be right on that bubble next year?
At last year’s annual spring conference of the Washington Educational Research Association (WERA) there was a session titled “Some Statistical Techniques for Predicting WASL from ITBS.” In that session a description of the overlap in the coverage of the EALR by the Iowa Tests of Basic Skills (ITBS) and the WASL was given. In addition, the statistical relationship between the 3rd grade ITBS scores and the subsequent 4th grade WASL scores was described and a method was identified for evaluating the utility of predicting WASL scores from the prior year’s ITBS scores. During the discussion period following the presentations several members of the audience described how they were providing classroom teachers information about the expected performance of their fourth grade students on the WASL based on those students prior year’s ITBS performance. It was reported that teachers found such information to be helpful.
Following the conference, staff from the Research and Evaluation Office of the Tacoma Public Schools merged district student files containing the ITBS scores from 1999 and WASL scores from 2000. The resulting database contained over 2,000 matched cases with both 3rd grade ITBS scores and 4th grade WASL scores and these data were used to investigate various statistical prediction models that might be potentially useful for providing teachers with some idea of their students’ expected WASL scores. In addition, using a methodology similar to that described at last year’s conference for evaluating the utility of score prediction systems, a more direct portrayal was also constructed. In this method, for each value or range of values for the ITBS national percentile rank (NPR) scores, the percent of the students within that score range who subsequently met the standard was calculated. Then a bivariate plot was constructed in a coordinate system where the prior year’s ITBS scores were represented along the X or horizontal axis and the percent of students meeting the standard (or any other performance category) were represented along the Y or vertical axis. The percent of students meeting the standard within each ITBS score category was plotted to complete the portrayal.
Consider the simple example data in Table 2 and the following explanation.
Table 2.
Select ranges in an example data set.
|
ITBS NPR Range |
Frequency |
N Met WASL Standard |
Percent Met WASL Standard |
|
1 – 4 |
129 |
16 |
12.4 |
|
5 – 9 |
129 |
14 |
10.9 |
|
10 – 14 |
95 |
20 |
21.1 |
|
. |
. |
. |
. |
|
. |
. |
. |
. |
|
. |
. |
. |
. |
|
50 – 54 |
90 |
65 |
72.2 |
|
55 – 59 |
145 |
113 |
77.9 |
|
60 – 64 |
188 |
153 |
81.4 |
|
. |
. |
. |
. |
|
. |
. |
. |
. |
|
. |
. |
. |
. |
|
80 – 84 |
71 |
69 |
97.2 |
|
85 – 89 |
71 |
67 |
94.4 |
|
. |
. |
. |
. |
|
. |
. |
. |
. |
The construction of the type of table shown above is straightforward. Each row pertains to an ITBS NPR score range. The second column contains the total number of students who scored in that NPR range as 3rd graders and the third column reflects the number of those students who subsequently met the WASL standard in 4th grade. The fourth column expresses the number meeting the standard as a percent of all students in that NPR range (or, the third column divided by the second column, expressed as a percent). Once such a table has been completed a simple histogram can be developed by plotting the values from column four (percent meeting standard) along the vertical axis for each corresponding value in column one (NPR band) along the horizontal axis.
The initial idea was to use these analyses as the basis for developing very general criteria related to 3rd grade ITBS performance that teachers and principals could use to identify students that might potentially be close to the WASL “meets standard” cut score of 400 in reading and mathematics. Looking at the data it appeared something reasonable could be developed for WASL reading, but the possibilities for mathematics were not as clear. The data suggest that it isn’t until you look at students scoring above the 60th NPR on the ITBS total mathematics that you even find half of them subsequently meeting the WASL standard. That is, the students likely to be very close to the math standard, and who with a little extra help could get over the cut score, were likely to be those students with above average basic skills performance coming into 4th grade. Such a message did not seem to have the potential to be helpful, particularly when the comparable reading basic skills score was closer to the 35th NRP.
This pattern of considerable difference in the difficulty of the performance standards for reading and mathematics at grade 4 prompted a desire to investigate these relationships at grades 7 and 10. However, it was not until the 2001 WASL results were released in late summer that the two years of data necessary for the analyses were available. Norm-referenced tests were first administered in the grades preceding the 7th and 10th grade WASL grades in the spring of 2000 when 6th grade students were given another level of the ITBS and 9th graders were administered the Iowa Tests of Educational Development (ITED).
Through the cooperation of the Assessment unit of the Office of Superintendent of Public Instruction (OSPI), matched sets of student records containing WASL scores and prior year ITBS scores were obtained. These files represented data for students who had taken the WASL as 4th graders in the spring of 2000 and 2001, and for 7th graders who had taken the WASL in 2001. Matched sets of student records containing WASL scores and 9th grade ITED scores were also made available for students who had taken the 10th grade WASL in 2001. In addition, OPSI research staff provided matched sets of student records for students with both 4th and 7th grade WASL scores (7th graders of 2000 and 2001) and students with both 7th and 10th grade WASL scores (10th graders of 2001). These sets of matched records have allowed various relationships between norm-referenced and standards-based scores for the same students to be investigated and also the relationships between different grade levels of WASL scores for the same students.
Using the approach applied to the





grade pattern was dramatically different.
The differences in the relationships between the percent of students meeting the WASL performance standard and those students’ norm-referenced scores suggested looking more closely at the cut-points for the various performance levels. Recent work by Schwarz, Yen, and Schafer (2001) and Shepard and Peressini (2002) suggest that using existing normative data can help make clearer the degree of difficulty associated with the different performance levels of standards-based assessments. Shepard and her colleagues were asked by the Denver Area School Superintendents’ Council to “provide a more complete understanding of the content and difficulty” of the 10th grade mathematics assessment that is a component of the Colorado State Assessment Program (CSAP). In addition to the 10th grade standards-based assessment, the CSAP also incorporates the American College Testing (ACT) Program’s 10th grade PLAN assessment, a nationally-normed test administered earlier in the year. Scores from the PLAN and the CSAP are highly correlated (r = .81) and thus they rank individuals in about the same order. When tests order individuals in a similar manner, score points on one assessment can be linked to score points on the other. These researchers related the two sets of scores using an equipercentile linking approach and used this linking to add meaning to the various cut points of the CSAP.
Since the correlations between the WASL standard scores and the prior year’s ITBS/ITED scale scores are also strongly correlated (Table 3) equipercentile linking can also be used to add meaning to the cut scores that define the various levels of WASL reading and mathematics tests.
Table 3. Correlations between reading and mathematics WASL standard scores and prior year’s ITBS/ITED scale scores.
|
|
3rd/4th Grade |
6th/7th Grade |
9th/10th Grade |
|
.72 |
.76 |
.75 |
|
Mathematics |
.77 |
.83 |
.80 |
At each grade level, for the reading and mathematics tests, cumulative percent distributions were constructed for both the norm-referenced scale scores (ITBS or ITED) and for the WASL standard scores. These cumulative percent distributions actually represent “state percentile rank” distributions. The WASL standard scores (and their corresponding state percentile ranks) representing the first active score in each of the performance levels 2, 3, and 4 were linked to the ITBS/ITED scale scores that represented those same state percentile ranks. Then, using the NPR scores associated with the ITBS/ITED scale scores, a cautious interpretation of the difficulty of the WASL cut scores can be made in terms of the corresponding national normative performance. Table 4 shows the prior year’s ITBS/ITED NPR scores associated with the WASL cut score that defines “meets standard” (standard score of 400).
Table 4. Prior year’s NPR associated with WASL reading and mathematics performance standards.
|
Grade |
Subject |
NPR |
Grade Level Difference* |
|
|
|
40th |
About a month below |
|
4th |
Mathematics |
61st |
Half a year above |
|
|
|
63rd |
One year above |
|
7th |
Mathematics |
72nd |
Two years above |
|
|
|
43rd |
Half a year below |
|
10th |
Mathematics |
72nd |
Three years above |
*The grade level for which the NPR would be expected to be about average.
This analysis confirms the patterns found in the first analysis. That is, the mathematics performance standard is difficult and very similar at 7th and 10th grades and only slightly less difficult in 4th grade. The reading standard is much easier at 4th and 10th grades, while the 7th grade reading standard is much more difficult and looks to be very similar to the 4th grade math standard in degree of difficulty. The cut scores at the standard vary in difficulty. In addition, the cut scores for levels 2 and 4 (between “well below standard” and “below standard;” and between “well above standard” and “above standard’) also are quite different across the grades and between reading and mathematics. Table 5 shows the various cut scores, their linked NPR scores based on the prior year’s norm-referenced testing, and the estimated deviation from grade level for a student for whom that NPR would be typical. These linked values also demonstrate the considerable variability in all the cut scores across grade levels for both reading and mathematics.
The normative benchmarks for level 2 vary from the 6th percentile for 4th grade reading to the 20th percentile for 7th grade mathematics. Level 4 benchmarks range from a low of the 56th percentile for 10th grade reading to a high of the 86th percentile for 7th grade mathematics, although 5 of the 6 level 4 benchmarks are at the 80th percentile or above.
Table 5. Prior year’s NPR associated with WASL
reading and mathematics cut scores at different performance levels.
|
Grade |
Subject |
Level |
NPR |
Grade Level Difference* |
|
|
|
2 |
6th |
-1.6 |
|
|
|
3 |
40th |
-0.1 |
|
|
|
4 |
80th |
1.5 |
|
4th |
|
|
|
|
|
|
|
2 |
36th |
-0.2 |
|
|
Mathematics |
3 |
61st |
0.6 |
|
|
|
4 |
82nd |
1.7 |
|
|
|
2 |
20th |
-1.9 |
|
|
|
3 |
63rd |
1.1 |
|
|
|
4 |
84th |
3.1 |
|
7th |
|
|
|
|
|
|
|
2 |
58th |
0.9 |
|
|
Mathematics |
3 |
72nd |
2.0 |
|
|
|
4 |
85th |
3.7 |
|
|
|
2 |
15th |
-3.6 |
|
|
|
3 |
43rd |
-0.7 |
|
|
|
4 |
56th |
0.8 |
|
10th |
|
|
|
|
|
|
|
2 |
55th |
0.9 |
|
|
Mathematics |
3 |
72nd |
3.2 |
|
|