The Educational Testing Act of 1981

crash course on test design

vant instruction as well as by the student's inclination to learn. It is fair to blame the school and the teachers--as well as the students—if students do badly on a test that covers material they just have completed or on a test that measures skills they have been taught for several years.

To oversimplify, the information from aptitude tests is "looking ahead" information: Do students have the knowledge or skills necessary to help them succeed in the future? The information from achievement tests is "looking back" information: Did the students learn what was taught? Of course, if students are not learning what the schools are teaching, and schools find this out in a reasonable time, they can institute procedures to remedy the situation. Consequently, one hears fewer charges of test bias leveled against achievement tests than against aptitude

tests.

2. Type of student response. Responses basically are of two kinds: The student must recognize the appropriate answer to a question or problem, or the student must produce the answer. Recognition questions on most professionally produced tests are multiplechoice items. Teachers frequently use true/false or matching items as well. In spite of popular opinion, multiplechoice questions can provide an effective measure of reasoning processes as well as straight memory. Look at Sample 2, Question C: The question does not simply ask, "What areas of the world have the highest population densities?" Rather, it presents a novel situation in which the student must infer that, of the choices offered, only popu lation density (Choice A) is plausible.

The most frequently used production tests are short-answer or "fill-inthe-blank" tests and essay questions. An example of a short-answer test item: "The sum of scores divided by the number of scores is called the The student could write in "mean," "arithmetic mean," or "average." You probably are familiar with essay questions from your own school days. For example: "What is meant by the statement that France, before 1789, was centralized but not unified?" Or: "Why did the Dreyfus affair become an issue of international significance?" Or, more typically: "Describe your summer vacation."

JULY 1981

Of course, the better constructed the questions and the clearer the directions to students, the easier essay tests are to score. But even then, scoring is not simple. To avoid biased and idiosyncratic scoring, scorers need to have in advance a clear outline of just what components will be scored and how much weight will be given to each. Will essays, for example, be scored for style and mechanics as well as substance? A successful scoring technique used by professional test makers (but rarely by teachers) is called "holistic scoring." Raters agree on the characteristics that a top paper will have and choose samples

that epitomize that standard; they do the same for papers at each of the lower ranks. Scores then are assigned to all papers by comparing them to the samples.

Production tests do have their problems: They can be tough to score; fewer such questions usually can be presented in a given period of time, so less comprehensive coverage of an area is possible. So why do testers use such questions at all? Because production tests measure characteristics that recog nition tests cannot. If you want to find out how well students write, there's no substitute for asking them to write. To

Samples of major types of test questions

ask them questions about writing is not the same thing. You can think of many other examples: translating a passage from Cicero, speaking extemporaneously on a popular topic, analyzing a blood sample, editing a technical manuscript, making a soufflé, disassembling a rifle. This distinction between recognition and production tests is an important one. A student who might recognize appropriate laboratory procedures from word and picture descriptions, for example, might well add water to sulphuric acid, instead of vice versa, when left alone in the lab.

3. Test interpretation. Tests can be interpreted in two ways: by reference to the performance of other people (normreferenced) and by reference to judgmental standards (criterion-referenced). In a norm-referenced test, for example, we would say that Betty's score is in the top quarter of scores of students in her graduating class, or above the mean for a national group of twelfth graders, or at the 60th percentile for seniors in the state (meaning that 60 percent of the state's seniors scored lower than she did). This example also indicates that, with respect to the skills measured, Betty had more competition in the state than in her own class (note that she stood higher in her class than she did in the state as a whole).

Criterion-referenced tests are interpreted according to judgments about what constitutes appropriate performance-highly satisfactory, adequate, or minimal-depending on the test and the purpose for which it was given. For example, a group of teachers might agree that a score of 60 was a passing score on Test X or that ten correct answers out of 12 items on a math quiz indicated acceptable mastery of a concept. An important thing to keep in mind is that there are no universal criteria for such tests: Sixty percent might be a passing score for one test but not passing on an easier test; ten correct answers out of 12 items might be considered mastery for one concept but not for another.

It is unfortunate that criterion-referenced and norm-referenced measurement often are viewed in sharp contrast, sometimes with the attitude that using one is "good" practice and the other "bad" practice. In fact, the two approaches frequently overlap and certainly are useful supplements to each other. Moreover, normative considerations usually underlie the choice of standards for criterion-referenced tests

Whether test users rely more heavily on norm-referenced or on criterion-referenced interpretations of test performance, they still must rely to a great extent on judgment in setting standards or cutoff scores. In a school system experiencing a teacher surplus, for example, the accrediting agency might qualify only those teacher applicants who score above the 25th percentile on a test. In times of a teacher shortage the school system might lower the standard.

A mysterious phenomenon is that many teachers and parents take the 50th percentile, or the mean, as the implicit standard of adequate performance. They will bewail the fact that little Maria or Fred is below average without stopping to think that half the children, by definition, have to be. In fact, it is a mistake to think of "average" as representing a single score; it is better to think of "average" as a range of scores encompassing the middle third of the group.

There are many different procedures that educators use to set standards or cutoff scores. These vary in formality and elaborateness. The burden is on educators, however, to make the standards they adopt explicit and to provide a clear rationale for them. This is of paramount importance if cutoff scores are to be used for such crucial decisions as graduation, promotion, or admission to a program of study.

4. Test developers. By far the largest number of tests are those developed by individual teachers. The critics of testing usually overlook this fact. For every S.A.T., there are thousands of midterm and final exams in mathematics and English; for every standardized elementary reading test, thousands of dittoed reading exercises are used in classrooms. Classroom tests (and other evaluations) are, and will continue to be, the major factors in promotion, graduation, recommendations for further education and employment, and students' self-appraisals of their academic fitness.

The two major types of outside groups that develop and supply tests are

state departments of education and national testing agencies (Educational Testing Service; American College Testing Program; Psychological Corporation; Harcourt, Brace, Jovanovich; Houghton-Mifflin; CTB/McGrawHill). Sometimes, states and national agencies work together, especially when state departments of education have limited technical or technological

resources.

In the case of achievement tests, those developed and used to evaluate students at the local level and those selected or imposed from outside sources should be complementary. A mandated state basic skills test should not come as a shock to a student. If the state test is an appropriate measure of school learning, it should look like the exercises students already have seen-frequently-in their schools. A lack of congruency is a legiti mate concern of education critics: Either schools have not prepared students properly, or the outside test is an unfair reflection of school objectives.

In the case of aptitude tests, you would not expect tests developed by outside sources to bear as close a relationship to the tests normally used in schools. (Teachers simply don't make up aptitude tests.) This is not a fault: For example, a student might encounter on an aptitude test a vocabulary word that is unfamiliar. What we hope is that from an accumulated knowledge of words and how they are used, the student might be able to deduce the meaning from the structure of the word and from the context.

By now, you should be suitably impressed by the number and variety of tests that can be devised from the four classifications: There are criterion-referenced achievement tests that are developed by teachers and require students to produce the answers (this category probably characterizes the largest number of educational tests given and taken in the world). There are criterion-referenced, multiple-choice tests that are developed by state departments of education to measure achievement in the basic skills. There are norm-referenced, multiple-choice, aptitude tests such as the ones designed by the Educational Testing Service for the College Board. (Tests of this type have been of greatest concern to school people in recent years, probably because they are the ones over which school people feel they have the least direct control.) In the appropriate circumstances, each type of test can be valuable.-Scarvia B. Anderson

THE AMERICAN SCHOOL BOARD JOURNAL

[graphic][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed][subsumed]

4. The Pretest

Once questions are written, they are administered to a sample group similar to the people for whom the test is intended. The pretest tells test developers several things:

The difficulty of each question.

• Whether individual questions are ambiguous or misleading.

• Whether items should be revised or discarded. • Whether incorrect alternative answers should be replaced or revised.

5. Sensitivity Review

Before and after the pretest, each test is reviewed to ensure that the questions reflect the multicultural nature of our society and that appropriate, positive references are made to minorities and women. Each test item is reviewed to ensure that any word, phrase, or description that may be regarded as biased, sexist, or racist is removed.

An Example

The Cycle Continues: Assemble the Test

11th Month

Select Items and

• Assemble Test

• Review Test

12th & 13th Months

• Editor's Reviews

14th & 15th Months

• Sensitivity Review

• Test Production

• Planograph Answer Key

Committee Review
Official Answer Key

• Quality Contral

16th & 17th Months

• Printing

18th Month

•Administer Test

• Preliminary item Analysis

19th Month

• Report Scores

Item Analysis

Test Analysis Report

These judgments are made by "sensitivity" reviewers who are especially trained to identify and eliminate material that might be unfair or offensive to any group.

6. The Final Phase

In the final phase, test makers analyze the pretest data and choose appropriate questions of the proper difficulty to reflect the subject matter and to test specific skills. A proper distribution of content accounts for differences in school curricula as well as regional differences, rural-urban differences, and sex and ethnic differences.

After the test is assembled, it is reviewed by other specialists, committee members, and, sometimes, by outside experts. Each reviewer answers all questions independently and prepares a list of correct answers which is then compared with the intended answers to verify agreement on the correct answer to each question. No test can go to press until the person responsible certifies that at least three different people have agreed to the correct answers to every question.

The test is then sent to the external committee of examiners for review before going to press.

7. Test Validity and Disclosure

The test preparation process continues even after the test has been administered Continuing research is necessary to ensure that the test is valid-that it does the job it is designed to do.

After the test is administered, but before final scoring takes place, preliminary statistical analysis of each question is carried out, based on several thousand answer sheets. The results are reviewed question-by-question. If a problem is found, corrective action, such as not scoring the question, is taken before final scoring and score reporting.

Accuracy of test content is further strengthened

by feedback from students to the test center supervisor, alerting test administrators to the possible existence of ambiguously worded items This process has been extended through test disclosure, which enables students to examine the questions and answers after the test has been given. When test development errors come to light through these processes, appropriate corrective action is taken.

At every stage of the process, ETS does all that is possible to assure the accuracy and fairness of individual test questions and the examinations as a whole. It's a demanding and time-consuming process ... but it's worth the effort. ETS strives to set high standards in the field for quality tests.

For further information, contact Information Division, ETS, Princeton, NJ 08541,

0074501-571P1 Printed in U.S.A

Attachment D

QUESTIONING THE QUESTIONS

The Process for Challenging a Question on an ETS Test

Test-Taker Inquiry to ETS

Test-takers may inquire about a question on an ETS admissions test in several ways:

Each inquiry is reviewed by two or more ETS staff members who have
major responsibility for the type of question in dispute. Additional
reviews by subject matter specialists, such as university faculty in the
relevant field, are also frequently obtained depending on the nature of
the challenge.

A letter, either explaining the question's soundness, or acknowledging
a flaw, is then mailed to the test-taker. (If a flaw is acknowledged,
the next step in the process is #8 below.)

Further Test-Taker Inquiry

If the test-taker is not satisfied with ETS' response, he or she may
write ETS again to continue the challenge.

Additional Independent Review

Additional reviews of the disputed test question are obtained from a
combination of university faculty and/or other appropriate experts
outside ETS (such as secondary school teachers for the PSAT or SAT)
and senior ETS staff who were not involved in developing the test in
question. Without prior knowledge of the test question or correspond-
ence related to it, the independent reviewers are asked to select an
answer to the question and justify their choice. They are then
given the correspondence between the test-taker and ETS and
asked to explain in writing why their perceptions are or are not
changed by those arguments.

« iepriekšējā Turpināt »

Grāmatas

The Educational Testing Act of 1981: Joint Hearings Before the Subcommittee ...