The Educational Testing Act of 1981

Chairman PERKINS. Thank you, Dr. Mehrens. We have now been joined by the distinguished chairman of our postsecondary committee. I understand Dr. Johnson has arrived and you are scheduled to be the next witness. Please take your place at the witness table. You may proceed as you wish. We welcome you.

STATEMENT OF SYLVIA JOHNSON, PROFESSOR OF RESEARCH METHODOLOGY, SCHOOL OF EDUCATION, DEPARTMENT OF PSYCHOEDUCATIONAL STUDIES, HOWARD UNIVERSITY

Ms. JOHNSON. Thank you.

I am glad to have the opportunity to testify before the subcommittee this morning. I will begin by giving you some brief information in terms of my background. I am an associate professor at Howard University in the School of Education in the area of research methodology and statistics. I hold a Ph. D. degree from the University of Iowa. While at the University of Iowa I was special research assistant in testing programs and my degree is in measurement and statistics. Years ago I spent 1 year, a graduate year, at Michigan State University in the program where Dr. Mehrens teaches and I have a masters degree from the area which Mr. Simon represents from Southern Illinois University and undergraduate degree from Howard University. I was asked to make my comments in terms of some areas related to measurements that have some implications for the bill you are considering. Specifically, I was asked to comment in terms of the use of tests for selection and what tests can and cannot do. I would like to begin by considering some points that are perhaps less technical than some of the points that might be raised but I will certainly be glad to go into more technical points if desired.

I think we need to be aware of the nature of measurement and some of the comments I will make here are related to a publication published by the Institute for Study of Educational Policy at Howard University which I wrote a couple of years ago while a senior fellow at the Institute. In terms of the nature of measurement, since we cannot measure minds in the same way we measure the length of a room, all educational and psychological measurement is done indirectly. We never are measuring psychological factors directly. By measuring performance or behavior on some task we infer the level of achievement, intellectual functioning or some other psychological construction. But because measurement is indirect does not mean it is faulty. In fact much of physical measurement is done indirectly. Inferences are made because of certain theories about how the universe operates. These inferences make it possible to measure from distances in the solar system to distances between and within atoms. Physical measurement as well as psychology measurement can result in errors. However, our errors in psychological measurement may be particularly hard to identify and correct. For example, if I say the combination of ingredients used in that wall, if they are mixed together will not harden, and I attempt to test that theory by taking a stroll over into the adjoining office in this direction I will meet with some solid evidence to refute that theory. However, the social scientists lack the benefit of such firm validation or invalidation and is faced with a strong re

sponsibility for careful, thoughtful analysis and empirical validation. In testing the behavior measure used as the basis for an inference is the test score. An inference is then made to a level of achievement, aptitude, personality, interest, or some other human characteristic which cannot be measured directly. The relationship between this test score and the characteristic which we wish to infer may not be the same for all examinees particularly for many members of minority groups who have been systematically excluded from educational benefits. As a result test scores alone frequently are not such measure of aptitude or other psychological constructs for minority groups. Leaders of minority communities have seen numerous cases of discrepancy between ability to perform and measured aptitude. And many regard tests as useless. The reality is that tests are increasingly used in all forms of decisionmaking. Their use and misuses need to be more fully understood by policymakers and community leaders. Many factors operate to attenuate or lower test scores and those factors tend to have the greatest influence on blacks and other minority applicants. These include factors which affect the actual performance of individuals on the tests such as socioeconomic status, differences in educational opportunity, motivation, narrowness of content of the test, atmosphere of the testing situation, and the perceived relevance of the test to success. They also include factors that affect the test score more indirectly such as composition of the group used for item tryouts as well as items selection and item analysis which precede the standardization. The composition of the standardization of groups and technique and procedures are employed in item construction. Also validity and appropriateness of tests often differ for applicants in relation to the same future performance or criterion. These factors attenuate scores and result in validity problems raised in terms of measuring scholastic aptitude.

When we use tests for selection, many tests as you well know are widely used for graduate professional school admissions as well as for admission to colleges. Often people assume that a certain score or certain cutting score automatically eliminates those who are less able and chooses those that are better. We have a problem that arises when it is assumed that if a certain cutting score on a test eliminates potentially unsuccessful applicants, raising that score may result in the better applicant automatically eliminating more of the potentially unsuccessful. In fact this is not always the case. That is many factors need to be considered in terms of college admission rather than a single criterion.

A rough analogy can been seen in the following situation.

Suppose one wishes to board a bus to travel some distance in a large city where there is an established bus fare. Let us assume that fare is a consistent fare for going whatever distance you wish to travel. If you have less than the established fare you will not get a ride. The exact fare is required. Having more money will not take you further under a fixed rate schedule. Whether you reach your destination does not relate to whether you have more than the basic fee in your pocket. It relates to other actions you may take and characteristics you may have, such as how well you check the route map, bus schedule and whether you watch to see if you

are nearing your destination. Your timely arrival relates to external factors such as traffic situation, possible accidents, et cetera.

Let us carry our analogy one step further. Imagine an entire college of individuals waiting to board this bus. And some have the required bus fare and others do not. The bus fare will be one method of determining who rides the bus. In this situation there are a whole set of antecedent questions that determine who gets a chance to drop their fare in the box. How widely were the bus schedules and route information distributed? How many people can the bus carry? Who regularly rides this route or knows other people who have taken this bus? Who gets in line first, and who is driving the bus? Once the bus is boarded the same characteristics and external factors determine success in negotiating the system. An educational example would be helpful here but our image of the test that is the general public's image of the test is as a mental yardstick which makes acceptance difficult. Let us conceive of the test as a yardstick that varies in appearance depending on the purpose for which it is used; a medical school aptitude test used to predict criterion of success as a physician could be viewed as a yardstick with numbers clearly printed in integrals halfway up this stick and blank the rest of the way. The implication-this is a reasonable measure of what we are interested in up to a certain point. After that we need to look at other yardsticks. The same test might be a blank yardstick when used for another purpose such as predicting success as a sculptor or musician. It might be a rubber yardstick if used to predict success in high school science courses. Even though the type of content is appropriate, the inappropriate level would result in measurement with low reliability and validity.

The personal attributes and characteristics that are the focus of many tests are complex entities. Test developers and measurement researchers are aware of the multidimensional nature of intelligence and other aptitudes. Earlier writers noted the general impression that intelligence is unit dimension, is borne out by the common usage of the word. We speak constantly of individuals as having high or low intelligence or having some particular degree of intelligence. This practice is part of our folklore, an example of our tendency to oversimplify complex concepts. Verbal tests with academic type tasks are used to measure academic aptitude, general aptitude and intelligence. The justification of this procedure is that when nonverbal and performance tasks are used they normally could relate highly with verbal testing. And, thus, therefore are more reliable, easy to administer and cheaper verbal tasks are employed. However, the correlations between verbal and performance tasks are not equally high in all groups. They are most likely to be high in those persons who have had the greatest opportunity to develop their verbal skills maximally. Those with the greatest environmental resources for the development of traditional verbal skills. The fact other tasks may more accurately assess aptitude in minorities is often not considered.

So the point we need to consider first, all psychological measurements are not the same. We measure characteristics of people by inference. The same test score does not mean the same thing for everyone. The score of 500 is not always a 500 is not always a 500

opposite to the rose. Nontest information may give better clues to potential success for low scoring students. However, we must recognize that nontest information can have the same weaknesses as tests in terms of validity and reliability and again must be used carefully. This means the complexity of decisionmaking regarding the future of our young people must be appreciated and dealt with accordingly. There is no simple way to make admissions and placement decisions. Because two groups different substantially on the average on some test, such as SAT, GRE, AMCAT, et cetera, does not mean one group is brighter than the other. It may mean there are factors in the test, the tested and the test taking that result in these differences, typical school and nonschool experiences of one group may better prepare them for the test material, may even mean one group is better prepared, at one point in time for certain educational experiences but it does not mean that people in a lower scoring group are not capable of competent work in the proposed course of study. Even though national legislation regarding tests has not been enacted, its proposal has created a climate within the testing industry that resulted in creative approaches to openness in testing and increased tolerance to those within the industry who have previously advocated use of tests for opening rather than closing options. Tests probably have the greatest utility in providing continual feedback regarding effectivness of educational programing for youngsters. In this use they should be as nonthreatening as possible, taken when the youngster is ready, often by computer, corrections and reasons for wrong answers provided and accompanied by instruction in these problem areas. When used as rigid admissions criteria tests only serve to reinforce the bars to entry and selective programs that have long existed in other ways and they discourage creative way so treating instructional material to make it more accessible to students from nontraditional backgrounds. I would like to thank you for your time this morning and I will be happy to comment further and my written presentation will be submitted.

Chairman. PERKINS. Your written presentation will be entered in the record without objection.

[The prepared statement of Sylvia Johnson follows:]

PREPARED STATement of SylvIA T. JOHNSON, PH. D., Department of PsychOEDUCATIONAL STUDIES, SCHOOL OF EDUCATION, HOWARD UNIVERSITY, WASHINGTON, D.C. Good morning Mr. Chairman and committee members. I am glad to have the opportunity to speak with you today regarding the topic of the use of tests in selection for educational programs.

I will begin with a brief account of my background in this area. I have been a faculty member in the Department of Psychoeducational Studies, School of Education at Howard University since 1974. I teach graduate courses in applied statistics, research methodology, and measurement, and I supervise Ph. D. dissertations in educational psychology. I received my Ph. D. from The University of Iowa in Educational Statistics and Measurement. While at Iowa, I was a special research assistant, assigned to the professor who was also the director of Iowa Testing Programs. Several years earlier, I completed a year of advanced graduate study at Michigan State University in the area of Research Design and Development.

I received my masters degree from Southern Illinois University at Carbondale, and completed my undergraduate work in mathematics at Howard University. Three years ago, I spent a period of time as a senior visiting fellow at the Institute for the Study of Educational Policy, Howard University. While there, I authorized a monograph entitled "The Measurement Mystique." Some of my testimony

today is taken from "The Measurement Mystique." However, I do not speak for Howard University, or for the Institute.

When we consider the quality of measurement that we get from tests, it is important to remind ourselves of the nature of the process of measurement.

Since we cannot measure minds in the same way that we measure the length of a room, all educational and psychological measurement is done indirectly. By measuring performance or behavior on some task, we infer the level of achievement, intellecutal functioning, or some other psychological construct.

Leaders in minority communities have seem numerous cases of discrepancy between actual ability to perform and measure aptitude, and many regard tests as useless. The reality is that tests are increasingly used in all forms of decision making, and that their uses and misuses need to be more fully understood by policy makers and community leaders.

How then are tests developed? There are several steps in the process of constructing a test. A test plan is developed which outlines specifications for the test questions, or for sets of questions. A question on a test is called a test item. The specifications may include the type of content the item should contain, the process to be involved in solving the item, the desired difficulty of the item, the format, and any other item characteristics considered important. After a set of items has been written, discussed, and edited, it should be tried out on subjects similar to those for whom the test is intended. This is not the norming, or standardization process, but is an important step, since it helps to determine which items will actually appear on the test. Here the hunches of writers and editors are literally "put to the test", by seeing how well available subjects actually perform.

Often the tryout items are administered in conjunction with an actual previously developed test which serves as a yardstick for judging the quality of the new items. If subjects who do well on the "old" test do well on the tryout item, and subjects who do poorly on the "old" test do poorly on a tryout item, we say that the item discriminated well, or has a high discrimination index. (This is discrimination in a descriptive sense, not in the sense of racial or sexual discrimination). Such an item is said to "work," and is likely to become a part of the test being developed. On the other hand, if subjects who do poorly on the "old" test do well on the tryout item, we say that the item discriminates negatively, and the item is likely to be rejected, revised, or the "correct" answer changed. Often there is something wrong with such items in meaning, style, or some other factor.

Another category of tryout items may occur. These are items that do not discriminate in one direction or the other. That is, subjects who do well on the "old" test perform on a certain level on the tryout item, and subjects who do poorly on the "old" test perform at about the same level. Such items are omitted from the test by many test developers, since they do not appear to add to the information to be gained from the test. These may be called "flat" items, due to the shape of the item characteristic curve, a graph of item performance relative to total score perform

ance.

This group of "flat" items can be quite an amalgam. It may include some poor or misunderstood items, but it may also include items with good discrimination power for subgroups for whom their content is appropriate, and low discrimination power for others. Items with greater validity for minority group members can easily be eliminated in this tryout process, particularly if the proportion of minorities in the tryout group is small. A test that has been well-constructed in some ways, such as careful attention to item format and content, may be seriously biased if the items selected are chosen after tryouts with groups containing few minorities. Some developers do not depend as heavily on tryout data in item selection, but place more emphasis on careful, logical selection by experienced editors and writers. Yet these items writers may be very similar in background and life experiences, as well as in training and academic experiences and thus become an additional source of bias themselves.

Many factors operate to attenuate or lower test scores, and these factors tend to have their greatest effects on Blacks and other minority applicants. These include factors which affect the actual performance of individuals on the test, such as socioeconomic status, differences in educational opportunity, motivation, narrowness of content of the tests, atmosphere of the testing situation, and the perceived relevance of the test to success. They also include factors that affect the test score more indirectly, such as the composition of the group used fo item tryouts and item selection and analysis which preceded the actual standardization, composition of the standardization or normative group, and the techniques and procedures employed in item construction.

« iepriekšējā Turpināt »

Grāmatas

The Educational Testing Act of 1981: Joint Hearings Before the Subcommittee ...