What do Those Numbers Mean?

Summary of Test Statistics

Number of Examinees:

The number of people who responded to at least one item on the test.

Number of Items:

The number ofitems on a test is determined by the amount of material to be covered, the time limitations for testing, and the desired reliability of the test.

Mean Score:

This is simply the arithmetic average of the earned test scores.

Standard Deviation:

The standard deviation describes the amount of spread among the scores. The larger the standard deviation, the more spread out the scores, and the easier it is to discriminate among students at different score levels Generally speaking, about two-thirds of the scores will fall within one standard deviation above and below the mean score and about 95 percent within two standard deviations

Reliability:

The reliability coefficient produced is an alpha coefficient. This number gives an estimate of the internal consistency reliability of the test. Internal consistency reliability is a measure of the extent to which the ordering of students’ scores on this test would correspond to the ordering obtained if an equivalent form of the test were given to these same students. Reliability computed via coefficient alpha usually takes values from 0.00 to 1.00 with 1.00 indicating identical ordering between the test and the hypothetical equivalent form Coefficient alpha may also take values less than zero. This happens when the internal consistency among the set of questions on the test is very low. If this occurs on a set of your test questions, it might be well to first check the processing characteristics listed on the first page of the report you were given. Check the answer key carefully since mis-keying of several answers can often throw off this statistic. If nothing seems to be amiss, it might be helpful to contact a consultant at T & E about the matter.

As a general rule, a reliability of 0.80 or higher is desirable for instructor-made tests. The higher the reliability estimated for the test, the more confident one may feel that the discriminations between students scoring at different score levels on the test are, in fact, stable differences. If the test has a lower reliability, one should use caution in trying to make discriminations between students such as might he done when assigning grades. This is especially true when scores are very close

One way to increase reliability for a test is to increase the number of test items. This gives an instructor a more complete picture of how much the students have learned. However, 30 to 50 questions during a single class period used for testing is usually the maximum number that can be answered by about 90 percent of the students in the class Often times, if the test is given to a very small group (e. g., less than 30 students), it may be useful to combine the results with those for other students taking the test just to get a larger group for use in getting a more stable estimate of the test reliability.

Standard Error of>Measurement (SEM):

The SEM is an estimate of the amount of error present in a student’s score About two-thirds of the time, a student’s score should be expected to fall in the interval of 1 SEM unit around his/her test score. The smaller the SEM, the narrower the size of this interval around the test score.

There is an inverse relationship between the SEM and reliability. Tests with higher reliability have smaller SEMs relative to the standard deviation of the test score. The smaller the SEM for a test (and, therefore, the higher the reliability), the greater one can depend on the ordering of scores to represent stable differences between students. The converse also holds. That is, the larger the SEM, the more error is likely present in the test scores for each student. Therefore, the less confident one should be about the stability of the ordering of students on the basis of their test scores.

If the coefficient alpha reliability estimate is negative, then the SEM will be larger than the standard deviation of the test scores What this usually means is that the spread of scores among the students is due more to error in measuring their knowledge about the course content than to what the students actually know or have learned. When this occurs, the same actions noted above for a negative reliability estimate also apply here. It is generally most useful to call a consultant at T & E.

Mean Item Difficulty:

Item difficulty is defined as the percent of students correctly answering the item. The mean difficulty of the items on the test is the average percent correct across all questions contributing to the test or subtest score. The mean difficulty statistic can be useful in estimating how hard the test was relative to the ability level of the group. When coupled with the information from the Item Analysis, this statistic can give some indication of the extent to which the test-difficulty might have influenced some of the other statistical indices on the test.

Experience has shown that test reliability is higher when items of medium difficulty are predominant. In addition, the test tends to have a greater spread of scores across the range (i.e., a higher standard deviation of scores relative to the range of scores on the test). Medium item difficulty is defined as being halfway between the chance probability of successfully getting an item correct and 1.00. Experience has also shown that item difficulties slightly higher than medium difficulty tend to maximize both test reliability and discrimination.

Maximum Attainable Score:

This is the highest score possible on the test.

Maximum Attained Score:

This is the highest score earned by a student in the group.

Minimum Attainable Score:

This is the lowest possible score on the test. It will usually be zero, but if a correction for guessing has been applied to the test, this score could be negative.

Minimum Attained Score:

This is the lowest score earned on the test. The range is the difference between the maximum attained score and the minimum attained score.

Item Analysis

The use of an item analysis is recommended for most testing purposes. This report provides an instructor with an in-depth look at how well each of the choices on an item functioned and how well each item contributed to the overall estimation of the students’ course-related learning. An item analysis is typically done using the total score as the criterion for effectiveness of the individual items.

The item analysis report consists of four sections: the quintile score group table; the individual item correct response curves and item statistics; the item difficulty summary table; and the item discrimination summary table.

The quintile score groups are a means of dividing the set of students’ scores into a manageable number of groups for use in examining the performance of each item on the test. The intent of the item analysis is to provide a picture of how well students at each score level performed on a test item. Normally there are too many different scores in the range to provide a succinct view of the performance of each of the choices to the item. In order to shorten the report, quintiles are used in place of the full range of scores. The quintiles are formed by dividing the range of scores into five groups as equal in size as possible.

The correct response curve is a graphical representation of the percentage of examinees in each quintile who answered correctly. An increase in percentage of students answering correctly from the 5th to the 1st quintile indicates that the item discriminates between students and that students who score high on the criterion (usually the total score) also tend to answer this question correctly. If this were not the case, then an item is probably not a good one for discriminating among students.

The individual item statistics and the matrix of responses are printed to the right of the correct response curve. The frequency of responses for each quintile score group is given in this table. For tests designed to rank order (or discriminate between) students, two things should be kept in mind when evaluating the incorrect choices to an item: (1) each choice should attract at least some of the students and (2) each should attract more students from the lower quintiles than from the higher ones.

The individual item statistics are given at the bottom of the matrix of response frequencies. The proportion of students who choose each alternative is printed below the column for each of the alternatives. The item difficulty is defined as the proportion selecting the correct alternative. If the purpose of the test is to discriminate among students, then items of middle or medium difficulty should be used. Items of medium difficulty or slightly higher will tend to do a better job of discriminating between students and will also tend to work together to provide a higher test reliability.

The last item statistic is the point biserial correlation (RPBI) given for each item choice This statistic ranges from -1.00 to +1.00 and is an index of the relationship between total scores and whether or not a response was made to that choice. A positive RPB1 for an answer option indicates a tendency for persons who select that choice to have high scores and for people who do not choose it to have low scores. The item discrimination index is defined as the RPBI for the correct choice to the question An item which discriminates well has an RPBI for the correct response which is at least .30. RPBI coefficients for the incorrect choices should be negative. One should not be too concerned about RPBIs computed from groups of less than about 200 students. The RPBI is not very stable for groups smaller than this. It may be useful to combine groups of students’ answers to the same test. In this way, a better picture emerges of the functioning of the item.

The item difficulty and item discrimination summaries are given at the end of the item analysis report These tables repeat information from the second section of the item analysis in a form similar to the frequency distribution of the test scores Items which may be too easy to provide much information between students (PROP is greater than .90) are easily identified in the summary table. It is also easy to compare the item difficulties against the medium difficulty criterion to see which items may be detracting significantly from the reliability and discrimination of the test. In addition, the item discrimination summary can be used to quickly isolate those items which have poor discrimination indices and which should possibly be removed or rewritten.

Return to Homepage

Testing and Evaluation Services

Home

Faculty and Staff

Students

UW Center for Placement Testing

General Information