z-logo
open-access-imgOpen Access
USING THE FREE‐RESPONSE SCORING TOOL TO AUTOMATICALLY SCORE THE FORMULATING‐HYPOTHESES ITEM
Author(s) -
Kaplan Randy M.,
Bennett Randy Elliot
Publication year - 1994
Publication title -
ets research report series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.235
H-Index - 5
ISSN - 2330-8516
DOI - 10.1002/j.2333-8504.1994.tb01581.x
Subject(s) - matching (statistics) , sample (material) , psychology , item response theory , scale (ratio) , test (biology) , computer science , point (geometry) , item bank , statistics , artificial intelligence , cognitive psychology , natural language processing , psychometrics , mathematics , developmental psychology , paleontology , chemistry , physics , geometry , chromatography , quantum mechanics , biology
Large‐scale institutional testing, and testing in general, are in a period of rapid change. Among the more obvious dimensions is the growing use of constructed‐response items and of computer‐based testing. This study explores the potential for using a computer‐based scoring procedure for the formulating‐hypotheses item. This item type presents a situation and asks the examinee to generate explanations for it. Each explanation is judged right or wrong and the number of creditable explanations summed to produce an item score. Scores were generated for 30 examinees' responses to each of eight items by a semantic pattern‐matching program and independently by five human raters. On its initial scoring run, the program agreed highly with the raters' mean item scores for some questions and improved its concurrence substantially as modifications to the automatic scoring process were made. By the final run, correlations between the program and the raters on item scores ranged from .89 to .97, and mean human‐machine discrepancies ran from .6 to 1.1 on a 16‐point scale. At the individual‐hypothesis level, the proportion agreement ranged from .80 to .94, which, given the large disproportion of correct responses in the sample, was little better than chance. Also detected was a tendency on the part of the program to erroneously classify wrong responses as correct. We conclude that F‐H items might be more effectively scored by a semiautomatic system that combines machine processing with a small number of human judges, and we present a preliminary configuration for such a process.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here