z-logo
open-access-imgOpen Access
AN EVALUATION OF THREE APPROXIMATE ITEM RESPONSE THEORY MODELS FOR EQUATING TEST SCORES 1
Author(s) -
Marco Gary L.,
Wingersky Marilyn S.,
Douglass James B.
Publication year - 1985
Publication title -
ets research report series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.235
H-Index - 5
ISSN - 2330-8516
DOI - 10.1002/j.2330-8516.1985.tb00131.x
Subject(s) - equating , item response theory , mathematics , statistics , rasch model , quantile , scale parameter , estimation theory , scaling , scale (ratio) , psychometrics , physics , geometry , quantum mechanics
The primary purpose of this study was to determine the extent to which three item response theory (IRT) models could be used to approximate the three‐parameter logistic model in estimating item parameters and in equating test scores. These approximate models were less expensive to apply and in some cases used less data than the full‐blown three‐parameter model. The approximations to the three‐parameter model used in this study were (1) the Rasch one‐parameter model, as operationalized in the BICAL computer program, (2) an approximate three‐parameter logistic model based on grouped data divided into fifths and twentieths, and (3) a modified three‐parameter logistic model with fixed a 's and c /s. The LOGIST computer program was used to estimate parameters for the modified three‐parameter model; Quantile, a modified version of LOGIST that accepted coarsely grouped data, was used to estimate item parameters for the approximate three‐parameter model. In the case of the approximate models involving BICAL and LOGIST, results of separate item calibrations were used to place item parameter estimates on the same scale. In the case of the approximate model involving Quantile, a method of scaling the item parameter estimates indirectly through existing SAT scaled scores was used. The data for the study came from a recent study (Petersen, Cook, & Stocking, 1983) of scale stability for the Scholastic Aptitude Test. As in the previous study, this study involved the chain equating of a test to itself through five intermediary forms. The sample consisted of approximately 2,670 cases for each of the SAT forms used. The results of the study were as follows: (1) the item calibrations based on twentieths were closer to the true values and to LOGIST estimates than item calibrations based on fifths; (2) the equating results based on twentieths, however, were not more accurate generally than those based on fifths; (3) the three‐parameter model using coarse groupings yielded highly accurate score conversions in equating a test to itself, more accurate in fact than the full‐blown three‐parameter models studied by Petersen, Cook, and Stocking; and (4) all of the approximate models yielded very accurate equating results. A follow‐up analysis indicated that these unexpected equating results were due in large part to the indirect method used to place item parameter estimates on scale through existing score conversions derived from conventional equating methods. The success of the approximate models raises a question about the adequacy of equating a test to itself as a criterion for evaluating equating results. Further research is recommended before any of the approximate models are used operationally.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here