
THE FEASIBILITY OF USING ITEM RESPONSE THEORY AS A PSYCHOMETRIC MODEL FOR THE GRE APTITUDE TEST
Author(s) -
Kingston Neal M.,
Dorans Neil J.
Publication year - 1982
Publication title -
ets research report series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.235
H-Index - 5
ISSN - 2330-8516
DOI - 10.1002/j.2333-8504.1982.tb01298.x
Subject(s) - item response theory , aptitude , psychology , test (biology) , local independence , classical test theory , psychometrics , independence (probability theory) , factorial , test theory , statistics , factor analysis , item analysis , social psychology , econometrics , cognitive psychology , mathematics , developmental psychology , paleontology , biology , mathematical analysis
The feasibility of using item response theory as a psychometric model for the GRE Aptitude Test was addressed by assessing the reasonableness of the assumptions of item response theory for GRE item types and examinee populations. Items from four forms and four administrations of the GRE Aptitude Test were calibrated using the three‐parameter logistic item response model (one form was given at two administrations and one administration used two forms; the exact relationships between forms and administrations are given in Test Forms and Populations section of this report). The unidimensionality assumption of item response theory was addressed in a variety of ways. Previous factor analytic research on the GRE Aptitude Test was reviewed to assess the dimensionality of the test and to extract information pertinent to the construction of sets of homogeneous items. On the basis of this review, separate calibrations of discrete verbal items and reading comprehension items were run, in addition to calibrations on all verbal items, because two strong dimensions on the verbal scale were identified in the factor analytic research. Local independence of item responses is a consequence of the unidimensionality assumption. To test the weak form of the local independence condition, partial correlations, both with and without a correction for guessing, among items with ability partialled out were computed and factor analyzed. Violations of local independence were observed in both verbal item types and quantitative item types. These violations were basically consistent with expectations based on the factor analytic review. Fit of the three‐parameter logistic model to GRE Aptitude Test data was assessed by comparing estimated item‐ability regressions, i.e., item response functions, with empirical item‐ability regressions. The three‐parameter model fit all verbal item types reasonably well. The fit to data interpretation items, regular math items, analytical reasoning items, and logical diagrams items also seemed acceptable. The model fit quantitative comparison items least well. The analysis of explanations item type was also not fit well by the three‐parameter logistic model. The stability of item parameter estimates for different samples was assessed. Item difficulty estimates exhibited a large degree of stability, followed by item discrimination parameter estimates. The hard‐to‐estimate lower asymptote or pseudoguessing parameter exhibited the least temporal stability. The sensitivity of item parameter estimates to the lack of unidimensionality that produced the local independence violations was examined. The discrete verbal and all verbal calibrations of discrete verbal items produced more similiar estimates of item discrimination than the reading comprehension and all verbal calibrations of reading comprehension items, reflecting the larger correlations that overall verbal ability estimates had with discrete verbal ability estimates. As compared to item discrimination estimates, item difficulty estimates exhibited much less sensitivity to homogeneity of item sets. The estimates of the lower asymptote were, for the most part, fairly robust to homogeneity of item calibration set. The comparability of ability estimates based on homogeneous item sets (reading comprehension items or discrete verbal items) with estimates based on all verbal items was examined. Correlations among overall verbal ability estimates, discrete verbal ability estimates, and reading comprehension ability estimates provided evidence for the existence of two distinct, highly correlated verbal abilities that can be combined to produce a composite ability that resembles the overall verbal ability defined by the calibration of all verbal items together. Three equating methods were compared in this research: equipercentile equating, linear equating, and item response theory true score equating. Various data collection designs (for both IRT and non‐IRT methods) and several item parameter linking procedures (for the IRT equatings) were employed. The equipercentile and linear equatings of the verbal scales were more similar to each other than they were to the IRT equatings. The degree of similarity among the scaled score distributions produced by the various equating methods, data collection designs, and linking procedures was greater for the verbal equatings than for either the quantitative or analytical equatings. In almost every comparison, the IRT methods produced quantitative scaled score means and standard deviations that were higher and lower, respectively, than those produced by the linear and equipercentile methods. The most notable finding in the analytical equatings was the sensitivity of the precalibration design (in this study, used only for the IRT equating method) to practice effects on analytical items, particularly for the analysis of explanations item type. Since the precalibration design is the data collection method most appealing (for administrative reasons) for equating the GRE Aptitude Test in a test disclosure environment, this sensitivity might present a problem for any equating method. In sum, the item response theory model and IRT true score equating, using the precalibration data collection design, appear most applicable to the verbal section, less applicable to the quantitative section because of possible dimensionality problems with data interpretation items and instances of nonmontonicity for the quantitative comparison items, and least applicable to the analytical section because of severe practice effects associated with the analysis of explanations item type. Expected revisions of the analytical section, particularly the removal of the troublesome analysis of explanations item type, should enhance the fit and applicability of the three‐parameter model to the analytical section. Planned revisions of the verbal section should not substantially affect the satisfactory fit of the model to verbal item types. The heterogeneous quantitative section might present problems for item response theory. It must be remembered, however, that these same (and other) factors that affect IRT based equatings may also affect other equating methods.