Automatic item generation explained

Automatic item generation (AIG), or automated item generation, is a process linking psychometrics with computer programming. It uses a computer algorithm to automatically create test items that are the basic building blocks of a psychological test. The method was first described by John R. Bormuth^[1] in the 1960s but was not developed until recently. AIG uses a two-step process: first, a test specialist creates a template called an item model; then, a computer algorithm is developed to generate test items.^[2] So, instead of a test specialist writing each individual item, computer algorithms generate families of items from a smaller set of parent item models.^[3] ^[4] More recently, neural networks, including Large Language Models, such as the GPT family, have been used successfully for generating items automatically. ^[5] ^[6]

Context

In psychological testing, the responses of the test taker to test items provide objective measurement data for a variety of human characteristics.^[7] Some characteristics measured by psychological and educational tests include academic abilities, school performance, intelligence, motivation, etc. and these tests are frequently used to make decisions that have significant consequences on individuals or groups of individuals. Achieving measurement quality standards, such as test validity, is one of the most important objectives for psychologists and educators.^[8] AIG is an approach to test development which can be used to maintain and improve test quality economically in the contemporary environment where computerized testing has increased the need for large numbers of test items.

Benefits

AIG reduces the cost of producing standardized tests,^[9] as algorithms can generate many more items in a given amount of time than a human test specialist. It can quickly and easily create parallel test forms, which allow for different test takers to be exposed to different groups of test items with the same level of complexity or difficulty, thus enhancing test security. When combined with computerized adaptive testing, AIG can generate new items or select which already-generated items should be administered next based on the test taker's ability during the administration of the test. AIG is also expected to produce items with a wide range of difficulty, fewer errors in construction, and is expected to permit higher comparability of items due to a more systematic definition of the prototypical item model.^[10] ^[11]

Radicals, incidentals and isomorphs

Test development (including AIG) can be enriched if it is based on any cognitive theory. Cognitive processes taken from a given theory are often matched with item features during their construction. The purpose of this is to predetermine a given psychometric parameter, such as item difficulty (from now on:). Let radicals be those structural elements that significantly affect item parameters and provide the item with certain cognitive requirements. One or more radicals of the item model can be manipulated in order to produce parent item models with different parameters (e.g.,) levels. Each parent can then grow its own family by manipulating other elements that Irvine called incidentals. Incidentals are surface features that suffer random variations from item to item within the same family. Items that have the same structure of radicals and only differ in incidentals are usually labeled as isomorphs^[12] or clones.^[13] ^[14] There can be two kinds of item cloning: On the one hand, the item model may consist of an item with one or more open places, and cloning is done by filling each place with an element selected from a list of possibilities. On the other hand, the item model could be an intact item which is cloned by introducing transformations, for example changing the angle of an object of spatial ability tests.^[15] The variation of these items' surface characteristics should not significantly influence the testee's responses. This is the reason why it is believed that incidentals produce only slight differences among the item parameters of the isomorphs.^[16]

Current developments

A number of item generators have been subjected to objective validation testing.

MathGen is a program that generates items to test mathematical achievement. In a 2018 article for the Journal of Educational Measurement, authors Embretson and Kingston conducted an extensive qualitative review and empirical try-outs to evaluate the qualitative and psychometric properties of generated items, concluding that the items were successful and that items generated from the same item structure had predictable psychometric properties.^[17] ^[18]

A test of melodic discrimination developed with the aid of the computational model Rachman-Jun 2015^[19] was administered to participants in a 2017 trial. According to the data collected by P.M. Harrison et al., results demonstrate strong validity and reliability.^[20]

Ferreyra and Backhoff-Escudero^[21] generated two parallel versions of the Basic Competences Exam (Excoba), a general test of educational skills, using a program they developed called GenerEx. They then studied the internal structure as well as the psychometric equivalence of the created tests. Empirical results of psychometric quality are favorable overall, and the tests and items are consistent as measured by multiple psychometric indices.

Gierl and his colleagues^[22] ^[23] ^[24] ^[25] used an AIG program called the Item Generator (IGOR^[26]) to create multiple-choice items that test medical knowledge. IGOR-generated items, even when compared to manually-designed items, showed good psychometric properties.

Arendasy, Sommer, and Mayr^[27] used AIG to create verbal items to test verbal fluency in German and English, administering them to German- and English-speaking participants respectively. The computer-generated items showed acceptable psychometric properties. The sets of items administered to these two groups were based on a common set of interlanguage anchor items, which facilitated cross-lingual comparisons of performance.

Holling, Bertling, and Zeuch^[28] used probability theory to automatically generate mathematical word problems with expected difficulties. They achieved a Rasch^[29] model fit and item difficulties could be explained by the linear logistic test model (LLTM^[30]), as well as by the Random-Effects LLTM. Holling, Blank, Kuchenbäcker, and Kuhn^[31] made a similar study with statistical word problems but without using AIG. Arendasy and his colleagues^[32] ^[33] presented studies on automatically generated algebra word problems and examined how a quality control framework of AIG can affect the measurement quality of items.

Automatic generation of figural items

The Item Maker (IMak) is a program written in the R programming language for plotting figural analogy items. The psychometric properties of 23 IMak-generated items were found to be satisfactory, and item difficulty based on rule generation could be predicted by means of the linear logistic test model (LLTM).^[16]

MazeGen is another program coded with R that generates mazes automatically. The psychometric properties of 18 such mazes were found to be optimal, including Rasch model fit and the LLTM prediction of maze difficulty.^[34]

GeomGen is a program that generates figural matrices.^[35] A study which identified sources of measurement bias related to response elimination strategies for figural matrix items concluded that distractor salience favors the pursuit of response elimination strategies and that this knowledge could be incorporated into AIG to improve the construct validity of such items.^[36] The same group used AIG to study differential item functioning (DIF) and gender differences associated with mental rotation. They manipulated item design features that have exhibited gender DIF in previous studies, and they showed that the estimates of the effect size of gender differences were compromised by the presence of different kinds of gender DIF that could be related to specific item design features.^[37] ^[38]

Arendasy also studied possible violations of the psychometric quality identified using item response theory (IRT) of automatically generated visuospatial reasoning items. For this purpose, he presented two programs, namely: the already-mentioned GeomGen and the Endless Loop Generator (EsGen). He concluded that GeomGen was more suitable for AIG because IRT principles can be incorporated during item generation.^[39] In a parallel research project using GeomGen, Arendasy and Sommer^[40] found that variation of the perceptual organization of items could influence the performance of respondents depending on their ability levels and that it had an effect on several psychometric quality indices. With these results, they questioned the unidimensionality assumption of figural matrix items in general.

MatrixDeveloper^[41] was used to generate twenty-five 4x4 square matrix items automatically. These items were administered to 169 individuals. According to research results, the items show a good Rasch model fit, and rule-based generation can explain the item difficulty.^[42]

The first known item matrix generator was designed by Embretson,^[43] and her automatically generated items demonstrated good psychometric properties, as it is shown by Embretson and Reise.^[44] She also proposed a model for adequate online item generation.

Notes and References

Bormuth, J. (1969). On a theory of achievement test items. Chicago, IL: University of Chicago Press.
Gierl, M.J., & Haladyna, T.M. (2012). Automatic item generation, theory and practice. New York, NY: Routledge Chapman & Hall.
Glas, C.A.W., van der Linden, W.J., & Geerlings, H. (2010). Estimation of the parameters in an item-cloning model for adaptive testing. In W.J. van der Linden, & C.A.W. Glas (Eds.). Elements of adaptive testing (pp. 289–314). DOI: 10.1007/978-0-387-85461-8_15.
Gierl, M.J., & Lai, H. (2012). The role of item models in automatic item generation. International Journal of testing, 12(3), 273–298. DOI: 10.1080/15305058.2011.635830.
von Davier, M. Automated Item Generation with Recurrent Neural Networks. Psychometrika 83, 847–857 (2018). https://doi.org/10.1007/s11336-018-9608-y
Yaneva, V., & von Davier, M. (Eds.). (2023). Advancing Natural Language Processing in Educational Assessment (1st ed.). Routledge. https://doi.org/10.4324/9781003278658
Van der Linden, W.J., & Hambleton, R.K. (1997). Item Response Theory: a brief history, common models, and extensions. In R.K. Hambleton, & W.J. van der Linden (Eds.). Handbook of modern Item Response Theory (pp. 1–31). New York: Springer.
Embretson, S.E. (1999). Issues in the measurement of cognitive abilities. In S.E. Embretson, & S.L. Hershberger (Eds.). The new rules of measurement (pp. 1–15). Mahwah: Lawrence Erlbaum Associates.
Rudner, L. (2010). Implementing the graduate management admission test computerized adaptive test. In W.J. van der Linden, and C.A.W. Glas (Eds.). Elements of adaptive testing (pp. 151–165). DOI: 10.1007/978-0-387-85461-8_15.
Irvine, S. (2002). The foundations of item generation for mass testing. In S.H. Irvine, & P.C. Kyllonen (Eds.). Item generation for test development (pp. 3–34). Mahwah: Lawrence Erlbaum Associates.
Lai, H., Alves, C., & Gierl, M.J. (2009). Using automatic item generation to address item demands for CAT. In D.J. Weiss (Ed.), Proceedings of the 2009 GMAC Conference on Computerized Adaptive Testing. Web: www.psych.umn.edu/psylabs/CATCentral.
Bejar, I. I. (2002). Generative testing: from conception to implementation in Item Generation for Test Development, eds. S. H. Irvine and P. C. Kyllonen (Mahwah, NJ: Lawrence Erlbaum Associates), 199–217.
Embretson, S.E. (1999). Generating items during testing: psychometric issues and models. Psychometrika, 64(4), 407–433.
Arendasy, M. E., and Sommer, M. (2012). Using automatic item generation to meet the increasing item demands of the high-stakes educational and occupational assessment. Learning and individual differences, 22, 112–117. doi: 10.1016/j.lindif.2011.11.005.
Glas, C. A. W., and van der Linden, W. J. (2003). Computerized adaptive testing with item cloning. Applied psychological measurement, 27, 247–261. doi: 10.1177/0146621603027004001.
Blum . Diego . Holling . Heinz . Automatic Generation of Figural Analogies With the IMak Package . Frontiers in Psychology . 6 August 2018 . 9 . 1286 . 10.3389/fpsyg.2018.01286 . 30127757 . 6087760 . free . The material was copied from this source, which is available under a Creative Commons Attribution 4.0 International License.
Embretson, S.E., & Kingston, N.M. (2018). Automatic item generation: a more efficient process for developing mathematics achievement items? Journal of educational measurement, 55(1), 112–131. DOI: 10.1111/jedm.12166
Willson, J., Morrison, K., & Embretson, S.E. (2014). Automatic item generator for mathematical achievement items: MathGen3.0. Technical report IES1005A-2014 for the Institute of Educational Sciences Grant R305A100234. Atlanta, GA: Cognitive Measurement Laboratory, Georgia, Institute of Technology.
Collins, T., Laney, R., Willis, A., & Garthwaite, P.H. (2016). Developing and evaluating computational models of music style. Artificial intelligence for engineering design, analysis, and manufacturing, 30, 16–43. DOI: 10.1017/S0890060414000687.
Harrison, P.M., Collins, T., & Müllensiefen, D. (2017). Applying modern psychometric techniques to melodic discrimination testing: item response theory, computerized adaptive testing, and automatic item generation. Scientific reports, 7(3618), 1–18.
Ferreyra, M.F., & Backhoff-Escudero, E. (2016). Validez del Generador Automático de Ítems del Examen de Competencias Básicas (Excoba). Relieve, 22(1), art. 2, 1–16. DOI: 10.7203/relieve.22.1.8048.
Gierl, M.J., Lai, H., Pugh, D., Touchie, C., Boulais, A.P., & De Champlain, A. (2016). Evaluating the psychometric characteristics of generated multiple-choice test items. Applied measurement in education, 29(3), 196–210. DOI: 10.1080/08957347.2016.1171768.
Lai, H., Gierl, M.J., Byrne, B.E., Spielman, A.I., & Waldschmidt, D.M. (2016). Three modeling applications to promote automatic item generation for examinations in dentistry. Journal of dental education, 80(3), 339–347.
Gierl, M.J., & Lai, H. (2013). Evaluating the quality of medical multiple-choice items created with automated processes. Medical education, 47, 726–733. DOI: 10.1111/medu.12202.
Gierl, M.J., Lai, H., & Turner, S.R. (2012). Using automatic item generation to create multiple-choice test items. Medical education, 46(8), 757–765. DOI: 10.1111/j.1365-2923.2012.04289.x.
Gierl, M.J., Zhou, J., & Alves, C. (2008). Developing a taxonomy of item mode types to promote assessment engineering. J technol learn assess, 7(2), 1–51.
Arendasy, M.E., Sommer, M., & Mayr, F. (2011). Using automatic item generation to simultaneously construct German and English versions of a Word Fluency Test. Journal of cross-cultural psychology, 43(3), 464–479. DOI: 10.1177/0022022110397360.
Holling, H., Bertling, J.P., & Zeuch, N. (2009). Automatic item generation of probability word problems. Studies in educational evaluation, 35(2–3), 71–76.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.
Fischer, G.H. (1973). The linear logistic test model as an instrument of educational research. Acta Psychological, 37, 359–374. DOI: 10.1016/0001-6918(73)90003-6.
Holling, H., Blank, H., Kuchenbäcker, K., & Kuhn, J.T. (2008). Rule-based item design of statistical word problems: a review and first implementation. Psychology science quarterly, 50(3), 363–378.
Arendasy, M.E., Sommer, M., Gittler, G., & Hergovich, A. (2006). Automatic generation of quantitative reasoning items. A pilot study. Journal of individual differences, 27(1), 2–14. DOI: 10.1027/1614-0001.27.1.2.
Arendasy, M.E., & Sommer, M. (2007). Using psychometric technology in educational assessment: the case of a schema-based isomorphic approach to the automatic generation of quantitative reasoning items. Learning and individual differences, 17(4), 366–383. DOI: 10.1016/j.lindif.2007.03.005.
Loe, B.S., & Rust, J. (2017). The perceptual maze test revisited: evaluating the difficulty of automatically generated mazes. Assessment, 1–16. DOI: 10.1177/1073191117746501.
Arendasy, M. (2002). Geom-Gen-Ein Itemgenerator für Matrizentestaufgaben. Viena: Eigenverlag.
Arendasy, M.E., & Sommer, M. (2013). Reducing response elimination strategies enhances the construct validity of figural matrices. Intelligence, 41, 234–243. DOI: 10.1016/j.intell.2013.03.006.
Arendasy, M.E., & Sommer, M. (2010). Evaluating the contribution of different item features to the effect size of the gender difference in three-dimensional mental rotation using automatic item generation. Intelligence, 38(6), 574–581. DOI:10.1016/j.intell.2010.06.004.
Arendasy, M.E., Sommer, M., & Gittler, G. (2010). Combining automatic item generation and experimental designs to investigate the contribution of cognitive components to the gender difference in mental rotation. Intelligence, 38(5), 506–512. DOI:10.1016/j.intell.2010.06.006.
Arendasy, M. (2005). Automatic generation of Rasch-calibrated items: figural matrices test GEOM and Endless-Loops Test EC. International Journal of testing, 5(3), 197–224.
Arendasy, M.E., & Sommer, M. (2005). The effect of different types of perceptual manipulations on the dimensionality of automatic generated figural matrices. Intelligence, 33(3), 307–324. DOI: 10.1016/j.intell.2005.02.002.
Hofer, S. (2004). MatrixDeveloper. Münster, Germany: Psychological Institute IV. Westfälische Wilhelms-Universität.
Freund, P.A., Hofer, S., & Holling, H. (2008). Explaining and controlling for the psychometric properties of computer-generated figural matrix items. Applied psychological measurement, 32(3), 195–210. DOI: 10.1177/0146621607306972.
Embretson, S.E. (1998). A cognitive design system approach to generating valid tests: application to abstract reasoning. Psychological methods, 3(3), 380–396.
Embretson, S.E., & Reise, S.P. (2000). Item Response Theory for psychologists. Mahwah: Lawrence Erlbaum Associates.