Automatic item generation (AIG), or automated item generation, is a process linking psychometrics with computer programming. It uses a computer algorithm to automatically create test items that are the basic building blocks of a psychological test. The method was first described by John R. Bormuth[1] in the 1960s but was not developed until recently. AIG uses a two-step process: first, a test specialist creates a template called an item model; then, a computer algorithm is developed to generate test items.[2] So, instead of a test specialist writing each individual item, computer algorithms generate families of items from a smaller set of parent item models.[3] [4] More recently, neural networks, including Large Language Models, such as the GPT family, have been used successfully for generating items automatically. [5] [6]
In psychological testing, the responses of the test taker to test items provide objective measurement data for a variety of human characteristics.[7] Some characteristics measured by psychological and educational tests include academic abilities, school performance, intelligence, motivation, etc. and these tests are frequently used to make decisions that have significant consequences on individuals or groups of individuals. Achieving measurement quality standards, such as test validity, is one of the most important objectives for psychologists and educators.[8] AIG is an approach to test development which can be used to maintain and improve test quality economically in the contemporary environment where computerized testing has increased the need for large numbers of test items.
AIG reduces the cost of producing standardized tests,[9] as algorithms can generate many more items in a given amount of time than a human test specialist. It can quickly and easily create parallel test forms, which allow for different test takers to be exposed to different groups of test items with the same level of complexity or difficulty, thus enhancing test security. When combined with computerized adaptive testing, AIG can generate new items or select which already-generated items should be administered next based on the test taker's ability during the administration of the test. AIG is also expected to produce items with a wide range of difficulty, fewer errors in construction, and is expected to permit higher comparability of items due to a more systematic definition of the prototypical item model.[10] [11]
Test development (including AIG) can be enriched if it is based on any cognitive theory. Cognitive processes taken from a given theory are often matched with item features during their construction. The purpose of this is to predetermine a given psychometric parameter, such as item difficulty (from now on:). Let radicals be those structural elements that significantly affect item parameters and provide the item with certain cognitive requirements. One or more radicals of the item model can be manipulated in order to produce parent item models with different parameters (e.g.,) levels. Each parent can then grow its own family by manipulating other elements that Irvine called incidentals. Incidentals are surface features that suffer random variations from item to item within the same family. Items that have the same structure of radicals and only differ in incidentals are usually labeled as isomorphs[12] or clones.[13] [14] There can be two kinds of item cloning: On the one hand, the item model may consist of an item with one or more open places, and cloning is done by filling each place with an element selected from a list of possibilities. On the other hand, the item model could be an intact item which is cloned by introducing transformations, for example changing the angle of an object of spatial ability tests.[15] The variation of these items' surface characteristics should not significantly influence the testee's responses. This is the reason why it is believed that incidentals produce only slight differences among the item parameters of the isomorphs.[16]
A number of item generators have been subjected to objective validation testing.
MathGen is a program that generates items to test mathematical achievement. In a 2018 article for the Journal of Educational Measurement, authors Embretson and Kingston conducted an extensive qualitative review and empirical try-outs to evaluate the qualitative and psychometric properties of generated items, concluding that the items were successful and that items generated from the same item structure had predictable psychometric properties.[17] [18]
A test of melodic discrimination developed with the aid of the computational model Rachman-Jun 2015[19] was administered to participants in a 2017 trial. According to the data collected by P.M. Harrison et al., results demonstrate strong validity and reliability.[20]
Ferreyra and Backhoff-Escudero[21] generated two parallel versions of the Basic Competences Exam (Excoba), a general test of educational skills, using a program they developed called GenerEx. They then studied the internal structure as well as the psychometric equivalence of the created tests. Empirical results of psychometric quality are favorable overall, and the tests and items are consistent as measured by multiple psychometric indices.
Gierl and his colleagues[22] [23] [24] [25] used an AIG program called the Item Generator (IGOR[26]) to create multiple-choice items that test medical knowledge. IGOR-generated items, even when compared to manually-designed items, showed good psychometric properties.
Arendasy, Sommer, and Mayr[27] used AIG to create verbal items to test verbal fluency in German and English, administering them to German- and English-speaking participants respectively. The computer-generated items showed acceptable psychometric properties. The sets of items administered to these two groups were based on a common set of interlanguage anchor items, which facilitated cross-lingual comparisons of performance.
Holling, Bertling, and Zeuch[28] used probability theory to automatically generate mathematical word problems with expected difficulties. They achieved a Rasch[29] model fit and item difficulties could be explained by the linear logistic test model (LLTM[30]), as well as by the Random-Effects LLTM. Holling, Blank, Kuchenbäcker, and Kuhn[31] made a similar study with statistical word problems but without using AIG. Arendasy and his colleagues[32] [33] presented studies on automatically generated algebra word problems and examined how a quality control framework of AIG can affect the measurement quality of items.
The Item Maker (IMak) is a program written in the R programming language for plotting figural analogy items. The psychometric properties of 23 IMak-generated items were found to be satisfactory, and item difficulty based on rule generation could be predicted by means of the linear logistic test model (LLTM).[16]
MazeGen is another program coded with R that generates mazes automatically. The psychometric properties of 18 such mazes were found to be optimal, including Rasch model fit and the LLTM prediction of maze difficulty.[34]
GeomGen is a program that generates figural matrices.[35] A study which identified sources of measurement bias related to response elimination strategies for figural matrix items concluded that distractor salience favors the pursuit of response elimination strategies and that this knowledge could be incorporated into AIG to improve the construct validity of such items.[36] The same group used AIG to study differential item functioning (DIF) and gender differences associated with mental rotation. They manipulated item design features that have exhibited gender DIF in previous studies, and they showed that the estimates of the effect size of gender differences were compromised by the presence of different kinds of gender DIF that could be related to specific item design features.[37] [38]
Arendasy also studied possible violations of the psychometric quality identified using item response theory (IRT) of automatically generated visuospatial reasoning items. For this purpose, he presented two programs, namely: the already-mentioned GeomGen and the Endless Loop Generator (EsGen). He concluded that GeomGen was more suitable for AIG because IRT principles can be incorporated during item generation.[39] In a parallel research project using GeomGen, Arendasy and Sommer[40] found that variation of the perceptual organization of items could influence the performance of respondents depending on their ability levels and that it had an effect on several psychometric quality indices. With these results, they questioned the unidimensionality assumption of figural matrix items in general.
MatrixDeveloper[41] was used to generate twenty-five 4x4 square matrix items automatically. These items were administered to 169 individuals. According to research results, the items show a good Rasch model fit, and rule-based generation can explain the item difficulty.[42]
The first known item matrix generator was designed by Embretson,[43] and her automatically generated items demonstrated good psychometric properties, as it is shown by Embretson and Reise.[44] She also proposed a model for adequate online item generation.