Indonesian achievement scores for TIMSS 2003: Effects of changing the score estimation model
The Trends in International Mathematics and Science Studies (TIMSS), like other international assessment studies, uses a complex scaling methodology to produce population-orientated scores for participating countries. Based on item response theory (IRT), the plausible value methodology combines test information with contextual variables. This procedure enables estimates to be produced for each student providing at least some achievement or contextual information is available. Some researchers view that the combination of contextual information with achievement data to produce population measures as controversial. It is often argued that providing the assessment information dominates the scaling model and that the plausible-value estimates are superior to other IRT measures. In this study, Indonesian mathematics data from TIMSS 2003 are used to investigate the importance of assessment data in the student plausible values.
The scored mathematics data from TIMSS, published item parameters and commercial IRT software were used to produce maximum likelihood (MLE), Warm’s maximum likelihood (WLE) and Expected A Posteriori (EAP) estimates for each student. The EAP estimates were produced using several priors, including uninformative and various normal informative priors. In some cases, the MLE and WLE procedures failed to give student estimates under a number of conditions. Here, the plausible value methodology would draw information from the contextual information.
We reported the weighted percentage of students getting MLE and WLE scores for each TIMSS test booklet, and the MLE, WLE, and various EAP averages and for key demographic groups. In mathematics-focused books, the percentage of students not receiving valid MLE scores was low and even lower for WLE scores. However, for science-focused books with relatively few mathematics items, the percentage of students without scores became non-trivial. While many students with missing scores had raw scores below that produced by guessing, some students had score patterns that were indicative of teaching and other effects.
