Item Analysis
Item Analysis allows us to observe the characteristics of a particular question (item) and can be used to ensure that questions are of an appropriate standard and select items for test inclusion.

Introduction
Item Analysis describes the statistical analyses which allow measurement of the effectiveness of individual test items. An understanding of the factors which govern effectiveness (and a means of measuring them) can enable us to create more effective test questions and also regulate and standardise existing tests.

There are three main types of Item Analysis: Item Response Theory, Rasch Measurement and Classical Test Theory. Although Classical Test Theory and Rasch Measurement will be discussed, this document will concentrate primarily on Item Response Theory.

The Models

Classical Test Theory
Classical Test Theory (traditionally the main method used in the United Kingdom) utilises two main statistics - Facility and Discrimination. The main problems with Classical Test Theory are that the conclusions drawn depend very much on the sample used to collect information. There is an inter-dependence of item and candidate.

Item Response Theory
Item Response Theory (IRT) assumes that there is a correlation between the score gained by a candidate for one item/test (measurable) and their overall ability on the latent trait which underlies test performance (which we want to discover). Critically, the 'characteristics' of an item are said to be independent of the ability of the candidates who were sampled.

Item Response Theory comes in three forms: IRT1, IRT2, and IRT3 reflecting the number of parameters considered in each case.

IRT can be used to create a unique plot for each item (the Item Characteristic Curve - ICC). The ICC is a plot of Probability that the Item will be answered correctly against Ability. The shape of the ICC reflects the influence of the three factors:

Of course when you carry out a test for the first time you don't know the ICC of the item because you don't know the difficulty (and discrimination of that item). Rather, you estimate the parameters (using paramater estimation techniques) to find values which fit the data you observed.

Using IRT models allows Items to be characterised and ranked by their difficulty and this can be exploited when generating Item Banks of equivalent questions. It is important to remember though, that in IRT2 and IRT3, question difficulty rankings may vary over the ability range.

Rasch Measurement
Rasch measurement is very similar to IRT1 - in that it considers only one parameter (difficulty) and the ICC is calculated in the same way. When it comes to utilising these theories to categorise items however, there is a significant difference. If you have a set of data, and analyse it with IRT1, then you arrive at an ICC that fits the data observed. If you use Rasch measurement, extreme data (e.g. questions which are consistently well or poorly answered) is discarded and the model is fitted to the remaining data.