It can indeed be considered remarkable that five individuals would observe a complex human behavior (in this case a student’s attempt to accurately depict an experiment) and make the same judgment as to the student’s capabilities. Yet this reliability of judgement is exactly what every educational measurement aims for. The fact that there are only two levels of attainment on this learning goal makes it easier to achieve agreement than it would if there had been more choices for judgement. But still the probability of 5 such judgments coming to agreement entirely by chance is small and would be likely to happen less than 4% of the time. [The probability of selecting one of two items strictly by chance is .5, and of doing so again is .5 x .5. For five judgments it would be .5 x .5 x .5 x .5 x .5 = 0.03125.]
Perfect agreement provides evidence that our assessment instrument is working as we would like it to. Disagreement, as in the case of judgments regarding Roland Questevarn’s response, can provide an opportunity for improvement of some aspect of the assessment or evaluation processes. Discussion on improvements of this kind take place in the thread devoted to ‘Evaluating an Assessment Instrument’.
Attached find the reliability report enhanced with rater comments that was requested in the previous post.