August 29, 2018 at 9:19 pm #1163
Here is a small educational scientific adventure that we can take together.
Recently we have been using the Cubes and Liquids Assessment Activity to study how the concept of reliability of educational measures can become a practical tool in every day teaching. This work has resulted in an updated scoring guide for Cubes and Liquids (C&L). Now we will consider together the results of a recent administration of C&L through the lens of this scoring guide.
There were two respondents to this assessment event. I will play the role of teacher who has called some colleagues together to considerthe assessment results. We have each independently rated the same two student responses and one student has self-rated using the same scoring guide. Let’s look at the results together. Each of you as raters can each access the results from your own point of view and compare their ratings with the teacher’s. However only the teacher has access to the full perspective. I will share the teacher’s perspective with you using images from the ACASE Online Assessment Information System (AIS) as the conversation requires.
Questions about the assessment instrument itself, including the scoring guide will be referred to another Conversation. In this conversation we will strictly be concerned with interpreting the data.
We will look at results for each of the six learning goals that make up C&L in turn and then the general picture as well. Let’s start with the learning goal that has to do with The ability to distinguish observation from inference (O vs. I for short) starting with what we call the Reliability Report. The teacher’s ratings in this report appear in the Standard column. I look forward to your interpretations, comments, questions etc. concerning these results.
[Outside visitors to this conversation can find out more about what is happening here by visiting the Educational Science Adventures — Guest Introduction.]
August 31, 2018 at 3:10 pm #1170
Monica De TuyaModerator
When I view this report the first thing that strikes me is the 100% agreement the raters had for one of the students – I was not expecting to see that level of consensus, and I am curious as to what worked so well to achieve that agreement. For the other student, everyone agrees except for one person – me! I see that there are comments available for some of the ratings, would it be worthwhile/possible to easily view those or present those within this conversation thread? That would add an element of information that might enhance the conversation.
September 4, 2018 at 4:03 pm #1171
It can indeed be considered remarkable that five individuals would observe a complex human behavior (in this case a student’s attempt to accurately depict an experiment) and make the same judgment as to the student’s capabilities. Yet this reliability of judgement is exactly what every educational measurement aims for. The fact that there are only two levels of attainment on this learning goal makes it easier to achieve agreement than it would if there had been more choices for judgement. But still the probability of 5 such judgments coming to agreement entirely by chance is small and would be likely to happen less than 4% of the time. [The probability of selecting one of two items strictly by chance is .5, and of doing so again is .5 x .5. For five judgments it would be .5 x .5 x .5 x .5 x .5 = 0.03125.]
Perfect agreement provides evidence that our assessment instrument is working as we would like it to. Disagreement, as in the case of judgments regarding Roland Questevarn’s response, can provide an opportunity for improvement of some aspect of the assessment or evaluation processes. Discussion on improvements of this kind take place in the thread devoted to ‘Evaluating an Assessment Instrument’.
Attached find the reliability report enhanced with rater comments that was requested in the previous post.
September 17, 2018 at 11:27 pm #1183
I agree (@Monica), it is odd that so many people would agree in a complex task. Take the reliability report for “Technical Description.” This was a learning goal with four levels of attainment. For one student, all raters agree that the goal has been attained. This is remarkable in itself, in that the student must be very proficient in this area.
I stop before saying, however, that the criteria for attainment of this learning goal must be well-defined, since looking at the other student’s report we see wide disparity between raters. What does this disparity mean in this case? It seems there was confusion in discerning which words constituted a 0, 1, or 2 according to the comments. How do we interpret this student’s attainment of the learning goals, especially in light of the other student’s unanimous attainment?
November 7, 2018 at 12:34 am #1268
The reliability result of the ability to distinguish observation from inference is really good that we got so much high reliability among six raters and one self-rater. I totally agree with Paul’s idea that the more levels of attainment of learning goals, the less possible to get higher percent agreement.
Thereby, I paid attention on the ability of technical description, which Hunter shared to us. Overall we got relatively lower percent agreement compared with the ability of distinguishing observation from inference. However, I realized that Paul, Mike and Panpan still had relatively higher percent agreement in this ability. So intuitively I think at least part of this is due to the professional development, because actually Paul, Mike and Panpan really obtained much more professional development in this reliability evaluation thing than others.
You must be logged in to reply to this topic.