Chrome Extension
WeChat Mini Program
Use on ChatGLM

(Re)Visualizing Rater Agreement: Beyond Single-Parameter Measures

The Journal of Writing Analytics(2017)

Cited 2|Views3
No score
Abstract
Technique Identification: A new graphical technique is presented for visualizing and assessing inter-rater agreement in discrete ordinal or categorical data, such as rubric ratings. To that aim, a chance-corrected Kappa with two new features is derived. First, it is based on interpreting ratings for each subject as vectors to visualize the data. This is done by creating two-dimensional vectors from a subject-rating summary table, sorting the vectors by their slopes, and plotting them in that order to create a trajectory that displays all the data in context. Second, it presents a graph and accompanying statistics (Kappa, p -value) for each pair of ratings in an organized display so that all useful comparisons of the data are visually displayed and statistically assessed. This information is presented on a logical grid, usually called facets . Kappa is calculated in the usual way, by referencing the actual results with an average of random rating assignments. This average becomes a reference line on each graph as a visual cue, as well. The statistical basis for the Kappa and significance testing are derived, and the test assumptions are specified. Value Contribution: The most commonly used statistics for inter-rater agreement, such as the Cohen Kappa or Inter-Class Correlation, give only a single parameter estimate of reliability from which to make judgments about ratings data. The technique presented here constructs graphs of all the data that allow visual inspection of the ratings versus a reference curve that represents chance-matching. The detailed reports on inter-rater agreement can show how to fine-tune ratings systems, such as understanding which parts of an ordinal scale are working best. This solves a practical problem for researchers who rely on rating-type classification by revealing which overall aspects of the rating system need to be improved and adds to the list of tools available for assessing rating reliability. In creating this approach to analysis of rater data, human usability is emphasized. Specifically, the use of geometry is designed to facilitate interpretability rather than being a mathematical derivation from first principles. Technique Application: Two applications are given, both involving social meaning-making. The first uses data from wine-judging to illustrate how the method can illuminate expertise in that domain. The results reproduce published findings that were based on a classical statistical method. A second sample application uses data from a university assessment of student writing in which ratings on a developmental scale are assigned by course instructors to their students. The rating program is an example of social meaning-making that can be used to generate larger data sets than are typical for classroom-based assessment programs. The analysis shows the strengths and weaknesses of the rating system in terms of reliability and demonstrates how that knowledge leads to improvements in assessment. Directions for Further Research: An argument is made for a public library of inter-rater data for empirical use by researchers. The social aspects of rating are discussed, and there is an illustration of the potential to derive new measures of inter-rater agreement from the meaning-making program that produces the data.
More
Translated text
Key words
rater agreement,measures,single-parameter
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined