Referring to late 19th and early 20th century school attendance centers, Thomas Guskey says:
...Although elementary teachers continued to use narrative reports to document student learning, high school teachers began using percentages and other similar markings to certify accomplishments in different subject areas (Kirschenbaum, Simon & Napier, 1971)
...But in 1912, a study by two Wisconsin researchers seriously challenged the reliability and accuracy of percentage grades.  Daniel Starch and Edward Charles Elliott found that 147 high school English teachers in different schools assigned widely different percentage grades to two identical student papers.  Scores on the first paper ranged from 64 to 98, and scores on the second paper ranged from 50 to 97.  One paper was given a failing mark by 15 percent of the teachers and a grade of over 90 by 12 percent of the teachers...With more than 30 different percentage grades assigned to a single paper and a range of more than 40 points, it is easy to see why this study created a stir among educators.
Starch and Elliot's study was immediately criticized by those who claimed that judging good writing is, after all, highly subjective.  But when the researchers repeated their study using geometry papers graded by 128 math teachers, they found even greater variation.  
Via September 2013 Educational Leadership.

-----
(Kind of) Trying out the idea myself

Nearly eight months ago, I started thinking about inter-rater reliability in a typical percentages/points based classroom vs. a 4-point standards-based grading classroom.  A bit of informal research took place during the ensuing weeks to whet my psychometric curiosity.   

Forty (40) certified secondary math teachers in Iowa were given the following activity:  

Prompt:

Fictitious student responses:

Assuming each question is worth 10 points, how many points would you award each student?



Next, score each student response holistically.  How well does the student understand the concept (area of circle)? 

























---
Reflection
  • When using a smaller scale (4 scoring possibilities vs. 10 scoring possibilities), mathematical logic kicks in: humans are more consistent when given fewer scoring possibilities.
  • Even when scoring possibilities are replaced with narratives, humans are still imperfect.
  • As an educational system, have our grading practices changed much since 1912?  
I am interested in your reflection on these ideas and informal research as well.  Leave them in the comments below.