An underlying tenet of sabermetrics—hell, of all science—is to continually ask questions, to follow-up on previous research, and to continue to update the analytic paradigm. Accountability is paramount, and though showing proof of one's work can present a humbling burden that forces the confrontation of potential holes in the system, here at Raising Aces we make it a point to have a transparent process that ups the communicative value of pitching evaluation.

I recently completed work for the 2015 Starting Pitcher Guide (order here!), capping off a winter spent in hibernation in my pitching-evaluation cave. Regular readers of this column are already familiar with the Mechanics Report Card, a grading system that I created to evaluate pitchers in a similar vein to the 20-80 scouting scale, and the SP Guide entailed more report cards than I have ever produced in one place. In total, there were 255 starting pitchers that received report cards in the 2015 Guide, which marked a healthy increase over the previous two editions, and all told there were about 1500 scores handed out for individual subjects. Such a healthy sample allowed me to cover the vast majority of MLB starters as well as provided an ample data-set with which to judge the work.

The 20-80 scouting scale is based on a system in which 50 represents major-league average, with a standard deviation of 10 points, such that a 60-grade will represent a skill that is one deviation better than average. It's a subjective scale that is better used for communication than statistical analysis, especially given the natural limitations in applying the rules of a normal curve to a non-normal data set, but these special conditions do not prevent us from running some quality assurance to test whether the system is communicating what was intended.

Before we get started, allow me to share some quick background info. The dimensions for the grades were covered in these two articles from the winter of 2012, and the analytic conditions have remained mostly the same. The one exception is the grade for release distance, which has been since removed from the report cards despite my continued obsession with the hidden benefits associated with a deeper release. The removal had more to do with the structure of the report cards themselves—release distance did not fit within the structure of four baseline grades (two for stability and two for power) that are joined by the one subject that rules them all, the all-encompassing mark for repetition. In contrast, release distance is vulnerable to a bevy of dependent mechanical variables.

The hope is that this exercise will be useful for multiple purposes, from providing context for the grades in the SP Guide to understanding my report cards from the past three years at Baseball Prospectus, as well as the future evolution of the 20-80 scale for pitching mechanics. Any scouting director can tell you about the habits of his own scouts—the guy who tends to over-grade power, the one who's obsessed with foot speed, or the guy who has yet to meet arm action that he really liked—and in this vein, I would like to reflect on my own performance this offseason to see if my grading system is consistent with the theoretical construct of the 20-80 scale. We all have tendencies, and the numerical representations can act to obscure the subjective evaluation that created the numbers, so I hope to learn something about myself as an evaluator in addition to the application of my grading system.

Readers will have to check out the 2015 Starting Pitcher Guide in order to attach the pitcher to the grades, as this is more of an exercise in theory and application of the Thorburn Report Cards, but hopefully it adds to the utility of these mechanical reports.

__Mean__: 53.9

__SD__: 9.5

It's to be expected for starting pitchers to score well on the balance measure, given the role that stability plays in supporting a pitcher's stamina, whereas the reliever cohort would likely fall short in comparison. That said, the 53.9 average feels a bit high, and yet it falls in line with a general trend that I noticed when grading pitchers over the winter (and during last season), that modern pitchers were emphasizing stability to a greater degree than in the past.

Anecdotal evidence aside, it could be a one-year blip or a reflection of bias from this evaluator, and the fact that there were far more 60 grades in the population than any other mark raises eyebrows. The standard deviation isn't too far from the subjective ideal but the shape of the curve is anything but normal, with a negative skew and a population drift towards the higher ratings. It is definitely something to keep an eye on for this season, and I might have to take a look at how I'm weighting the three planes of balance in coming up with a final tally; X-plane is side-to-side (first to third), Y-plane is up-and-down, and Z-plane is back-and-forth (rubber to plate).

__Mean__: 51.6

__SD__: 8.1

The momentum population falls closer to the theoretical mean than balance, with an overall shape that adheres more closely to the normal distribution, though the relatively modest standard deviation of 8.1 reflects the bigger issue. A tighter-wound distribution of data reflects the tendency for scores to fall in the middle of the curve, with fewer at the extremes.

Clustered data is expected for this measure—I know that I tend to drift more towards the middle on momentum grades and only veer outside the gates in extreme cases, because the common viewing angle that is afforded via television feeds is suboptimal for evaluating momentum. Ideally, I would have a side-view from behind the dugouts to make a proper analysis of momentum, but the centerfield camera leaves much to be desired and the non-strategic POV adds a layer of uncertainty to my evaluations that is manifested in more scores toward the middle.

__Mean__: 54.0

__SD__: 10.1

All of the subjects thus far have veered toward means that were above the theoretical average of 50, but torque is the most egregious, and the context of the category calls for potential changes to be made. The mean of 54.0 in the '15 SP Guide is the highest of any of the subjects under the microscope, and though the standard deviations falls right in line with theoretical expectations, the shape of the distribution reflects a prevailing tendency to dole out grades that are average or better.

Adding to the intrigue is that torque is expected to follow the opposite tendency as balance; that is, starting pitchers would be expected to fare worse than relievers. Relief pitchers tend to let 'er rip during their shorter stints, sacrificing stability in the name of power, and such a trend would suggest that the average torque among relievers to be even higher. The examples can be seen at the high end of the scale, where there are zero starter pitchers with 80 torque, yet such elite-level separation can be found sporadically in big-league bullpens.

MLB as a whole is trending heavily toward increased velocity, such that the accepted range of what constitutes an “average” fastball has had to be adjusted in scouting circles to reflect the reality of today's gas-pumping population (not that all scouts have made such an adjustment). My personal assessments of torque likely need to be recalibrated accordingly, given the above numbers in addition to the connection between pitch-velocity and torque. The relative grades still hold, as nearly all of the pitchers who experienced a noticeable torque adjustment last season saw a corresponding impact to pitch-speed, but the evidence points to an outdated model of league-average torque and that aspect will be under construction this season.

__Mean__: 51.3

__SD__: 9.9

The grades for posture conform most closely to the expected distribution of the 20-80 scale, with a mean that is a shade above 50—at least part of which can be explained by the tendency for starting pitchers to have better stability—and a standard deviation that is within a hair of ideal. The extra dose of 55-grade posture marks the only blip on the spine-tilt radar.

It makes sense that posture would be the most stable of measures. It involves the fewest variables—it is simply the X-plane displacement of the head relative to the center of mass, taken at release point, aka glove-side lean—and can be easily isolated with dead-center cameras that minimize the parallax effect (Tampa and Minnesota work really well in the AL, St. Louis and Pittsburgh in the NL), allowing for direct comparisons from player to player. These elements make it the easiest grade to evaluate, and though many pitchers are inconsistent with the degree of spine-tilt throughout the season, the relative ease of analysis allows for more reliability when designating grades for posture.

__Mean__: 51.8

__SD__: 10.2

The repetition grades follow an interesting pattern. The mean and standard deviation give the impression of a very well-calibrated system for repetition, but one statistical irregularity stands out along with one theoretical obstacle. First, the data: there is an odd dip in the frequency of 45-grade repetition, but the data points between 40 and 60 are otherwise pretty flat. This could be an anomaly or part of a trend, and is worth monitoring in the future. Repetition is another grade that would be expected to be higher with starters, given the demands of throwing 100-plus pitches versus just 15-20.

The theoretical obstacle is that repetition ought to be the toughest subject to evaluate. It involves every aspect of the delivery, encompassing all of the baseline grades, and it takes a longer time to analyze properly than any of the other scores. On the bright side, there are indicators of repetition in the form of pitch command (not control) and how far the baseball deviates from its intended location, but even with that caveat, I have to admit surprise that the overall distribution for repetition was so clean.

__Mean__: 2.36 (C+)

__SD__: 0.68

The grades are designed to be centered around a “C” average, but once again the results shade to the high side of the theoretical mean. The letter grades involve a combination of the baseline scores, repetition, and other aspects, so it is the most likely to veer off-course. The extra B- grades likely reflect my optimism that these guys can fix at least some of their woes, as I admittedly bake in an extra dose of subjectivity with the overall grades and optimism is one of those ingredients. I found it interesting that the standard deviation on the overall grades was basically two-thirds of a full letter, centered around the mean of C+, but it makes sense given the distribution of scores.

***

__Conclusions__

*The fact that every measure shaded above the ideal 50 is likely a reflection of the data set (the worst pitchers were culled, only elite prospects included) as well as personal tendencies to score things on the high side.

*The momentum grades have low variance, due to the analytic restrictions imposed by the viewing angles.

*The grades for posture are very reliable and they adhere most closely to the theoretical construct of the 20-80 scale.

*Repetition grades followed a surprisingly “normal” pattern, and the multitude of interacting variables encourages further investigation.

*The balance grades were a bit high, even when considering the advantages that starters enjoy in this subject. Balance is one of the toughest aspects to measure, given the three planes of potential movement and the dynamic nature of the measurement, such that balance is assessed throughout the delivery. Balance will be under the microscope during the season to see if some improvements can be made to the methodology.

*The torque grades are very high, especially when considering that relievers would be expected to perform better. Given the rapidly evolving environment of velocity in today's game, I will likely need to recalibrate my measurement system in order to be more in line with the times.

*The overall grades turned out slightly better than intended—I guess I'm an easy grader, and hopefully the students will appreciate my slight departure from the standard bell curve.

#### Thank you for reading

This is a free article. If you enjoyed it, consider subscribing to Baseball Prospectus. Subscriptions support ongoing public baseball research and analysis in an increasingly proprietary environment.

Subscribe now
"It's a subjective scale that is better used for communication than statistical analysis, especially given the natural limitations in applying the rules of a normal curve to a non-normal data set, but these special conditions do not prevent us from running some quality assurance to test whether the system is communicating what was intended."

The fact that this is a subjective scale that is centered on the statistical premise of the normal curve means that each evaluator has a personal system that more or less conforms to that shape, revealing the extent to which each grade is truly "average" in his/her mind (as well as the standard thresholds for 40, 60, 70, etc). My goal here was to better understand the extent to which I am communicating the grades as intended.

Please keep in mind that there is a statistical basis for this exercise, that being the data from hi-speed motion capture that I collected for 5 years at the National Pitching Association. The majority of those measurements did meet the requirements for an approximately normal distribution, so it is notable (to me at least) when the MLB population appears to have drifted, shifting the population mean (ie in the case of torque).