This is the second in a series on how to use statistics to make better decisions. The first one described how to use stats for diagnostic purposes.
This article focuses on the types of data collected and how to measure the value of statistical analysis.
Objective and Subjective data
There are lots of kinds of data. Collecting some statistics sound easy – just count whatever you want to measure. Yet many data collections require judgment on the part of the analyst. For example, one has to ask what is a drop as opposed to determining that a incomplete pass was uncatchable? Even those that are just counted have judgment built in. One can count the number of penalties, but the penalties themselves have judgment built it.
Similarly, we remember multiple choice tests as objective and essays as subjective. Yet do you remember questions that had two correct answers or none. These can be pretty subjective in practice. OTOH, the professor might have very specific criteria in mind on the essays. Were the essays grammatically correct without typographical errors? Did they actually cover a specific number of required areas in the answer. Were the points made well with appropriate logic? Grading essays can be very objective.
Qualitative versus quantitative
Qualitative data is routinely researched and used for analysis. Entire areas of marketing are based on qualitative data. Polling is based on qualitative data for example. Often they use Likert scales – typically very good, good, average, bad, and very bad.
Polling is based on qualitative data, but it is pretty quantitative in nature. The answers to the responses are equivalent to quantitative data. If you ask 1,000 people a question and you can get close to 1,000 ideas of what the question is really asking. Trying to minimize this effect is what makes creating polls a hard business.
After all how much better is very good as opposed to just good. At what point does the person answering the poll draw the line? Some people will have a strict interpretation and some a loose interpretation when answering a poll. One can count up the responses, but the responses themselves can be more subjective than a question of whether a pass was a drop or uncatchable.
Yet with a small number of raters for qualitative analysis such as PFF, using a defined methodology will decrease the variance among their responses. In fact, experience will make the process better as the raters find areas to improve.
Bill Parcells, not known as a stat guy, used to say to wait until the 4th game or so before making any kind of judgment. He made that observation based on his experience.
A sample with a limited number of entries is subject to extreme values or outliers. If you had a great first game, that may not be representative of how one does overall. We laugh at saying a guy who gets 2 sacks in the first game translates in 32 per year at that rate. Yet with about 60 plays per game, with 4 games allows 240 plays [more or less] and that ought to be sufficiently large to overcome small sample bias.
Factors in measuring statistical analysis
One measures statistical analysis is two ways – reliability and validity.
Reliability is the term for consistency. This is precision not necessarily accuracy. The classic example is a tape measure. It may be off. If you measure your arm it may show 34.2 inches when it really is 34.3 inches. Yet the key is whether you get the same results again and again under all conditions. It the tape was made of elastic and you got 34.1 one day and 33.9 the next day, it would be unreliable.
Eye witness accounts are notably unreliable due to
a. the ability of the witness to see such as conditions of darkness, or poor weather
b. the distance the witness is from the scene,
c. the condition of the witnesses such as being tired, ill, or under undue stress, or substances
d. the skill or lack thereof of the witness
e. the bias or preconceived notions of the witness.
We see this in every court room on TV and in real life. The impeachment of the witness is a classic plot twist of many law/court shows. Two different witnesses often differ in both minor and major details. In fact, when all the witnesses are too close in all the details, it is a tipoff that they have colluded.
The most notable reason for overturning death penalty cases is mistaken identity. DNA has exonerated many convicted murders who were falsely accused.
Are some observers better than others – sure. The term of art is inter-rate reliability. Qualitative scoring is an issue and some judgment is required. Yet if one has a grading guidance and follows the methodology, reliability increases.
Politics is a major issue in Olympic judging. Note the Olympics take the best score and worst scores out of the equation and average the rest. The methodology limits, not eliminates, bad scoring.
Further, if you record the scores in standard measure and publish those records routinely, reliability increases. For example, look at how many folks have an impression of Costa based solely, or primarily, on the earliest games of the year. His scores reflect a bad first third of last season. Yet if one looks at the scores over the entire season, trends emerge that an unbiased observer should note.
The same occurred with other players. The key is having a logical explanation for changes.
Bernie was injured and had surgery in the off-season. His opening games were bad. Yet by the third game he improved significantly for the next several games and was playing at a HIGH LEVEL until he moved to center.
He was also bad at center as shown by his scores. Yet that was the first time he had played center in the NFL games. Can he do better at center -maybe. After all Tyron was also terrible by his scores in his first games at LT but improved significantly.
One explanation for why the tackles did so poorly in the first three games is the change in center. The timing was off, yet as the team played together the communication and adaption improved. The rate of penalties decreased significantly.
Validity is actually measuring what you say you are looking at. I give the example of a driver’s test. One actually drives a vehicle to test your skill. Note how every 16 year old has stories of some tester who makes the driver do unnatural acts and it was not fair. If true, that is INVALID.
Yet it is more likely that the tester has a standardized course with specific things that is looked at, such as parallel parking within 10 inches of the curb without hitting the pylons. Thus, to that extent, the test is VALID. One has to see well to drive so one tests one’s vision. That is valid.
One also notes the color of one’s eyes for ID purposes. Yet while we can measure the color of the eyes – that is invalid for licensing purposes. Validity is hard to measure. The standard way is to correlate it with other accepted results. PFF scores and rankings tend to agree with controversial actions of the team – RAT, Costa, keeping Spencer among them and to that extent, they seem valid.