Quantcast
Channel: Normal Distribution | Minitab
Viewing all 64 articles
Browse latest View live

Talking the Statistical Talk with Those Who Don’t Walk the Statistical Walk

$
0
0

When you work in data analysis, you quickly discover an irrefutable fact: a lot of people just can't stand statistics. Some people fear the math, some fear what the data might reveal, some people find it deadly dull, and others think it's bunk. Many don't even really know why they hate statistics—they just do. Always have, probably always will. 

Problem is, that means we who analyze data need to communicate our results to people who aren't merely inexperienced in statistics, but may have an actively antagonistic relationship with it. 

Given that situation, which side do you think needs to make concessions? 

The Difference Between Brilliance and Bull

When I was a kid, I used to see this T-shirt in the Spencer Gifts store in the Monroeville Mall: "If you can't dazzle them with brilliance, baffle them with bull---." I thought it was an amusing little saying then, but as I've grown older I've realized an underlying truth about it:

In terms of substance, brilliance and bull often are identical.

Whether what you're saying is viewed as one or the other depends on how well you're getting your ideas across. Given the ubiquity of the "Lies, damned lies, and statistics" quip, it would seem that most statistically-minded people aren't doing that well. We so often forget that most of the people we need to reach just don't get statistics. But when we do that, we're putting another layer on the Tower of Babel. 

In sharing the results of an analysis with people who aren’t statistics-savvy, we have two obligations. First, to make a concerted effort to convey clearly the results of our analysis, and its strengths and limitations. Most of us do this to some degree. But second, we should take every opportunity we can to demystify and humanize statistics, to help people appreciate not just the complexity but also the art that goes into analysis. To promote statistical literacy. I think most of us can do better in this regard.

Opening the Black Box

There is an impression among those not well versed in statistical methods that the discipline is something of a black box: statisticians know the magic buttons that will transform a spreadsheet full of data into something meaningful. 

A good statistician knows the formulas and methods inside out, and very smart ones expand the discipline with new techniques and applications. But an effective statistician is sensitive to the relationship between the language of statistics and the language the audience speaks, and able to bridge that gap.    

Statisticians who are trying to communicate about their work with the uninitiated are like ambassadors: they need to be completely cognizant of local knowledge, customs, and beliefs, and present their message in a way that will be understood by the recipients. 

In other words, unless we're speaking to a room full of other statisticians, we should stop talking like statisticians. 

What We Mean Is Not Necessarily What We Say

The language of statistics can seem particularly impenetrable and obtuse. That's hard to deny, given that  the method we use to compare means is called “Analysis of Variance.” And when it comes to distributions, right-skewed data are clustered on the left side of a bar graph and left-skewed data clustered on the right. Not exactly intuitive. That's why the Assistant in Minitab 17 uses plain language in its output and dialog boxes, and avoids confusing statistical jargon.

Indeed, some statistical language can seem like outright obfuscation, like the notion that a statistical test "failed to reject the null hypothesis." From an editorial viewpoint, "failing to reject the null hypothesis" would seem a needlessly circular equivalent to the word accept.

Of course from a statistical perspective, replacing "failure to reject" with "accept" would be very wrong. So we’re left with a phrase that’s precise, correct, and also confusing. It takes only seconds to compare "failing to reject the null" to a jury saying "not guilty." When evidence against the accused isn’t convincing, that doesn’t prove innocence. But how often is "failure to reject the null" presented to lay audiences with no explanation?

Another difficulty with statistical language, ironically, is that it includes so many common words. Unfortunately, their meanings in statistics are not the same as their common connotations, so when we use them in a statistical context, we often connote unintended ideas. Consider just a few of the terms that mean one thing to statisticians, and quite another to everyone else.

  • Significant. For most, this word equates to "important." Statisticians know that significant things may have no importance at all.
  • Normal. People take this to mean it something is ordinary or commonplace, not that it follows a Gaussian distribution.
  • Regression. To "regress" is to shrink or move backwards. Most people won’t relate that idea to estimating an output variable based on its inputs.
  • Average. People hear this not as a mathematical value but as a qualitative judgment, meaning common or fair. 
  • Error. Statisticians mean the measure of an estimate's precision, but people hear "mistake." 
  • Bias. For statisticians, it doesn't mean attitudinal prejudice, but rather the accuracy of a gauge compared to a reference value.
  • Residual. People think residuals are leftovers, not the differences between observed and fitted values.
  • Power. A statistical test can be powerful without being influential. Seems like a contradiction, unless you know it refers to the probability of finding a significant (there we go again…) effect when it truly exists.   
  • Interaction. An act of communication for most, rather than effects of one factor being dependent on another.
  • Confidence. This word carries an emotional charge, and can leave a non-statistical audience thinking statistical confidence means the researchers really believe in their results. 

And the list goes on...statistical terms like sample, assumptions, stability, capability, success, failure, risk, representative, and uncertainty can all mean different things to the world outside the statistical circle.

Statisticians frequently lament the common misunderstandings and lack of statistical awareness among the population at large, but we are responsible for making our points clear and complete when we reach out to non-statistical audiences—and almost any audience we reach will have a non-statistical contingent. 

Making an effort to help the people we communicate with appreciate the technical meanings of these terms as we use them is an easy way to begin promoting higher levels of statistical literacy.


The Minitab Blog Quiz: Test Your Stat-Smarts!

$
0
0

How deeply has statistical content from Minitab blog posts (or other sources) seeped into your brain tissue? Rather than submit a biopsy specimen from your temporal lobe for analysis, take this short quiz to find out. Each question may have more than one correct answer. Good luck!

  1. Which of the following are famous figure skating pairs, and which are methods for testing whether your data follow a normal distribution?

    a. Belousova-Protopopov
    b. Anderson-Darling
    c. Kolmogorov-Smirnov
    d. Shen-Zhao
    e. Shapiro-Wilk
    f. Salé-Pelletier
    g. Ryan-Joiner

      Figure skaters are a, d, and f. Methods for testing normality are b, c, e, and g. To learn about the different methods for testing normality in Minitab, click here.

     
  2. A t-value is so-named because...

    a. Its value lies midway between the standard deviation(s) and the u-value coefficient (u).
    b. It was first calculated in Fisher’s famous “Lady Tasting Tea” experiment.
    c. It comes from a t-distribution.
    d. It’s the first letter of the last name of the statistician who first defined it.
    e. It was originally estimated by reading tea leaves.

      The correct answer is c. To find out what the t-value means, read this post.


     

  3. How do you pronounce µ, the mean of the population, in English?

    a. The way a cow sounds
    b. The way a kitten sounds
    c. The way a chicken sounds
    d. The way a sheep sounds
    e. The way a bullfrog sounds

      The correct answer is b. For the English pronunciation of µ and, more importantly, to understand how the population mean differs from the sample mean, read this post.

  4. What does it mean when we say a statistical test is “robust” to the assumption of normality?

    a. The test strongly depends on having data that follow normal distribution.
    b. The test can perform well even when the data do not strictly follow a normal distribution.
    c. The test cannot be used with data that follow a normal distribution.
    d. The test will never produce normal results.

      The correct answer is b. To find out which commonly used statistical tests are robust to the assumption of normality, see this post.

     
  5. A Multi-Vari chart is used to...

    a. Study patterns of variation from many possible causes.
    b. Display positional or cyclical variations in processes.
    c. Study variations within a subgroup, and between subgroups.
    d. Obtain an overall view of the factor effects.
    e. All of the above.
    f. Ha! There’s no such thing as a “Multi-Vari chart!”

      The correct answer is e (or, equivalently, a, b, c, and d). To learn how you can use a Multi-Vari chart, see this post.

     
  6. How can you identify a discrete distribution?

    a. Determine whether the probabilities of all outcomes sum to 1.
    b. Perform the Kelly-Banga Discreteness Test.
    c. Assess the kurtosis value for the distribution.
    d. You can’t—that’s why it’s discrete.


      The correct answer is a. To learn how to identify and use discrete distributions, see this post. For a general description of different data types, click here. If you incorrectly answered c, see this post.


     

  7. Which of these events can be modeled by a Poisson process?

    a. Getting pooped on by a bird
    b. Dying from a horse kick while serving in the Prussian army
    c. Tracking the location of an escaped zombie
    d. Blinks of a human eye over 24-hour period
    e. None of the above.

      The correct answer is a, b, and c. To understand how the Poisson process is used to model rare events, see the the following posts on Poisson and bird pooping, Poisson and escaped zombies, and Poisson and horse kicks.

     
  8. Why should you examine a Residuals vs. Order Plot when you perform a regression analysis?

    a. To identify non-random error, such as a time effect.
    b. To verify that the order of the residuals matches the order of data in the worksheet.
    c. Because a grumpy, finicky statistician said you have to.
    d. To verify that the residuals have constant variance.

     

    The correct answer is a. For examples of how to interpret the Residuals vs Order plot in regression, see the following posts on snakes and alcohol, independence of the residuals, and residuals in DOE.


     

  9. The Central Limit Theorem says that...

    a. If you take a large number of independent, random samples from a population, the distribution of the samples approaches a normal distribution.
    b. If you take a large number of independent, random samples from a population, the sample means will fall between well-defined confidence limits.
    c. If you take a large number of independent, random samples from a population, the distribution of the sample meansapproaches a normal distribution.
    d. If you take a large number of independent, random samples from a population, you must put them back immediately.
     

    The correct answer is c, although it is frequently misinterpreted as a. To better understand the central limit theorem, see this brief, introductory post on how it works, or this post that explains it with bunnies and dragons.

     


  10. You notice an extreme outlier in your data. What do you do?

    a. Scream. Then try to hit it with a broom.
    b. Highlight the row in the worksheet and press [Delete]
    c. Multiply the outlier by e-1
    d. Try to figure out what’s going on
    e. Change the value to the sample mean
    f.  Nothing. You’ve got bigger problems in life.
     

    The correct answer is d. Unfortunately, a, b, and f are common responses in practice. To see how to use brushing in Minitab graphs to investigate outliers, see this post. To see how to handle extreme outliers in a capability analysis, click here. To read about when it is and isn't appropriate to delete data values, see this post. To see what it feels like, statistically and personally, to be an outlier, click here.


     

  11. Which of the following are true statements about the Box-Cox transformation?

    a. The Box-Cox transformation can be used with regression analysis.
    b. You can only use the Box-Cox transformation with positive data.
    c. The Box-Cox transformation is not as powerful as the Johnson transformation.
    d. The Box-Cox transformation transforms data into 3-dimensional cube space.
      a, b, and c are true statements. To see how the Box-Cox uses a logarithmic function to transform non-normal data, see this post. For an example of how to use the Box-Cox transformation when performing a regression analysis, see this post. For a comparison of the Box-Cox and Johnson transformations, see this post.


     

  12. When would you use a paired t-test instead of a 2-sample t-test?

    a. When you don’t get significant results using a 2-sample t test.
    b. When you have dependent pairs of observations.
    c. When you want to compare data in adjacent columns of the worksheet.
    d. When you want to analyze the courtship behavior of exotic animals.
     

    The correct answer is b. For an explanation of the difference between a paired t test and a 2-sample t-test, click here.


     

  13. Which of these are common pitfalls to avoid when interpreting regression results?

    a. Extrapolating predictions beyond the range of values in the sample data.
    b. Confusing correlation with causation.
    c. Using uncooked spaghetti to model linear trends.
    d. Adding too much jitter to points on the scatterplot.
    e. Assuming the R-squared value must always be high.
    f. Treating the residuals as model errors.
    g. Holding the graph upside-down.
     

    The correct answers are a, b, and e. To see an amusing example of extrapolating beyond the range of sample data values, click here. To understand why correlation doesn't imply causation, see this post. For another example, using NFL data, click here, and for yet another, using NBA data, click here. To understand what R-squared is, see this post. To learn why a high R-squared is not always good, and a low R-squared is not always bad, see this post.


     

  14. Which of the following are terms associated with DOE (design of experiment), and which are terms associated with a BUCK? 

    a. Center point
    b. Crown tine
    c. Main effect
    d. Corner point
    e. Pedicle
    f.  Split plot
    g. Block
    h. Burr
    i. Main beam
    j. Run
     

    The design of experiment (DOE) terms are a, c, d, f, g, and j. The parts of a buck's antlers are b, e, and h. The Minitab blog contains many great posts on DOE, including several step-by-step examples that provide a clear, easy-to-understand synopsis of the process to follow when you create and analyze a designed experiment in Minitab. Click hereto see a complete compilation of these DOE posts.


     

  15. Which of these are frequently cited as common statistical errors?

    a. Assuming that a small amount of random error is OK.
    b. Assuming that you've proven the null hypothesis when the p-value is greater than 0.05.
    c. Assuming that correlation implies causation.
    d. Assuming that statistical significance implies practical significance.
    e. Assuming that inferential statistics is a method of estimation.
    f. Assuming that statisticians are always right.
     

    The correct answers are b, c, and d. To see common statistical mistakes you should avoid click here. And here.
  Looking for more information? Try the online Minitab Topic Library

For more information on the concepts covered in this quiz—as well as many other statistical concepts—check out the Minitab Topic Library.

On the Topic Library Overview page, click Menu to access topic of your choice.
For example, for more information on interpreting residual plots in regression analysis, click Modeling Statistics > Regression and correlation > Residuals and residual plots.

How to Think Outside the Boxplot

$
0
0

There's nothing like a boxplot, aka box-and-whisker diagram, to get a quick snapshot of the distribution of your data. With a single glance, you can readily intuit its general shape, central tendency, and variability.

boxplot diagram

To easily compare the distribution of data between groups, display boxplots for the groups side by side. Visually compare the central value and spread of the distribution for each group and determine whether the data for each group are symmetric about the center. If you hold your pointer over a plot, Minitab displays the quartile values and other summary statistics for each group.

boxplot hover

The "stretch" of the box and whiskers in different directions can help you assess the symmetry of your data.

skewed boxplots

Sweet, isn't it?  This simple and elegant graphical display is just one of the many wonderful statistical contributions of John Tukey. But, like any graph, the boxplot has both strengths and limitations. Here are a few things to consider.

Be Wary of Sample Size Effects 

Consider the boxplots shown below for two groups of data, S4 and L4.

boxplot two groups

Eyeballing these plots, you couldn't be blamed for thinking that L4 has much greater variability than S4.

But guess what? Both data sets were generated by randomly sampling from a normal distribution with mean of 4 and a standard deviation of 1. That is, the data for both plots come from the same population.

Why the difference? The sample for L4 contains 100 data points. The sample for S4 contains only 4 data points. The small sample size shrinks the whiskers and gives the boxplot the illusion of decreased variability. In this way, if group sizes vary considerably, side-by-side boxplots can be easily misinterpreted.

How to See Sample Size Effects

Luckily, you can easily change the settings for a boxplot in Minitab to visually capture sample-size effects. Right-click the box and choose Edit Interquartile Range Box. Then click the Options tab and check the option to show the box widths proportional to the sample size.

options

Do that, and the side-by-side boxplots will clearly reflect sample size differences.

width proportional

Yes that looks weird. But it should look weird! For the sake of illustration, we're comparing a sample of 4 to a sample of 100, which is a weird thing to do.

In practice, you'd be likely to see less drastic—though not necessarily less important—differences in the box widths when groups are different sizes. The following side-by-side boxplots show groups with sample sizes that range from 25 to 100 observations.

side by side plot

Thinner boxes (Group F) indicate smaller samples and "thinner" evidence. Heftier boxes (Group A) indicate larger samples and more ample evidence. The group comparisons are less misleading now because the viewer can clearly see that sample sizes for the groups differ.

Small Samples Can Make Quartiles Meaningless

Another issue with using a boxplot with small samples is that the calculated quartiles can become meaningless. For example, if you have only 4 or 5 data values, it makes no sense to display an interquartile range that shows the "middle 50%" of your data, right?

Minitab display options for the boxplot can help illustrate the problem. Once again, consider the example with the groups S4 (N = 4) and L4 (N = 100), which were both sampled from a normal population with mean of 4 and standard deviation of 1.

boxplot two groups

To visualize the precision of the estimate of the median (the center line of the box), select the boxplots, then choose Editor > Add > Data Display. You'll see a list of items that you can add to the plot. Select the option to display a confidence interval for the median on the plot.

median ci option

Here's the result:

boxpot median ci

First look at the boxplot for L4 on the right. A small box is added to the plot inside the interquartile range box to show the 95% confidence interval for the median. For L4, the 95% confidence interval for the median is approximately (3.96, 4.35), which seems a fairly precise estimate for these data.

S4, on the left, is another story. The 95% confidence interval (3.65, 5.19) for the median is so wide that it completely obscures the whiskers on the plot. The boxplot looks like some kind of clunky, decapitated Transformer. That's what happens when the confidence interval for the median is larger than the interquartile range of the data. If your plot looks like that when you display the confidence interval for the median, it often means that your sample is probably too small to obtain meaningful quartile estimates.

Case in Point: Boxplots and Politics

Like Ginger Rogers, I'm kind of writing this post backwards—although not in high heels. What got me thinking about these issues with the boxplot was a comment from a reader who suggested that my choice of a time series plot to represent the U.S. deficit data was politically biased. Here's the time series plot:

time series plot

Even though I deliberately refrained from interpreting this graph from a political standpoint (given the toxic political climate on the Internet, I didn't want to go there!), the reader felt that by choosing a time series plot for these data, I was attempting to cast Democratic administrations in a more favorable light.The reader asked me to instead consider side-by-side boxplots of the same data:

boxplot deficit

I appreciated the reader's suggestion in a general sense. After all, it's always a sound strategy to examine your data using a variety of graphical analyses.

But not every graph is appropriate for every set of data. And for these data, I'd argue that boxplots are not the best choice, regardless of whether you're a member of the Democratic, Republican, Objectivist, or Rent Is Too Damn High party.

For one thing, the sample sizes for each boxplot are much too small (between 4 and 8 data points, mostly), raising the issues previously discussed. But something else is amiss...

Context is everything...especially in statistics

 In most cases, such as in most process data, longer boxes and whiskers indicate greater variability, which is usually a "bad" thing. So when you eyeball the boxplots of %GDP deficits quickly, your eye is drawn to the longer boxes, such as the plot for the Truman administration. The implication is that the deficits were "bad" for those administrations.

But is variability a bad thing with a deficit? If a president inherits a huge deficit and quickly turns it into a huge surplus, that creates a great amount of variability—but it's good variability.

You could argue that the relative location of the center line (median) of the side-by-side plots provides a useful means of comparing "average" deficits for each administration. But really, with so few data values, the median value of each administration is just as easy to see in the time series plot. And the time series plot offers additional insight into overall  trends and individual values for each year.

Look what happens when you graph the same data values, but in a different time order, using time series plots and boxplots.

time series plot trends

boxplot of trends

Using a boxplot for this trend data is liking putting on a blindfold. You want to choose a graphical display that illuminates information about data, not obscures it.

In conclusion, a monkey wrench is a wonderful tool. Unless you try to use it as a can opener. Graphs are kind of like that, too.

Weight for It! A Healthy Application of the Central Limit Theorem

$
0
0

Like so many of us, I try to stay healthy by watching my weight. I thought it might be interesting to apply some statistical thinking to the idea of maintaining a healthy weight, and the central limit theorem could provide some particularly useful insights. I’ll start by making some simple (maybe even simplistic) assumptions about calorie intake and expenditure, and see where those lead. And then we can take a closer look at these assumptions to try to get a little closer to reality.

scaleI should hasten to add that I’m not a dietitian—or any kind of health professional, for that matter. So take this discussion as an example of statistical thinking rather than a prescription for healthy living.

Some Basic Assumptions

Wearable fitness trackers like a FitBit or pedometer can give us data about the calories we burn, while a food journal or similar tool helps monitor how many calories we take in. The key assumption I am going to make for this discussion is that the number of calories I take in are roughly in balance with the calories I burn. Not that they balance exactly every day, but on average they tend to be in balance. Applying statistical thinking, I am going to assume my daily calorie balance is a random variable X with mean 0, which corresponds to perfect balance.

On days when I consume more calories than I burn, X is positive. On days when I burn more calories than I consume, X is negative. On a day when a coworker brings in doughnuts, X might be positive. On a day when I take a walk after dinner instead of watching TV, X might be negative. I will assume that when X is positive, the extra calories are stored as fat. On days when X is negative, I burn up stored fat to fuel my extra activity. I will assume each pound of body fat is the accumulation of 3500 extra calories.

The variation in X is represented by the variance, which is the mean squared deviation of X from its mean. The standard deviation is the square root of the variance. I will assume a standard deviation of 200 calories.

Each day there’s a new realization of X. If I assume each day’s X value is independent of that from the day before, then it’s like taking a random sample over time from the distribution of X. The central limit theorem assumes independence, so I’ll at least start off with that assumption. Later I’ll revisit my assumptions.

Based on all these assumptions, if I add up all the X’s over the next year (X1 + X2 + … + X365) that will tell me how much weight I will gain or lose. If the sum is positive, I gain at the rate of one pound for every 3500 calories. If the sum is negative, I lose at the same rate. So let’s apply some statistical theory to see what we can say about this sum.

First, the mean of the sum will be the sum of the means. That’s why I wanted to assume that my daily calorie balance has a mean of 0. Add up 365 zeroes, and you still have zero. Just like my daily calorie balance is a random variable with mean zero, so is my yearly calorie balance. So far so good!

Accounting for Variation

Next consider the variability. Variances also add. With the assumption of independence, the variance of the sum is the sum of the variances. I assumed a daily standard deviation of 200 calories, which is the square root of the variance, which would be 40,000 calories squared. It’s weird to talk about square calories, so that’s why I prefer to talk about the standard deviation, which is in units of calories. But standard deviations don’t sum nicely the way variances do. My yearly calorie balance will have a variance of 365 × 40,000 calories squared.

The standard deviation is the square root of this, or 200 times the square root of 365. The square root of 365 is about 19.1, so the standard deviation of my yearly calorie balance is about 19.1*200 = 3820. Is that good? Is that bad? Not sure, but this quantifies the intuitive but vague idea that my weight varies more from year to year than it does from day to day.

Enter the Central Limit Theorem

Now let’s bring the central limit theorem into the discussion. What can it add to what we have found already? The central limit theorem is about the distribution of the average of a large number of independent identically distributed random variables—such as our X. It says that for large enough samples, the average has an approximately normal distribution. And because the average is just the sum divided by the total number of Xs, which is 365 in our example, this also lets us use the normal distribution to get approximate probabilities for my weight change over the next year.

Let’s use Minitab’s Probability Distribution Plot to visualize this distribution. First let’s see the distribution of my yearly calorie balance using a mean of 0 and a standard deviation of 3820.

Distribution Plot

We can get the corresponding distribution in terms of pounds gained by dividing the mean and standard deviation by 3500.

Distribution Plot

The right tail of the distribution is labeled to show that under my assumptions I have about an 18% probability of gaining at least one pound. The distribution is symmetric about zero, so I have the same probability of losing at least one pound over the year. On the bright side, I have about a 64% probability of staying within one pound of my current weight as shown in this next graph.

Normal Distribution Plot

Before I revisit the assumptions, let’s project this process farther into the future. What does it imply for 10 years from now? What’s the distribution of the sum X1+X2+…+X3652 (I included a couple of leap years)? The mean will still be zero. The standard deviation will be 200 times the square root of 3652 or about 12,086. Dividing by 3500 pounds per calorie, we have a standard deviation of about 3.45. What’s the probability that I will have gained 5 pounds or more over the next 10 years?

Distribution Plot

It’s about 7.3%. That’s actually not too bad!

A More Realistic Look at the Assumptions

Now let’s revisit the assumptions. A key assumption is that my mean calorie imbalance is exactly zero. I’m thinking that’s easier said than done—after all, I’m not weighing my food and calculating calories. I am wearing a smart watch to keep track of my exercise calories, but that’s only a piece of the puzzle, and even there, it’s probably not accurate down to the exact calorie.

So let’s look at what happens if I’m off by a little. Suppose the mean of X is slightly positive, say 10 calories more in than out per day. Means add up, so over a year, the mean imbalance is 365 ×10 = 3650 calories. So on average I’ll gain a little more than a pound. Applying the central limit theorem again, what’s my probability of gaining a pound or more in a year?

Distribution Plot

As this graph shows, the probability is about 51.57% that I will gain at least one pound in a year.

What about 10 years? The average is now 36,520 calories, which translates to about 10.43 pounds. Now what’s the probability of gaining at least 5 pounds over the next 10 years?

Distribution Plot

That’s a probability of over 94% of gaining at least 5 pounds, with gains of around 10 pounds or more being very likely.

That’s a big difference due to a seemingly insignificant 10 calorie imbalance per day. Ten calories is about a minute of jumping rope or a dill pickle.

Considering Correlation?

I assumed that each day’s calorie balance was independent of every other day’s. What happens if there are correlations between days? I could get some positive correlation if I buy a whole cheesecake and take a few days to eat it. Or if I go on a hiking trip over a long weekend. On the other hand, I could get some negative correlation if I try to make up for yesterday’s overeating by eating less or exercising more than usual today.

If there are correlations between days, then in addition to summing variances, I have to include a contribution for each pair of days that are correlated. Positive correlations make the variance of the sum larger, but negative correlations make the variance of the sum smaller. So introducing some negative correlations on a regular basis would help reduce the fluctuations of my weight from its mean. But as we’ve seen, that’s no substitute for keeping the long-term mean as close to zero as possible. If I notice my weight trending too quickly in one direction or the other, I had better make a permanent adjustment to how much I eat, how much activity I get, or both.

A DOE in a Manufacturing Environment (Part 2)

$
0
0

In my last post, I discussed how a DOE was chosen to optimize a chemical-mechanical polishing process in the microelectronics industry. This important process improved the plant's final manufacturing yields. We selected an experimental design that let us study the effects of six process parameters in 16 runs.

Analyzing the Design

Now we'll examine the analysis of the DOE results after the actual tests have been performed. Our objective is to minimize the amount of variability (minimize the Std Dev response) to achieve better wafer uniformity. At the same time we would like to minimize cycle times by increasing the Removal Rate of the process (maximize the V A°/Min response).

Therefore, we are dealing with a multi-response DOE.

Design array

 
From the DOE data shown above, we first built a model for the Removal Rate (V A°/Min): Down Force, Carrier velocity and Table velocity had a significant effect, as the Pareto chart below clearly shows. The Down Force*Carrier velocity and the Carrier velocity*Table velocity two-factor interactions were also significant. We therefore eliminated the remaining parameters, which were not significant, from the model, gradually.
 

Removal Rate Pareto
  
A process expert later confirmed that such effects and interactions were logical and could have been expected from our current knowledge of this process. The graph below shows the main effect plots (Removal Rate response).
  Interactions

   
We had chosen a fractional DOE, so several two-factor interactions were confounded with one another. However, the Down Force*Carrier velocity and the Carrier velocity*Table velocity interactions made more sense from a process point of view; besides, these two interactions were associated with very significant main effects.

Interaction Plot for V A/Min

We built a second model for the standard deviation response. The standard deviation does not necessarily follow a normal distribution, therefore a log transformation is needed to transform Standard deviation data into normality. 

Carrier velocity, Pad and Table velocity as well as the Pad*Carrier velocity interaction had significant effects on the logarithm of the Standard deviation response, as shown in the Pareto graph below. Again this was quite logical and confirmed our process knowledge.

 Std Dev Pareto
   
The main factor effects are shown in the graph below (Log of Standard Deviation).
  
Main effects

The interaction effect is shown in the graph below (Logarithm of Standard Deviation response).

Interaction Plot for LN

We then compared the predictions from our two models to the real observations from the tests, in order to assess the validity of our model. We were interested in identifying observations that would not be consistent with our models. Outliers due to changes in the process environment during the tests often occur, and may bias statistical analyses.

The residual plots below represent the differences between real experiment values and predictions based on the model. Minitab provides four different ways to look at the residuals, in order to help assess whether these residuals are normally and randomly distributed.
  
Residuals for Std Dev
  
In these diagrams, the residuals seem to be generally random (shown by the right hand plots, which display the residuals against their fitted values and their observation order) and normally distributed (displayed by the probability plot on the left).
 
But the observation in row 16 (the blue point that has been selected and brushed in the plot below) looks suspicious as far as the residuals for the removal rate (V A°) response are concerned. It is positioned farther away from the other residual values. We used the brushing functionality in Minitab to get more information about this point (see the values in the table below). We then eliminated observation number 16, and reran the DOE analysis...but the final Removal Rate model was still very similar to the initial one.
  
Residuals for Removal Rate

From our two final models, we used the process optimization tool within Minitab to identify a global process optimum. Two terms were included in the Standard Deviation model (V Carrier: E, Pad: D, and V Table: F), and three terms plus some interactions were included in the Removal Rate (V A°) model (Down Force: A, V Carrier: E and V Table: F).
  
Finding the best compromise while considering conflicting objectives is not always easy, especially when different models contain the same parameters. Reducing the standard deviation to improve yields was more important than minimizing cycle times. Therefore, in the Minitab optimization tool, we assigned a score of 3 in terms of importance to the standard deviation response and a score of only 1 to the V A° (Removal Rate) response.

optimize responses
  
Define optimization

Optimization tool
 
The Minitab Optimization Tool results indicate that the Down Force needs to be maximized to increase the Removal Rate (reduce Cycle time), whereas Carrier and Table velocities as well as Pad need to be kept low in order to achieve a smoother, uniform surface (a small standard deviation).  

Confirmation tests run at the settings recommended by the optimization tool were consistent with the conclusions that had been drawn from the DOE.

Conclusion

This DOE proved to be a very effective method to identify the factors that had real effects by making them clearly emerge from the surrounding process noise. This analysis thus gave us both a very pragmatic and accurate approach to adjusting and optimizing the polishing process. The conclusions from the DOE could easily be extended to the operating process.

 

Best Way to Analyze Likert Item Data: Two Sample T-Test versus Mann-Whitney

$
0
0

Worksheet that shows Likert dataFive-point Likert scales are commonly associated with surveys and are used in a wide variety of settings. You’ve run into the Likert scale if you’ve ever been asked whether you strongly agree, agree, neither agree or disagree, disagree, or strongly disagree about something. The worksheet to the right shows what five-point Likert data look like when you have two groups.

Because Likert item data are discrete, ordinal, and have a limited range, there’s been a longstanding dispute about the most valid way to analyze Likert data. The basic choice is between a parametric test and a nonparametric test. The pros and cons for each type of test are generally described as the following:

  • Parametric tests, such as the 2-sample t-test, assume a normal, continuous distribution. However, with a sufficient sample size, t-tests are robust to departures from normality.
  • Nonparametric tests, such as the Mann-Whitney test, do not assume a normal or a continuous distribution. However, there are concerns about a lower ability to detect a difference when one truly exists.

What’s the better choice? This is a real-world decision that users of statistical software have to make when they want to analyze Likert data.

Over the years, a number of studies that have tried to answer this question. However, they’ve tended to look at a limited number of potential distributions for the Likert data, which causes the generalizability of the results to suffer. Thanks to increases in computing power, simulation studies can now thoroughly assess a wide range of distributions.

In this blog post, I highlight a simulation study conducted by de Winter and Dodou* that compares the capabilities of the two sample t-test and the Mann-Whitney test to analyze five-point Likert items for two groups. Is it better to use one analysis or the other?

The researchers identified a diverse set of 14 distributions that are representative of actual Likert data. The computer program drew independent pairs of samples to test all possible combinations of the 14 distributions. All in all, 10,000 random samples were generated for each of the 98 distribution combinations! The pairs of samples are analyzed using both the two sample t-test and the Mann-Whitney test to compare how well each test performs. The study also assessed different sample sizes.

The results show that for all pairs of distributions the Type I (false positive) error rates are very close to the target amounts. In other words, if you use either analysis and your results are statistically significant, you don’t need to be overly concerned about a false positive.

The results also show that for most pairs of distributions, the difference between the statistical power of the two tests is trivial. In other words, if a difference truly exists at the population level, either analysis is equally likely to detect it. The concerns about the Mann-Whitney test having less power in this context appear to be unfounded.

I do have one caveat. There are a few pairs of specific distributions where there is a power difference between the two tests. If you perform both tests on the same data and they disagree (one is significant and the other is not), you can look at a table in the article to help you determine whether a difference in statistical power might be an issue. This power difference affects only a small minority of the cases.

Generally speaking, the choice between the two analyses is tie. If you need to compare two groups of five-point Likert data, it usually doesn’t matter which analysis you use. Both tests almost always provide the same protection against false negatives and always provide the same protection against false positives. These patterns hold true for sample sizes of 10, 30, and 200 per group.

*de Winter, J.C.F. and D. Dodou (2010), Five-Point Likert Items: t test versus Mann-Whitney-Wilcoxon, Practical Assessment, Research and Evaluation, 15(11).

What Are Degrees of Freedom in Statistics?

$
0
0

lionAbout a year ago, a reader asked if I could try to explain degrees of freedom in statistics. Since then,  I’ve been circling around that request very cautiously, like it’s some kind of wild beast that I’m not sure I can safely wrestle to the ground.

Degrees of freedom aren’t easy to explain. They come up in many different contexts in statistics—some advanced and complicated. In mathematics, they're technically defined as the dimension of the domain of a random vector.

But we won't get into that. Because degrees of freedom are generally not something you need to understand to perform a statistical analysis—unless you’re a research statistician, or someone studying statistical theory.

And yet, enquiring minds want to know. So for the adventurous and the curious, here are some examples that provide a basic gist of their meaning in statistics.

The Freedom to Vary

First, forget about statistics. Imagine you’re a fun-loving person who loves to wear hats. You couldn't care less what a degree of freedom is. You believe that variety is the spice of life.

Unfortunately, you have constraints. You have only 7 hats. Yet you want to wear a different hat every day of the week.

7 hats

On the first day, you can wear any of the 7 hats. On the second day, you can choose from the 6 remaining hats, on day 3 you can choose from 5 hats, and so on.

When day 6 rolls around, you still have a choice between 2 hats that you haven’t worn yet that week. But after you choose your hat for day 6, you have no choice for the hat that you wear on Day 7. You must wear the one remaining hat. You had 7-1 = 6 days of “hat” freedom—in which the hat you wore could vary!

That’s kind of the idea behind degrees of freedom in statistics. Degrees of freedom are often broadly defined as the number of "observations" (pieces of information) in the data that are free to vary when estimating statistical parameters.

Degrees of Freedom: 1-Sample t test

Now imagine you're not into hats. You're into data analysis.

You have a data set with 10 values. If you’re not estimating anything, each value can take on any number, right? Each value is completely free to vary.

But suppose you want to test the population mean with a sample of 10 values, using a 1-sample t test. You now have a constraint—the estimation of the mean. What is that constraint, exactly? By definition of the mean, the following relationship must hold: The sum of all values in the data must equal n x mean, where n is the number of values in the data set.

So if a data set has 10 values, the sum of the 10 values must equal the mean x 10. If the mean of the 10 values is 3.5 (you could pick any number), this constraint requires that the sum of the 10 values must equal 10 x 3.5 = 35.

With that constraint, the first value in the data set is free to vary. Whatever value it is, it’s still possible for the sum of all 10 numbers to have a value of 35. The second value is also free to vary, because whatever value you choose, it still allows for the possibility that the sum of all the values is 35.

In fact, the first 9 values could be anything, including these two examples:

34, -8.3, -555, -92, -1, 0, 1, -22, 99
0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9

But to have all 10 values sum to 35, and have a mean of 3.5, the 10th value cannot vary. It must be a specific number:

34, -8.3, -555, -92, -1, 0, 1, -22, 99  -----> 10TH value must be 61.3
0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 ----> 10TH value must be 20.5

Therefore, you have 10 - 1 = 9 degrees of freedom. It doesn’t matter what sample size you use, or what mean value you use—the last value in the sample is not free to vary. You end up with n - 1 degrees of freedom, where n is the sample size.

Another way to say this is that the number of degrees of freedom equals the number of "observations" minus the number of required relations among the observations (e.g., the number of parameter estimates). For a 1-sample t-test, one degree of freedom is spent estimating the mean, and the remaining n - 1 degrees of freedom estimate variability.

The degrees for freedom then define the specific t-distribution that’s used to calculate the p-values and t-values for the t-test.

t dist

Notice that for small sample sizes (n), which correspond with smaller degrees of freedom (n - 1 for the 1-sample t test), the t-distribution has fatter tails. This is because the t distribution was specially designed to provide more conservative test results when analyzing small samples (such as in the brewing industry).  As the sample size (n) increases, the number of degrees of freedom increases, and the t-distribution approaches a normal distribution.

Degrees of Freedom: Chi-Square Test of Independence

Let's look at another context. A chi-square test of independence is used to determine whether two categorical variables are dependent. For this test, the degrees of freedom are the number of cells in the two-way table of the categorical variables that can vary, given the constraints of the row and column marginal totals.So each "observation" in this case is a frequency in a cell.

Consider the simplest example: a 2 x 2 table, with two categories and two levels for each category:

 

Category A

Total

Category B

        ?

  

       6

 

 

      15

Total

     10

      11

      2

It doesn't matter what values you use for the row and column marginal totals. Once those values are set, there's only one cell value that can vary (here, shown with the question mark—but it could be any one of the four cells). Once you enter a number for one cell, the numbers for all the other cells are predetermined by the row and column totals. They're not free to vary. So the chi-square test for independence has only 1 degree of freedom for a 2 x 2 table.

Similarly, a 3 x 2 table has 2 degrees of freedom, because only two of the cells can vary for a given set of marginal totals.

 

Category A

 Total

Category B

         ?

        ?

 

      15

 

 

 

      15

Total

      10

      11

     9

       30

If you experimented with different sized tables, eventually you’d find a general pattern. For a table with r rows and c columns, the number of cells that can vary is (r-1)(c-1). And that’s the formula for the degrees for freedom for the chi-square test of independence!

The degrees of freedom then define the chi-square distribution used to evaluate independence for the test.

chi square

The chi-square distribution is positively skewed. As the degrees of freedom increases, it approaches the normal curve.

Degrees of Freedom: Regression

Degrees of freedom is more involved in the context of regression. Rather than risk losing the one remaining reader still reading this post (hi, Mom!), I'll  cut to the chase. 

Recall that degrees of freedom generally equals the number of observations (or pieces of information) minus the number of parameters estimated. When you perform regression, a parameter is estimated for every term in the model, and and each one consumes a degree of freedom. Therefore, including excessive terms in a multiple regression model reduces the degrees of freedom available to estimate the parameters' variability. In fact, if the amount of data isn't sufficient for the number of terms in your model, there may not even be enough degrees of freedom (DF) for the error term and no p-value or F-values can be calculated at all. You'll get output something like this:

regression output

If this happens, you either need to collect more data (to increase the degrees of freedom) or drop terms from your model (to reduce the number of degrees of freedom required). So degrees of freedom does have real, tangible effects on your data analysis, despite existing in the netherworld of the domain of a random vector.

Follow-up

This post provides a basic, informal introduction to degrees of freedom in statistics. If you want to further your conceptual understanding of degrees of freedom, check out this classic paper in the Journal of Educational Psychology by Dr. Helen Walker, an associate professor of education at Columbia who was the first female president of the American Statistical Association. Another good general reference is by Pandy, S., and Bright, C. L., Social Work Research Vol 32, number 2, June 2008, available here.

Understanding t-Tests: t-values and t-distributions

$
0
0

T-tests are handy hypothesis tests in statistics when you want to compare means. You can compare a sample mean to a hypothesized or target value using a one-sample t-test. You can compare the means of two groups with a two-sample t-test. If you have two groups with paired observations (e.g., before and after measurements), use the paired t-test.

Output that shows a t-value

How do t-tests work? How do t-values fit in? In this series of posts, I’ll answer these questions by focusing on concepts and graphs rather than equations and numbers. After all, a key reason to use statistical software like Minitab is so you don’t get bogged down in the calculations and can instead focus on understanding your results.

In this post, I will explain t-values, t-distributions, and how t-tests use them to calculate probabilities and assess hypotheses.

What Are t-Values?

T-tests are called t-tests because the test results are all based on t-values. T-values are an example of what statisticians call test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. The procedure that calculates the test statistic compares your data to what is expected under the null hypothesis.

Each type of t-test uses a specific procedure to boil all of your sample data down to one value, the t-value. The calculations behind t-values compare your sample mean(s) to the null hypothesis and incorporates both the sample size and the variability in the data. A t-value of 0 indicates that the sample results exactly equal the null hypothesis. As the difference between the sample data and the null hypothesis increases, the absolute value of the t-value increases.

Assume that we perform a t-test and it calculates a t-value of 2 for our sample data. What does that even mean? I might as well have told you that our data equal 2 fizbins! We don’t know if that’s common or rare when the null hypothesis is true.

By itself, a t-value of 2 doesn’t really tell us anything. T-values are not in the units of the original data, or anything else we’d be familiar with. We need a larger context in which we can place individual t-values before we can interpret them. This is where t-distributions come in.

What Are t-Distributions?

When you perform a t-test for a single study, you obtain a single t-value. However, if we drew multiple random samples of the same size from the same population and performed the same t-test, we would obtain many t-values and we could plot a distribution of all of them. This type of distribution is known as a sampling distribution.

Fortunately, the properties of t-distributions are well understood in statistics, so we can plot them without having to collect many samples! A specific t-distribution is defined by its degrees of freedom (DF), a value closely related to sample size. Therefore, different t-distributions exist for every sample size. You can graph t-distributions using Minitab’s probability distribution plots.

T-distributions assume that you draw repeated random samples from a population where the null hypothesis is true. You place the t-value from your study in the t-distribution to determine how consistent your results are with the null hypothesis.

Plot of t-distribution

The graph above shows a t-distribution that has 20 degrees of freedom, which corresponds to a sample size of 21 in a one-sample t-test. It is a symmetric, bell-shaped distribution that is similar to the normal distribution, but with thicker tails. This graph plots the probability density function (PDF), which describes the likelihood of each t-value.

The peak of the graph is right at zero, which indicates that obtaining a sample value close to the null hypothesis is the most likely. That makes sense because t-distributions assume that the null hypothesis is true. T-values become less likely as you get further away from zero in either direction. In other words, when the null hypothesis is true, you are less likely to obtain a sample that is very different from the null hypothesis.

Our t-value of 2 indicates a positive difference between our sample data and the null hypothesis. The graph shows that there is a reasonable probability of obtaining a t-value from -2 to +2 when the null hypothesis is true. Our t-value of 2 is an unusual value, but we don’t know exactly how unusual. Our ultimate goal is to determine whether our t-value is unusual enough to warrant rejecting the null hypothesis. To do that, we'll need to calculate the probability.

Using t-Values and t-Distributions to Calculate Probabilities

The foundation behind any hypothesis test is being able to take the test statistic from a specific sample and place it within the context of a known probability distribution. For t-tests, if you take a t-value and place it in the context of the correct t-distribution, you can calculate the probabilities associated with that t-value.

A probability allows us to determine how common or rare our t-value is under the assumption that the null hypothesis is true. If the probability is low enough, we can conclude that the effect observed in our sample is inconsistent with the null hypothesis. The evidence in the sample data is strong enough to reject the null hypothesis for the entire population.

Before we calculate the probability associated with our t-value of 2, there are two important details to address.

First, we’ll actually use the t-values of +2 and -2 because we’ll perform a two-tailed test. A two-tailed test is one that can test for differences in both directions. For example, a two-tailed 2-sample t-test can determine whether the difference between group 1 and group 2 is statistically significant in either the positive or negative direction. A one-tailed test can only assess one of those directions.

Second, we can only calculate a non-zero probability for a range of t-values. As you’ll see in the graph below, a range of t-values corresponds to a proportion of the total area under the distribution curve, which is the probability. The probability for any specific point value is zero because it does not produce an area under the curve.

With these points in mind, we’ll shade the area of the curve that has t-values greater than 2 and t-values less than -2.

T-distribution with a shaded area that represents a probability

The graph displays the probability for observing a difference from the null hypothesis that is at least as extreme as the difference present in our sample data while assuming that the null hypothesis is actually true. Each of the shaded regions has a probability of 0.02963, which sums to a total probability of 0.05926. When the null hypothesis is true, the t-value falls within these regions nearly 6% of the time.

This probability has a name that you might have heard of—it’s called the p-value!  While the probability of our t-value falling within these regions is fairly low, it’s not low enough to reject the null hypothesis using the common significance level of 0.05.

Learn how to correctly interpret the p-value.

t-Distributions and Sample Size

As mentioned above, t-distributions are defined by the DF, which are closely associated with sample size. As the DF increases, the probability density in the tails decreases and the distribution becomes more tightly clustered around the central value. The graph below depicts t-distributions with 5 and 30 degrees of freedom.

Comparison of t-distributions with different degrees of freedom

The t-distribution with fewer degrees of freedom has thicker tails. This occurs because the t-distribution is designed to reflect the added uncertainty associated with analyzing small samples. In other words, if you have a small sample, the probability that the sample statistic will be further away from the null hypothesis is greater even when the null hypothesis is true.

Small samples are more likely to be unusual. This affects the probability associated with any given t-value. For 5 and 30 degrees of freedom, a t-value of 2 in a two-tailed test has p-values of 10.2% and 5.4%, respectively. Large samples are better!

I’ve explained how t-values and t-distributions work together to produce probabilities. In my next post, I’ll show how each type of t-test works.


Exploring Healthcare Data, Part 1

$
0
0

Working with healthcare-related data often feels different than working with manufacturing data. After all, the common thread among healthcare quality improvement professionals is the motivation to preserve and improve the lives of patients. Whether collecting data on the number of patient falls, patient length-of-stay, bed unavailability, wait times, hospital acquired-infections, or readmissions, human lives are stake. And so collecting and analyzing data—and trusting your results—in a healthcare setting feels even more critical.

ATP testBecause delivering quality care efficiently is of utmost importance in the healthcare industry, understanding your process, collecting data around that process, and knowing what analysis to perform is key. Awareness about your process and opportunities to improve patient care and cut costs will benefit from using data to drive decisions in your organization that will result in better business and better care.

So, in the interest of using data to draw insights and make decisions that have positive impacts, I’d like to offer several tips for exploring and visualizing your healthcare data in a way that will prepare you for a formal analysis. For instance, graphing your data and examining descriptive statistics such as means and medians can tell you a lot about how your data are distributed and can help you visualize relationships between variables. These preliminary explorations can also reveal unusual observations in your data that should be investigated before you perform a more sophisticated statistical analysis, allowing you to take action quickly when a process, outcome, or adverse event needs attention.

In the first part of this series, I’ll offer two tips on exploring and visualizing data with graphs, brushing, and conditional formatting. In part 2, I’ll offer three more tips focusing on data manipulation and obtaining descriptive statistics.

If you’d like to follow along, you can download and explore the data yourself! If you don’t yet have Minitab 17, you can download the free, 30-day trial.

A Case Study: Ensuring Sound Sanitization Procedures

Let’s look at a case study where a hospital was seeking to examine—and ultimately improve—their room cleaning procedures.

The presence of adenosine triphosphate (ATP) on a surface indicates that bacteria exists. Hospitals can use ATP detection systems to ensure the effectiveness of their sanitization efforts and identify improvement opportunities.

Staff at your hospital used ATP swab tests to test 8 surfaces in 10 different hospital rooms across 5 departments, and recorded the results in a data sheet. ATP measurements below 400 units ‘pass’ the swab test, while measurements greater than or equal to 400 units ‘fail’ the swab test and require further investigation.

Here is a screenshot of part of the worksheet:

health care data

Tip #1: Evaluate the shape of your data

You can use a histogram to graph all eight surfaces that were tested in separate panels of the same graph. This helps you observe and compare the distribution of data across each touch point.

If you’ve downloaded the data, you can use the ATP Unstacked.MTW worksheet to create this same histogram by navigating to Graph > Histogram > Simple. In the Graph Variables window, select Door Knob, Light Switch, Bed Rails, Call Button, Phone, Bedside Table, Chair, and IV Pole. Click on the Multiple Graphs subdialog and select In separate panels of the same graph under Show Graph Variables. Click OK through all dialogs.

health care data - histogram 

These histograms reveal that:

  • For all test areas, the distribution is asymmetrical with some extreme outliers.
  • Data are all right-skewed.
  • Data do not appear to be normally distributed.
Tip #2: Identify and investigate outliers

An individual value plot can be used to graph the ATP measurements collected across all eight surfaces. Identifying the outliers is quite easy with this plot.

And again, you can use the ATP Unstacked.MTW worksheet to create an individual value plot that looks just like mine. Navigate to Graph > Individual Value Plots > Multiple Y’s > Simple, and choose Door Knob, Light Switch, Bed Rails, Call Button, Phone, Bedside Table, Chair, and IV Pole as Graph variables. Click OK.

health care data - individual value plot

This individual value plot reveals that:

  • Extreme outliers are present for ATP measurements on Bed Rails, Call Button, Phone, and Bedside Table.
  • These extreme values are influencing the mean ATP measured for each surface.
  • It may be more helpful to analyze differences in medians since the means are skewed by these outliers (judging by the histogram and individual value plot).

Once the outliers are identified, you can investigate them with Minitab’s brushing tool to uncover more insights by right-clicking anywhere in the individual value plot and selecting Brush. Setting ID variables also helps to reveal information about other variables associated with these outliers. To do this, right-click in the graph again and select Set ID Variables. Enter Room as the Variable and click OK. Click and drag the cursor to form a rectangle around the outliers as shown below.

health care data - brushing

Brushing can provide actionable insights:

  • Brushing the extreme outliers on the individual value plot and setting ID variables reveals the room numbers associated with high ATP measurements.
  • Quickly identifying rooms where surfaces have high levels of ATP enables faster follow-up and investigation on specific surfaces in specific rooms.

Finally, you can use conditional formatting and other cell properties to investigate and make notes about the outliers. To look at outliers across all surfaces tested, highlight columns C2 through C9, right-click in the worksheet, and select Conditional Formatting > Statistical > Outlier. Alternatively, you can highlight only the extreme outliers by right-clicking in the worksheet, selecting Conditional Formatting > Highlight Cell > Greater Than and entering 2000 (a value we know extreme outliers are above based on the individual value plot).

To make notes about individual outliers, right-click on the cell containing the extreme value, select Cell Properties > Comment, and enter your cell comment.

health care data - conditional formatting

Conditional formats and cell properties offer:

  • Quick insight into surfaces and rooms with high ATP measurements.
  • More efficient investigation of problem areas in order to make process improvements.
Visualizations that Lead to Actionable Insights

By exploring and visualizing your data in these preliminary ways, you can see how easy it is to draw conclusions before even doing an analysis. The data is not normally distributed but is highly skewed by several extreme outliers, which greatly influence the mean ATP measurement recorded for each surface. The first graph created to visualize the data is helpful evidence that comparing medians instead of means may be a more effective way to determine if statistically significant differences exist across surfaces. Investigating these outliers both graphically and in the worksheet offers further evidence that analyzing differences in median measurements will be most effective. It is also obvious that bed rails, call buttons, phones, and bedside tables are highly contaminated surfaces—one might surmise this is because of the touch points’ close proximity to sick patients, and the frequency with which patients come into contact with these surfaces.

You can use these insights to focus our initial process improvement efforts on the most problematic touch points and hospital rooms. In part 2 of this blog post, I’ll share some tips for manipulating data, extracting even more information from the data, and displaying descriptive statistics about contamination levels.

Exploring Healthcare Data, Part 2

$
0
0

In the first partof this series, we looked at a case study where staff at a hospital used ATP swab tests to test 8 surfaces for bacteria in 10 different hospital rooms across 5 departments. ATP measurements below 400 units pass the swab test, while measurements greater than or equal to 400 units fail the swab test and require further investigation.

washing handsI offered two tips on exploring and visualizing data using graphs, brushing, and conditional formatting.

  1. Evaluate the shape of your data.
  2. Identify and investigate outliers.

By performing these preliminary explorations on the swab test data, we discovered that the mean ATP measurement would not be effective for testing whether surfaces showed statistically significant differences in contamination levels. This was due to the data being highly skewed by extreme outliers.

We then identified where these unusually high-ATP measurements were discovered in the hospital. These findings provide valuable information for appropriately focusing process improvement efforts on particular hospital rooms, departments, and surfaces within those rooms.

Now that we've seen how much some simple exploration and visualization tools can reveal, let's run through three more tools that will help you explore your own healthcare data in order to draw actionable insights.

If you’d like to follow along and didn't already download the data from the first post, you can download and explore the data yourself! If you don’t yet have Minitab 17, you can download the free, 30-day trial.

Tip #3: Manipulate the data

The swab test data the hospital staff collected and recorded is unstacked—this simply means that all response measurements are contained in multiple columns rather than stacked together in one column. To do additional data visualization and a more formal analysis, you need to reconfigure or manipulate how the data is arranged. We can accomplish this by stacking rows.

The ATP Stacked.MTW worksheet in the downloadable Minitab project file above already has the data reshaped for you. But you can manipulate the data on your own using the ATP Unstacked.MTW worksheet. Just navigate to Data > Stack > Rows, and complete the dialog as shown:

health care data - stack rows to prepare for analysis

Stacking all rows of your data and storing the associated column subscripts (or column names) in a separate column will result in all ATP measurements stacked into one column, a separate column containing categories for Surfaces, and another column containing the Room Number.

With stacked data, you are properly set up to perform formal analyses in Minitab—this is an important step as you work with your data, as most Minitab analyses require columns of stacked data. We won’t tackle a formal analysis here, but rest assured that you are set up to do so!

Tip #4: Extract information from your original data set

Once your data are stacked, you can use functions available in Calc > Calculator and Data > Recode to leverage information intrinsic to your original data to create new variables to explore and analyze.

For instance, we know the first character of each room number denotes the department. You can use the ‘left’ function in Calc > Calculator to extract the left-most character from the Room column, and store the result in a new column labeled Department. You can do this by filling out the Calculator dialog as shown:

manipulating health care data

You also know that ATP measurements below 400 ‘pass’ the ATP swab test. Recoding ranges of ATP values to text to indicate which values ‘Pass’ and which values ‘Fail’ can be useful when visualizing the data. You can do this by filling out the Data >Recode > To Text dialog as shown:

health care data dialog box

Finally, you can use this newly extracted data to create a stacked bar chart showing the counts of measurements that failed, passed, or were missing from the ATP swab test across Department and the recoded ATP. Using the ATP Stacked.MTW worksheet, navigate to Graph > Bar Chart > Stack. Verify that the Bars represent drop-down shows the default selection, Counts of unique values. Click OK. Select Department and Recoded ATP as Categorical variables, and click OK.

Minitab produces the following graph:
 

Health care ATP swab test data

The bar chart reveals that:

  • Department 4 has the highest count of ATP measurements that failed the swab test.
  • The sanitation team should consider focusing initial efforts in department 4 as the investigation of problems with room-cleaning procedures continues.
Tip #5: Obtain important statistics that describe your data

Now that we’ve manipulated the data in a way that prepares us for more formal analyses, identified which department contains the most contaminated surfaces, and compared the portion of measurements in each department that passed or failed the ATP swab test, we can display descriptive statistics to get an idea of how mean or median bacteria levels differed or varied across surfaces and across departments.

Using the ATP Stacked.MTW worksheet, navigate to Stat > Basic Statistics > Display Descriptive Statistics. Enter ATP as the Variable, Department as the By variable, and click OK. Press Ctrl + E to re-enter the Display Descriptive Statistics dialog, and replace Department with Surface as the By variable. Click OK.  The following output displays in Minitab’s Session Window.

Health care data descriptive statistics

Health care data swab tests descriptive statistics

The descriptive statistics reveal helpful information:

  • These statistics allow for easy comparison of mean and median ATP measurements as well as the variation of ATP measurements, either by department or by surface.
     
  • Notice that mean ATP measurements are much higher than median ATP measurements for both sets of descriptive statistics. This is because the data are right-skewed. Certain analyses that assume you have normally distributed data—such as t-tests to compare means—might not be the best tool to formally analyze this data. Comparing medians might offer more insight.
     
  • Both sets of descriptive statistics highlight which departments and surfaces to focus on for investigation and process improvement efforts. For instance, department 4 has the highest median ATP presence, while Bed Rails, Phone, and Call Button—the touch points closest to a sick patient in a hospital bed—appear to be the most problematic surfaces to sanitize. Process improvement efforts can begin with this information.
What Else Can You Do with Your Data?

What you’ve seen in this two-part blog post is just the beginning. But consider how much of this initial exploration is actionable! By having this foundation for visualizing and manipulating your data, you’ll be well on your way to investigating and testing root causes, and more efficiently performing analyses that yield trustworthy results.

If you’re interested in how other healthcare organizations use Minitab for quality improvement, check out our case studies.

Tests of 2 Standard Deviations? Side Effects May Include Paradoxical Dissociations

$
0
0

Once upon a time, when people wanted to compare the standard deviations of two samples, they had two handy tests available, the F-test and Levene's test.

Statistical lore has it that the F-test is so named because it so frequently fails you.1 Although the F-test is suitable for data that are normally distributed, its sensitivity to departures from normality limits when and where it can be used.

Levene’s test was developed as an antidote to the F-test's extreme sensitivity to nonnormality. However, Levene's test is sometimes accompanied by a troubling side effect: paradoxical dissociations. To see what I mean, take a look at these results from an actual test of 2 standard deviations that I actually ran in Minitab 16 using actual data that I actually made up:

Ratio of the standard deviations in Release 16

Nothing surprising so far. The ratio of the standard deviations from samples 1 and 2 (s1/s2) is 1.414 / 1.575 = 0.898. This ratio is our best "point estimate" for the ratio of the standard deviations from populations 1 and 2 (Ps1/Ps2).

Note that the ratio is less than 1, which suggests that Ps2 is greater than Ps1. 

Now, let's have a look at the confidence interval (CI) for the population ratio. The CI gives us a range of likely values for the ratio of Ps1/Ps2. The CI below labeled "Continuous" is the one calculated using Levene's method:

Confidence interval for the ratio in Release 16

What in Gauss' name is going on here?!? The range of likely values for Ps1/Ps2—1.046 to 1.566—doesn't include the point estimate of 0.898?!? In fact, the CI suggests that Ps1/Ps2 is greater than 1. Which suggests that Ps1 is actually greater than Ps2.

But the point estimate suggests the exact opposite! Which suggests that something odd is going on here. Or that I might be losing my mind (which wouldn't be that odd). Or both.

As it turns out, the very elements that make Levene's test robust to departures from normality also leave the test susceptible to paradoxical dissociations like this one. You see, Levene's test isn't actually based on the standard deviation. Instead, the test is based on a statistic called the mean absolute deviation from the median, or MADM. The MADM is much less affected by nonnormality and outliers than is the standard deviation. And even though the MADM and the standard deviation of a sample can be very different, the ratio of MADM1/MADM2 is nevertheless a good approximation for the ratio of Ps1/Ps2. 

However, in extreme cases, outliers can affect the sample standard deviations so much that s1/s2 can fall completely outside of Levene's CI. And that's when you're left with an awkward and confusing case of paradoxical dissociation. 

Fortunately (and this may be the first and last time that you'll ever hear this next phrase), our statisticians have made things a lot less awkward. One of the brave folks in Minitab's R&D department toiled against all odds, and at considerable personal peril to solve this enigma. The result, which has been incorporated into Minitab 17, is an effective, elegant, and non-enigmatic test that we call Bonett's test.

Confidence interval in Release 17

Like Levene's test, Bonett's test can be used with nonnormal data. But unlike Levene's test, Bonett's test is actually based on the actual standard deviations of the actual samples. Which means that Bonett's test is not subject to the same awkward and confusing paradoxical dissociations that can accompany Levene's test. And I don't know about you, but I try to avoid paradoxical dissociations whenever I can. (Especially as I get older, ... I just don't bounce back the way I used to.) 

When you compare two standard deviations in Minitab 17, you get a handy graphical report that quickly and clearly summarizes the results of your test, including the point estimate and the CI from Bonett's test. Which means no more awkward and confusing paradoxical dissociations.

Summary plot in Release 17

------------------------------------------------------------

 

1 So, that bit about the name of the F-test—I kind of made that up. Fortunately, there is a better source of information for the genuinely curious. Our white paper, Bonett's Method, includes all kinds of details about these tests and comparisons between the CIs calculated with each. Enjoy.

 
return to text of post

 

 

Understanding Bootstrapping and the Central Limit Theorem

$
0
0

For hundreds of years, people having been improving their situation by pulling themselves up by their bootstraps. Well, now you can improve your statistical knowledge by pulling yourself up by your bootstraps. Minitab Express has 7 different bootstrapping analyses that can help you better understand the sampling distribution of your data. 

A sampling distribution describes the likelihood of obtaining each possible value of a statistic from a random sample of a population—in other words, what proportion of all random samples of that size will give that value. Bootstrapping is a method that estimates the sampling distribution by taking multiple samples with replacement from a single random sample. These repeated samples are called resamples. Each resample is the same size as the original sample.

The original sample represents the population from which it was drawn. Therefore, the resamples from this original sample represent what we would get if we took many samples from the population. The bootstrap distribution of a statistic, based on the resamples, represents the sampling distribution of the statistic.

Bootstrapping and Running Backs  

For example, let’s estimate the sampling distribution of the number of yards per carry for Penn State’s star running back Saquon Barkley. Going through all 182 of his carries from last season seems daunting, so instead I took a random sample of 49 carries and recorded the number of yards he gained for each one. If you want to follow along, you can get the data I used here.

Repeated sampling with replacement from these 49 samples mimics what the population might look like. To take a resample, one of the carries is randomly selected from the original sample, the number of yards gained is recorded, and then then that observation is put back into the sample. This is done 49 times (the size of the original sample) to complete a single resample.

To obtain a single resample, in Minitab Express go to STATISTICS > Resampling > Bootstrapping > 1-Sample Mean. Enter the column of data in Sample, and enter 1 for number of resamples. The following individual plot represents a single bootstrap sample taken from the original sample.

Note: Because Minitab Express randomly selects the bootstrap sample, your results will be different.

Individual Value Plot

The resample is done by sampling with replacement, so the bootstrap sample will usually not be the same as the original sample. To create a bootstrap distribution, you take many resamples. The following histogram shows the bootstrap distribution for 1,000 resamples or our original sample of 49 carries.

Bootstrap Histogram

The bootstrap distribution is centered at approximately 5.5, which is an estimate of the population mean for Barkley’s yards per carry. The middle 95% of values from the bootstrapping distribution provide a 95% confidence interval for the population mean. The red reference lines represent the interval, so we can be 95% confident the population mean of Barkley’s yards per carry is between approximately 3.4 and 7.8.

Bootstrapping and the Central Limit Theorem

The central limit theorem is a fundamental theorem of probability and statistics. The theorem states that the distribution of the mean of a random sample from a population with finite variance is approximately normally distributed when the sample size is large, regardless of the shape of the population's distribution. Bootstrapping can be used to easily understand how the central limit theorem works.

For example, consider the distribution of the data for Saquon Barkley’s yards per carry.

Histogram

It’s pretty obvious that the data are nonnormal. But now we’ll create a bootstrap distribution of the means of 10 resamples.  

Bootstrap Histogram

The distribution of the means is very different from the distribution of the original data. It looks much closer to a normal distribution. This resemblance increases as the number of resamples increases. With 1,000 resamples, the distribution of the mean of the resamples is approximately normal.

Bootstrap Histogram

Note: Bootstrapping is only available in Minitab Express, which is an introductory statistics package meant for students and university professors.

See How Easily You Can Do a Box-Cox Transformation in Regression

$
0
0

Translink Ticket Vending Machine found at all train stations in south-east Queensland.For one reason or another, the response variable in a regression analysis might not satisfy one or more of the assumptions of ordinary least squares regression. The residuals might follow a skewed distribution or the residuals might curve as the predictions increase. A common solution when problems arise with the assumptions of ordinary least squares regression is to transform the response variable so that the data do meet the assumptions. Minitab makes the transformation simple by including the Box-Cox button. Try it for yourself and see how easy it is!

The government in Queensland, Australia shares data about the number of complaints about its public transportation service. 

I’m going to use the data set titled “Patronage and Complaints.” I’ll analyze the data a bit more thoroughly later, but for now I want to focus on the transformation. The variables in this data set are the date, the number of passenger trips, the number of complaints about a frequent rider card, and the number of other customer complaints. I'm using the range of the data from the week ending July 7th, 2012 to December 22nd 2013.  I’m excluding the data for the last week of 2012 because ridership is so much lower compared to other weeks.

If you want to follow along, you can download my Minitab data sheet. If you don't already have it, you can download Minitab and use it free for 30 days

Let’s say that we want to use the number of complaints about the frequent rider card as the response variable. The number of other complaints and the date are the predictors. The resulting normal probability plot of the residuals shows an s-curve.

The residuals do not appear normal.

Because we see this pattern, we’d like to go ahead and do the Box-Cox transformation. Try this:

  1. Choose Stat > Regression > Regression > Fit Regression Model.
  2. In Responses, enter the column with the number of complaints on the go card.
  3. In Continuous Predictors, enter the columns that contain the other customer complaints and the date.
  4. Click Options.
  5. Under Box-Cox transformation, select Optimal λ.
  6. Click OK.
  7. Click Graphs.
  8. Select Individual plots and check Normal plot of residuals.
  9. Click OK twice.

The residuals are more normal.

The probability plot that results is more linear, although it still shows outlying observations where the number of complaints in the response are very high or very low relative to the number of other complaints. You'll still want to check the other regression assumptions, such as homoscedasticity.

So there it is, everything that you need to know to use a Box-Cox transformation on the response in a regression model. Easy, right? Ready for some more? Check out more of the analysis steps that Minitab makes easy.

  The image of the Translink vending machine is by Brad Wood and is licensed for reuse under thisCreative Commons License.

 

Those 10 Simple Rules for Using Statistics? They're Not Just for Research

$
0
0

Earlier this month, PLOS.org published an article titled "Ten Simple Rules for Effective Statistical Practice." The 10 rules are good reading for anyone who draws conclusions and makes decisions based on data, whether you're trying to extend the boundaries of scientific knowledge or make good decisions for your business. 

Carnegie Mellon University's Robert E. Kass and several co-authors devised the rules in response to the increased pressure on scientists and researchers—many, if not most, of whom are not statisticians—to present accurate findings based on sound statistical methods. 

Since the paper and the discussions it has prompted focus on scientists and researchers, it seems worthwhile to consider how the rules might apply to quality practitioners or business decision-makers as wellIn this post, I'll share the 10 rules, some with a few modifications to make them more applicable to the wider population of all people who use data to inform their decisions. 

questions1. Statistical Methods Should Enable Data to Answer Scientific Specific Questions

As the article points out, new or infrequent users of statistics tend to emphasize finding the "right" method to use—often focusing on the structure or format of their data, rather than thinking about how the data might answer an important question. But choosing a method based on the data is putting the cart before the horse. Instead, we should start by clearly identifying the question we're trying to answer. Then we can look for a method that uses the data to answer it. If you haven't already collected your data, so much the better—you have the opportunity to identify and obtain the data you'll need.

2. Signals Always Come With Noise

If you're familiar with control charts used in statistical process control (SPC) or the Control phase of a Six Sigma DMAIC project, you know that they let you distinguish process variation that matters (special-cause variation) from normal process variation that doesn't need investigation or correction.

control chart
Control charts are one common tool used to distinguish "noise" from "signal." 

The same concept applies here: whenever we gather and analyze data, some of what we see in the results will be due to inherent variability. Measures of probability for analyses, such as confidence intervals, are important because they help us understand and account for this "noise." 

3. Plan Ahead, Really Ahead

Say you're starting a DMAIC project. Carefully considering and developing good questions right at the start of a project—the DEFINE stage—will help you make sure that you're getting the right data in the MEASURE stage. That, in turn, should result in a much smoother and stress-free ANALYZE phase—and probably more successful IMPROVE and CONTROL phases, too. The alternative? You'll have to complete the ANALYZE phase with the data you have, not the data you wish you had. 

4. Worry About Data Quality

gauge"Can you trust your data?" My Six Sigma instructor asked us that question so many times, it still flashes through my mind every time I open Minitab. That's good, because he was absolutely right: if you can't trust your data, you shouldn't do anything with it. Many people take it for granted that the data they get is precise and accurate, especially when using automated measuring instruments and similar technology. But how do you know they're measuring precisely and accurately? How do you know your instruments are calibrated properly? If you didn't test it, you don't know. And if you don't know, you can't trust your data. Fortunately, with measurement system analysis methods like gage R&R and attribute agreement analysis, we never have to trust data quality to blind faith. 

5. Statistical Analysis Is More Than a Set of Computations

Statistical techniques are often referred to as "tools," and that's a very apt metaphor. A saw, a plane, and a router all cut wood, but they aren't interchangeable—the end product defines which tool is appropriate for a job. Similarly, you might apply ANOVA, regression, or time series analysis to the same data set, but the right tool depends on what you want to understand. To extend the metaphor further, just as we have circular saws, jigsaws, and miter saws for very specific tasks, each family of statistical methods also includes specialized tools designed to handle particular situations. The point is that we select a tool to assist our analysis, not to define it. 

6. Keep it Simple

Many processes are inherently messy. If you've got dozens of input variables and multiple outcomes, analyzing them could require many steps, transformations, and some thorny calculations. Sometimes that degree of complexity is required. But a more complicated analysis isn't always better—in fact, overcomplicating it may make your results less clear and less reliable. It also potenitally makes the analysis more difficult than necessary. You may not need a complex process model that includes 15 factors if you can improve your output by optimizing the three or four most important inputs. If you need to improve a process that includes many inputs, a short screening experiment can  help you identify which factors are most critical, and which are not so important. 

7. Provide Assessments of Variability

No model is perfect. No analysis accounts for all of the observed variation. Every analysis includes a degree of uncertainty. Thus, no statistical finding is 100% certain, and that degree of uncertainty needs to be considered when using statistical results to make decisions. If you're the decision-maker, be sure that you understand the risks of reaching a wrong conclusion based on the analysis at hand. If you're sharing your results with stakeholders and executives, especially if they aren't statistically inclined, make sure you've communicated that degree of risk to them by offering and explaining confidence intervals, margins of error, or other appropriate measures of uncertainty. 

8. Check Your Assumptions

Different statistical methods are based on different assumptions about the data being analyzed. For instance, many common analyses assume that your data follow a normal distribution. You can check most of these assumptions very quickly using functions like a normality test in your statistical software, but it's easy to forget (or ignore) these steps and dive right into your analysis. However, failing to verify those assumptions can yield results that aren't reliable and shouldn't be used to inform decisions, so don't skip that step. If you're not sure about the assumptions for a statistical analysis, Minitab's Assistant menu explains them, and can even flag violations of the assumptions before you draw the wrong conclusion from an errant analysis. 

9. When Possible, Replicate Verify Success!

In science, replication of a study—ideally by another, independent scientist—is crucial. It indicates that the first researcher's findings weren't a fluke, and provides more evidence in support of the given hypothesis. Similarly, when a quality project results in great improvements, we can't take it for granted those benefits are going to be sustained—they need to be verified and confirmed over time. Control charts are probably the most common tool for making sure a project's benefits endure, but depending on the process and the nature of the improvements, hypothesis tests, capability analysis, and other methods also can come into play.  

10. Make Your Analysis Reproducible Share How You Did It

In the original 10 Simple Rules article, the authors suggest scientists share their data and explain how they analyzed it so that others can make sure they get the same results. This idea doesn't translate so neatly to the business world, where your data may be proprietary or private for other reasons. But just as science benefits from transparency, the quality profession benefits when we share as much information as we can about our successes. Of course you can't share your company's secret-sauce formulas with competitors—but if you solved a quality challenge in your organization, chances are your experience could help someone facing a similar problem. If a peer in another organization already solved a problem like the one you're struggling with now, wouldn't you like to see if a similar approach might work for you? Organizations like ASQ and forums like iSixSigma.com help quality practitioners network and share their successes so we can all get better at what we do. And here at Minitab, we love sharing case studies and examples of how people have solved problems using data analysis, too. 

How do you think these rules apply to the world of quality and business decision-making? What are your guidelines when it comes to analyzing data? 

 

Using Marginal Plots, aka "Stuffed-Crust Charts"

$
0
0

In my last post, we took the red pill and dove deep into the unarguably fascinating and uncompromisingly compelling world of the matrix plot. I've stuffed this post with information about a topic of marginal interest...the marginal plot.

Margins are important. Back in my English composition days, I recall that margins were particularly prized for the inverse linear relationship they maintained with the number of words that one had to string together to complete an assignment. Mathematically, that relationship looks something like this:

Bigger margins = fewer words

stuffed crustIn stark contrast to my concept of margins as information-free zones, the marginal plot actually utilizes the margins of a scatterplot to provide timely and important information about your data. Think of the marginal plot as the stuffed-crust pizza of the graph world. Only, instead of extra cheese, you get to bite into extra data. And instead of filling your stomach with carbs and cholesterol, you're filling your brain with data and knowledge. And instead of arriving late and cold because the delivery driver stopped off to canoodle with his girlfriend on his way to your house (even though he's just not sure if the relationship is really working out: she seems distant lately and he's not sure if it's the constant cologne of consumables about him, or the ever-present film of pizza grease on his car seats, on his clothes, in his ears?)

...anyway, unlike a cold, late pizza, marginal plots are always fresh and hot, because you bake them yourself, in Minitab Statistical Software.

I tossed some randomly-generated data around and came up with this half-baked example. Like the pepperonis on a hastily prepared pie, the points on this plot are mostly piled in the middle, with only a few slices venturing to the edges. In fact, some of those points might be outliers. 

Scatterplot of C1 vs C2

If only there were an easy, interesting, and integrated way to assess the data for outliers when we make a scatterplot.

Boxplots are a useful way look for outliers. You could make separate boxplots of each variable, like so:

Boxplot of C1  Boxplot of C2

It's fairly easy to relate the boxplot of C1 to the values plotted on the y-axis of the scatterplot. But it's a little harder to relate the boxplot of C2 to the scatterplot, because the y-axis on the boxplot corresponds to the x-axis on the scatterplot. You can transpose the scales on the boxplot to make the comparison a little easier. Just double-click one of the axes and select Transpose value and category scales:

Boxplot of C2, Transposed

That's a little better. The only thing that would be even better is if you could put each boxplot right up against the scatterplot...if you could stuff the crust of the scatterplot with boxplots, so to speak. Well, guess what? You can! Just choose Graph > Marginal Plot > With Boxplots, enter the variables and click OK

Marginal Plot of C1 vs C2

Not only are the boxplots nestled right up next to the scatterplot, but they also share the same axes as the scatterplot. For example, the outlier (asterisk) on the boxplot of C2 corresponds to the point directly below it on the scatterplot. Looks like that point could be an outlier, so you might want to investigate further. 

Marginal plots can also help alert you to other important complexities in your data. Here's another half-baked example. Unlike our pizza delivery guy's relationship with his girlfriend, it looks like the relationship between the fake response and the fake predictor represented in this scatterplot really is working out: 

Scatterplot of Fake Response vs Fake Predictor 

In fact, if you use Stat > Regression > Fitted Line Plot, the fitted line appears to fit the data nicely. And the regression analysis is highly significant:

Fitted Line_ Fake Response versus Fake Predictor

Regression Analysis: Fake Response versus Fake Predictor The regression equation is Fake Response = 2.151 + 0.7723 Fake Predictor S = 2.12304 R-Sq = 50.3% R-Sq(adj) = 49.7% Analysis of Variance Source DF SS MS F P Regression 1 356.402 356.402 79.07 0.000 Error 78 351.568 4.507 Total 79 707.970

But wait. If you create a marginal plot instead, you can augment your exploration of these data with histograms and/or dotplots, as I have done below. Looks like there's trouble in paradise:

Marginal Plot of Fake Response vs Fake Predictor, with Histograms Marginal Plot of Fake Response vs Fake Predictor, with Dotplots

Like the poorly made pepperoni pizza, the points on our plot are distributed unevenly. There appear to be two clumps of points. The distribution of values for the fake predictor is bimodal: that is, it has two distinct peaks. The distribution of values for the response may also be bimodal.

Why is this important? Because the two clumps of toppings may suggest that you have more than one metaphorical cook in the metaphorical pizza kitchen. For example, it could be that Wendy, who is left handed, started placing the pepperonis carefully on the pie and then got called away, leaving Jimmy, who is right handed, to quickly and carelessly complete the covering of cured meats. In other words, it could be that the two clumps of points represent two very different populations. 

When I tossed and stretched the data for this example, I took random samples from two different populations. I used 40 random observations from a normal distribution with a mean of 8 and a standard deviation of 1.5, and 40 random observations from a normal distribution with a mean of 13 and a standard deviation of 1.75. The two clumps of data are truly from two different populations. To illustrate, I separated the two populations into two different groups in this scatterplot: 

 Scatterplot with Groups

This is a classic conundrum that can occur when you do a regression analysis. The regression line tries to pass through the center of the data. And because there are two clumps of data, the line tries to pass through the center of each clump. This looks like a relationship between the response and the predictor, but it's just an illusion. If you separate the clumps and analyze each population separately, you discover that there is no relationship at all: 

Fitted Line_ Fake Response 1 versus Fake Predictor 1

Regression Analysis: Fake Response 1 versus Fake Predictor 1 The regression equation is Fake Response 1 = 9.067 - 0.1600 Fake Predictor 1 S = 1.64688 R-Sq = 1.5% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 1.609 1.60881 0.59 0.446 Error 38 103.064 2.71221 Total 39 104.673

Fitted Line_ Fake Response 2 versus Fake Predictor 2

Regression Analysis: Fake Response 2 versus Fake Predictor 2 The regression equation is Fake Response 2 = 12.09 + 0.0532 Fake Predictor 2 S = 1.62074 R-Sq = 0.3% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 0.291 0.29111 0.11 0.741 Error 38 99.818 2.62679 Total 39 100.109

If only our unfortunate pizza delivery technician could somehow use a marginal plot to help him assess the state of his own relationship. But alas, I don't think a marginal plot is going to help with that particular analysis. Where is that guy anyway? I'm getting hungry. 


High Cpk and a Funny-Looking Histogram: Is My Process Really that Amazing?

$
0
0

Here is a scenario involving process capability that we’ve seen from time to time in Minitab's technical support department. I’m sharing the details in this post so that you’ll know where to look if you encounter a similar situation.

You need to run a capability analysis. You generate the output using Minitab Statistical Software. When you look at the results, the Cpk is huge and the histogram in the output looks strange:

What’s going on here? The Cpk seems unrealistic at 42.68, the "within" fit line is tall and narrow, and the bars on the histogram are all smashed down. Yet if we use the exact same data to make a histogram using the Graph menu, we see that things don’t look so bad:

So what explains the odd output for the capability analysis?

Notice that the ‘within subgroup’ variation in the capability output is represented by the tall dashed line in the middle of the histogram.  This is the StDev (Within) shown on the left side of the graph. The within subgroup variation of 0.0777 is very small relative to the overall standard deviation. 

So what is causing the within subgroup variation to be so small? Another graph in Minitab can give us the answer: The Capability Sixpack. In the case above, the subgroup size was 1 and Minitab’s Capability Sixpack in Stat> Quality Tools> Capability Sixpack> Normal will plot the data on a control chart for individual observations, an I-chart:

Hmmm...this could be why, in Minitab training, our instructors recommend using the Capability Sixpack first.

In the Capability Sixpack above, we can see that the individually plotted values on the I-chart show an upward trend, and it appears that the process is not stable and in control (as it should be for data used in a capability analysis).  A closer look at the data in the worksheet clearly reveals that the data was sorted in ascending order:

Because the within-subgroup variation for data not collected in subgroups is estimated based on the moving ranges (average of the distance between consecutive points), sorting the data causes the within-subgroup variation to be very small. With very little within-subgroup variation we see a very tall, narrow fit line that represents the within subgroup variation, and that is ‘smashing down’ the bars on the histogram. We can see this by creating a histogram in the Graph menu and forcing Minitab to use a very small standard deviation (by default this graph uses the overall standard deviation that is used when calculating Ppk): Graph> Histogram > Simple, enter the data, click Data View, choose the Distribution tab, check Fit distribution and for the Historical StDev enter 0.0777, then click OK and now we get:

Mystery solved!  And if you still don’t believe me, we can get a better looking capability histogram by randomizing the data first (Calc> Random Data> Sample From Columns):

Now if we run the capability analysis using the randomized data in C2 we see:

A note of caution: I’m not suggesting that the data for a capability analysis should be randomized. The moral of the story is that the data in the worksheet should be entered in the order it was collected so that it is representative of the normal variation in the process (i.e., the data should not be sorted). 

Too bad our Cpk doesn’t look as amazing as it did before…now it's time to get to work with Minitab to improve our Cpk!

Process Capability Statistics: Cpk vs. Ppk

$
0
0

Back when I used to work in Minitab Tech Support, customers often asked me, “What’s the difference between Cpk and Ppk?” It’s a good question, especially since many practitioners default to using Cpk while overlooking Ppk altogether. It’s like the '80s pop duo Wham!, where Cpk is George Michael and Ppk is that other guy.

Poofy hairdos styled with mousse, shoulder pads, and leg warmers aside, let’s start by defining rational subgroups and then explore the difference between Cpk and Ppk.

Rational Subgroups

A rational subgroup is a group of measurements produced under the same set of conditions. Subgroups are meant to represent a snapshot of your process. Therefore, the measurements that make up a subgroup should be taken from a similar point in time. For example, if you sample 5 items every hour, your subgroup size would be 5.

Formulas, Definitions, Etc.

The goal of capability analysis is to ensure that a process is capable of meeting customer specifications, and we use capability statistics such as Cpk and Ppk to make that assessment. If we look at the formulas for Cpk and Ppk for normal (distribution) process capability, we can see they are nearly identical:

The only difference lies in the denominator for the Upper and Lower statistics: Cpk is calculated using the WITHIN standard deviation, while Ppk uses the OVERALL standard deviation. Without boring you with the details surrounding the formulas for the standard deviations, think of the within standard deviation as the average of the subgroup standard deviations, while the overall standard deviation represents the variation of all the data. This means that:

Cpk:
  • Only accounts for the variation WITHIN the subgroups
  • Does not account for the shift and drift between subgroups
  • Is sometimes referred to as the potential capability because it represents the potential your process has at producing parts within spec, presuming there is no variation between subgroups (i.e. over time)
Ppk:
  • Accounts for the OVERALL variation of all measurements taken
  • Theoretically includes both the variation within subgroups and also the shift and drift between them
  • Is where you are at the end of the proverbial day
Examples of the Difference Between Cpk and Ppk

For illustration, let's consider a data set where 5 measurements were taken every day for 10 days.

Example 1 - Similar Cpk and Ppk

similar Cpk and Ppk

As the graph on the left side shows, there is not a lot of shift and drift between subgroups compared to the variation within the subgroups themselves. Therefore, the within and overall standard deviations are similar, which means Cpk and Ppk are similar, too (at 1.13 and 1.07, respectively).

Example 2 - Different Cpk and Ppk

different Cpk and Ppk

In this example, I used the same data and subgroup size, but I shifted the data around, moving it into different subgroups. (Of course we would never want to move data into different subgroups in practice – I’ve just done it here to illustrate a point.)

Since we used the same data, the overall standard deviation and Ppk did not change. But that’s where the similarities end.

Look at the Cpk statistic. It’s 3.69, which is much better than the 1.13 we got before. Looking at the subgroups plot, can you tell why Cpk increased? The graph shows that the points within each subgroup are much closer together than before. Earlier I mentioned that we can think of the within standard deviation as the average of the subgroup standard deviations. So less variability within each subgroup equals a smaller within standard deviation. And that gives us a higher Cpk.

To Ppk or Not to Ppk

And here is where the danger lies in only reporting Cpk and forgetting about Ppk like it’s George Michael’s lesser-known bandmate (no offense to whoever he may be). We can see from the examples above that Cpk only tells us part of the story, so the next time you examine process capability, consider both your Cpk and your Ppk. And if the process is stable with little variation over time, the two statistics should be about the same anyway.

(Note: It is possible, and okay, to get a Ppk that is larger than Cpk, especially with a subgroup size of 1, but I’ll leave explanation for another day.)

To Infinity and Beyond with the Geometric Distribution

$
0
0

See if this sounds fair to you. I flip a coin.

pennies

Heads: You win $1.
Tails: You pay me $1.

You may not like games of chance, but you have to admit it seems like a fair game. At least, assuming the coin is a normal, balanced coin, and assuming I’m not a sleight-of-hand magician who can control the coin.

How about this next game?

You pay me $2 to play.
I flip a coin over and over until it comes up heads.
Your winnings are the total number of flips.

So if the first flip comes up heads, you only get back $1. That’s a net loss of $1. If it comes up tails on the first flip and heads on the second flip, you get back $2, and we’re even. If it comes up tails on the first two flips and then heads on the third flip, you get $3, for a net profit of $1. If it takes more flips than that, your profit is greater.

It’s not quite as obvious in this case, but this would be considered a fair game if each coin flip has an equal chance of heads or tails. That’s because the expected value (or mean) number of flips is 2. The total number of flips follows a geometric distribution with parameter p = ½, and the expected value is 1/p.

geometric distribution with p=0.5

Now it gets really interesting. What about this next game?

You pay me $x dollars to play.
I flip a coin over and over until it comes up heads.
Your winnings start at $1, but double with every flip.

This is a lot like the previous game with two important differences. First, I haven’t told you how much you have to pay to play. I just called it x dollars. Second, the winnings grow faster with the number of flips. It starts off the same with $1 for one flip and $2 dollars for two flips. But then it goes to $4, then $8, then $16. If the first head comes up on the eighth flip, you win $128. And it just keeps getting better from there.

So what’s a fair price to play this game? Well, let’s consider the expected value of the winnings. It’s $∞. You read that right. It’s infinity dollars! So you shouldn’t be too worried about what x is, right? No matter what price I set, you should be eager to pay it. Right? Right?

I’m going to go out on a limb and guess that maybe you would not just let me name my price. Sure, you’d admit that it’s worth more than $2. But would you pay $10? $50? $100,000,000? If the fair price is the expected winnings, then any of these prices should be reasonable. But I’m guessing you would draw the line somewhere short of $100, and maybe even less than $10.

This fascinating little conundrum is known as the Saint Petersburg paradox. Wikipedia tells me it goes by that name because it was addressed in the Commentaries of the Imperial Academy of Science of Saint Petersburg back in 1738 by that pioneer of probability theory, Daniel Bernoulli.

The paradox is that while theory tells us that no price is too high to pay to play this game, nobody is willing to pay very much at all to play it.

What’s more, even if you decide what you're willing to pay, you won't find any casinos that even offer this game, because the ultimate outcome is just as unpredictable for the house as it is for the player.

The paradox has been discussed from various angles over the years. One reason I find it so interesting is that it forces me to think carefully about things that are easy to take for granted about the mean of a distribution.

Such as…

The mean is a measure of central tendency.

This is one of the first things we learn in statistics. The mean is in some sense a central value, which the data tends to vary around. It’s the balancing point of the distribution. But when the mean is infinite, this interpretation goes out the window. Now every possible value in the distribution is less than the mean. That’s not very central!

The sample mean approaches the population mean.

One of the most powerful results in statistics is the law of large numbers. Roughly speaking, it tells us that as your sample size grows, you can expect the sample average to approach the mean of the distribution you are sampling from. I think this is a good reason to treat the mean winnings as the fair price for playing the game. If you play repeatedly at the fair price your average profit approaches zero. But here’s the catch: the law of large numbers assumes the mean of the distribution is finite. So we lose one of the key justifications of treating the mean as the fair price when it’s infinite.

The central limit theorem.

Another extremely important result in statistics is the central limit theorem, which I wrote about in a previous blog post. It tells us that the average of a large sample has an approximate normal distribution centered at the population mean, with a standard deviation that shrinks as the sample size grows. But the central limit theorem requires not only a finite mean but a finite standard deviation. I’m sorry to tell you that if the mean of the distribution is infinite, then so is the standard deviation. So not only do we lack a finite mean that our average winnings can gravitate toward, we don’t have a nicely behaved standard deviation to narrow down the variability of our winnings.

Let’s end by using Minitab to simulate these two games where the payoff is tied to the number of flips until heads comes up. I generated 10,000 random values from the geometric distribution. The two graphs show the running average of the winnings in the two games. In the first case, we have expected winnings of 2, and we see the average stabilizes near 2 pretty quickly. 

Time Series Plot of Expected Game Winnings - $2

In the second case, we have infinite expected winnings, and the average does not stabilize.

Time Series Plot of Infinite Expected Game Winnings

If you'd like to do some simulation on this paradox yourself, here's how to do it in Minitab. First, use Calc > Make Patterned Data > Simple Set of Numbers... to make a column with the numbers 1 to 10,000. Next, open Calc > Random Data > Geometric... to create a separate column of 10,000 random data points from the geometric distribution, using .5 as the Event Probability. 

Now we can compute the running average of the random geometric data in C2 with Minitab's Calculator, using the PARS function. PARS is short for “partial sum.” In each row it stores the sum of the data up to and including that row. To get the running average of a game where the expected winnings are $2, divide the partial sums by C1, which just contains the row numbers:

calculator with formula for running average

The computation for the game with infinite mean is the same, except that the winnings double in value when C2 increases by 1. Therefore, we take the partial sums of 2(C2 – 1) instead of just C2, and divide each by C1. That formula is entered in the calculator as shown below: 

calculator with running average formula for game with infinite winnings

Finally, select Time Series Analysis > Time Series Plot... and plot the running average of the games with expected winnings of $2 and $∞. 

So, how much would you pay to play this game? 

Data Not Normal? Try Letting It Be, with a Nonparametric Hypothesis Test

$
0
0

So the data you nurtured, that you worked so hard to format and make useful, failed the normality test.

not-normal

Time to face the truth: despite your best efforts, that data set is never going to measure up to the assumption you may have been trained to fervently look for.

Your data's lack of normality seems to make it poorly suited for analysis. Now what?

Take it easy. Don't get uptight. Just let your data be what they are, go to the Stat menu in Minitab Statistical Software, and choose "Nonparametrics."

nonparametrics menu

If you're stymied by your data's lack of normality, nonparametric statistics might help you find answers. And if the word "nonparametric" looks like five syllables' worth of trouble, don't be intimidated—it's just a big word that usually refers to "tests that don't assume your data follow a normal distribution."

In fact, nonparametric statistics don't assume your data follow any distribution at all. The following table lists common parametric tests, their equivalent nonparametric tests, and the main characteristics of each.

correspondence table for parametric and nonparametric tests

Nonparametric analyses free your data from the straitjacket of the normality assumption. So choosing a nonparametric analysis is sort of like removing your data from a stifling, conformist environment, and putting it into a judgment-free, groovy idyll, where your data set can just be what it is, with no hassles about its unique and beautiful shape. How cool is that, man? Can you dig it?

Of course, it's not quite that carefree. Just like the 1960s encompassed both Woodstock and Altamont, so nonparametric tests offer both compelling advantages and serious limitations.

Advantages of Nonparametric Tests

Both parametric and nonparametric tests draw inferences about populations based on samples, but parametric tests focus on sample parameters like the mean and the standard deviation, and make various assumptions about your data—for example, that it follows a normal distribution, and that samples include a minimum number of data points.

In contrast, nonparametric tests are unaffected by the distribution of your data. Nonparametric tests also accommodate many conditions that parametric tests do not handle, including small sample sizes, ordered outcomes, and outliers.

Consequently, they can be used in a wider range of situations and with more types of data than traditional parametric tests. Many people also feel that nonparametric analyses are more intuitive.

Drawbacks of Nonparametric Tests

But nonparametric tests are not completely free from assumptions—they do require data to be an independent random sample, for example.

And nonparametric tests aren't a cure-all. For starters, they typically have less statistical power than parametric equivalents. Power is the probability that you will correctly reject the null hypothesis when it is false. That means you have an increased chance making a Type II error with these tests.

In practical terms, that means nonparametric tests are less likely to detect an effect or association when one really exists.

So if you want to draw conclusions with the same confidence level you'd get using an equivalent parametric test, you will need larger sample sizes. 

Nonparametric tests are not a one-size-fits-all solution for non-normal data, but they can yield good answers in situations that parametric statistics just won't work.

Is Parametric or Nonparametric the Right Choice for You?

I've briefly outlined differences between parametric and nonparametric hypothesis tests, looked at which tests are equivalent, and considered some of their advantages and disadvantages. If you're waiting for me to tell you which direction you should choose...well, all I can say is, "It depends..." But I can give you some established rules of thumb to consider when you're looking at the specifics of your situation.

Keep in mind that nonnormal data does not immediately disqualify your data for a parametric test. What's your sample size? As long as a certain minimum sample size is met, most parametric tests will be robust to the normality assumptionFor example, the Assistant in Minitab (which uses Welch's t-test) points out that while the 2-sample t-test is based on the assumption that the data are normally distributed, this assumption is not critical when the sample sizes are at least 15. And Bonnett's 2-sample standard deviation test performs well for nonnormal data even when sample sizes are as small as 20. 

In addition, while they may not require normal data, many nonparametric tests have other assumptions that you can’t disregard. For example, the Kruskal-Wallis test assumes your samples come from populations that have similar shapes and equal variances. And the 1-sample Wilcoxon test does not assume a particular population distribution, but it does assume the distribution is symmetrical. 

In most cases, your choice between parametric and nonparametric tests ultimately comes down to sample size, and whether the center of your data's distribution is better reflected by the mean or the median.

  • If the mean accurately represents the center of your distribution and your sample size is large enough, a parametric test offers you better accuracy and more power. 
  • If your sample size is small, you'll likely need to go with a nonparametric test. But if the median better represents the center of your distribution, a nonparametric test may be a better option even for a large sample.

 

Common Assumptions about Data (Part 1: Random Samples and Statistical Independence)

$
0
0

horse before the cart road sign

Statistical inference uses data from a sample of individuals to reach conclusions about the whole population. It’s a very powerful tool. But as the saying goes, “With great power comes great responsibility!” When attempting to make inferences from sample data, you must check your assumptions. Violating any of these assumptions can result in false positives or false negatives, thus invalidating your results. In other words, you run the risk that your results are wrong, that your conclusions are wrong, and hence that the solutions you implement won’t solve the problem (unless you’re really lucky!).

You’ve heard the joke about what happens when you assume? For this post, let’s instead ask “What happens when you fail to check your assumptions?” After all, we’re human—and humans assume things all the time.  Suppose, for example, I want to schedule a phone meeting with you and I’m in the U.S. Eastern time zone. It’s easy for me to assume that everyone is in same time zone, but you’re really in California, or Australia. What would happen if I called a meeting at 2:00 p.m. but didn’t specify the time zone?  Unless you checked, you might be early or late to the meeting, or miss it entirely!  

The good news is that when it comes to the assumptions in statistical analysis, Minitab has your back. Minitab 17 has even more features to help you verify and validate the needed statistical analysis assumptions before you finalize your conclusion. When you use the Assistant in Minitab, the software will identify the appropriate assumptions for your analysis, provide guidance to help you develop robust data collection plans, check the assumptions when you analyze your data, and let you know the results in an easy-to-understand Report Card and Diagnostic Report.

The common data assumptions are: Random Samples, Independence, Normality, Equal Variance, Stability, and that your Measurement System is accurate and precise. In this post, we’ll address Random Samples and Statistical Independence.

What Is the Assumption of Random Samples?

A sample is random when each data point in your population has an equal chance of being included in the sample; therefore selection of any individual happens by chance, rather than by choice. This reduces the chance that differences in materials or conditions strongly bias results. Random samples are more likely to be representative of the population; therefore you can be more confident with your statistical inferences with a random sample. 

There is no test that assures random sampling has occurred. Following good sampling techniques will help to ensure your samples are random. Here are some common approaches to making sure a sample is randomly created:

  • Using a random number table or feature in Minitab (Figure 1).
  • Systematic selection (every nth unit or at specific times during the day).
  • Sequential selection (taken in sequence for destructive testing, etc.).
  • Avoiding the use of judgement or convenience to select samples.

Minitab dialog boxes

Figure 1. Random Data Generator in Minitab 17

Non-random samples introduce bias and can result in incorrect interpretations.

What Is the Assumption of Statistical Independence?

Statistical independence is a critical assumption for many statistical tests, such as the 2-sample t test and ANOVA. Independence means the value of one observation does not influence or affect the value of other observations. Independent data items are not connected with one another in any way (unless you account for it in your model). This includes the observations in both the “between” and “within” groups in your sample. Non-independent observations introduce bias and can make your statistical test give too many false positives.  

Following good sampling techniques will help to ensure your samples are independent. Common sources of non-independence include:

  • Observations that are close together in time.
  • Observations that are close together in space or nested.
  • Observations that are somehow related.

Minitab can test for independence using the Chi-Square Test for Association, which is designed to determine if the distribution of observations for one variable is similar for all categories of the second variable. 

The Real Reason You Need to Check the Assumptions

You will be putting a lot of time and effort into collecting and analyzing data. After all the work you put into the analysis, you want to be able to reach correct conclusions. You want to be confident that you can tell whether observed differences between data samples are simply due to chance, or if the populations are indeed different! 

It’s easy to put the cart before the horse and just plunge in to the data collection and analysis, but it’s much wiser to take the time to understand which data assumptions apply to the statistical tests you will be using, and plan accordingly.

In my next blog post, I will review the Normality and Equal Variance assumptions.  

Viewing all 64 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>