How to Perform a Correlation Analysis

Video Activity
Join over 3 million cybersecurity professionals advancing their career
Sign up with
Required fields are marked with an *

Already have an account? Sign In »

9 hours 53 minutes
Video Transcription
Hi, guys. Welcome back to your lean six Sigma green belt. I'm Katherine Mukai, over and in today's module. We're actually gonna finish up regression analysis
by performing the correlation analysis so you'll notice that I call it regression and I call it Correlation, much like everything else in Lean and Six Sigma. Because we borrow so much from statisticians, we use the terms interchangeably.
So regression is a little bit of an older term used, really kind of to discuss the very specifics of
how did the how do the on values move together, where his correlation is looking at more. How are the values correlated with each other? At the end of the day, it's the same thing. So in this module, I want you to be able to finish up your correlation analysis.
I want you to register why we do a scatter plot when we could so easily just do a correlation
analysis and I want you to understand the measures of magnitude. Your regression analysis of your correlation analysis is going to be your number one hypothesis tool as a green belt. So I want you to be able to feel really comfortable speaking from the results of your testing.
So we did a scatter plot in our last module and it was really interesting because it graphically showed us or visually showed us the direction off our relationship. And we like this because humans again are very visual people and we're very intuitive. And that sort of graphic representation
it tells a story. It helps us look at something and understand.
So I really love this comic because what you are looking at on the left is something that wouldn't be unheard of. So let's kind of go through.
You're looking at an R squared, which we're gonna look at in a minute that is very low. It's no really very diagnostic. But when you see the scatters and you see how far away from your trendline they are, that gives you a sense of the variants in your data.
Eso. When we looked at
both the OK Cupid data, we saw that there was some very answer. Your your plots were pretty far apart, but there was still an identifiable relationship. When we looked our presidents date of birth and the year they were elected,
we learned that there's not really a lot of variants. That was a very tight line. So if you think about that from a normal distribution curve,
you would see a very tall bell curve with not a lot of spread. Well, we're looking at here on the left with R R squared being 0.6 We're looking at a very low distribution curve with a huge amount of spread because those points show you how far away from the average or your trend line there.
that being said, your scatter plot is another way to visualize your data, it initially gives you a sense of variants. Well, you're going to be also looking at is your goodness of fit. How close are the dots to your trendline? Another kind of
gut instinct measure of variants in your process. Also, how many dots are far away from your trendline? Do you see clustering patterns
at the end of the day, while we want Teoh mathematically show the magnitude of the relationship, which will be the less the rest of the module, we can look at this and probably get a gut instinct that we're not going to need to do the rest of the work to finish it
because this isn't a really indicative graph where we're like, Yup, there is definitely a relationship there.
So correlation analysis is that second piece of regression analysis where you have your scatter plot. Andrew Correlation. It is a quantifiable way to measure the magnitude of the relationship. So remember when we're talking about hypothesis testing
from her green belt perspective, what you are interested in? Is there a relationship between your X and your Y variables or what you can control
and what your process output is? It doesn't account for cause ality, and I keep mentioning this. Mora's a C Y. A. Than anything else. As you get more and more familiar with your process, you'll be able to tease out your variables easier. But if this is something where you're new in a department or you don't really
no all of the ends and out of the specific process,
you always want to leave the opportunity for there to be another very over is actually the driving force. So while we say cause ality remember, we had our, um, our statistician who refused to be named that said Correlation really strongly hints at it.
Correlation really does strongly hint at it.
We do correlation analysis because it tells us which variables we changing will impact our outputs or why. Variables. So if you think about back to our root cause, remember all of the work that we did in our root cause analysis creates those variables
that we are looking to change.
If we find that there is no relationship between what we believe to be the root cause and the actual measurement are proud project objective,
then we know that that's not a solution we want to pursue. So correlation analysis helps us pursue the correct solution, or at least get it
solutions that we believe will have more magnitude towards our problem or project objective or a problem statement. If you fail to reject the null hypothesis, so remember false negative? Um, you will be looking for another variable.
So remember I mentioned that all of your root causes are your variables. It doesn't account for cause ality. Failing to reject the null hypothesis means that you are looking for that hidden variable
eso. That's where your Type one and type two errors play into this.
So correlation analysis in Excel iso easy like it's kind of like you need the that was simple button or whatever other time you go with for so easy.
So if you recognize our graph, this is our president year and you're elected graph exactly the one that we created for our scatter plot module. All we're going to do is go to format trendline
in Excel and towards the bottom of our options. We're going to see display R squared value on chart.
Your R squared value is the measure of relationships. Eso this is It's also called Pearson's coefficient. But what this tells you is the goodness of fit between your X and your Y variables.
The tighter your trendline is like looking at birth year and you're elected.
The higher your are square value is it gives you a measure of goodness effect. It also tells you whether or not that relationship is positive or negative.
So when we are looking, are Pearson's coefficient? Remember, you've probably heard your family say along the way, or someone mentioned a 1 to 1 relationship. If you are looking at a Pearson's coefficient of one,
you are looking at a 1 to 1 relationship or perfectly positive
What that means is when you move X by one unit Why will move in a corresponding positive way by one unit? If you are negative one, you are perfectly negative. Same thing. Change your unit of X by one. Your why will change by a unit of one
and I say unit
very, very mindfully because not all of our measures are going to be perfect. 1 to 1 relationships where you have 10 X and 10. Why you may have 10 x in 27 y, which means that when you move X by one unit your why may move by 2.7 or if you have
a y intercept So a baseline that you're always working with.
what's important in here is your units of measure. So remember our measurement scale module If our one tick on our X graph equates to five but one tick on our Y graph equates to 200 When we move five, we will see that corresponding 200
So unit to unit not actual measure
Eso you rarely see perfects what you will see Strong lease moderates lease and weeklies. So a strong is a 0.8 to 1 or a negative one to negative 10.8.
You'll notice that if you think about our presidential matchup, that was a 0.98 So it is a very strong positive. We can feel very confident
that the relationship between the year and the date of birth of the year elected and their date of birth is very, very, really very, very strongly correlated. Which means that we know that somebody is not going to be elected before they're born. And if we have them born in 50 years, they're probably not going to be elected.
tomorrow. So that's what that relationship gives you is a sense of confidence in your solution.
If you look in your moderates, so your 0.52 point eight or your negative 80.82 negative 0.5. What you are looking at is is there is some relationship that your activity is not going to be as strong, but these air still effective solutions.
This is where the majority of the solutions that I work with live where you have
ah 0.5 relationship, where you're like, OK, this is good, we can work with us. Let's use the solution. Um point to 2.5 or point negative 52 or negative 0.5 to negative 0.2. These were gonna be your weak ones. These are kind of your *** shoots. This is where you're gonna live in your
operationally significant. So not statistically significant.
You're gonna want to be in moderates to be statistically significant, but operational least significant. And then remember, when you're reading this, keep in mind what your goal is. If bigger is better, you're gonna want to be in the positives. If smaller is better, you're gonna want to be in the negatives. So that's how you, as a green bell, are going
to do your hypothesis testing for that relationship which will help you select which variable is you're going to develop solutions for
with that, it's been a while since we've had a pop quiz. So as we're looking at this, what is the magnitude of relationship between the Europe president was born and the year he was elected in office on I really hope you guys were paying attention to our last slide and you can say strongly positive because it's not quite a perfect one.
So today we wrapped up regression analysis. We did correlation analysis on We talked about the magnitude of relationship and what that means for you, for statistically significant,
moderate and better is going to be very statistically significant. Weekly isn't going to be not very statistically significant, but operationally statistic,
statistic or operationally significant. And then in between means there's no relationship. And what you're looking at is probably going to just be noise in the process. So with that, that wraps up our hypothesis testing. So we have done root cause analysis and where now we're testing
whether or not those root causes. Actually, our drivers in our process
on, we're going to switch over to something completely different when we look at theory of constraints, so I will see you guys there.
Up Next