Friday

Inferential Statistics for Data Science

 

Pretext

Statistics broadly divided into two parts:

Descriptive and Inferential. Today we will discuss the latter one.

What is inferential statistics?

Inferential statistics is a branch of statistics that involves using sample data to make inferences or draw conclusions about a larger population. It is used to estimate population parameters, such as means, variances, or proportions, based on a subset of the data.

Inferential statistics involves statistical techniques such as hypothesis testing, confidence intervals, and regression analysis. These techniques are used to test the validity of assumptions made about the population, based on the sample data. The results of inferential statistics can be used to make decisions, draw conclusions, and make predictions about the population.

In contrast to descriptive statistics, which is used to summarize and describe data, inferential statistics is concerned with making inferences and predictions about the population based on sample data. It is widely used in many fields, including business, economics, social sciences, and medicine, among others.

Why do we need a hypothesis in inferential statistics?

A hypothesis is a tentative explanation or prediction for a phenomenon or set of phenomena. It is an educated guess based on prior knowledge, observation, or theory, which can be tested through further investigation or experimentation.

In scientific research, a hypothesis is often formulated as an if-then statement, which outlines the expected relationship between variables. For example, “If the amount of sunlight increases, then the rate of photosynthesis in plants will also increase.”

Hypotheses are essential in the scientific method because they provide a framework for investigation and allow researchers to make predictions and test their ideas. If the hypothesis is supported by the evidence, it may become a theory, which is a well-supported and widely accepted explanation for a particular phenomenon.

It’s important to note that a hypothesis is not a definitive statement of fact, but rather a proposal that can be tested and potentially refuted through empirical evidence.

In statistical hypothesis testing, the null hypothesis is a statement that assumes there is no significant difference or relationship between two or more variables. It is often denoted as H0 and is a starting point for the hypothesis testing process.

What is the null hypothesis?

The null hypothesis is typically formulated as the opposite of the research or alternative hypothesis, which is the hypothesis that the researcher is trying to support or reject. For example, if the research hypothesis is that a new drug is more effective than an existing drug, the null hypothesis would be that there is no significant difference between the two drugs.

During hypothesis testing, statistical methods are used to determine whether the data support the null hypothesis or provides evidence to reject it in favour of the research hypothesis. If the data strongly support the research hypothesis, the null hypothesis can be rejected, and the researcher can conclude that there is a significant difference or relationship between the variables being studied.

It’s important to note that failing to reject the null hypothesis does not necessarily mean that the null hypothesis is true, but rather that there is insufficient evidence to support the research hypothesis. Therefore, further research and testing may be necessary to draw definitive conclusions.

What is the alternative hypothesis?

In statistical hypothesis testing, the alternative hypothesis is a statement that contradicts or opposes the null hypothesis. It represents the hypothesis that the researcher is trying to support, and is denoted as Ha.

The alternative hypothesis can take different forms, depending on the research question and the variables being studied. For example, if the research question is whether a new drug is more effective than an existing drug, the alternative hypothesis might be that the new drug is significantly more effective than the existing drug.

During hypothesis testing, statistical methods are used to determine whether the data supports the alternative hypothesis or the null hypothesis. If the data provide strong evidence in favour of the alternative hypothesis, the null hypothesis can be rejected, and the researcher can conclude that there is a significant difference or relationship between the variables being studied.

It’s important to note that the alternative hypothesis is not always the opposite of the null hypothesis, but rather represents a specific hypothesis that the researcher is trying to support. The choice of the alternative hypothesis depends on the research question and the goals of the study.

What is P value and what is its importance?

In statistical hypothesis testing, the p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the observed one, assuming that the null hypothesis is true. The p-value is used as a criterion for deciding whether to reject the null hypothesis or not.

In simpler terms, the p-value tells us how likely it is that our observed results occurred by chance alone. If the p-value is very small (typically less than 0.05 or 0.01), then we reject the null hypothesis and conclude that there is strong evidence to support the alternative hypothesis. Conversely, if the p-value is large, then we fail to reject the null hypothesis and conclude that there is not enough evidence to support the alternative hypothesis.

For example, if we conduct a hypothesis test to determine if a new drug is effective in reducing blood pressure, our null hypothesis might be that the drug has no effect, and the alternative hypothesis might be that the drug does have an effect. If the calculated p-value is 0.02, this means that there is a 2% chance of observing the results we did if the null hypothesis is true. If our significance level is 0.05, then we would reject the null hypothesis and conclude that the drug is effective in reducing blood pressure.

Designing and performing experiments typically involves several key steps:

  1. Identify the research question or problem: This step involves clearly defining the research question or problem that the experiment aims to address.
  2. Formulate a hypothesis: The next step is to develop a hypothesis, which is a tentative explanation for the observed phenomenon that can be tested through experimentation.
  3. Design the experiment: This step involves developing a detailed plan for the experiment, including selecting the sample size, defining the experimental conditions, and identifying the variables to be measured.
  4. Collect data: In this step, data is collected by conducting the experiment and recording the results. It’s important to ensure that the data is collected accurately and consistently.
  5. Analyze the data: Once the data is collected, it must be analyzed to determine whether the results support or reject the hypothesis. This typically involves using statistical methods to analyze the data and test for significant differences between the groups being studied.
  6. Draw conclusions: Based on the results of the analysis, conclusions are drawn about the hypothesis, and the implications of the results are considered. If the hypothesis is supported, the conclusions may be used to inform further research or practical applications.
  7. Communicate the results: Finally, the results of the experiment are communicated to others, either through publication in a scientific journal or presentation at a conference or other venue. It’s important to clearly and accurately communicate the methods used, the results obtained, and the conclusions drawn from the experiment.

What is a confidence interval?

A confidence interval is a range of values that is likely to contain the true value of a population parameter, such as a population mean or proportion, based on a sample of data.

In other words, a confidence interval provides a measure of the uncertainty associated with a sample estimate of a population parameter. A wider interval indicates greater uncertainty, while a narrower interval indicates greater precision.

Confidence intervals are typically expressed as a range of values with an associated level of confidence. For example, a 95% confidence interval for a population mean would indicate that if we were to repeatedly sample from the population and construct a confidence interval for each sample, we would expect that 95% of those intervals would contain the true population mean.

The width of the confidence interval depends on several factors, including the sample size, the level of confidence, and the variability of the data. A larger sample size, a higher level of confidence, and lower variability typically result in a narrower confidence interval.

Confidence intervals can be calculated using statistical software or by hand using formulas that depend on the sample size and level of confidence. They are commonly used in research studies to provide an estimate of the precision of sample estimates of population parameters and to make inferences about the population based on the sample data.

What is the significance level or p-value?

A significance level, also known as a p-value, is a probability that measures the strength of evidence against the null hypothesis in a statistical test.

In hypothesis testing, we begin with a null hypothesis, which is the assumption that there is no significant difference between the groups being compared or that the observed effect is due to chance. The alternative hypothesis is the opposite of the null hypothesis, and it represents the possibility of a real difference or effect.

The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the one observed, assuming that the null hypothesis is true. If the p-value is very small, typically less than 0.05, we reject the null hypothesis and accept the alternative hypothesis, concluding that the observed effect is unlikely to be due to chance alone.

Conversely, if the p-value is large, typically greater than 0.05, we fail to reject the null hypothesis and conclude that the observed effect is likely due to chance or random variation. The exact threshold for determining statistical significance depends on the study design and the level of risk associated with making a type I error (rejecting a true null hypothesis).

In summary, the significance value or p-value is a measure of the strength of evidence against the null hypothesis in a statistical test, and it helps to determine whether the observed effect is statistically significant or due to chance.

There are four types perform experiments:

  1. z-test
  2. t-test
  3. Chi-Square test
  4. ANOVA

Here are the details of each of them with examples below:

z-test

The z-test is a statistical test that is used to determine whether two population means are significantly different from each other, assuming that the population standard deviation is known. It is a parametric test that can be used for large sample sizes.

Here’s an example of how the z-test can be used:

Suppose a company wants to determine if the average weight of its cereal boxes is the same as the weight stated on the packaging. The company randomly selects a sample of 50 cereal boxes and weighs them, finding that the average weight is 400 grams, with a standard deviation of 10 grams. The weight stated on the packaging is 395 grams.

To perform a z-test, we first need to calculate the test statistic, which is the number of standard errors between the sample mean and the population mean under the null hypothesis. The formula for the z-test statistic is:

z = (x̄ — μ) / (σ / √n)

where: x̄ is the sample mean μ is the population mean under the null hypothesis (i.e., 395 grams) σ is the population standard deviation (i.e., 10 grams) n is the sample size (i.e., 50)

Plugging in the values, we get:

z = (400–395) / (10 / √50) = 3.54

Next, we need to determine the p-value associated with the test statistic. This can be done using a standard normal distribution table or a statistical software program. Assuming a two-tailed test at a significance level of 0.05, we find that the p-value is less than 0.05, indicating that the sample mean is significantly different from the population mean.

Therefore, we can reject the null hypothesis and conclude that the average weight of the cereal boxes is different from the weight stated on the packaging. The company may need to adjust its manufacturing process to ensure that the weight of the cereal boxes matches the weight stated on the packaging.

t-test

The t-test is a statistical test that is used to determine whether two groups of data are significantly different from each other in terms of their means. It is a parametric test that can be used for small sample sizes when the population standard deviation is unknown.

Here’s an example of how the t-test can be used:

Suppose a researcher wants to determine whether there is a significant difference in the mean height of male and female students in a college. The researcher randomly selects a sample of 10 male students and a sample of 10 female students and measures their heights. The data is summarized in the table below:

To perform a t-test, we first need to calculate the test statistic, which is the number of standard errors between the sample means under the null hypothesis. The formula for the t-test statistic is:

t = (x̄1 — x̄2) / (s_p * √(1/n1 + 1/n2))

where: x̄1 and x̄2 are the sample means of the two groups s_p is the pooled standard deviation, calculated as: s_p = √[((n1–1)s1² + (n2–1)s2²) / (n1 + n2–2)] n1 and n2 are the sample sizes of the two groups

Plugging in the values, we get:

t = (175–165) / (6.3 * √(1/10 + 1/10)) = 2.82

Next, we need to determine the degrees of freedom and look up the critical value of t from a t-distribution table or a statistical software program. Assuming a two-tailed test at a significance level of 0.05 and 18 degrees of freedom (df = n1 + n2–2), we find that the critical value of t is ±2.101.

Finally, we compare the calculated t-value to the critical value of t. If the calculated t-value is greater than the critical value of t, we reject the null hypothesis and conclude that there is a significant difference between the means of the two groups. If the calculated t-value is less than the critical value of t, we fail to reject the null hypothesis and conclude that there is no significant difference between the means of the two groups.

In this example, the calculated t-value of 2.82 is greater than the critical value of 2.101, indicating that the difference in mean height between male and female students is statistically significant. Therefore, we can reject the null hypothesis and conclude that there is a significant difference in the mean height of male and female students in the college.

Chi-Square test

The chi-square test is a statistical test used to determine whether there is a significant association between two categorical variables. It is used to test whether there is a difference between the expected frequencies and the observed frequencies in a contingency table.

Here’s an example of how the chi-square test can be used:

Suppose a researcher wants to investigate whether there is an association between gender and smoking habits. The researcher selects a random sample of 200 individuals from a population and asks them whether they smoke or not. The data is summarized in the table below:

To perform a chi-square test, we first need to calculate the test statistic, which is the sum of the squared differences between the observed and expected frequencies. The expected frequencies are calculated assuming that there is no association between the two variables and are calculated as:

expected frequency = (row total * column total) / grand total

Using the data in the table, the expected frequencies are:

The formula for calculating the chi-square test statistic is:

χ² = Σ[(O — E)² / E]

where: O is the observed frequency E is the expected frequency

Plugging in the values, we get:

χ² = [(30–25)² / 25] + [(70–75)² / 75] + [(20–25)² / 25] + [(80–75)² / 75] = 2.33

Next, we need to determine the degrees of freedom and look up the critical value of chi-square from a chi-square distribution table or a statistical software program. Assuming a two-tailed test at a significance level of 0.05 and 1 degree of freedom (df = (r-1) * (c-1)), we find that the critical value of chi-square is 3.84.

Finally, we compare the calculated chi-square value to the critical value of chi-square. If the calculated chi-square value is greater than the critical value of chi-square, we reject the null hypothesis and conclude that there is a significant association between the two variables. If the calculated chi-square value is less than the critical value of chi-square, we fail to reject the null hypothesis and conclude that there is no significant association between the two variables.

In this example, the calculated chi-square value of 2.33 is less than the critical value of 3.84, indicating that there is no significant association between gender and smoking habits. Therefore, we fail to reject the null hypothesis and conclude that there is no evidence to suggest that gender is associated with smoking habits in the population.

ANOVA

ANOVA (Analysis of Variance) is a statistical test that is used to analyze the difference between two or more groups of data. It measures whether the means of different groups are significantly different from each other or not.

For example, let’s say we want to determine if there is a significant difference in the mean height of three different plant species. We randomly select 10 plants from each species and measure their height in centimetres. The data is as follows:

Species A: 32, 35, 36, 37, 41, 43, 44, 45, 46, 48 Species B: 30, 32, 34, 36, 37, 39, 40, 41, 42, 43 Species C: 28, 30, 31, 33, 34, 35, 36, 38, 39, 41

To conduct an ANOVA test on this data, we first calculate the sum of squares for each group, as well as the total sum of squares. We then use these values to calculate the F-statistic, which measures the ratio of the variance between groups to the variance within groups. If the F-statistic is greater than the critical value, we can conclude that there is a significant difference in the means of the groups.

Here are the steps to conduct the ANOVA test:

  1. Calculate the mean height for each species:

Mean A = (32+35+36+37+41+43+44+45+46+48) / 10 = 40.7 Mean B = (30+32+34+36+37+39+40+41+42+43) / 10 = 37.4 Mean C = (28+30+31+33+34+35+36+38+39+41) / 10 = 34.5

2. Calculate the total sum of squares (SST):

SST = Σ(xi — x̄)² = (32–37.53)² + (35–37.53)² + … + (41–37.53)² + … + (39–37.53)² + … + (41–37.53)² = 416.77

3. Calculate the sum of squares between groups (SSB):

SSB = (nA*(MA — M)²) + (nB*(MB — M)²) + (nC*(MC — M)²) = (10*(40.7–37.53)²) + (10*(37.4–37.53)²) + (10*(34.5–37.53)²) = 61.46

4. Calculate the sum of squares within groups (SSW):

SSW = Σ(xi — Mi)² = (32–40.7)² + (35–40.7)² + … + (41–37.4)² + … + (41–34.5)² = 355.31

5. Calculate the degrees of freedom for each component:

df(SST) = N — 1 = 29 df(SSB) = k — 1 = 2 df(SSW) = N — k = 27

6. Calculate the mean square for each component:

MSB = SSB / df(SSB) = 30.73 MSW = SSW / df(SSW) = 13.16

7. Calculate the F-statistic

To calculate the F-statistic in an ANOVA test, you need to divide the mean square between groups (MSB) by the mean square within groups (MSW):

F = MSB / MSW

Using the example data above, the F-statistic can be calculated as follows:

F = MSB / MSW
= 30.73 / 13.16
= 2.34

To determine whether this F-value is statistically significant, you would need to compare it to the critical F-value for the chosen significance level (alpha) and degrees of freedom. This critical value can be looked up in an F-distribution table or calculated using statistical software. If the calculated F-value is larger than the critical F-value, then we reject the null hypothesis and conclude that there is a significant difference between the means of the groups.

Hope this will help you to clear the concepts for inferential statistics. Let me know if you have any suggestions.

No comments:

Python Meta Classes