Data Science and Statistics



The statistical ideas utilized in data science are explained in this article. It's crucial to note that data science isn't a new idea, and that most statisticians are capable of working as data scientists. 

Many principles from statistics are used in data science since statistics is the finest instrument for processing and interpreting data. Statistical techniques may help you extract a lot of information from the data collection. 

You should study everything you can about statistics if you want to understand data science and become an expert in the area. 

While there are numerous areas in statistics that a data scientist should be familiar with, the following are the most crucial: 


  1. Descriptive Statistics are a type of statistical analysis that is used to describe something. 
  2. Inferential Statistics are statistics that may be used to make inferences. 

Descriptive statistics is the act of describing or looking at data in a way that makes it easier to understand. 

This technique aids in the quantitative summarizing of data using numerical representations or graphs. 


The following are some of the subjects you should study: 


  1. Normal Distribution 
  2. Central Tendency 
  3. Variability 
  4. Kurtosis 



Normal Distribution 


A normal distribution, also known as a Gaussian distribution, is a continuous distribution often used in statistics. Any data set following a normal distribution is spread across a graph, also a bell-shaped curve. 

In normal distributions, the data points in the set peak at the center of the bell-shaped curve, which represents the center of the data set. 

When the data moves away from the mean, it will fall to the end of the curve. You need to ensure the data you look at is distributed normally if you want to make inferences from the data set.


Central Tendency 


Measures of central tendency aid in determining the data set's center values. The mean, median, and mode are the three most widely used measurements. Any distribution's mean, or arithmetic mean, is located at the middle of the data set. 


The following formula may be used to compute the data set's mean or average: 

(the total number of points in the data collection) / (number of data points) 

Another metric is the median, which is the midpoint of the data set when the points are sorted ascending. 

You can easily calculate the midpoint if you have an odd amount of values, but if you have an even number of data points, you take the average of the two data points in the middle of the data set. 

The mode is the last metric, and its value is the data point that appears the most times in the data collection. 


Variability 


Variability is a factor that aids in determining the distance between the data points in a data collection and the average or mean of the data points. 

This number also displays the difference between the chosen data points. Variability may be viewed and assessed using central measure metrics such as range, variation, and standard deviation. 

The range is a number that represents the difference between the data set's lowest and greatest values. 


Skewness and Kurtosis 


The skewness of a data collection might assist you figure out how symmetrical it is. The data set will take the shape of a bell curve if it is spread evenly. The data is not skewed if the curve is formed equally. 

The data is negatively or positively skewed if the curve goes to the right or left side of the data points, respectively. This indicates that the data is dominating on either the left or right side of the central tendency measurements. 

Kurtosis is a metric that aids in determining the distribution's tails. You can tell if the data is light or heavy-tailed by plotting the dots on a graph. Based on the center region of the distribution, you may make this assumption. 


Statistical Inference 


  • Inferential statistics are used to get insights into a data collection. Descriptive statistics give information about the data. 
  • Inferential statistics is concerned with drawing conclusions about a big population from a small sample of data. 


Assume you're trying to figure out how many individuals in Africa have got the polio vaccination. 

This analysis can be carried out in two ways: Inquire of every person in Africa if they have received the polio vaccination. Take a sample of people from throughout the continent, make sure they're from various sections, then extrapolate the results throughout the entire continent. 

The first procedure is difficult, if not impossible, to accomplish. It's impossible to walk across the country asking people if they've got the vaccination. 

The second technique is preferable since it allows you to draw conclusions or insights from the sample you've chosen and extrapolate the results to a larger population. 



The following are some inferential statistics tools:




Theorem of the Central Limit 


"The average of the sample equals the average of the total population," says the central limit theorem. This demonstrates that the sample and population have the same features and measurements of the data's central tendency, such as standard deviation. 

This implies you may choose more data points, which will result in a normal curve. 

If you wish to apply the central limit theorem, you must first grasp the idea of confidence intervals. This represents the population's mean value as a rough estimate. 

The total of an error margin is used in the process of constructing an interval in the population. "Multiplying the standard error of the mean with the zscore of the percentage of confidence level" is one way to determine this mistake. 


Testing Hypotheses 


The extent to which you can test whatever assumption you make about the data set is known as hypothesis testing. You can collect the findings of your hypothesis analysis on a smaller population using this kind of testing. 

The null hypothesis is the theory you'll be testing, and we'll be comparing it against the alternative hypothesis to see if it's valid. The case you must test is the null hypothesis. 

Consider the following scenario: you're conducting a survey to find out who smokes and who doesn't, as well as how smokers are affected by cancer. 

When conducting this survey, you make the assumption that the number of cancer patients who smoke is equal to the number of cancer patients who do not smoke. This is your null hypothesis, and you must test it in order to reject it. 

The alternative hypothesis is that the number of cancer patients who smoke is higher than the number of cancer patients who do not. You may test hypotheses and evaluate data to see if the null hypothesis is valid or not using the data and evidence provided. 


ANOVA (Analysis of Variance) 


Another statistical technique used to test hypotheses across several sets of data is ANOVA. This idea aids in determining if the groups under consideration have similar averages and variations. 

With ANOVA, you can perform this type of analysis with minimal error rates. The F-ratio may be used to compute ANOVA. 

The F-ratio is a formula for calculating the ratio of mean square error between groups to mean square error within specific groups. 


The procedures for calculating ANOVA are as follows: 


1. Write the hypotheses and explain why they are needed. A null and alternative hypothesis should be included in every study. 

2. If the null hypothesis is true, you must assume that the average of the groups is the same. 

3. The alternative hypotheses' average will be different.