BASIC STATISTICS CONCEPTS
We’re in a world where data is ruling, and almost every existing industry wants to use data to solve business problems. It’s a Data Scientist who creates some sense out of unstructured data.
Statistics is a broad concept, and it’s one of the must-have skills required in a Data Science career path. Some Statistics concepts are used in data analysis.
I’ve listed down the essential ones that every Data Scientist must know. Go ahead, have a look.
- Statistical features
- Probability Distribution
- Dimensionality Reduction
- Bayesian Statistics
Let’s go through each of them.
Statistical features are used in data exploration that data scientists apply while studying any dataset.
A Basic Box Plot:
When looking at the basic box plot, the minimum and maximum values represent the data range’s upper and lower ends. To remove outliers in the given dataset, the median is preferred over mean.
Box plot flawlessly illustrates what we can do with original statistical features:
- A short box plot indicates that much of the data points are similar, as there are many identical values in a small range.
- If the median value closer to the bottom tells us that most of the data have lower virtues, whereas, if the median is closer to the top, it means that most of the data have elevated values.
- Therefore there is skewness in the given data if the median isn’t in the middle.
- It can be concluded from long whiskers that variance and the deviation are high. If data is highly differing in one direction, then the whiskers exist only on that side of the box.
A probability distribution is used concerning random variables.
Presume, you draw a random sample and measure the heights of the subjects. Well, you measure heights, which can create a distribution of heights.
This type of proportion is useful when you need to know which outcomes are most likely, the spread of potential values, and the likelihood of different results.
Uniform Distribution, data within a defined range remains uniform, and there is no value apart from that range. Those ranges can be referred to as on and off data range.
Classification values can have multiple values apart from 0. However, scientists can still visualize it in the same way as a piecewise function of multiple uniform distributions.
A Normal Distribution, commonly referred to as a Gaussian Distribution, is uniquely interpreted by its mean and standard deviation. The import distinction from other distributions (Poisson) is that the standard deviation is the same in all directions.
Therefore, by referring to Gaussian Distribution, one can know the average of the dataset.
A Poisson distribution is similar to Normal Distribution except that it has added a factor of skewness. It is very spread in one direction and is highly concentrated in another direction.
In dimensionality reduction, the following two components are used:
Feature Selection: Tries to discover a subset of the unique set of variables, or features, to get a smaller model to model the problem.
Any of the three ways does feature selection:
Feature Extraction: Reduce the data in a high dimensional space to a lower-dimensional space that means a void with a lower number of dimensions.
Bayesian Statistics describes uncertainty well with the help of conditional probability. Decision making is done by using the probability of both; available information and the new evidence.
Bayes Theorem, which is an essential aspect of Bayesian Statistics, is defined as follows.
[P (b) ≠ 0]
‘a’ is the event that is yet to occur.
‘b’ is the event that has occurred already.
P (a/b) – Probability of a given b
P (b/a) – Probability of b given a
Bayesian Statistics is found to be implemented in various sectors (finance, healthcare, insurance, Etc.)
Recommended Books to get you started:
- Elements of Statistical Learning by Trevor Hastie and Rob Tibshirani
- Think Stats by Alien B Downey
- Introduction to Bayesian Statistics by William M Bolstad
- Statistics for Data Science by James Miller