Data Science 2:-Data Preprocessing using Scikit Learn
What is Data Preprocessing ?
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.
Steps for Data Preprocessing
- Data Encoding
- Normalization
- Standardization
- Imputation of missing values
- Discretization
Dataset
It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
Data Encoding
Encoding is the transformation of categorical variables to binary or numerical counterparts. An example is to treat male or female for gender as 1 or 0. Categorical variables must be encoded in many modeling methods.
Label Encoding
Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning
One Hot Encoding
One hot encoding is one method of converting data to prepare it for an algorithm and get a better prediction. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector.
Normalization
- Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency
- Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.
- Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution in effectiveness of an important equally important attribute because of other attribute having values on larger scale.
Data Standardization
- Data standardization is the process of bringing data into a uniform format that allows analysts and others to research, analyze, and utilize the data. In statistics, standardization refers to the process of putting different variables on the same scale in order to compare scores between different types of variables.
- Standardization comes into picture when features of input data set have large differences between their ranges, or simply when they are measured in different measurement units These differences in the ranges of initial features causes trouble to many machine learning models.
Handling Missing Values
Missing values need to be handled carefully because they reduce the quality of any of our performance matrix and prediction. No model can handle these NULL or NaN values on its own so we need to deal with it. Firstly, we need to check whether we have null values in our dataset or not. We can do that using the isnull() method.
Handling the missing values is one of the greatest challenges faced by analysts because making the right decision on how to handle it generates robust data models.
Data Discretization
Data discretization is the process of converting continuous data into discrete buckets by grouping it. Discretization is also known for easy maintainability of the data. Training a model with discrete data becomes faster and more effective than when attempting the same with continuous data.
There are 3 types of Discretization available in Sci-kit learn.
1.Quantile Discretization Transform
2.Uniform Discretization Transform
3. KMeans Discretization Transform
Github Link