The fundamental purpose of this post is to brief regarding the PCA algorithm step by step and in a way that everyone can easily understand what can actually PCA do and how we can use PCA in the project/algorithm. Before proceeding here is a quick overview of what we cover in this post.
What is PCA?
What is a Variance?
Why is Normalization Necessary for PCA?
Practical Examples of PCA
Code in Python
PCA is an unsupervised machine learning algorithm. PCA is mainly used for dimensionality reduction in a dataset consisting of many variables that are highly correlated or lightly correlated with each other while retaining the variation present in the dataset up to a maximum extent. It is also a great tool for exploratory data analysis for making predictive models.
While it is said that the more data we have, the more accurate results we are going to observe, while it is rightly said but it’s not just data that we require, we need high-quality data to get better results. Dimensionality reduction is basically a technique where we reduce the number of columns or features from the dataset based on their relevance to the problem, the least their requirement, the more are the chances that they get removed from the dataset.
PCA performs a linear transformation on the data so that most of the variance or information in your high-dimensional dataset is captured by the first few principal components. The first principal component will capture the most variance, followed by the second principal component, and so on.
Each principal component is a linear combination of the original variables. Because all the principal components are orthogonal to each other, there is no redundant information. So, the total variance in the data is defined as the sum of the variances of the individual component. So decide the total number of principal components according to cumulative variance ‘‘explained’’ by them.
Some techniques other than PCA for dimensionality reduction include LDA which stands for Linear Discriminant Analysis ( Related Blog: Introduction to Linear Discriminant Analysis in Supervised Learning ).
In machine learning, Variance is one of the most important factors that directly affect the accuracy of the output. When a machine learning model becomes too sensitive for the independent variables, it tries to find out the relationship between every feature which gives rise to the problem like ‘overfitting’ or high variance. Too much noise enters the dataset because of high variance and thus results are affected. When we use principal component analysis for dimensionality reduction, the problem of overfitting get’s solved simultaneously.
Understanding PCA variance
Normalization is necessary to make every variable in proportion with each other, we have seen that the models which are not scaled properly in accordance with each other tend to perform poorly in comparison to those which are scaled well. We cannot extract features if two variables have a large scaling difference.
Consider two columns, one shows the distance in meters, and the other column shows the distance in km, therefore, 1000 in column one is equal to 1 in column two, but our model is unaware of it so how could it find out the accurate relationship between them, therefore scaling or normalization is a necessary condition for our model to perform. ( Also read: Introduction to Statistical Data Analysis )
You have a dataset that includes measurements for different sensors on an engine (temperatures, pressures, emissions, and so on). While much of the data comes from a healthy engine, the sensors have also captured data from the engine when it needs maintenance. You cannot see any obvious abnormalities by looking at any individual sensor. However, by applying PCA, you can transform this data so that most variations in the sensor measurements are captured by a small number of principal components. It is easier to distinguish between a healthy and unhealthy engine by inspecting these principal components than by looking at the raw sensor data.
You have a dataset that includes measurements for different variables on wine (alcohol, ash, magnesium, and so on). You cannot see any obvious abnormalities by looking at any individual variables. However, by applying PCA, you can transform this data so that most variations in the measurements of the variables are captured by a small number of principal components. It is easier to distinguish between red and white wine by inspecting these principal components than by looking at the raw variable data.
Before implementing the PCA algorithm in python first you have to download the wine data set. Below attach source contains a file of the wine dataset so download first to proceed
Source: Wine.csv
First of all, before processing algorithms, we have to import some libraries and read a file with the help of pandas.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from matplotlib.colors import ListedColormap
dataset = pd.read_csv(r'dataset)
dataset.head()
Wine Dataset
As we call out the dataset with the help of the Pandas data frame, now we have to split our dataset into training and testing set with test size is 0.2 times of dataset and remaining data is our training data.
#split into dependant and independent variable
x = dataset.iloc[:,0:13].values
y = dataset.iloc[:,13].values
#splitting dataset into a training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 0)
The next step is to do feature scaling of train and test dataset with help of StandardScaler.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
We are applying the PCA algorithm for two-component and fitting logistic regression to the training set and predict the result.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
explained_variane = pca.explained_variance_ratio_
#fitting logistic Regression to training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(x_train, y_train)
#predicting results
y_pred = classifier.predict(x_test)
print("accuracy score:", accuracy_score(y_test,y_pred))
As we see that in predicting results our accuracy score came 0.97222222 is approx 97% which is good for predicting test set results. After predicting, we visualize our training set results using 2 components.
#visualising the Training set results
X_set, y set = x_test, y_test
X1, x2 = np.meshgrid(np.arange(start = X_set[:, 0].min()- 1 ,
stop = X_set[:, 0].max() + 1,
step = 0.01),
np.arange(start - X_set[:, 1].min() -1,
stop = X_set[:, 0].max() + 1,
step = 0.01)),
plt.contourf(x1, x2, classifier.predict
(np. array([X1.ravel(), X2.ravel()].T).reshape(X1.shape),
alpha = 0.75, Cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X1.min(), X1.max())
for i,j in enumerate(np.unique(y_set)):
plt. scatter(X_set[y_set== j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
plt.title('PCA using Logistic Regression (Training set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()
Logistic regression to implement PCA
As we visualize the result on the training dataset, now we do a similar process for test set results and see our accuracy of the dataset using two components. As we take two components so we see that the first component shows most variation occurs between features and the second component shows most variation occurs between plotted features.
In the principal component space, you should be able to see your objects cluster in a meaningful way. As we learned the basic introduction of the PCA algorithm in this blog. More blogs are on the way where you will learn PCA and other machine learning algorithms in depth. Keep reading and exploring Analytics Steps. Till then, Happy Reading!
5 Factors Influencing Consumer Behavior
READ MOREElasticity of Demand and its Types
READ MOREAn Overview of Descriptive Analysis
READ MOREWhat is PESTLE Analysis? Everything you need to know about it
READ MOREWhat is Managerial Economics? Definition, Types, Nature, Principles, and Scope
READ MORE5 Factors Affecting the Price Elasticity of Demand (PED)
READ MORE6 Major Branches of Artificial Intelligence (AI)
READ MOREScope of Managerial Economics
READ MOREDijkstra’s Algorithm: The Shortest Path Algorithm
READ MOREDifferent Types of Research Methods
READ MORE
Latest Comments
motwanikushal14
Nov 09, 2020Nice article!