In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data of top Enron executives. The Enron datasets comprising emails and financial data of Enron were made available to the public for research and analysis and can be downloaded from https://www.cs.cmu.edu/~./enron/.
The goal of this project is to use machine learning to build a POI (Person of Interest) identifier based on financial and email data made public. Here, 'person of interest' refers to a person who is charged by the law for committing a crime, in this case, the scandal at Enron.
The overall work done for this project can be divided into four parts, a usual trend in Machine Learning:
Exploring the Enron Dataset: This involves data cleaning, outlier removal and analyzing.
Feature Processing of the Enron Dataset: Includes creation, scaling, selection and transforming of features.
Choosing the Algorithm(s): Multiple classification models are trained and tuned.
Evaluation: Involves validation and overall performance check.
Question 1: Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?
The goal of the project was to identify Enron employees who may have committed fraud based on the public Enron financial and email dataset while exploring different machine learning algorithms and addressing various feature selection methods.
The dataset had a total of 146 data points, and 18 of them were POIs in the original dataset. There are 20 features for each person in the dataset, 14 financial features, and 6 e-mail features. These features are analyzed and then fed into classification models. The classification models are then validated and compared to select the optimal classifier.
Outliers were removed with the help of visualization of variables. This has been described in the section titled 'Outlier Investigation & Analyzing the Features'.
import pickle
import numpy as np
import pandas as pd
from time import time
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style('white')
import warnings
warnings.filterwarnings('ignore')
# Load the dataset
with open("final_project_dataset.pkl", "rb") as data_file:
data_dict = pickle.load(data_file)
pandas
dataframe for easy anlysis of the dataset.# Converting the given pickled Enron data to a pandas dataframe.
enron_df = pd.DataFrame.from_records(list(data_dict.values()))
# Set the index of df to be the employees series:
employees = pd.Series(list(data_dict.keys()))
enron_df.set_index(employees, inplace=True)
enron_df.head()
print ("Size of the enron dataframe: ", enron_df.shape)
print ("Number of data points (people) in the dataset: ", len(enron_df))
print ("Number of Features in the Enron Dataset: ", len(enron_df.columns))
# Counting the number of POIs and non-POIs in the given dataset.
poi_count = enron_df.groupby('poi').size()
print ("Total number of POI's in the given dataset: ", poi_count.iloc[1])
print ("Total number of non-POI's in the given dataset: ", poi_count.iloc[0])
On loading the data as a DataFrame, the data-types are in string/objects.
enron_df.dtypes
# Converting the datatypes in the given pandas dataframe
# into floating points for analysis and replace NaN with zeros.
# Coerce numeric values into floats or ints; also change NaN to zero.
enron_df_new = enron_df.apply(lambda x : pd.to_numeric(x, errors = 'coerce')).copy().fillna(np.nan)
enron_df_new.head()
# Dropping column 'email_address' as it is not required in analysis.
enron_df_new.drop('email_address', axis = 1, inplace = True)
# Checking the changed shape of df.
enron_df_new.shape
The features can be categorized as the following.
Financial Features (in US dollars):
salary
deferral_payments
total_payments
loan_advances
bonus
restricted_stock_deferred
deferred_income
total_stock_value
expenses
exercised_stock_options
other
long_term_incentive
restricted_stock
director_fees
Email Features (count of emails):
to_messages
email_address
from_poi_to_this_person
from_messages
from_this_person_to_poi
shared_receipt_with_poi
POI Labels (boolean):
poi
Bonus
and Salary
¶Drawing scatterplot of Bonus vs Salary of Enron employees.
plt.scatter(enron_df_new['salary'][enron_df_new['poi'] == True],
enron_df_new['bonus'][enron_df_new['poi'] == True],
color = 'r', label = 'POI')
plt.scatter(enron_df_new['salary'][enron_df_new['poi'] == False],
enron_df_new['bonus'][enron_df_new['poi'] == False],
color = 'b', label = 'Not-POI')
plt.xlabel("Salary")
plt.ylabel("Bonus")
plt.title("Scatterplot of Salary vs Bonus w.r.t POI")
plt.legend(loc='upper left')
plt.show()
It can be noted from the above figure, one non-POI point has very high value of salary and bonus. Checking for the concerned point.
# Finding the non-POI employee having maximum salary
enron_df_new['salary'].argmax()
# Deleting the row 'Total' from the dataframe
enron_df_new.drop('TOTAL', axis = 0, inplace = True)
# Drawing scatterplot with the modified dataframe
plt.scatter(enron_df_new['salary'][enron_df_new['poi'] == True],
enron_df_new['bonus'][enron_df_new['poi'] == True],
color = 'r', label = 'POI')
plt.scatter(enron_df_new['salary'][enron_df_new['poi'] == False],
enron_df_new['bonus'][enron_df_new['poi'] == False],
color = 'b', label = 'Not-POI')
plt.xlabel("Salary")
plt.ylabel("Bonus")
plt.title("Scatterplot of Salary vs Bonus w.r.t POI")
plt.legend(loc='upper left')
plt.show()
From the above figure, it is observed that the data becomes more spread out and more comprehensible after the outlier removal. Its also observed that values of bonuses of POIs are higher than that of non-POIs.
As the POI's were taking larger amounts of money as a bonus, in addition to their high salary, it can be stated that the ratio of bonus to the salary of the POI's will be higher as compared to that of non-POI's. Hence, a new feature called bonus-to-salary_ratio is created in the hope that it may aid in the POI identification in the later parts of this project. <a id=#new_features_1></a>
enron_df_new['bonus-to-salary_ratio'] = enron_df_new['bonus']/enron_df_new['salary']
From the enron61702insiderpay.pdf provided by findlaw.com, a dataset was observed named 'THE TRAVEL AGENCY IN THE PARK' It is known that Enron had made up some transactions with bogus companies and people reference. So on observing the features of this dataset, it can be considered as an outlier with very low values in all features except in others and total-payments. Hence, it is removed.
# Features of the index 'THE TRAVEL AGENCY IN THE PARK'
enron_df_new.loc['THE TRAVEL AGENCY IN THE PARK']
# Deleting the row with index 'THE TRAVEL AGENCY IN THE PARK'
enron_df_new.drop('THE TRAVEL AGENCY IN THE PARK', axis = 0, inplace = True)
deferred_income
, deferred_payment
and total_payment
¶According to http://www.businessdictionary.com/,
Deferred payment is "a loan arrangement in which the borrower is allowed to start making payments at some specified time in the future. Deferred payment arrangements are often used in retail settings where a person buys and receives an item with a commitment to begin making payments at a future date."
Deferred income (also known as deferred revenue, unearned revenue, or unearned income) is, in accrual accounting, money received for goods or services which have not yet been delivered. According to the revenue recognition principle, it is recorded as a liability until delivery is made, at which time it is converted into revenue.
As Enron scam involved a lot of undisclosed assets and cheating public by selling assets to shell companies at the end of each month and repurchasing them at the start of next month to hide the accounting losses, there are chances that a lot of deferred revenue by the company was used by the POIs.
enron_df_new['deferred_income'].describe()
The deferred_income feature has mostly negative values as it is the money which has to be returned by the company.
# Finding out the integer index locations of POIs and non-POIs.
poi_rs = []
non_poi_rs = []
for i in range(len(enron_df_new['poi'])):
if enron_df_new['poi'][i] == True:
poi_rs.append(i+1)
else:
non_poi_rs.append(i+1)
print("Length of po list: ", len(poi_rs))
print("Length non-poi list: ", len(non_poi_rs))
Drawing a scatterplot of Eemployees with deferred income
# Since 'deferred_income' is negative, for intuitive understanding,
# a positive person of the variable is created for visualization.
enron_df_new['deferred_income_p'] = enron_df_new['deferred_income'] * -1
plt.scatter(non_poi_rs,
enron_df_new['deferred_income_p'][enron_df_new['poi'] == False],
color = 'b', label = 'Not-POI')
plt.scatter(poi_rs,
enron_df_new['deferred_income_p'][enron_df_new['poi'] == True],
color = 'r', label = 'POI')
plt.xlabel('Employees')
plt.ylabel('deferred_income')
plt.title("Scatterplot of Employees with deferred income")
plt.legend(loc='upper right')
plt.show()
The above scatterplot is not much helpful in either detecting outliers or finding patterns as some POIs as well as non-POIs have high values of deferred income. Although, a very trend does suggest POIs to have higher deferred income.
Creating a scatterplot of total_payments
vs deferral_payments
w.r.t POI
.
# Scatterplot of total_payments vs deferral_payments w.r.t POI
plt.scatter(enron_df_new['total_payments'][enron_df_new['poi'] == False],
enron_df_new['deferral_payments'][enron_df_new['poi'] == False],
color = 'b', label = 'Not-POI')
plt.scatter(enron_df_new['total_payments'][enron_df_new['poi'] == True],
enron_df_new['deferral_payments'][enron_df_new['poi'] == True],
color = 'r', label = 'POI')
plt.xlabel('Total_payments')
plt.ylabel('deferral_payments')
plt.title("Scatterplot of total_payments vs deferral_payments w.r.t POI")
plt.legend(loc='upper right')
plt.show()
From the above scatterplot, it can be observed that the majority of POIs have a meager value of deferral payments as compared to the deferral_payments of non-POIs. We can also observe there are two outliers. The one having a high value of total_payments is a POI, and the other outlier with a high value of deferral payments is a non-POI. Hence, removing the non-POI outlier.
# Finding the non-POI employee having maximum 'deferral_payments'
enron_df_new['deferral_payments'].argmax()
# Removing the non-POI employee having maximum 'deferral_payments'
enron_df_new.drop('FREVERT MARK A', axis = 0, inplace = True)
long_term_incentive
¶Making a scatterplot to check the long_term_incentive
of different Enron employees.
# Finding out the integer index locations of POIs and non-POIs
poi_rs = []
non_poi_rs = []
for i in range(len(enron_df_new['poi'])):
if enron_df_new['poi'][i] == True:
poi_rs.append(i+1)
else:
non_poi_rs.append(i+1)
# Making a scatterplot
plt.scatter(non_poi_rs,
enron_df_new['long_term_incentive'][enron_df_new['poi'] == False],
color = 'b', label = 'Not-POI')
plt.scatter(poi_rs,
enron_df_new['long_term_incentive'][enron_df_new['poi'] == True],
color = 'r', label = 'POI')
plt.xlabel('Employees')
plt.ylabel('long_term_incentive')
plt.title("Scatterplot of Employee Number with long_term_incentive")
plt.legend(loc='upper left')
plt.show()
One employee has a very high value of long_term_incentive
, so considering this point as an outlier and removing it.
enron_df_new['long_term_incentive'].argmax()
enron_df_new.drop('MARTIN AMANDA K', axis = 0, inplace = True)
restricted_stock
and restricted_stock_deferred
¶# Scatterplot of restricted_stock vs 'restricted_stock_deferred' w.r.t POI
plt.scatter(enron_df_new['restricted_stock'][enron_df_new['poi'] == False],
enron_df_new['restricted_stock_deferred'][enron_df_new['poi'] == False],
color = 'b', label = 'Not-POI')
plt.scatter(enron_df_new['restricted_stock'][enron_df_new['poi'] == True],
enron_df_new['restricted_stock_deferred'][enron_df_new['poi'] == True],
color = 'r', label = 'POI')
plt.xlabel('restricted_stock')
plt.ylabel('restricted_stock_deferred')
plt.title("Scatterplot of restricted_stock vs 'restricted_stock_deferred' w.r.t POI")
plt.legend(loc='upper right')
plt.show()
enron_df_new['restricted_stock_deferred'].argmax()
There is an obtained outlier in the feature restricted_stock_deferred
. Taking a quick look at the values of restricted_stock_deferred
most of the values are zeros, and the remaining few are negative values. The outlier found here is the Enron employee 'BHATNAGAR SANJAY' who is not a POI. Hence, this datapoint is removed. There is no exciting observation in the other axis of this graph.
enron_df_new.drop('BHATNAGAR SANJAY', axis = 0, inplace = True)
from_poi_to_this_person
and from_this_person_to_poi
¶Given that the dataset is related to the emails, it can be thought that for doing such a big scam, the POI's might frequently have communication between them via E-mails. Hence, by checking on the number of e-mails transferred between POIs and an Employee, we can guess for the involvement of that person in that scam.
plt.scatter(enron_df_new['from_poi_to_this_person'][enron_df_new['poi'] == False],
enron_df_new['from_this_person_to_poi'][enron_df_new['poi'] == False],
color = 'b', label = 'Not-POI')
plt.scatter(enron_df_new['from_poi_to_this_person'][enron_df_new['poi'] == True],
enron_df_new['from_this_person_to_poi'][enron_df_new['poi'] == True],
color = 'r', label = 'POI')
plt.xlabel('from_poi_to_this_person')
plt.ylabel('from_this_person_to_poi')
plt.title("Scatterplot of count of from and to mails between poi and this_person w.r.t POI")
plt.legend(loc='upper right')
plt.show()
This scatterplot shows the relationship between the count of emails sent to and fro among different employees of Enron. I think a different feature showing the proportion of mail sent by employees and POI to each other will be more helpful in finding the POI. As POIs are more likely to have more communications with other POIs as compared to communication with other non-POIS, two new features are created.
fraction_mail_from_poi
and fraction_mail_to_poi
<a id=#new_features_2></a>¶enron_df_new['fraction_mail_from_poi'] = enron_df_new['from_poi_to_this_person']/enron_df_new['from_messages']
enron_df_new['fraction_mail_to_poi'] = enron_df_new['from_this_person_to_poi']/enron_df_new['to_messages']
# Scatterplot of fraction of mails from and to between poi and this_person w.r.t POI
plt.scatter(enron_df_new['fraction_mail_from_poi'][enron_df_new['poi'] == False],
enron_df_new['fraction_mail_to_poi'][enron_df_new['poi'] == False],
color = 'b', label = 'Not-POI')
plt.scatter(enron_df_new['fraction_mail_from_poi'][enron_df_new['poi'] == True],
enron_df_new['fraction_mail_to_poi'][enron_df_new['poi'] == True],
color = 'r', label = 'POI')
plt.xlabel('fraction_mail_from_poi')
plt.ylabel('fraction_mail_to_poi')
plt.title("Scatterplot of fraction of mails between poi and this_person w.r.t POI")
plt.legend(loc='upper right')
plt.show()
From the above figure, the difference between POs and non-POI points can be clearly classified. The red dots representing POIs are more distinct, have higher values and are more separate from the non-POI blue points.
Question 2: What features did you end up using your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your feature that does not come ready-made in the dataset – explain what feature you tried to make, and the rationale behind it.
A pipeline was created, and we decided to try SelectKBest
in a range of 8 to 11 features and use it on 5 different algorithms. Most of the algorithms required 9 features as determined by GridSearchCV
.
Feature Preprocessing (including feature scaling) was done in this section.
Additional features were created during the exploratory data analysis i.e. 'fraction_mail_from_poi', 'fraction_mail_to_poi' & 'bonus-to-salary_ratio'.
The features selected for the classifier using SelectKBest
described in this section.
# Clean all 'inf' values which we got if the person's from_messages = 0
enron_df_new = enron_df_new.replace('inf', 0)
enron_df_new = enron_df_new.fillna(0)
# Converting the above modified dataframe to a dictionary
enron_dict = enron_df_new.to_dict('index')
print("Features of modified data_dictionary:-")
print("Total number of datapoints: ",len(enron_dict))
print("Total number of features: ",len(enron_dict['METTS MARK']))
# Store to my_dataset for easy export below.
dataset = enron_dict
Out of the all features available, (given and created above), the following 24 features would be used. The given features can be divided into two types: financial features and email features.
17 Financial Features:
['salary', 'bonus', 'long_term_incentive', 'bonus-to-salary_ratio', 'expenses','restricted_stock_deferred', 'restricted_stock', 'deferred_income','total_payments','other','shared_receipt_with_poi', 'loan_advances', 'director_fees', 'exercised_stock_options', 'deferral_payments', 'total_stock_value', 'restricted_stock']
6 Email Features: ['fraction_mail_from_poi', 'fraction_mail_to_poi', 'from_poi_to_this_person', 'from_this_person_to_poi', 'to_messages', 'from_messages']
POI: Which is the target variable.
# Features_list is a list of strings, each of which is a feature name.
# The first feature must be "poi" (target variable).
features_list = ['poi', 'salary', 'bonus', 'long_term_incentive', 'bonus-to-salary_ratio', 'deferral_payments', 'expenses',
'restricted_stock_deferred', 'restricted_stock', 'deferred_income','fraction_mail_from_poi', 'total_payments',
'other', 'fraction_mail_to_poi', 'from_poi_to_this_person', 'from_this_person_to_poi', 'to_messages',
'from_messages', 'shared_receipt_with_poi', 'loan_advances', 'director_fees', 'exercised_stock_options',
'total_stock_value']
# Extract features and labels from dataset for local testing
data = featureFormat(dataset, features_list, sort_keys = True)
labels, features = targetFeatureSplit(data)
1. Feature Scaling: MinMaxScaler
is used which scales features to lie between zero and one. MinMaxScaler transforms features by scaling each feature to a given range. This estimator scales and translates each feature individually such that it is in the given range on the training set, i.e., between zero and one. It is also limited to be used with algorithms that involve distance measures to avoid loss of information.
2. Feature Selection: Feature selection/dimensionality reduction on sample sets is essential to improve estimators’ accuracy scores, boost performance & simplification of the model. In this project, SelectKBest
to find the 'K' best or high-scoring features. Objects of these functions, take as input a scoring function that returns univariate scores and p-values. Here, f_classif
is used as the scoring function which computes the ANOVA F-value between labels and features for classification tasks.
3. Pipeline: Sequentially apply feature processing steps such as scaling, selection, and classification. Sklearn's GridSearchCV
module automates this process by performing a grid search over a range of parameter values for an estimator.
4. Principle Component Analysis (PCA): PCA was tried, but it did not improve f1, precision or recall for the selected classification algorithms. Hence, it was not used and has not been described below to keep the notebook to-the-point.
# Split data into training and testing datasets
from sklearn import cross_validation
features_train, features_test, \
labels_train, labels_test = cross_validation.train_test_split(features, labels,
test_size=0.3, random_state=42)
# Stratified ShuffleSplit cross-validator
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=100, test_size=0.3, random_state = 42)
# Importing modules for feature scaling and selection
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
# Defining features to be used via the pipeline
## 1. Feature scaling
scaler = MinMaxScaler()
## 2. Feature Selection
skb = SelectKBest(f_classif)
For this project, the following algorithms were selected.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
# Classifier 1: Logistic Regression
lr_clf = LogisticRegression()
pipeline = Pipeline(steps=[("SKB", skb), ("LogisticRegression", lr_clf)])
param_grid = {"SKB__k": range(9, 10),
'LogisticRegression__tol': [1e-2, 1e-3, 1e-4],
'LogisticRegression__penalty': ['l1', 'l2']
}
grid = GridSearchCV(pipeline, param_grid, verbose = 0, cv = sss, scoring = 'f1')
t0 = time()
# clf = clf.fit(features_train, labels_train)
grid.fit(features, labels)
print("Training Time: ", round(time()-t0, 3), "s")
# Best algorithm
clf = grid.best_estimator_
t0 = time()
# Refit the best algorithm:
clf.fit(features_train, labels_train)
prediction = clf.predict(features_test)
print("Testing time: ", round(time()-t0, 3), "s")
# Evaluation Measures
print("Accuracy of DT classifer is : ", accuracy_score(labels_test, prediction))
print("Precision of DT classifer is : ", precision_score(prediction, labels_test))
print("Recall of DT classifer is : ", recall_score(prediction, labels_test))
print("f1-score of DT classifer is : ", f1_score(prediction, labels_test))
# Classifier 2: KNN Classifier
clf_knn = KNeighborsClassifier()
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state = 42)
pipeline = Pipeline(steps = [("scaling", scaler), ("SKB", skb), ("knn",clf_knn)])
param_grid = {"SKB__k":[3,4,5,6,7,8,9,10,11,12,13,14,15, 16, 17, 18],
"knn__n_neighbors": [3,4,5,6,7,8,9,11,12,13,15],
}
grid = GridSearchCV(pipeline, param_grid, verbose = 0, cv = sss, scoring = 'f1')
t0 = time()
# clf = clf.fit(features_train, labels_train)
grid.fit(features, labels)
print("Training time: ", round(time()-t0, 3), "s")
# Best Algorithm
clf = grid.best_estimator_
t0 = time()
# Refit the best algorithm:
clf.fit(features_train, labels_train)
prediction = clf.predict(features_test)
print("Testing time: ", round(time()-t0, 3), "s")
# Evaluation measures
print("Accuracy of DT classifer is : ", accuracy_score(labels_test, prediction))
print("Precision of DT classifer is : ", precision_score(prediction, labels_test))
print("Recall of DT classifer is : ", recall_score(prediction, labels_test))
print("f1-score of DT classifer is : ", f1_score(prediction, labels_test))
## Classifier 3: Gaussian Naive Bayes (GaussianNB) classifier
clf_gnb = GaussianNB()
pipeline = Pipeline(steps = [("SKB", skb), ("NaiveBayes", clf_gnb)])
param_grid = {"SKB__k":[3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]}
grid = GridSearchCV(pipeline, param_grid, verbose = 0, cv = sss, scoring = 'f1')
t0 = time()
grid.fit(features, labels)
print("Training time: ", round(time()-t0, 3), "s")
# Best Algorithm
clf = grid.best_estimator_
t0 = time()
# Refit the best algorithm:
clf.fit(features_train, labels_train)
prediction = clf.predict(features_test)
print("Testing time: ", round(time()-t0, 3), "s")
print("Accuracy of GaussianNB classifer is : ", accuracy_score(labels_test, prediction))
print("Precision of GaussianNB classifer is : ", precision_score(prediction, labels_test))
print("Recall of GaussianNB classifer is : ", recall_score(prediction, labels_test))
print("f1-score of GaussianNB classifer is : ", f1_score(prediction, labels_test))
Question 3: What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?
In total, three algorithms were tried viz. Gaussian Naïve Bayes, Logistic Regression, KNN (K-Nearest Neighbors). Gaussian Naïve Bayes was the best performing model amongst all the models based on the f1-score and the minimum requirement of 0.33 for precision and recall. KNN also comes close, but I chose Gaussian NB because of its precision.
Question 4: What does it mean to tune the parameters of an algorithm, and what can happen if you do not do this well? How did you tune the parameters of your particular algorithm?
The process of tuning the parameters involves setting the values of the algorithmic parameters to such optimal values that enable us to complete a machine learning task in the "best possible way."
Not correctly tuning will result in the sub-optimum or poor performance of the algorithm while making the whole machine learning task very time-consuming. Also, algorithms are not explicitly tuned to any dataset. Therefore, iteratively tuning our algorithm to obtain an evaluation we are satisfied with is recommended.
This project utilized three algorithms and used the GridSearchCV
function to obtain the best parameters for them. Since there are no parameters to tune for Gaussian Naïve Bayes, they have not been specified. However, for completeness, the tuning parameters for KNN have been mentioned below.
param_grid = {"SKB__k":[3,4,5,6,7,8,9,10,11,12,13,14,15, 16, 17, 18],
"knn__n_neighbors": [3,4,5,6,7,8,9,11,12,13,15],
}
grid = GridSearchCV(pipeline, param_grid, ... scoring = 'f1')
SelectKBest
for GaussianNB ¶# Obtaining the boolean list showing selected features
features_selected_bool = grid.best_estimator_.named_steps['SKB'].get_support()
# Finding the features selected by SelectKBest
features_selected_list = [x for x, y in zip(features_list[1:], features_selected_bool) if y]
print("Total number of features selected by SelectKBest algorithm: ", len(features_selected_list))
# Finding the score of features
feature_scores = grid.best_estimator_.named_steps['SKB'].scores_
# Finding the score of features selected by selectKBest
feature_selected_scores = feature_scores[features_selected_bool]
# Creating a pandas dataframe and arranging the features based on their scores and ranking them
imp_features_df = pd.DataFrame({'Features_Selected':features_selected_list, 'Features_score':feature_selected_scores})
imp_features_df.sort_values('Features_score', ascending = False, inplace = True)
Rank = pd.Series(list(range(1, len(features_selected_list)+1)))
imp_features_df.set_index(Rank, inplace = True)
print("The following table shows the feature selected along with its corresponding scores.")
imp_features_df
Question 5: What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?
Validation is usually performed to ensure that the machine learning algorithm we have selected, generalizes well. A classic mistake is over-fitting, where our model performs very well on the training dataset but significantly worse on the cross-validation and testing datasets.
To overcome this mistake, we can perform cross0validation on the dataset. Although we can use the train_test_split, cross-validation technique, a better fit for our project would be to use the StratifiedShuffleSplit
technique.
StratifiedShuffleSplit
is used when there are few observations in a dataset being used for analysis. This technique randomly shuffles through our data, creating testing and training data. The stratified shuffle split is also used to handle class imbalances in the data. This is important, especially since there are very few POIs in the data.
StratifiedShuffleSplit
creates train/validation subsets (as per the code above, it will create 100 of them). Internally, GridSearchCV
estimates the models using the 100 train subsets and validate the model on the 100 validation subsets.
In this project, while training, it was kept in mind to optimize the precision and recall. Hence, I used f1-score as the key measure for algorithms' performance as f1_score considers both the precision and the recall.
The metrics have been summarized below as they are later used to draw inferences from the study.
Accuracy is the ratio of correctly predicted observation to the total observations.
$Accuracy = \frac{TP + TN}{TP+FP+FN+TN}$
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
$Precision = (\frac{TP}{TP + FP})$
Recall is the ratio of correctly predicted positive observations to the all observations in actual class.
$Recall = (\frac{TP}{TP + FN})$
F1-score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.
$f1 = 2.(\frac{precision.recall}{precision + recall})$
Question 6: Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.
Metric values obtained after running the tester.py
file :
Algorithm used | Accuracy | Precision | Recall | f1 score |
---|---|---|---|---|
Gaussian Naive Bayes | 0.852 | 0.480 | 0.387 | 0.428 |
Accuracy, although a crucial metric can be misleading, mainly when dealing with imbalanced classes, or in other words, when the data is skewed towards one class. This is the case with the Enron set — since there are much more non-POIs than POI (you can just guess the more common class label for every point, which is not a very insightful strategy but still get decent Accuracy).
Precision implies that whenever a POI gets flagged in the test set, there's a lot of confidence that it’s very likely to be a real POI and not a false alarm.On the other hand, the tradeoff is that sometimes real POIs are missed, since the classifier is effectively reluctant to pull the trigger on edge cases.
38.7% might seem low, but this metric is particularly insightful for the Enron case. Since we are dealing with a criminal situation, we want our classifier to err on the side of guessing guilty – higher levels of scrutiny — so it makes sure as many people get flagged as POI, maybe at a cost of identifying some innocent people along the way. Boosting its Recall metric the classifier ensures that is correctly identifying every single POI. The tradeoff is that the algorithm will be biased towards "overdoing" it.