{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Identify Fraud from Enron Email Dataset\n", "\n", "In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data of top Enron executives. The Enron datasets comprising emails and financial data of Enron were made available to the public for research and analysis and can be downloaded from https://www.cs.cmu.edu/~./enron/.\n", "\n", "The goal of this project is to use *machine learning* to build a POI (Person of Interest) identifier based on financial and email data made public. Here, 'person of interest' refers to a person who is charged by the law for committing a crime, in this case, the scandal at Enron. \n", "\n", "The overall work done for this project can be divided into four parts, a usual trend in Machine Learning:\n", " \n", " 1. **Exploring the Enron Dataset:** This involves data cleaning, outlier removal and analyzing.\n", " \n", " 2. **Feature Processing of the Enron Dataset:** Includes creation, scaling, selection and transforming of features. \n", " \n", " 3. **Choosing the Algorithm(s):** Multiple classification models are trained and tuned.\n", " \n", " 4. **Evaluation:** Involves validation and overall performance check." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1: Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?**\n", "\n", "The goal of the project was to identify Enron employees who may have committed fraud based on the public Enron financial and email dataset while exploring different machine learning algorithms and addressing various feature selection methods. \n", "\n", "The dataset had a total of 146 data points, and 18 of them were POIs in the original dataset. There are 20 features for each person in the dataset, 14 financial features, and 6 e-mail features. These features are analyzed and then fed into classification models. The classification models are then validated and compared to select the optimal classifier.\n", "\n", "Outliers were removed with the help of visualization of variables. This has been described in the section titled 'Outlier Investigation & Analyzing the Features'." ] }, { "cell_type": "code", "execution_count": 222, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pickle\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from time import time\n", "\n", "from feature_format import featureFormat, targetFeatureSplit\n", "from tester import dump_classifier_and_data\n", "\n", "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "from matplotlib import pyplot as plt\n", "import seaborn as sns\n", "sns.set_style('white')\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Load the dataset\n", "with open(\"final_project_dataset.pkl\", \"rb\") as data_file:\n", " data_dict = pickle.load(data_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## I. Exploring the Enron Dataset\n", "\n", "- The pickled Enron data is loaded as a `pandas` dataframe for easy anlysis of the dataset.\n", "- The key i.e., the Enron employees name is used as the index of the pandas dataframe." ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | bonus | \n", "deferral_payments | \n", "deferred_income | \n", "director_fees | \n", "email_address | \n", "exercised_stock_options | \n", "expenses | \n", "from_messages | \n", "from_poi_to_this_person | \n", "from_this_person_to_poi | \n", "... | \n", "long_term_incentive | \n", "other | \n", "poi | \n", "restricted_stock | \n", "restricted_stock_deferred | \n", "salary | \n", "shared_receipt_with_poi | \n", "to_messages | \n", "total_payments | \n", "total_stock_value | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
METTS MARK | \n", "600000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "mark.metts@enron.com | \n", "NaN | \n", "94299 | \n", "29 | \n", "38 | \n", "1 | \n", "... | \n", "NaN | \n", "1740 | \n", "False | \n", "585062 | \n", "NaN | \n", "365788 | \n", "702 | \n", "807 | \n", "1061827 | \n", "585062 | \n", "
BAXTER JOHN C | \n", "1200000 | \n", "1295738 | \n", "-1386055 | \n", "NaN | \n", "NaN | \n", "6680544 | \n", "11200 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "1586055 | \n", "2660303 | \n", "False | \n", "3942714 | \n", "NaN | \n", "267102 | \n", "NaN | \n", "NaN | \n", "5634343 | \n", "10623258 | \n", "
ELLIOTT STEVEN | \n", "350000 | \n", "NaN | \n", "-400729 | \n", "NaN | \n", "steven.elliott@enron.com | \n", "4890344 | \n", "78552 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "12961 | \n", "False | \n", "1788391 | \n", "NaN | \n", "170941 | \n", "NaN | \n", "NaN | \n", "211725 | \n", "6678735 | \n", "
CORDES WILLIAM R | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "bill.cordes@enron.com | \n", "651850 | \n", "NaN | \n", "12 | \n", "10 | \n", "0 | \n", "... | \n", "NaN | \n", "NaN | \n", "False | \n", "386335 | \n", "NaN | \n", "NaN | \n", "58 | \n", "764 | \n", "NaN | \n", "1038185 | \n", "
HANNON KEVIN P | \n", "1500000 | \n", "NaN | \n", "-3117011 | \n", "NaN | \n", "kevin.hannon@enron.com | \n", "5538001 | \n", "34039 | \n", "32 | \n", "32 | \n", "21 | \n", "... | \n", "1617011 | \n", "11350 | \n", "True | \n", "853064 | \n", "NaN | \n", "243293 | \n", "1035 | \n", "1045 | \n", "288682 | \n", "6391065 | \n", "
5 rows × 21 columns
\n", "\n", " | bonus | \n", "deferral_payments | \n", "deferred_income | \n", "director_fees | \n", "email_address | \n", "exercised_stock_options | \n", "expenses | \n", "from_messages | \n", "from_poi_to_this_person | \n", "from_this_person_to_poi | \n", "... | \n", "long_term_incentive | \n", "other | \n", "poi | \n", "restricted_stock | \n", "restricted_stock_deferred | \n", "salary | \n", "shared_receipt_with_poi | \n", "to_messages | \n", "total_payments | \n", "total_stock_value | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
METTS MARK | \n", "600000.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "94299.0 | \n", "29.0 | \n", "38.0 | \n", "1.0 | \n", "... | \n", "NaN | \n", "1740.0 | \n", "False | \n", "585062.0 | \n", "NaN | \n", "365788.0 | \n", "702.0 | \n", "807.0 | \n", "1061827.0 | \n", "585062.0 | \n", "
BAXTER JOHN C | \n", "1200000.0 | \n", "1295738.0 | \n", "-1386055.0 | \n", "NaN | \n", "NaN | \n", "6680544.0 | \n", "11200.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "1586055.0 | \n", "2660303.0 | \n", "False | \n", "3942714.0 | \n", "NaN | \n", "267102.0 | \n", "NaN | \n", "NaN | \n", "5634343.0 | \n", "10623258.0 | \n", "
ELLIOTT STEVEN | \n", "350000.0 | \n", "NaN | \n", "-400729.0 | \n", "NaN | \n", "NaN | \n", "4890344.0 | \n", "78552.0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "12961.0 | \n", "False | \n", "1788391.0 | \n", "NaN | \n", "170941.0 | \n", "NaN | \n", "NaN | \n", "211725.0 | \n", "6678735.0 | \n", "
CORDES WILLIAM R | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "651850.0 | \n", "NaN | \n", "12.0 | \n", "10.0 | \n", "0.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "False | \n", "386335.0 | \n", "NaN | \n", "NaN | \n", "58.0 | \n", "764.0 | \n", "NaN | \n", "1038185.0 | \n", "
HANNON KEVIN P | \n", "1500000.0 | \n", "NaN | \n", "-3117011.0 | \n", "NaN | \n", "NaN | \n", "5538001.0 | \n", "34039.0 | \n", "32.0 | \n", "32.0 | \n", "21.0 | \n", "... | \n", "1617011.0 | \n", "11350.0 | \n", "True | \n", "853064.0 | \n", "NaN | \n", "243293.0 | \n", "1035.0 | \n", "1045.0 | \n", "288682.0 | \n", "6391065.0 | \n", "
5 rows × 21 columns
\n", "\n", " | Features_Selected | \n", "Features_score | \n", "
---|---|---|
1 | \n", "deferred_income | \n", "13.287587 | \n", "
2 | \n", "bonus | \n", "12.438591 | \n", "
3 | \n", "salary | \n", "12.225775 | \n", "
4 | \n", "exercised_stock_options | \n", "11.166453 | \n", "
5 | \n", "fraction_mail_from_poi | \n", "10.598733 | \n", "
6 | \n", "total_stock_value | \n", "10.191784 | \n", "
7 | \n", "long_term_incentive | \n", "10.164526 | \n", "
8 | \n", "bonus-to-salary_ratio | \n", "9.869367 | \n", "
9 | \n", "total_payments | \n", "9.361047 | \n", "
10 | \n", "other | \n", "9.141458 | \n", "
11 | \n", "shared_receipt_with_poi | \n", "8.649023 | \n", "
12 | \n", "loan_advances | \n", "7.658627 | \n", "