POSTS

News - Insights - Case Studies

Email Insights from Data Science – Part 2

email_analysis_blog

Analyzing Email Contents

In Part 1 of this series I demonstrated a method for extracting email contents from a proprietary repository in preparation for analysis and further data exploration.  In this part I will focus on analysis and rating of the extracted information to determine usability for building a supervised modeling dataset.

Currently, the data we retrieved from the Enron repository is still in its raw, but mostly clean and filtered form.  This means the dataset is unstructured and unfocused for the tasks we are solving for.  Since our goals are to classify the email contents to determine overall company sentiment (negative/positive) and alignment with company objectives, I’ll need to transform the unstructured texts into a supervised dataset that we will use to train a recurrent network.

Before we start topic modeling, we need to understand the scope of the dataset, completeness of the data and how skewed various features may be in order to determine the best steps to align the data for modeling.

Feature Possibilities

As I mentioned earlier, I am focused on just a few of the features that can be taken from the email content.  Several of the features outside the scope of this article include the following insights.

  • Sender/Receiver roles.  Understanding the role and level of importance an email sender has on an organization is a primary class for modeling communications.  This information is most efficiently obtained from Human Resources, ERP and Finance platforms or Sales/Vendor contact management systems.  In the event these systems are unavailable, an organization’s structure can be inferred from email content directly to establish a solid baseline for further communication analysis.
  • Internal/External origination.  By including origination details within a prediction model, we can gain insights into quite a few different aspects of a company.  Some of these include vendor relationships and reliance (i.e. how dependent on vendor support, where do ideas originate, tone of relationships, negotiation predictions), client interaction frequency and tone (i.e. client relationship management, how well embedded are clients with the company’s products and services, the overall tone of the base, etc.), industry influence and employee interactions, outside sales targets, and internal email management (i.e. spam and email security).
  • Mobile/Desktop/Tablet.  Knowing the device an email is sent from can provide insights into technology acceptance, schedules, and job types.  It may also indicate dedication and leadership qualities when paired with other features.
  • Group inclusion.  The number of email groups an individual is involved with may indicate importance within an organization.  It may also reveal improper access to information and security issues.
  • Chain length.  How long an email chain becomes and the frequency of short versus long email chains may shed light on repetitive or consistent issues, improper use of email as a communication medium or help indicate how significant or severe an issue has become (i.e. does the “To” address list continue to grow, how often is it forwarded, total number of recipients, frequency of CC/BCC inclusions, the role of recipients, etc.).
  • Grammar/Punctuation & Formal/Informal tone.  The way individuals communicate can say a lot about a company’s culture, the depth of knowledge within its ranks and the image portrayed to clients, vendors and competitors.  Analyzing speech patterns can reveal areas for improvement within the workforce or leadership ranks.

Analysis of Email Contents

Prior to the complex application development steps, I perform a number of data inspections to frame the scope and diversity of the data I’m working with and to provide insight into any further data preparation tasks that may be required.

To accomplish this I use a number of different techniques ranging from basic summary counts, probability distribution testing and scoring, frequency distributions and covariance rating.  For this exercise I’ll show a few of the views that I found to be relevant to the task.

More Cleaning.  First off, in the previous cleaning step the email addresses where put into the data extract as a raw array of strings.  Additional inspection has uncovered a few invalid conditions within a significant number of samples that need to be corrected before moving forward.

As you can see from this tiny example set, there are a few with erroneous information.

Example corrupted email addresses

A quick iteration through each address array is all that is needed to put everything in order.  The following code snippet was used to clean up the address fields.  Since our incoming address arrays are strings (an artifact from using Pandas save csv and load csv routines) I needed to remove unnecessary characters first before iterating through each address.

				
					    def _address_clean(self, addr):
        ''' Additional email address cleaning '''
        addr = re.sub(r'e-mail <(.*)>',r'\1',addr)
        addr = re.sub(r' +', '', addr)
        addr = re.sub(r'/o.*=', '', addr)
        addr = re.sub(r'"', '', addr)
        return addr

    def _fix_email_addresses(self, type, df):
        ''' Split email address array strings into usable arrays '''
        split_embeds = lambda x: x.replace('[','').replace(']','').replace('\'','').split(',')
        addrs = [split_embeds(s) for s in tqdm(df[type].values)]
        u_addrs = [[self._address_clean(y) for y in x] for x in tqdm(addrs)]
        df[type] = u_addrs
        return df
				
			

Base Statistics

Now that the email addresses have been properly prepared we can move to calculating some base statistics; including email address counts, the starting/ending date range for the dataset and Z scores.

Unique email address counts and ranges

Looking at the unique address counts we can determine the total number of relevant emails for processing and the appropriate compute resources needed.  Since I filtered out system generated and external email addresses we now know that 2,100 employees sent emails within this study.  Even though the email repository only contains 150 employee accounts there were obviously many more employees at Enron during this timeframe sending emails throughout the company.

We also now understand that of those 2,100 senders there were 18,188 unique receivers from an estimated 29,000 total employees.  So less than 10% of the total employees are represented as communicators…with roughly two-thirds of the total employees being influenced by this small subset.

There can also be an assumption made that due to the 150 target accounts being senior managers, that the theme from this sample of email content will most likely be skewed towards a business focus regardless of whether the content originates from or is sent to a senior manager.

And without knowing the titles and responsibilities of these 2,100 senders this would not make the best dataset for predictions since we have less than 10% of the employees represented and no idea how these 2,100 employees are distributed across the various departments and job functions within Enron.  This list is likely skewed in a manner that will not represent the company as a whole very well…good thing this is just an example, otherwise we would need to find a better data source.

For this exercise let’s assume the distribution of these 2,100 senders equally represents each level, department, gender, age, etc. within the organization.

Note. Did you notice the CC and BCC counts match?  At first I thought this was an error in my routines, but after inspecting the data further this number is accurate so for this exercise we will drop BCC and assume either the systems that generated this dataset were in error or the curators of this dataset made a mistake.  In a real study we would need to resolve this as I believe the use of BCC can show leadership hierarchy or lack thereof.

Data Time Span

Next we can see the range of time this dataset represents spans from 10/30/1998 to 07/12/2002; roughly four years.  However, looking at the email distribution by date histogram shows the data is skewed towards the range 07/01/2000 – 03/01/2002. For a real implementation, the dataset would be trimmed to a more effective range.

Also of note is the significant spike in activity around September of 2001; which is undoubtedly directly correlated with the World Trade Center attack.  This is a good example of unexpected classification opportunities that were not expected as part of the project.  In a production scenario I would attempt to classify external events of this nature to track the long-term performance impact on the organization as well as gauge the behavior patterns of the workforce.

Email Distribution Histogram by Date/Time

Z-Scores

Calculating Z-Scores for the email date/times shows a similar (and expected) distribution pattern when compared with the general date/time histogram.  

The Z-Score tells us the number of standard deviations a particular data sample is away from the distribution mean.  So by looking at the Z-Score output and the date/time histogram above we can visually see where the mean is within the date/time range to help us determine how “normal” this distribution is and whether adjustments will need to be made.

In this case the distribution is mostly Gaussian-shaped with the above noted exceptions.

Z-Scores Distribution for Email Date/Time Range (Note the x-axis is reversed newest to oldest)

Another helpful visualization is the boxplot; which will show the outliers for the Z-Scores if any exist.  In this case we’ll define outliers as any value that falls outside of the standard interquartile range.  

As shown below, there are a few on the negative side of the distribution; which correlates with the negative tail shown on the histogram above (i.e. values less than approximately -3).

Boxplot of Email Date/Time Z-Score Distribution

Normally I would generate histograms and variance reports for each data element to ensure a proper understanding of the distributions.  For me, this information is more easily consumed in tabular reporting form (i.e. I can consolidate and summarize large, multi-dimensional datasets in a smaller report), but visualization comes in handy when strange data patterns emerge.

Email Address Frequencies

The next area to look at is the frequency of email addresses within the dataset.  We looked at the unique email address counts before; which helped us understand the scope of users involved.  Now I want to know, out of those 2,100 senders and 18,188 receivers, what is the volume (or distribution) of senders and receivers.  Is our dataset evenly distributed across all senders and receivers or is it skewed in favor of a subset of users?

There are three different categories of email addresses in our dataset: senders (“From Address”), receivers (“To Address”) and copy recipients (“CC Address”).  Within each category the user can be the sender, a receiver, a forwarder or has deleted the email.

Starting with the senders group, based upon the outputs (images below) we can see the number of active users drops off sharply with only a small proportion of the total users contributing to the majority of emails.  This may or may not be a concern if this distribution represents an accurate view of typical communications within the organization.  Given this dataset is a small subset of the overall user base I do not believe this will be a good representation of the company for this exercise.

This output reads as “total emails sent by user and of those emails, X number of emails were original, Y number were responses and Z were deleted (either original or responses)”.

I’m not showing all of the outputs here, but for the receiver group, the distribution is a little better with a higher number of users receiving emails, but still skewed towards a subset of the total users.  And as expected, the CC class is very slanted towards a small set of users.

From Address Email Counts
From Address Email Count Distributions

Subject Token Frequency

From the tokens of each email subject there is potential information that can be obtained for additional insights into company operations.  Generally the subject context will roughly match the body content, but not always.  Just so we understand the distribution of words contains in this dataset I performed a quick token count and observed an expected bias of the token distribution towards words centered around the business with the majority of terms not occurring frequently.

To identify culture specific terms we would need to remove the higher occurring tokens until we achieved a more distributed view.  For this exercise I will not be using the subject line and will instead focus on the email body contents.

Subject Line Token Count Distribution

Email Action Frequency

Sometimes it is helpful to understand how a particular user base functions and in the case of email analysis some details can be gleaned from interpreting the frequency of events.

In this case we know we have email senders, email responses and emails being deleted.  The proportion of these events occurring over time may indicate changes in the corporate landscape.  For example, an increase in deleted email could indicate a change in policy or potential legal concerns.  An increase in email responses could indicate an increase in dedication, sense of urgency or operational issues.

For this exercise the focus is on sentiment and professionalism so we will not be using this information, but I did include a quick view so we have a baseline.  To make this effective it needs to include date and time components as well so changes can be tracked over time.

I also like to understand the probability of a user base performing certain actions.  In this case the user base deletes 17% of their emails and responds to about 28%.  Now we don’t know from this information which types of emails are being deleted or responded to…so that information would be helpful to include if we have the necessary access.

It is interesting that the probability of an email being deleted after it is sent is roughly 10% of the time.  The caveat here is that this probability is based upon the entire dataset and not on individual’s actions…or in other words the probability to delete an email can come from someone receiving an email as well as sending or responding to one.

Email Action Frequency

Time Frequencies

Knowing when emails are sent can provide a considerable insight into the work ethic of the employee base.  It is also a good indicator of culture and can be used to manage morale and HR expectations.  Companies with many “after normal work hours” emails may want to understand why this is occurring and if/how it can be corrected.  Some conditions that come to mind include technical support (i.e. too many trouble tickets from system issues or workforce quality concerns), too few resources for a given job function forcing after hours work, a culture heavily slanted towards performance at all costs, and benign causes like global work forces.  The same can be said for emails sent on weekends versus weekdays.

For this project, the time distribution is based upon local time so we can get a true sense of when emails are being sent.  From the graphs below we can see quite a few emails sent in the early morning hours; with the majority between 5am and 11am.  In this case we know Enron is a global company so this is expected if some or all of the workforce is adjusting their hours to compensate for the time differences.  For a real-world project, email time should be adjusted to represent the normal work hours for each region; which would require having additional details about each employee.

 

Hourly Email Distribution

For day of week we can see the majority of emails are sent during the work week and not the weekend (0=Monday).  An anecdotal observation would seem to indicate that for this dataset the workforce performs the most on Monday and Tuesday and then declines with increasing velocity towards Friday.

Day of Week Distribution

Body Content Length

The length of email content can provide clues regarding how email is used as a communication tool within an organization.  Shorter emails will tend to be oriented more towards events and updates where longer content may indicate information sharing and detailed planning.  The more technical an organization is can lead to longer emails since technical staff tend to be detail oriented.  

Email contents are also a potential source of evidence (i.e. how we were able to obtain this private email repository from Enron) with many organizations having learned this truth and adopted policies around retention and information sharing.

Since this dataset is from the late 90’s to early 2000’s it should be a relatively accurate indicator of how email was used within the company; either as an operational workflow tool or for detailed information sharing.  As an energy company, Enron was not what I would call a technology leaning organization so email lengths should be on the shorter side.

To find this information we’ll use K-Means clustering to group email lengths into 20 different bucket sizes.  From the detail below we can see that the assumption of shorter email lengths held true with 43,205 emails having a median length of 81.3 characters.  From the graph we can visually see that the shorter the email the more occurrences there are.

For our purposes the 318 emails in the smallest 13 classes should be removed as they do not represent the majority of emails.

 

Email Body Length Statistics

Content Token Frequency

Another basic statistic that we should know about is the effective vocabulary size and frequency distribution of the content tokens.  For this exercise we need to know how extensive the email content vocabulary is so we can select the appropriate tools for processing.  A large vocabulary or one made up of multiple languages will require more extensive compute resources than a small, single language word set.

You can use CountVectorizer available within the Scikit-Learn library for counting token frequency.  It is also easy to write a quick iterative function manually.

After removing a significant number of outliers, the following graphs show a distribution of mostly low frequency words with the exception of roughly 2,000 common words.  Since I’m more interested in the words less related to everyday business I’ve kept the lower bounds and trimmed from the common end; with an assumption that the more common words are either stop words of little value or everyday business terms.

 

Email Content Token Frequency Details
Email Content Token Frequency

Address Relationship Graph

Another interesting way to identify potential modeling data is through graphing.  In this scenario it would be helpful to know the ratio of user relationships within the organization based upon the dataset.  We can map the “from” email addresses with the “to” email addresses to find the proportion of high association users with low association users.  This can provide a view into leadership, influence and hierarchy.

For this effort I used the NetworkX python graph library to build a directed graph.  Once the graph was assembled I simply called the .info, .density and .degree_histogram functions to display the details below.

From this information we can discern several pieces of information about how our users connect.  

First off, if you recall the outputs from the unique address section from earlier there were 2,100 “from” addresses and 18,188 “to” addresses.  Our graph contains 18,303 nodes (which contains both “from” and “to” addresses) so not all of our users are in both groups; 115 users are either just senders or just receivers.

The average number of connections for a user is 3.68 or in other words, on average (based upon this dataset), each user knows 3.68 other users.

Our graph density is very low; meaning the number of potential connections is much higher than the actual connections.  In this case, based upon the number of vertices (nodes) in our graph, the maximum number of connections (edges) is 167,490,753.  Our graph has 67,382 edges so this user base is not very well connected when compared to the potential. 

It is to be expected, especially in large corporations, that not every employee communicates with every other employee, but this information, shown over time and associated with other events like leadership training and culture changes, can be a good indicator of how well corporate changes are accepted.

The final data points involve the indexed degrees vector.  This array of data shows the number of nodes associated with each degree of connection (or edges).  For example, there are 11,829 nodes with only one edge connecting it to another node.  Starting at 0, each position of the array is the number of edge connections; so 2,033 nodes have 2 connections, 952 have 3 connections, etc.  

By looking at this data we can see many nodes deeper in the list with a high number of connections, these could be valid, highly connected users, or these could indicate system email accounts or departmental accounts used for communicating to large groups.

From Address - To Address Graph Details

Using graph techniques we can dig deeper into the network to determine which users could be ranked “influential” or “leaders” and use this information to classify each data sample for supervised learning later in the modeling process. 

The graph can also be used to identify anomalies that may have slipped through the filtering process. 

Correlation and Multicollinearity

It’s a little early in the process to check correlation and multicollinearity since we have not identified our target labels yet, but it may still be helpful to understand how some of our features are related and potentially remove those that are redundant or likely to interfere with model inference.  Machine learning regularization techniques minimize multicollinearity to some degree, but I still check for extreme situations or opportunities to remove features and reduce compute time.

To perform this test we will generate a correlation matrix and use a technique called variance inflation factor to determine how our features relate and which features may have a negative influence on model training.

For this stage in the data pipeline, the features at hand include “day of the week”, “hour of day”, “month”, the “outside hours” boolean, the email “source” and the “forwarded” boolean.  Some of our data is not in numeric form so we’ll need to perform a quick transform to classify the non-numeric values before calculating.

Variance Inflation Factor. The first calculation we will perform is VIF or variance inflation factor.  VIF determines the level of “influence” a particular feature has with relation to other features.  The higher the inflation factor the more correlated the feature will be with at least one other feature.   Very high outcomes (i.e. the values are relative, but a general rule is greater than 10.0) are cause for further investigation to determine if the relationship will affect modeling later in the process.  Correcting these occurrences involves removing one of the offending features or making adjustments to separate the relationship.

The test results for this dataset show low collinearity, so at this point I will not need to make any adjustments.

Variance Inflation Factor and Correlation Matrix

The correlation matrix shows us similar information as the VIF.  Positive values represent a relationship between the variables where both increase relative to positive change in either variable.  Negative values represent the reverse condition where an increase in one variable represents a decline in the other (and visa versa).  The larger/smaller the values the more sensitive the relationship.  Values close to zero signify the least amount of correlation.

Based upon this dataset, none of the features have a strong correlation.  There is a mild relationship between “hour of day”, “outside hours” and “source”.  This can be interpreted several ways, but I believe it makes sense that certain email actions may occur differently throughout the day.  Early in the day there may be more original emails drafted and sent versus later in the day when more emails are being responded too.

Heatmaps. Another way to look at correlation is through a visual heatmap.  The image below shows a simple matrix of the relationships along with a handy color gradient to easily gauge where variables have the most extreme influence.

Correlation Heatmap

Conclusion

This article focused on the initial analysis of the Enron email repository to determine the steps that may be needed to further clean the data before building the supervised training set.

Several techniques were used, including basic summation, frequency distributions, variance inflation factoring, correlation matrices, graphing networks and visualizations.

It was determined that the Enron dataset is not very suited to representing the company as a whole with regard to employee morale/sentiment and alignment, but it is the dataset available so we will continue to use it.  Of the analysis steps performed, the content token frequency test showed that 318 emails should be dropped from the exercise since the content length was outside the normal range for this user base.

The next post will continue the series and dive into building the supervised learning data from the unstructured email  content using the topic modeling techniques Latent Dirichlet Allocation and Non-Negative Matrix Factorization.

If there are questions about the code or processing logic for this blog series or your company is in need of assistance implementing an AI or machine learning solution, please feel free to contact me at mike@avemacconsulting.com and I will help answer your questions or setup a quick call to discuss your project. 

Source Code

For those interested, I have included the complete source code used to perform the analysis functions for this article.

All of the code for this series can be accessed from my Github repository at:

github.com/Mike-Schmidt-Avemac/ai-email-insights.

				
					'''
MIT License

Copyright (c) 2021 Avemac Systems LLC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
'''

#!/usr/bin/python3 -W ignore::DeprecationWarning
import sys

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
import string
from time import time
import datetime as dt
import pandas as pd
import numpy as np
import collections
import re
from tqdm import tqdm
import sklearn.cluster
import matplotlib.pyplot as plt
import networkx as nx
from scipy.stats import zscore
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns

pd.set_option('display.max_rows', 100)
pd.set_option('display.min_rows', 20)
pd.set_option('display.max_colwidth', 100)

class BasicEmailStats():
    ''' Generate simple statistics about email dataset'''

    def __init__(self, config):
        self.data_dir = config['data_dir']
        self.manual_word_counts_fn = config['manual_word_counts_fn']
        self.plot_save_dir = config['plot_image_save_directory']
        raw_emails = pd.read_csv(self.data_dir + config['email_extracted_fn'])
        raw_emails.fillna(value="[]", inplace=True)

        self.email_df = self._fix_email_addresses('From_Address', raw_emails)
        self.email_df = self._fix_email_addresses('To_Address', raw_emails)
        self.email_df = self._fix_email_addresses('Cc_Address', raw_emails)
        self.email_df = self._fix_email_addresses('Bcc_Address', raw_emails)

        self.email_df['DateTime'] = self.email_df['DateTime'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S%z'))
        self.email_df['DateTime_TS'] = self.email_df['DateTime'].apply(lambda x: x.timestamp())
        self.email_df['DateTime_HOUR'] = self.email_df['DateTime'].apply(lambda x: x.hour)
        self.email_df['DateTime_MONTH'] = self.email_df['DateTime'].apply(lambda x: x.month)

        print(f'\n--- Init Complete\n\n')

        return

    def cliporpad(self, text:str, clen):
        return text.ljust(clen)[0:clen]

    def xy_gen_plot(self, X, X_label, Y, Y_label, title=None, aug_plots=None, aug_plot_labels=None, spot=None, spot_label=None, save=False, img_fn='generic_plot.png'):
        '''generic graph plot - support X, Y and augmented plot points'''

        fig, (p1) = plt.subplots(1, 1)

        if title is not None:
            p1.set_title(title)

        p1.set_ylim([min(Y)-(max(Y)*0.1), max(Y)+(max(Y)*0.1)])
        p1.set_xlim([min(X)-(max(X)*0.1), max(X)+(max(X)*0.1)])
        p1.plot(X, Y, 'o-', color='blue')

        if aug_plots is not None:
            for x in range(len(aug_plots)):
                p1.plot(aug_plots[x][0], aug_plots[x][1], label=aug_plot_labels[x])

        if spot is not None:
            p1.plot(spot[0], spot[1], 'bo')
            p1.annotate(spot_label, xy=(spot[0]+0.2, spot[1]-0.2))

        p1.set_xlabel(X_label)
        p1.set_ylabel(Y_label)
        p1.legend()
        p1.grid(True)
        fig.tight_layout()

        if save: 
            plt.savefig(self.plot_save_dir + img_fn, format='png')

        plt.show()
        return

    def df_plot(self, ds, columns, types=['hist'], bins=20):
        '''Pandas based plots'''
        for t in types:
            if t == 'box':
                ds.boxplot(column=columns)
            elif t == 'hist':
                ds.hist(column=columns, bins=bins)
            elif t == 'density':
                ds.plot.kde()
        plt.show()
        return

    def _address_clean(self, addr):
        ''' Additional email address cleaning '''
        addr = re.sub(r'e-mail <(.*)>',r'\1',addr)
        addr = re.sub(r' +', '', addr)
        addr = re.sub(r'/o.*=', '', addr)
        addr = re.sub(r'"', '', addr)
        return addr

    def _fix_email_addresses(self, type, df):
        ''' Split email address array strings into usable arrays '''
        split_embeds = lambda x: x.replace('[','').replace(']','').replace('\'','').split(',')
        addrs = [split_embeds(s) for s in tqdm(df[type].values)]
        u_addrs = [[self._address_clean(y) for y in x] for x in tqdm(addrs)]
        df[type] = u_addrs
        return df

    def unique_address_count(self):
        ''' Find unique email address counts '''
        from_addrs = list(set([y for x in self.email_df['From_Address'] for y in x]))
        to_addrs = list(set([y for x in self.email_df['To_Address'] for y in x]))
        cc_addrs = list(set([y for x in self.email_df['Cc_Address'] for y in x]))
        bcc_addrs = list(set([y for x in self.email_df['Bcc_Address'] for y in x]))

        addrs_df = pd.DataFrame([len(self.email_df)], columns=['Total_Emails'])
        addrs_df['From_Addresses'] = [len(from_addrs)]
        addrs_df['To_Addresses'] = [len(to_addrs)]
        addrs_df['Cc_Addresses'] = [len(cc_addrs)]
        addrs_df['Bcc_Addresses'] = [len(bcc_addrs)]

        print(f'\n--- Unique Email Address Counts\n\n{addrs_df.head(20)}\n')

        return addrs_df
    
    def _frequency_addresses_groupby(self, type):
        columns = [type, 'Source','Day','Outside_Hours','Forwarded']
        # using list comprehension to build address dataframe from each address array within each email
        addrs = pd.DataFrame([[y, x[1], x[2], x[3], x[4]] for x in self.email_df[columns].values for y in x[0]], columns=columns)

        group = addrs.groupby(by=[type]) # group by address only so all data elements are captured within one reference point

        # since the grouping is by address type and we're looking for counts of multiple data elements
        # we'll just spin through the group details and manually sum each data element by the element type

        frequencies = []
        for grp in tqdm(group.groups):
            details = group.get_group(grp)
            total = len(details)
            sources = collections.Counter(details.Source.values)
            days = collections.Counter(details.Day.values)
            hours = collections.Counter(details.Outside_Hours.values)
            forwards = collections.Counter(details.Forwarded.values)

            # build a cum row for each group 
            frequency = {}
            frequency['user'] = grp
            frequency['total'] = total
            for k,v in sources.items(): frequency['sources_'+k] = v
            for k,v in days.items(): frequency['days_'+str(k)] = v
            for k,v in hours.items(): frequency['after_hours_'+str(k).lower()] = v
            for k,v in forwards.items(): frequency['forwards_'+str(k).lower()] = v
            frequencies.append(frequency)

        df = pd.DataFrame(frequencies).fillna(0.0).sort_values(by='total', ascending=False)
        return df[sorted(df.columns, reverse=True)]

    def _frequency_addresses_groupby_sum(self, group_details):
        columns = sorted(group_details.columns, reverse=True)
        columns.remove('user')
        sums = {c:group_details[c].sum() for c in columns}
        return pd.DataFrame([sums])

    def side_by_side_histogram(self, df, bins=30):
        fig, axs = plt.subplots(1, 4, sharey=True, tight_layout=True)
        fig.text(0.5, 0.01, 'email count', ha='center')
        fig.text(0.01, 0.5, 'total users', va='center', rotation='vertical')
        axs[0].hist(df['total'], bins=bins); axs[0].title.set_text('total')
        axs[1].hist(df['sources_sent'], bins=bins); axs[1].title.set_text('sources_sent')
        axs[2].hist(df['sources_responded'], bins=bins); axs[2].title.set_text('sources_responded')
        axs[3].hist(df['sources_deleted'], bins=bins); axs[3].title.set_text('sources_deleted')
        plt.show()
        return

    def frequency_addresses(self):
        ''' Find unique email address frequency by type, action, etc. '''
        from_addrs_grp = self._frequency_addresses_groupby('From_Address')
        from_addrs_grp_sum = self._frequency_addresses_groupby_sum(from_addrs_grp)
        print(f'\n--- From Email Address Counts\n\n{from_addrs_grp.head(20)}\n\n{from_addrs_grp_sum.head(20)}')
        self.side_by_side_histogram(from_addrs_grp)

        to_addrs_grp = self._frequency_addresses_groupby('To_Address')
        to_addrs_grp_sum = self._frequency_addresses_groupby_sum(to_addrs_grp)
        print(f'\n--- To Email Address Counts\n\n{to_addrs_grp.head(20)}\n\n{to_addrs_grp_sum.head(20)}')
        self.side_by_side_histogram(to_addrs_grp)

        cc_addrs_grp = self._frequency_addresses_groupby('Cc_Address')
        cc_addrs_grp_sum = self._frequency_addresses_groupby_sum(cc_addrs_grp)
        print(f'\n--- Cc Email Address Counts\n\n{cc_addrs_grp.head(20)}\n\n{cc_addrs_grp_sum.head(20)}')
        self.side_by_side_histogram(cc_addrs_grp)

        return
    
    def frequency_subject_line(self):
        ''' Find subject line word frequency - Could use TF-IDF or simple dictionary for this - using Pandas groupby as example '''
        word_map = []
        for x in tqdm(self.email_df['Subject']):
            tokens = x.split()
            for token in tokens:
                if token not in ['Re:','RE:','FW:','-','[]','for','of','and','to','on','the','in','&']:
                    word_map.append({'word':token})
        word_map = pd.DataFrame(word_map)
        group = word_map.groupby(by=['word'])
        counts = group.size().to_frame(name='count').sort_values(by='count', ascending=False)
        print(f'\n--- Subject Line Token Frequency\n\n{counts.head(20)}')
        counts.hist(column='count', bins=100); plt.show()
        return

    def frequency_actions(self):
        ''' Email action frequency counts - using simple dictionary '''

        # define counts and probabilities
        df = self.email_df[['Source']]
        group = df.groupby('Source')
        actions = group.size()
        probs = actions.div(len(df))
        actions['P(sent)'] = probs['sent']
        actions['P(deleted)'] = probs['deleted']
        actions['P(responded)'] = probs['responded']
        actions['P(deleted|sent)'] = actions['P(deleted)'] * actions['P(sent)']
        actions['P(deleted|responded)'] = actions['P(deleted)'] * actions['P(responded)']
        actions['P(responded|sent)'] = actions['P(responded)'] * actions['P(sent)']
        print(f'\n--- Email Action Frequency\n\n{actions.to_frame().T.head(20)}')

        return

    def frequency_time(self):
        ''' Email frequency by hour of day '''
        group = self.email_df.groupby('DateTime_HOUR')
        counts = group.size()
        print(f'\n--- Email By Hour of Day Frequency\n\n{counts.to_frame().T.head(20)}')

        self.df_plot(self.email_df, columns=['DateTime_HOUR'], types=['box','hist'], bins=24)
        return

    def frequency_day_of_week(self):
        ''' Email frequency by day of week '''
        group = self.email_df[['Day']].groupby('Day')
        counts = group.size()
        print(f'\n--- Email By Day of Week Frequency\n\n{counts.to_frame().T.head(20)}')

        self.df_plot(self.email_df, columns=['Day'], types=['box','hist'], bins=7)
        return
        
    def length_body_tokens(self):
        ''' Cluster body content lengths - Use sklearn.cluster.KMeans '''
        lengths = []
        for x in tqdm(self.email_df['Body']):
            lengths.append([len(x)])
        estimator = sklearn.cluster.KMeans(n_clusters=20).fit(lengths)
        cluster_frequency = np.unique(estimator.labels_, return_counts=True)
        cluster_centers = [round(x[0],1) for x in estimator.cluster_centers_]
        df = pd.DataFrame([{'cluster_id':x,'cluster_count':cluster_frequency[1][x],'cluster_center':cluster_centers[x]} for x in cluster_frequency[0]]).sort_values(by='cluster_center')
        print(f'\n--- Body Content Segmentation By Character Length\n\n{df.head(20)}')

        self.xy_gen_plot(df.cluster_count, 'Email Group Count', df.cluster_center, 'Email Length (Center)', title='Email Body Content Length')
        return

    def graph_from_to_addresses(self):
        ''' Routine to generate a graph of email address relationships '''
        graph = nx.DiGraph()
        for from_a,to_arr in tqdm(self.email_df[['From_Address','To_Address']].values):

            if not graph.has_node(from_a[0]):
                graph.add_node(from_a[0])

            for to_a in to_arr:
                if not graph.has_node(to_a):
                    graph.add_node(to_a)
                if not graph.has_edge(from_a[0], to_a):
                    graph.add_edge(from_a[0], to_a, count=1)
                else:
                    graph.edges[from_a[0], to_a]['count'] = graph.edges[from_a[0], to_a]['count'] + 1

        nx.write_graphml(graph, self.plot_save_dir + 'from_to_addresses.graphml')
        #nx.draw_circular(graph)
        print(f'\n--- Graph From Addresses - To Addresses Info\n\n{nx.info(graph)}')
        print(f'\n--- Graph From Addresses - To Addresses Density\n\n{nx.density(graph)}')
        print(f'\n--- Graph From Addresses - To Addresses Degrees\n\n{nx.degree_histogram(graph)}')

        plt.show()

        return

    def datetime_zscore(self):
        ''' Quick check of probability for email date/time range '''
        zscores = self.email_df['DateTime_TS'].to_frame().apply(zscore)
        print(f'\n--- DateTime Z-Score, Oldest is {zscores["DateTime_TS"].max()} and Newest is {zscores["DateTime_TS"].min()}\n\n')
        zscores['DateTime_TS'].hist(bins=20); plt.show()
        zscores.boxplot(column='DateTime_TS'); plt.show()
        return

    def range_date_time(self):
        ''' Graph the email date/time range '''
        print(f'\n--- DateTime Min \'{self.email_df.DateTime.min()}\' & Max \'{self.email_df.DateTime.max()}\' Range\n\n')
        self.email_df['DateTime'].hist(bins=20); plt.show()
        return

    def manual_document_word_count(self):
        ''' Routine to investigate email content token frequencies '''
        words = {}
        punct_pattern = re.compile("[" + re.escape(string.punctuation) + "0-9" + "]")
        for x in tqdm(self.email_df['Body']):
            for y in re.sub(punct_pattern, "", x).lower().split(' '):
                count = words[y]+1 if y in words.keys() else 1
                words.update({y:count})
        print(f'\n--- Raw Word Count - dictionary len is {len(words)}, min count of {min(words.values())}, max count of {max(words.values())}\n')

        words = dict(filter(lambda item: item[1] > 5 and item[1] < 1000, words.items())) # roughly trim outliers
        print(f'\n--- Trimmed Word Count - dictionary len is {len(words)}, min count of {min(words.values())}, max count of {max(words.values())}\n')
        values = list(words.values())

        self.xy_gen_plot(np.arange(start=0,stop=len(words), dtype=int), 'vocab', sorted(values), 'count')
        self.df_plot(pd.DataFrame(values, columns=['count']), columns='count', bins=10)
        return words

    def _multicollinearity_transform(self, x):
        if x.name == 'Outside_Hours':
            x = x.apply(lambda n: 1 if n else 0)
        elif x.name == 'Source':
            amap = {'deleted':0,'responded':1,'sent':2,'received':3}
            x = x.apply(lambda n: amap[n])
        elif x.name == 'Forwarded':
            x = x.apply(lambda n: 1 if n else 0)

        return x

    def multicollinearity_test(self):
        # vif check
        columns = ['Day','DateTime_HOUR','DateTime_MONTH','Outside_Hours','Source','Forwarded']
        vif_df = self.email_df[columns].apply(self._multicollinearity_transform)

        vdf = pd.DataFrame()
        vdf['features'] = columns
        vdf['vif'] = [variance_inflation_factor(vif_df.values, i) for i in range(len(vif_df.columns))]
        print(f'\n--- Variance Inflation Factor\n{vdf}\n')

        # correlation matrix and heatmap
        print(f'\n--- Correlation Matrix\n{vif_df.corr()}\n')
        fig, ax = plt.subplots(1)
        fig.subplots_adjust(bottom=0.3,left=0.3)
        axs = sns.heatmap(vif_df.corr(), ax=ax); plt.show()

        return

config = {
    'email_extracted_fn': 'extracted_emails.pd',
    'data_dir': '/email_analysis_blog/data/',
    'plot_image_save_directory': '/email_analysis_blog/plots/',
    'manual_word_counts_fn': 'email_content_word_counts.csv',
}

ebs = BasicEmailStats(config)

x = ebs.unique_address_count()
x = ebs.range_date_time()
x = ebs.datetime_zscore()
x = ebs.frequency_addresses()
x = ebs.frequency_subject_line()
x = ebs.frequency_actions()
x = ebs.frequency_time()
x = ebs.frequency_day_of_week()
x = ebs.length_body_tokens()
x = ebs.manual_document_word_count()
x = ebs.graph_from_to_addresses()
x = ebs.multicollinearity_test()

exit()
				
			

Schedule a Meeting

More Posts