POSTS

News - Insights - Case Studies

Email Insights from Data Science – Part 1

email_analysis_blog

Identifying Behavioral Insights

I’ve always been interested in the dynamics of corporations and how decisions, personality types and business performance impact culture, morale, work ethic and collaboration.  I believe most executives understand how larger events ripple through an organization, but it’s the subtleties that go unseen for long periods of time that can truly derail progress and is the topic for this series of posts.

Of course the opportunity to measure these phenomenon has been available for quite some time, but implementation methods have been traditionally extensive and complex with results being highly customized  and rule-based.  With the breakthroughs in Data Science and Machine Learning, specifically Natural Language Processing (NLP), the level of effort is much more aligned with the expected benefits; not to mention the general accessibility of data through industry-wide acceptance of XaaS API interfaces and the relative ease of NLP feature modeling.

This type of employee analysis can have ethical concerns of course.  Privacy, discrimination and bias must be avoided and management must be cognizant of the negative possibilities this type of information could cause within an organization.  Even though most corporate employees understand that information sent/received on company equipment and networks belongs to the company (and therefore the amount of information available to analyze has been reduced) I believe there are still insights to be gained from benign day-to-day interactions and the word forms used to interact with internal and external resources.  That being said, it is easy to understand how this type of analysis could be used for unethical and nefarious purposes.

In this multi-part series I will explain a number of methods and techniques available to inspect email contents (also useful with other communication platforms) for details into employee tone, dedication, alignment and leadership.

Data Exploration

The first step to extracting insights from corporate communications is to assess available data, determine target goals and set realistic expectations regarding accuracy and prediction results.

Communication data for this exercise is limited to email systems an organization controls directly or has unlimited access through API integrations or export tools.  Other platforms for consideration could include document shares, project management tools like Jira or Trello, public or corporate social media, support platforms like Zendesk, internal messaging platforms and customer service transcripts.

Besides the raw communication details there may be a need to better classify this information with user demographics.  This information can be obtained from Human Resource information and payroll systems (i.e. title, tenure, age, gender, number of direct reports, years experience, performance reviews, salary, etc.).  And once again, this information needs to be treated anonymously to prevent bias and discrimination.  I do not advocate leveraging this information for individual interpretation nor do I advocate management decisions based solely on predictions classified by age, race or gender.  Including demographic information, however, can provide positive insights into interaction differences, cultural differences and understanding differences that can be evaluated for opportunities to improve collaboration and teamwork; rather than targeting employees that don’t fit as well as others.

It is also important to develop a plan for who sees this information and what insights are included.  Typically these details should be restricted to executive management as a way to evaluate employee alignment and to gauge employee reaction to business changes.  With proper communication and training regarding the intent of these analyses, the details may be made available to lower management teams to support better interactions with their organizations, but care must be taken to avoid manipulation and bias.

For this series, we will focus on employee tone (positive or negative) and work/life balance (degree of professional versus personal communications).  The range of analysis is limited only by time and objectives.  Other areas to consider are alignment (for or against change), dedication (work ethic, commitment to quality) and leadership (influence).

Model expectations. At the end of the day a well implemented machine learning model is still just a probability engine, albeit a sophisticated one, with the primary benefits being precision, range and inference (or in other words the technology will guess answers more closely, process complex multivariate datasets dynamically,  and predict an outcome without having seen the question before). 

Expecting a statistical model to be correct 100% of the time is not realistic.  Is is also not realistic to accept incorrect outcomes more often than what would be expected from an educated manager’s predictions.  Somewhere in-between lies the sweet spot; with the goal to push accuracy as close to 100% as possible while still maintaining a substantial amount of inference capability.  The point here is to use these models as tools for decision making and not as absolute governors. 

Data Selection

Since this post is an example of how to extract information from email contents, I will need a sample dataset from the public domain.  To accomplish this I will use the Enron email export (the cleaned 2015 version without attachments) available from the federal investigation in 2001 and downloaded from the Carnegie Mellon University Computer Science repository here.

This repository is not the best data source for this exercise since it is restricted to senior managers only and has been cleansed several times from the original version, but should suffice to depict the steps necessary to extract relevant features from private email systems.

Data Structure

The data repository has been organized by email user account with each user having a physical directory with sub-directories for “inbox”, “sent”, “deleted” and custom filters/labels.  We will focus on the “sent” (i.e. originated emails), “sent_items” (i.e. responded to emails) and the “deleted” emails.

After downloading the archive (on a Linux platform), simply run “tar -xzvf enron_mail_20150507.tar.gz” to unpack.

The archive structure looks like the following image with each numbered file representing an individual email or email chain.  This is raw data so a number of preparation steps will be needed before we can begin analyzing the information.

Tar archive file structure example.

This format is specific to the export and preparation decisions made by the handlers of this information over several iterations and is proprietary.  Other email systems will organize data in different ways and will require source specific collection and processing logic.

Since the structure of the Enron repository is pretty simple, I’ll use basic, low-level ETL (Extract-Transform-Load) for this example (or in other words I will use Python to custom code an ETL function rather than leveraging a commercial solution like CloverETL, Pentaho, Talend or AWS Glue).

Structure Analysis

After downloading the email archive and unpacking on my Linux workstation, I begin the initial analysis of the data structure and formatting.  It’s a good idea to become familiar with how the data your working with is organized and in a lot of cases when working with external data, reverse engineer how the data was encoded and identify the major patterns for initial segmentation.

In this instance, each employee’s email account has been physically located in a file directory specific to each user with consistent sub-folders for received mail, sent mail, replied mail and deleted mail.  There are other folders available with additional information, but for this study we’ll focus on the primary email actions.

Within each of the targeted sub-folders will be a list of numbered files that do not appear to correlate in any way with numbered files in other folders (i.e. the “1.” file in the inbox folder is not the same as the “1.” file in the sent folder).

Each file contains a set of email header fields and a content payload (i.e. the email body).  Within the email body there are a number of patterns and artifacts to be dealt with as a result of handling by disparate email platforms, automated processes and the state of the email.  We will address these concerns during the extraction phase. 

The following is an example email showing the header fields available for analysis.  For our purposes, the email “body”, “to”, “from”, “cc”, “bcc” “datetime” and “subject” fields are all we are interested in.  If organizational structure prediction was one of our targets the “X-” fields would be valuable. Or if this archive contained attachments the ability to analyze intellectual property (IP) rule adherence within the email content along with attachment content would be a possibility. 

Example email structure and header fields.

Overall the structure is relatively consistent and requires only a minimal amount of manipulation.

Data Extraction

To model positive/negative employee tone, professional/personal content, alignment, dedication and leadership traits we will need to extract some data elements in their current form and infer some features based on this data.

Tone and content type will be based solely on the language used within the email body.  Alignment would also rely upon the email body contents.  Predicting dedication scores is not only based upon content, but also when an email is sent (i.e. during or after business hours, on a weekday or weekend).  Leadership traits are also based upon language, but can also be indicated by frequency of inclusion in CC and BCC fields.  Whether or not an employee forwards emails can also contribute to leadership scores.

To get at this information a number of steps will be required; namely functions to filter, clean, parse, deduplicate and save the extracted information for the analysis phase of the pipeline.

Parsing

First thing is to retrieve the data and parse it into meaningful values.  To accomplish this I will use the Python library glob and the email.parser library function. The glob function easily creates a list of filenames based upon directory paths and the email parser function is robust enough, for this exercise, to pull the information from each email in a succinct manner.

				
					
    def _parse_emails(self, base_dir, limit=sys.maxsize):
        ''' Loop through all of the email files and extract/infer features '''

        email_sources = glob.glob(base_dir + '*/deleted_items/*')
        email_sources.extend(glob.glob(base_dir + '*/sent_items/*'))
        email_sources.extend(glob.glob(base_dir + '*/sent/*'))

        parser = ep.Parser()
        emails = []

        for email_fn in tqdm(email_sources[0:limit]):

            # retrieve email content
            try:
                with open(email_fn, 'r') as f:
                    email = parser.parse(f)
            except Exception:
                continue # encoder error, skip

            user_name = self._extract_user_name(email_fn)

            # skip external and system generated emails
            if self._is_external_origination(email.get('From')): continue
            if self._is_system_generated(email.get('From')): continue

            # extract fields
            fields = {}
            date_time = dt.datetime.strptime(email.get('Date')[:-6], '%a, %d %b %Y %H:%M:%S %z')
            fields['DateTime'] = date_time
            fields['Day'] = date_time.weekday()
            fields['Outside_Hours'] = date_time < dt.datetime(date_time.year, date_time.month, date_time.day, 7, 0, 0, tzinfo=dt.timezone(date_time.utcoffset())) or date_time > dt.datetime(date_time.year, date_time.month, date_time.day, 18, 0, 0, tzinfo=dt.timezone(date_time.utcoffset()))
            fields['From_Address'] = email.get('From')
            fields['To_Address'] = [x for x in email.get('To').replace('\n','').replace('\t','').split(',')] if email.get('To') is not None else None
            fields['Cc_Address'] = [x for x in email.get('Cc').replace('\n','').replace('\t','').split(',')] if email.get('Cc') is not None else None
            fields['Bcc_Address'] = [x for x in email.get('Bcc').replace('\n','').replace('\t','').split(',')] if email.get('Bcc') is not None else None
            fields['Subject'] = email.get('Subject')
            fields['Forwarded'] = 'Fwd' in email.get('Subject') or 'FW' in email.get('Subject') or 'Forwarded' in email.get_payload()
            fields['Source'] = self._determine_email_action(email_fn)
            fields['Body'] = self._clean(email.get_payload())

            if len(fields['Body']) <= 1: continue # skip empty emails

            emails.append(fields)

        # deduplicate content
        df = pd.DataFrame(emails).drop_duplicates(subset='Body').reset_index(drop=True)
        print('--- Found %d emails out of %d possible' % (len(df), limit))
            
        return df
				
			

Filtering

For this effort we are not interested in emails that originate from outside the company nor are we interested in emails that are generated automatically from applications, spam or those sent from groups or departments.  Empty emails are also of no value. We are focused on employee traits and behaviors so information not stemming from employees can be ignored.  Note that by not including external emails there may be an impact on dedication scores, since it can be construed that non-work related email content, games and subscriptions are indicators of varying levels of dedication.

Upon casual inspection of the data, a number of internal system and department emails were observed. Rather than manually identify these email addresses for removal, I’m using an automated approach to “guess” at which addresses are not related to an actual user. To do this I leveraged the NLTK names corpus to compare each “from” address component to a human name.  The emails with “from” addresses that do not contain some form of human name will be discarded. It should be noted that this logic could have been reversed to use POS (part of speech) tagging to identify address parts that were not proper nouns. It should also be noted that complex names not of Western origin may not be fully represented with this corpus.  In a real-world situation a better names database should be acquired.

				
					
    def _is_external_origination(self, from_addr):
        ''' Determine if email originated external to the company '''
        addr_parts = from_addr.split('@')
        return False if len(addr_parts) == 2 and addr_parts[1].lower() == 'enron.com' else True

    def _is_system_generated(self, from_addr):
        ''' Determine if email was system generated. Note - doesn't work well with complex names '''
        parts = re.sub(r'@.*', '', from_addr).split('.')
        proper_nouns = [1 if x in self._name_list else 0 for x in parts]
        return True if sum(proper_nouns) == 0 else False

				
			

Cleaning

There are a number of standard steps taken to clean email content; including removing unneeded punctuation, newlines, whitespace and invalid characters.  

For this dataset there is also a need to remove historic content from the email body (i.e. original email thread message, forwarded content, inline attachments and embedded content from previous email actions.  Different systems create different artifacts so this step can involve a number of different techniques.

I prefer to use the Python re library for data cleaning as it provides an almost unlimited tool for defining custom expressions in a concise manner.

				
					
    def _clean(self, payload):
        ''' Remove unwanted information from email body '''

        text = payload
        marks = ['`', '&', '*', '+', '/', '<', '=', '>', '[', '\\', ']', '^', '_', '{', '|', '}', '~', '»', '«'] 
        punct_pattern = re.compile("[" + re.escape("".join(marks)) + "]")

        text = re.sub(r'\n', " ", text) # remove newlines
        text = re.sub(r'\r', " ", text) # remove carriage returns
        text = re.sub(r'\t', " ", text) # replace tabs 
        text = re.sub(r'=[0-9][0-9]', '', text) # remove parsing artifacts
        text = re.sub(r'[^\040-\176]+', '', text) # remove invalid characters
        text = re.sub(punct_pattern, "", text) # remove unneeded punct

        text = re.sub(r'-----Original Message-----.*', '', text)
        text = re.sub(r'From: .*', '', text)
        text = re.sub(r'----- Forwarded by.*', '', text)
        text = re.sub(r'---------------------- Forwarded by.*', '', text)
        text = re.sub(r'--------- Inline attachment follows.*', '', text)

        text = re.sub(r'  +', ' ', text) # cleanup whitespace
        return text
				
			

Deduplication

Since there will be duplicated emails within this dataset (i.e. multiple users receiving the same email) to ensure the data we extract is not skewed in this manner we will remove emails that contain the same body content after cleaning.  With Pandas this can be done easily with the “.drop_duplicates” function.

				
					
        # deduplicate content
        df = pd.DataFrame(emails).drop_duplicates(subset='Body').reset_index(drop=True)
				
			

Saving

We could continue to analyze the dataframe within this application, but it’s best to break up the processing into checkpoints to avoid unnecessary processing time in the event an error is discovered.  In this instance, saving the formatted data is very simply done with the Pandas “.to_csv” function.  One thing to note about the Pandas save routines is that objects and arrays are saved as strings so when re-importing the data in later steps care must be taken to convert back to the proper forms.

				
					
email_df = EmailExtraction(config).email_data
email_df.to_csv(config['data_dir'] + config['email_extracted_fn']) #save to file for next step in pipeline
				
			

Performance

The overall size of the repository for the directories selected is 146,589 emails out of the approximately 550,000 total emails.  The elapsed time to process is roughly 3 minutes using this single-threaded process.  

For larger implementations it is recommended that the dataset be segmented into smaller chunks and the code modified to process in parallel.  This could be accomplished using Python and forking multiple processes.  One could also use Cython multithreading or port the code to a different language more inherently capable of parallel processing like R, Go, C++ or Java.  Depending upon the technology employed, a multi-threaded approach may require merging the independently processed data streams back into a single repository and then deduplicating before moving on to the next step in the pipeline.

Results

Out of roughly 147K of eligible emails, the process filtered out 61,641 as unsuitable for our needs, leaving 84,948 valid emails for analysis and modeling.

Conclusion

For this post I explained the steps needed to extract and clean corporate email content for further analysis and eventual modeling.  The required data elements were identified, the data structures investigated and the necessary ETL application developed to accurately extract the information and store it for the subsequent workflow step.

The next post will pick up where this article ended and will be focused on analysis of the data to determine the proper techniques for modeling.

If there are questions about the code or processing logic for this blog series or your company is in need of assistance implementing an AI or machine learning solution, please feel free to contact me at mike@avemacconsulting.com and I will help answer your questions or setup a quick call to discuss your project. 

Source Code

The following is the complete source code I created to extract and format the email data for this article.

All of the code for this series can be accessed from my Github repository at:

github.com/Mike-Schmidt-Avemac/ai-email-insights.

				
					'''
MIT License

Copyright (c) 2021 Avemac Systems LLC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
'''

#!/usr/bin/python3 -W ignore::DeprecationWarning
import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
import datetime as dt
import pandas as pd
import numpy as np
import glob
import re
import email.parser as ep
from tqdm import tqdm
from nltk.corpus import names

pd.set_option('display.max_rows', 100)
pd.set_option('display.min_rows', 20)
pd.set_option('display.max_colwidth', 100)

class EmailExtraction():
    ''' Process raw emails into data frame'''

    def __init__(self, config):
        ''' Initialize structures and parse emails upon instantiation '''

        # create a proper names list to estimate system email addresses from actual user email addresses
        print('--- Process proper names list')
        self._name_list = [name.lower() for name in tqdm(names.words())]

        # parse emails, keeping only the information we're interested in
        print('--- Parse raw email data')
        self.email_data = self._parse_emails(config['email_base_dir'], limit=config['email_limit'])

        return 
    
    def _extract_user_name(self, email_fn):
        ''' Pulls the account user name from the physical directory structure '''
        segments = email_fn.split('/')
        return segments[-3]

    def _determine_email_action(self, email_fn):
        ''' Looking at the physical email location, determine the action '''
        action = None

        # is username in "to" or "from" address list
        if 'deleted_items' in email_fn:
            action = 'deleted'
        elif 'sent_items' in email_fn:
            action = 'responded'
        elif 'sent' in email_fn:
            action = 'sent'
        else:
            action = 'received'

        return action

    def _clean(self, payload):
        ''' Remove unwanted information from email body '''

        text = payload
        marks = ['`', '&', '*', '+', '/', '<', '=', '>', '[', '\\', ']', '^', '_', '{', '|', '}', '~', '»', '«'] 
        punct_pattern = re.compile("[" + re.escape("".join(marks)) + "]")

        text = re.sub(r'\n', " ", text) # remove newlines
        text = re.sub(r'\r', " ", text) # remove carriage returns
        text = re.sub(r'\t', " ", text) # replace tabs 
        text = re.sub(r'=[0-9][0-9]', '', text) # remove parsing artifacts
        text = re.sub(r'[^\040-\176]+', '', text) # remove invalid characters
        text = re.sub(punct_pattern, "", text) # remove unneeded punct

        text = re.sub(r'-----Original Message-----.*', '', text)
        text = re.sub(r'From: .*', '', text)
        text = re.sub(r'----- Forwarded by.*', '', text)
        text = re.sub(r'---------------------- Forwarded by.*', '', text)
        text = re.sub(r'--------- Inline attachment follows.*', '', text)

        text = re.sub(r'  +', ' ', text) # cleanup whitespace
        return text

    def _is_external_origination(self, from_addr):
        ''' Determine if email originated external to the company '''
        addr_parts = from_addr.split('@')
        return False if len(addr_parts) == 2 and addr_parts[1].lower() == 'enron.com' else True

    def _is_system_generated(self, from_addr):
        ''' Determine if email was system generated. Note - doesn't work well with complex names '''
        parts = re.sub(r'@.*', '', from_addr).split('.')
        proper_nouns = [1 if x in self._name_list else 0 for x in parts]
        return True if sum(proper_nouns) == 0 else False

    def _parse_emails(self, base_dir, limit=sys.maxsize):
        ''' Loop through all of the email files and extract/infer features '''

        email_sources = glob.glob(base_dir + '*/deleted_items/*')
        email_sources.extend(glob.glob(base_dir + '*/sent_items/*'))
        email_sources.extend(glob.glob(base_dir + '*/sent/*'))

        parser = ep.Parser()
        emails = []

        for email_fn in tqdm(email_sources[0:limit]):

            # retrieve email content
            try:
                with open(email_fn, 'r') as f:
                    email = parser.parse(f)
            except Exception:
                continue # encoder error, skip

            user_name = self._extract_user_name(email_fn)

            # skip external and system generated emails
            if self._is_external_origination(email.get('From')): continue
            if self._is_system_generated(email.get('From')): continue

            # extract fields
            fields = {}
            date_time = dt.datetime.strptime(email.get('Date')[:-6], '%a, %d %b %Y %H:%M:%S %z')
            fields['DateTime'] = date_time
            fields['Day'] = date_time.weekday()
            fields['Outside_Hours'] = date_time < dt.datetime(date_time.year, date_time.month, date_time.day, 7, 0, 0, tzinfo=dt.timezone(date_time.utcoffset())) or date_time > dt.datetime(date_time.year, date_time.month, date_time.day, 18, 0, 0, tzinfo=dt.timezone(date_time.utcoffset()))
            fields['From_Address'] = email.get('From')
            fields['To_Address'] = [x for x in email.get('To').replace('\n','').replace('\t','').split(',')] if email.get('To') is not None else None
            fields['Cc_Address'] = [x for x in email.get('Cc').replace('\n','').replace('\t','').split(',')] if email.get('Cc') is not None else None
            fields['Bcc_Address'] = [x for x in email.get('Bcc').replace('\n','').replace('\t','').split(',')] if email.get('Bcc') is not None else None
            fields['Subject'] = email.get('Subject')
            fields['Forwarded'] = 'Fwd' in email.get('Subject') or 'FW' in email.get('Subject') or 'Forwarded' in email.get_payload()
            fields['Source'] = self._determine_email_action(email_fn)
            fields['Body'] = self._clean(email.get_payload())

            if len(fields['Body']) <= 1: continue # skip empty emails

            emails.append(fields)

        # deduplicate content
        df = pd.DataFrame(emails).drop_duplicates(subset='Body').reset_index(drop=True)
        print('--- Found %d emails out of %d possible' % (len(df), limit))
            
        return df

config = {
    'email_base_dir': '/email_analysis_blog/data/maildir/',
    'email_limit': 50000,
    'email_extracted_fn': 'extracted_emails.pd',
    'data_dir': '/email_analysis_blog/data/',
}

email_df = EmailExtraction(config).email_data
email_df.to_csv(config['data_dir'] + config['email_extracted_fn']) #save to file for next step in pipeline
print('--- Sample parsed data')
print(email_df)
exit()
				
			

Schedule a Meeting

More Posts