POSTS

News - Insights - Case Studies

Email Insights from Data Science – Part 3

email_analysis_blog

Unsupervised Topic Modeling

In Part 2 of this series the focus was on analysis of the Enron email dataset for anomalies, outliers and transformations needed to ensure the data was suitable for processing.  As a result, the modifications needed were benign and we ended up only needing to filter out the emails with large content and drop the BCC address series.

For this post the intent is to implement several different unsupervised topic modeling and classification techniques to generate labels and transform our raw dataset into one suitable for inference model training.  Since this exercise will produce multiple outputs for the same purpose, depending upon the viability of the data, I may employ an ensemble-like approach in the following post to determine a final classification for each of the categories.

Our categories for topic modeling will include “negative / neutral / positive” sentiment and “personal / business” alignment.  These categories provide a measurement of motivation and dedication characteristics.

Creating Supervised Training Data

To clarify the meaning of “unsupervised” learning…it is the process of finding meaningful insights and relationships from data without the need for human labeling; or in other words automated learning with reduced bias.  

Supervised learning requires a human to evaluate data elements and relationships; and then apply a label to them.  For example, an individual could review each email from our set of 83,691 emails and assign each a sentiment and an alignment score; which we can then use to train a traditional machine learning model.  Besides the obvious effort involved, one of the main problems with this approach is bias.  Having just one human evaluate an email provides a single opinion aligned with that individual’s beliefs and knowledge.  To eliminate bias, more opinions are needed for each element.

Unsupervised techniques apply consistent logic to each evaluation and reduce bias.  Of course bias can still be a problem if the technique in itself is biased, but at the very least the bias is consistent.

The methods to implement unsupervised learning are numerous.  For example, in Natural Language Processing the transformer modeling approach creates knowledge by comparing a document’s text with itself, one token (or n-gram tokens) at a time.  The larger the network and the more information available, the stronger the insights.  This approach works well for text generation and sequence-to-sequence similarity predictions.

Since we are attempting to classify emails into different topics, I’ll be using matrix decomposition algorithms with a bag-of-words approach for this article.  There are many other ways to implement these functions and I will summarize a few of them at the end of this post.

I will be using the SciKit-Learn library for the algorithms needed in this implementation.  The decomposition functions require as input a vocabulary token frequency matrix.  We’ll use the CountVectorizer and TfidfVectorizer routines to create these inputs.  For the clustering (decomposition) step the LatentDirichletAllocation and NMF (Non-Negative Matrix Factorization) methods will be employed.

Objectives

To summarize the sentiment classification objective, the steps include using a fixed sentiment vocabulary to filter words from the email content into a token frequency matrix.  The token frequency matrix will be used to group the emails based upon the frequencies and then each topic group will be scored using a weighted algorithm (i.e. softmax or probability).

For the alignment objective, classification labels will be created using a similar approach, but instead of a fixed vocabulary we will generate a list of words to exclude from the email term frequency matrix.  Using this reduced vocabulary frequency matrix, topics will be deduced using decomposition techniques; ending with a scoring function that leverages a cosine similarity algorithm to auto label each topic.

Preparation

Feature extraction and labeling require a number of steps be performed before processing can begin.

From the previous analysis phase we identified the BCC address element as non-essential, so that feature will be removed from the dataframe.  We determined that approximately 350 emails were too large for our purposes and needed to be filtered out.  And I noticed there was an artifact from a previous steps that had crept into the dataset (i.e. an unused Pandas index), so that has been removed as well.

Several of the sentiment calculation routines rely upon a fixed vocabulary of words to identify the “polarity” of an email.  Each of these files will need to be loaded and structured appropriately for later use.  

To reduce term variations (i.e. like “process” and “processes”) a lemmatizer will be applied to each term to consolidate topics and improve topic efficiency.  For performance purposes the lemmatizer function will be initialized at the class level.

				
					
        raw_emails = pd.read_csv(self.data_dir + config['email_extracted_fn'])
        raw_emails.fillna(value="[]", inplace=True)

        self.email_df = self._fix_email_addresses('From_Address', raw_emails)
        self.email_df = self._fix_email_addresses('To_Address', raw_emails)
        self.email_df = self._fix_email_addresses('Cc_Address', raw_emails)
        self.email_df.pop('Bcc_Address') # from analysis phase, Bcc is duplicated - remove
        self.email_df.pop('Unnamed: 0') # artifact from analysis phase - remove

        # from the analysis phase - remove samples with content length greater than 6000 characters
        self.email_df = self.email_df[self.email_df['Body'].map(len) <= 6000].reset_index(drop=True)

        # build positive/negative sentiment dictionaries
        self.sentiment_d = self._create_sentiment_dictionary(self.data_dir, config['negative_sentiment_fn'], config['positive_sentiment_fn'])

        # common lemmatizer
        self.lemmatizer = WordNetLemmatizer()
				
			

Sentiment Vocabulary Method 1

The first method for calculating sentiment within the email dataset involves using a fixed vocabulary of simple tags; either negative or positive.  We will rely solely upon the frequency of each token occurrence within a given topic to weight the sentiment outcome.  I’m using a vocabulary developed by Minqing Hu and Bing Liu and available at http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html.

This particular vocabulary is skewed to the negative with 4,814 negative keywords and 2,036 positive keywords.  In a real-world implementation this offset should be investigated further to ensure the imbalance is appropriate (i.e. those are all of the positive words available rather than the only words found in the study).  For this project, since the number of positive words is adequately large enough, we will assume if the email content is truly positive then it will be well represented by this sentiment vocabulary.

Uses sentiment lexicon from http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html.

Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA

Processing

To implement this approach we will use the sentiment vocabulary to generate a token frequency from the email content.  I’ll accomplish this by using the CountVectorizer function from Scikit-Learn with the sentiment vocabulary acting as the fixed vocabulary for the algorithm.  This ensures the frequency matrix will only include the words we are interested in.

With the frequency matrix in hand, we pass it as input to the LatentDirichletAllocation (LDA) routine; which evaluates the matrix for relationships within each email and across the entire dataset.  I have set the number of output topics to 100 based upon a bit of trial and error running the routine a few times and also evaluating the topic scores to find the point where the scores reflect meaningful insights and not default/random outcomes.

The topic groups produced by LDA include an “importance” rating for each word in the vocabulary.  To extract features from this structure requires an algorithm specific to the task.  For this exercise I’m using the top 15 subtopics.  Each subtopic has a relevance score attached and in my experience with this dataset, the first 1-5 subtopics exhibit the most influence on the sentiment outcomes.

Note: Before evaluating each subtopic, I scaled the weighted scores to a range of 0.0 <= score <= 50.0 in order to provide consistent results for scoring.

Using the scaled scores from the top 15 subtopics, the vector is summarized and then evaluated for a ‘negative’, ‘positive’ or ‘neutral’ outcome.  I selected a range of -15.0 < neutral < 15.0 to separate the classes based simply upon an approximately even split from the general scaled range of -50.0 <= sum <= 50.0.  

Note: The total summarized score could be outside of this range if multiple subtopics have similar frequencies for a given sample.  In a real implementation, further evaluation of the best scoring method is needed.

Outcome

The results from running method 1 for sentiment classification showed a negative leaning topic classification with a positive leaning email distribution. Out of a possible 100 topics, over half (53) were scored as negative.  But, when the topics were applied back to each email there were fewer negative emails than positive emails in the actual dataset.  The ratio of topic sentiment does not directly correlate with the actual classification ratio.

The actual outcome ratio is expected in my opinion, especially based upon the source of the emails used in this study.  A wider range of employee data samples would most likely move the distribution to either polarity, but it is possible this outcome represents the company.

Note: It would also be helpful to group these outcomes by year and month to see the polarity change over time. In a production implementation this would be the approach taken.

Sentiment Model 1 Detail Results

Note: This method produced a high number of “unknown” outcomes; which means this set of emails did not have an associated topic.  For processing purposes, these emails can be considered “neutral” since the frequency distribution routine did not find any sentiment keywords for this subset.

Sentiment Model 1 Results

As can be seen from the subtopic list, the scores appear to be accurate; especially when the first 1-5 tokens have the most influence over the score.  Let’s try another vocabulary to see if the results are similar.

Sentiment Vocabulary Method 2

This sentiment calculation uses a similar process to method 1 with the following exceptions.

  • The sentiment vocabulary is from the AFINN project.
  • The AFINN vocabulary provides a more robust polarity range of [-5, 5] rather than a simple binary score.

As the first method demonstrated, this routine will use a similar fixed sentiment vocabulary to generate a token frequency matrix of sentiment terms; which is then used to generate a topic matrix of frequency scores.  The topic matrix is then further evaluated to create the sentiment outcome for each email.

The AFINN vocabulary, in contrast with the previous sentiment vocabulary, is skewed to the positive with 2,476 positive keywords and 1,600 negative keywords.  This dataset is also much smaller at roughly half the size.

The most significant difference with this sentiment vocabulary is the scoring method.  The included terms were hand-scored over a range that provides more resolution as to the polarity of the text.  This additional attribute means we will be able to weight the output scores based upon frequency and also term polarity.

Uses AFINN from http://corpustext.com/reference/sentiment_afinn.html

Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May.

Processing

For this approach we will use the same CountVectorizer and LatentDireichletAllocation implementation as the first method; with the exception of swapping out the sentiment vocabulary.  We’ll also keep the topic list at 100 components.

Topic scoring will also use a scaled range of 0.0 <= score <= 50.0 before calculating the outcome.

The probability distribution of the frequency scores is a simple algorithm that multiplies the frequency scores with the sentiment scores vector to produce a frequency/polarity score matrix which is then summed to produce the weighted sentiment score.

The range for this score is much smaller than the previous method, so a neutral range of -1.5 < neutral < 1.5 was used to separate the classes.  Note: For a real implementation, further evaluation of the best scoring range is needed.

				
					terms = dict(map(lambda x: x, zip(subtopics, subscores)))
sublabel_weighted_score = self.ac.calculate_sentiment_weighted(terms=terms)
sublabel_weighted = 'neg' if sublabel_weighted_score < -1.5 else 'pos' if sublabel_weighted_score > 1.5 else 'neu'
				
			

Outcome

The results from this implementation shows a much different result from method 1.  The topics are rated more heavily towards the neutral range with an almost even grouping of negative/positive topics.

When applied back to the original emails, the number of negatively associated emails dropped significantly while the neutral count went up considerably.

Note: This disparity of outcomes may be due directly to the positively skewed sentiment vocabulary, the limited size of the vocabulary, and/or the neutral scoring range.  Further investigation is needed to validate the results.

Sentiment Model 2 Original Results

Offhand, the outcome does not intuitively seem accurate.  Given the orientation of the sentiment terms within the AFINN vocabulary and the valence scoring method, the increase in neutral scores could be accurate, but the small number of negative outcomes does not fit expectations.

To get a more confident result, I adjusted the neutral scoring range from -1.5/1.5 to -0.5 < score < 0.5.  This adjustment produced results more inline with expectations; although I will still caution, in a production setting this needs further statistical evaluation.

Sentiment Model 2 Adjusted Results

As can be seen from the new results, the ratio of negative and positive outcomes has increased and the neutral range is closer to expectations.  

These results better reflect the outcome from method 1 as well; which provides some confidence in the results.  Overall though, I would say this sentiment vocabulary is lacking with regard to email classification.

Sentiment Model 2 Results

Sentiment Vocabulary Method 3 - Vader

For sentiment classification method 3 I’m using the Vader algorithm as implemented by the NLTK team.

This technique is a rule-based solution and quite different from the decomposition and grouping approach used in methods 1 and 2.  The researchers that created the algorithms for Vader hand-rated the 7500+ sentiment tokens using an unbiased group of human volunteers to determine polarity scores from -4 (very negative) to 4 (very positive) for each.

https://www.nltk.org/api/nltk.sentiment.html?highlight=vader#module-nltk.sentiment.vader

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Implementing the Vader polarity scoring method is very easy with a Pandas dataframe organized with one email document per row.  Simply “apply” the function to the body column within the dataframe to capture a valence score for each email.  The scores can then be transformed again to generate the class labels ‘neg’, ‘neu’ and ‘pos’.

				
					from nltk.sentiment import vader
vsa = vader.SentimentIntensityAnalyzer()
vader_df = pd.DataFrame(self.email_df['Body'].apply(vsa.polarity_scores).to_list())
self.email_df['Class_Sentiment_Vader'] = vader_df['compound'].apply(lambda x: 'neg' if x < -0.20 else 'pos' if x > 0.20 else 'neu')
				
			

The Vader algorithm returns scores for “negative”, “positive”, “neutral” and “compound”.  The compound score is a normalized summation of all the token sentiment scores combined.  Since the compound score returned by the algorithm is a range between -1.0 and 1.0, we need a way to determine the effective range for negative, positive and neutral scores.  We could pick an arbitrary cutoff point at roughly even intervals; like “neg <= -0.25, -0.25 < neu < 0.25, pos >= 0.25”, but based upon the distribution of scores from the actual email contents the neutral cutoff is better approximated at around -0.20 and 0.20.

Vader Polarity Scoring Histogram

Looking at the histogram of scores, we can tell the algorithm calculated a positive leaning overall result for our dataset with a considerable number of “very positive” results.

The upward frequency of increasingly positive scores is suspicious.  This doesn’t mean the results are inaccurate, just that further investigation is required to substantiate the outcomes.  It could very well be that the internal environment at Enron was extremely positive…but that seems unlikely.

More likely, within corporate environments especially, there is a common misuse of positively-associated terms and punctuation that manifests from culture and institutionalized etiquette.  Rule-based algorithms like Vader may have a difficult time detecting this type of nuance.

Note: To properly understand language use within an organization requires an algorithm that learns cultural subtleties through NLP techniques first; which can then be used to standardize communication content; removing company-specific bias.

Outcome

Assuming for this exercise that the Vader algorithm is unbiased and accurate.  Let’s run the process and see how the outputs stack up against methods 1 and 2.

Vader Polarity Scoring Results

As can be seen from the summary, the output is very heavy on the neutral scores.  The negative results seem aligned, but the neutral results don’t match the distribution seen from the distribution histogram.

The reason for the mismatch is the source of information.  The histogram was created using the “compound” value while the sentiment totals seen here are the “negative”, “neutral”, and “positive” values returned by Vader.

The actual label totals, using the compounded value, scored a total of 6,476 negative emails, 22,317 neutral and 54,898 positive; which matches the histogram.  These results are a mix of the outputs seen from method 1 and method 2, but most closely match method 2.

Results Over Time

Vader Sentiment Results Graph

One of the most effective ways to visualize sentiment changes within an organization is through a graph by year and month.  This graph is large and may be difficult to see properly (open the image in a separate tab for an enlarged view), but neatly shows the changes in sentiment within the company by year and month.

If you recall, the majority of our data starts in the 2000’s so the first 10 months or so of the graph do not have enough data for a reliable prediction, but the remainder appears to be relatively accurate, with smaller fluctuations up until roughly June of 2001, when a steady downward trend begins.  June is about the time industry analysts started to question Enron’s viability and the SEC stepped in soon after. Note, the graph also shows an uptick towards positive around February when Jeffrey Skilling took over as CEO…internally that event seems to have inspired the company for a brief time.

Based upon this graph I would say there is some level of accuracy with the Vader results since we can tie specific dates to definitive changes in sentiment.  At the very least, used consistently, this approach will provide insights into at least significant changes within an organization.  Much more fine-tuning and analysis would be needed to dial in subtle changes.

Auto Classification Method 1

We are now ready to begin working on general classification; specifically to identify policy alignment with regards to personal/business oriented emails.

To determine alignment labels we will need to use a different approach to identify the classes; by filtering out the words we do not want to work with, associating the remaining words into topics by word frequency and then scoring each topic based upon the subtopics.

This implementation will use the same count vectorization and Latent Dirichlet Allocation methods we used for the sentiment classifications; except this time the count vectorization step will be performed twice.  The first pass will be used to identify the words we wish to filter out and the second pass will apply the list to the email content.

Note: The filtering (or stopwords) process will also create a full list of all word token counts and save that matrix to a file for manual analysis.

For this exercise I’m leveraging the NLTK words library to generate the stopwords list.  Using CountVectorizer, the frequency matrix is created and then iterated through to compare each word with the NLTK dictionary.  I also want to filter out all word types except those I’m interested in, so the NLTK part of speech tagging library will be employed to remove all word types except nouns and adjectives.

Note. I used CountVectorizer as a tokenizer here, rather than a frequency measure.  The reason I did it this way was to use the initial stopwords list and lemmatizer in the same way as the actual classification step

				
					def _stop_word_support_function(self, macro_filter=0.5, micro_filter=10):
        ''' Routine to collect word count list and pos word filter list for developing stop word vocabulary'''

        print(f'\n--- Stop Word Support Function Start')
        start = time()

        # create corpus from email content
        contents = self.email_df['Body'].to_list()
        vocab = None

        sw_list = [name.lower() for name in names.words()]
        sw_list.extend([stopword.lower() for stopword in stopwords.words()])

        tokenizer = self._body_content_analysis_tokenizer

        # fetch counts of word phrases
        tfc = CountVectorizer(max_df=macro_filter, min_df=micro_filter, max_features=20000, strip_accents='unicode', analyzer='word', 
                              token_pattern=r"[a-zA-Z][a-z][a-z][a-z]+", ngram_range=(1,1), stop_words=sw_list, vocabulary=vocab,
                              tokenizer=tokenizer)
        tf = tfc.fit_transform(contents)
        tf_a = tf.toarray()
        tf_fns = np.array(tfc.get_feature_names())

        # form/sort topic/sub-topic dataframe and save to file for manual analysis due to high variability
        # will use manual inspection to develop an algorithm for filtering list to a functional vocabulary
        sums = np.sum(tf_a, axis=0)
        dense_word_matrix = []
        exclude_word_matrix = ['wa']
        word_filter = lambda word: pos_tag([word])[0][1][0:2] not in ['NN','VB']
        word_list = words.words()

        for x in tqdm(range(len(sums))):
            phrase = {'phrase':tf_fns[x], 'count':sums[x]}
            dense_word_matrix.append(phrase)

            # collect words to filter - add to stop_words
            oov = True if tf_fns[x] not in word_list else False
            try:
                if oov or word_filter(tf_fns[x]):
                    exclude_word_matrix.append(tf_fns[x])
            except Exception as err:
                exclude_word_matrix.append(tf_fns[x])

        print(f'\n--- Stop Word Support Function Complete in ({time()-start} seconds)\n')

        sums_df = pd.DataFrame(dense_word_matrix).sort_values(by='count',ascending=False).reset_index(drop=True)
        sums_df.to_csv(self.data_dir+self.manual_word_counts_fn, index=False) # save to file for manual inspection
        print(f'\n--- Word Matrix\n\n{sums_df.head(20)}')

        word_filter_df = pd.DataFrame(exclude_word_matrix, columns=['word']).sort_values(by='word').reset_index(drop=True)
        word_filter_df.to_csv(self.data_dir+self.word_filter_list_fn, index=False) # save to file for manual inspection
        print(f'\n--- Word Filter\n\n{word_filter_df.head(20)}')

        return word_filter_df['word'].to_list()

				
			
Word Filter List Results

Processing

With the customized stopwords list we can move on to generate the topic frequency matrix.  LDA organizes topics in a strict manner when compared with other algorithms so I selected a topic/group count of 100 to ensure enough data for accurate results.  Too many topic groups will dilute the data set and the algorithm will return random connections; too few and the data points become intertwined and difficult to classify.

To classify each of the generated topics will require enumerating through each topic and determining the appropriate class based upon the following high-level objectives.

  • The subtopic list is sorted in descending order by frequency to align the most relevant terms at the top.
  • The top N subtopic are extracted for weight calculation.  In this instance I selected the top 20 associated words.
  • For each subtopic, the routine classifies the topics according to the target labels.  In this case I’m using the words “fun” and “work” to delineate work-oriented emails from personal.  I chose these labels based upon their distance from one another mathematically.  Class labels too far apart will skew results as will words too closely related.
 
The secret sauce for auto classification in this exercise is an implementation based upon cosine similarity.  The algorithm I developed is fairly simple for this example, but depending upon the level of detail required, this selection process can be very complicated and based upon multiple semantic “levels”, morphology and customized vocabularies.
 
Note: The number of classes can be any reasonable number.  The limit is mostly compute related, but at some point too many classes will dilute the data and create inaccurate results.
Note: The top 20 associated words were considered, but like the sentiment analysis the top 1-5 words carry the most weight.
Note: This process can be used for other labeling exercises like detecting emails oriented towards [‘finance’, ‘legal’, ‘compliance’] or [‘client’,’vendor’,’employee’].
Note: I chose frequency boundaries for the Latent Dirichlet Allocation algorithm of 50% of emails as the maximum frequency and 10 emails as the minimum.  This means any terms identified that occur in more than 50% of emails or less than 10 emails will not be used.  The idea here is that terms occurring too frequently will most likely be associated with common concepts we are not interested in and terms occurring infrequently will not represent the company overall.
				
					    ##############################
    # Topic Classification Method
    ##############################

    def _classify_topic_method(self, model, features, nbr_top_tokens=20, title='', classes=[], method='prob'):
        print(f'\n--- Classify Topic Words Method - {title}\n\n')
        topics = []
        for idx, component in enumerate(model.components_): # iterate through all of the topics
            feature_index = component.argsort()[::-1]
            feature_names = [features[x] for x in feature_index]
            scores = [component[x] for x in feature_index]

            subtopics = []
            subtopics_scores = []
            for x in range(0, nbr_top_tokens): # iterate through the number of requested subtopics and calculate sentiment scores
                if x < len(feature_index): 
                    subtopics.append(feature_names[x])
                    subtopics_scores.append(scores[x])

            topic = {}
            topic['topics'] = subtopics
            topic['topics_scores'] = subtopics_scores

            # find the class label for these topic terms
            terms = {subtopics[x]:subtopics_scores[x] for x in range(len(subtopics))}
            label = self.ac.classify_terms(classes=classes, terms=terms, method=method, use_weights=True if method != 'none' else False) # use AutoClassify to determine label
            topic['label'] = label

            # save
            topics.append(topic)

            # display topic plus support subtopic words
            print(f'{self.cliporpad(str(idx), 3)} {self.cliporpad(label, 15)} / {subtopics_scores[0]} = {" ".join(subtopics)}')

        # convert to dataframe and save for analysis
        df = pd.DataFrame(topics)
        columns = ['label','topics']
        print(f'\n{df[columns].sort_values(by="label", ascending=False).head(200)}\n\n')
        df.to_csv(self.data_dir+self.topic_gradients_fn, index=False)

        return df

    def _add_label_to_dataframe(self, dcmp, topics, columns):
    #def _build_supervised_dataframe(self, lda, topics, contents):
        ''' Using the unsupervised sentiment analysis data, create a supervised learning dataset '''
        assert len(dcmp) == len(self.email_df), 'Length of decomposition matrix should match email dataframe length'

        values = []
        for x in tqdm(range(len(dcmp))):
            tidx = np.argmax(dcmp[x]) if np.argmax(dcmp[x]) > np.argmin(dcmp[x]) else -1
            value = topics.at[tidx, columns[0]] if tidx >= 0 else 'unknown'
            values.append(value)

        self.email_df[columns[1]] = pd.Series(values)
        return

    def _body_content_analysis_tokenizer(self, text, max_length=20):
        words = re.findall(r"[a-zA-Z][a-z][a-z][a-z]+", text)
        arr = [self.lemmatizer.lemmatize(w) for w in words if len(w) <= max_length] # could also use spacy here
        return arr

    def topic_classification(self, macro_filter=0.5, micro_filter=10, vectorizer='CountVectorizer', decomposition='LatentDirichletAllocation', classes=[], n_components=100, subtopics=15, method='prob', mode=1):
        '''
            General topic classification using various techniques.

            Content frequency distributions -> CountVectorizer & TFIDFVectorizer 
            Decomposition ->
                LDA - LatentDirichletAllocation
                LSA - TruncatedSVD
                NMF - Non-Negative Matrix Factorization

            Classification ->
                AutoClassify (developed by Avemac Systems LLC)
        '''

        # create corpus from email content
        contents = self.email_df['Body'].to_list()

        # custom stop words - not needed if using a fixed vocabulary
        sw_list = [name.lower() for name in names.words()]
        sw_list.extend([stopword.lower() for stopword in stopwords.words()])
        sw_list.extend(self._stop_word_support_function(macro_filter=macro_filter, micro_filter=micro_filter))
        sw_list.extend(['pirnie','skean','sithe','staab','montjoy','lawner','brawner']) # a few names that made it through the filters

        # fixed vocabulary of keywords
        vocab = None

        # lemmatizer
        tokenizer = self._body_content_analysis_tokenizer

        # fetch counts of word phrases
        print(f'\n--- Starting Email Content Analysis')

        start = time()

        if vectorizer == 'CountVectorizer':
            tfc = CountVectorizer(max_df=macro_filter, min_df=micro_filter, max_features=20000, strip_accents='unicode', analyzer='word', 
                                 token_pattern=r"[a-zA-Z][a-z][a-z][a-z]+", ngram_range=(1,1), stop_words=sw_list, vocabulary=vocab, tokenizer=tokenizer)
            title = 'Count'
        else:
            tfc = TfidfVectorizer(max_df=macro_filter, min_df=micro_filter, max_features=20000, strip_accents='unicode', analyzer='word', 
                                 token_pattern=r"[a-zA-Z][a-z][a-z][a-z]+", ngram_range=(1,1), stop_words=sw_list, vocabulary=vocab, tokenizer=tokenizer,
                                 use_idf=1, smooth_idf=1, sublinear_tf=1)
            title = 'TFIDF'

        tf = tfc.fit_transform(contents)
        tf_a = tf.toarray()
        tf_fns = np.array(tfc.get_feature_names())
        print(f'--- Content Frequency Analysis ({time()-start} seconds)\n')

        start = time()

        if decomposition == 'LatentDirichletAllocation':
            dcmp = LatentDirichletAllocation(n_components=n_components, max_iter=3, learning_method='online', learning_offset=50.0, random_state=1).fit(tf)
            title += ' - LDA'
        elif decomposition == 'TruncatedSVD':
            dcmp = TruncatedSVD(n_components=n_components, n_iter=100, random_state=1).fit(tf)
            title += ' - LSA'
        else:
            dcmp = NMF(n_components=n_components, random_state=1, beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=0.1, l1_ratio=0.5).fit(tf)
            title += ' - NMF'

        print(f'--- Decomposition Analysis ({n_components} components in {time()-start} seconds)\n')

        results = self._classify_topic_method(dcmp, tf_fns, nbr_top_tokens=subtopics, title=title, classes=classes, method=method)

        # add class label to training dataframe
        column = 'Class_Alignment_'+str(mode)
        self._add_label_to_dataframe(dcmp.transform(tf), results, ('label',column))
        print(f'\n{self.email_df[[column]].groupby(by=column).size()}\n')
        return
				
			

Outcome

The results from the algorithm seem okay.  The topics were classified 25% personal and 75% business oriented.  Without an established/verified baseline, internal knowledge of the company and confirmation of the tagging results it is difficult to determine the accuracy of this outcome with any level of certainty.

Anecdotally we can generalize that corporate environments have a significant amount of non-company related communications (especially during the early 2000’s when social technologies were becoming mainstream).  Depending upon the culture of the company a 3-to-1 ratio may be perfectly acceptable.

Looking at the actual results when applied to each email in the dataset, only 10% were labeled as personal, 83% as professional and 7% as unknown.  This feels like a better ratio and more likely to be accurate.  

Note: So far during this exercise we have seen a consistent behavior between topic classification and actual labeling…the topic ratio does not indicate the ratio of labeled records.  Just because the topics are labeled a certain way does not indicate the number of emails associated with each label.

Auto Classification Model 1 Results

In order to gain a higher degree of confidence in the outcome, a visual inspection of each topic and associated label is necessary.  This step could be automated further, but for this post I’ll just do a quick check and see if the labels seem appropriate.

Looking at the first 30 topics I have a concern that not all of the labels are as accurate as I would like.  There are at least two topics that I would declare mislabeled.  

To correct this, given the first 1-5 subtopics are the most influential, if the first subtopic is not a critical term (i.e. one very specific to the target labels) then removing that term from consideration would be the first option (i.e. add the term to the stopwords list and rerun the process).  This will change the term frequency and hopefully allow the routine to group more common terms together.

Another option is to adjust the number of topics produced.  Looking at the lead subtopic score, all of the subtopics have a calculated frequency (i.e. not a random outcome).  This means there are more potential topics to be discovered so the topic range could be increased a bit to see if the terms better distribute themselves.  Sometimes it is better to generate fewer topics so that can be tried as well.

Auto Classification Model 1 Labels

Ultimately the objective is to find the right balance of topics and terms so the labels produced are logical and optimal.

Auto Classification Method 2

For the second classification attempt we will modify method 1 just a bit to see if we obtain better results.  Instead of using LatentDirichletAllocation to create the topic frequency matrix we will leverage the Non-Negative Matrix Factorization algorithm.

Due to the way NMF calculates term frequencies, the number of topics associated with the same term is higher than with LDA.  To compensate for this and to ensure topics are specific enough for our labeling task, I adjusted the number of topics generated from 100 to 200.

Outcome

The results are similar to model 1, but appear to be more aligned with expectations.  Instead of a 3-to-1 ratio of personal to professional labels the ratio is now 17-to-1, but as we have observed, the ratio of topic labels does not correlate directly with how the labels are applied to the input samples.  This routine classified an actual 8% of emails as personal and 85% as professional.  The number of unknown samples stayed roughly the same.

Auto Classification Model 2 Results

Upon visual inspection the associated subtopics are better aligned with the generated label.  There are still some associations that are suspect, but the semantics are more subtle and harder to differentiate.  Overall I would rate this method more accurate than the first attempt.

Auto Classification Method 3

For the final classification attempt, I modified method 2 slightly, replacing CountVectorizer with TFIDF as the mechanism for creating the word frequency distribution matrix.  All other factors I left the same.

TfidfVectorizer produces a matrix of token “relevance” or “importance” rather than basic counts.  This provides the NMF algorithm with a little more information to work with when determining how to reduce the email content into the topics we will use to generate class labels.

The images below are a look inside the data structure of both algorithms.  It is easy to see the integer counts produced by CountVectorizer versus the weights produced by TfidfVectorizer.

CountVectorizer Frequency Counts
TfidfVectorizer Relevance Scores

Outcome

The classification results from TFIDF appear to be dialed in a little better than the count-based approach used in method 2, but the overall labeling distribution is about the same mix as model 1.  Tweaking each approach for optimal balance and then blending the outputs may provide even better results.

Auto Classification Results

Final Assessment

Now that individual processing is complete we have an opportunity to view the results of each function together. 

Normally one approach is used to solve a particular problem, but I find it is advantageous to try different methods if time permits.  Having multiple outcomes also provides a form of “checks-n-balances” insight into the viability of one approach versus another versus our own expectations.

Sentiment Classification

The routines to calculate sentiment varied in implementation the most, so it is no surprise the results were also varied.

One improvement I did not discuss before is the iteration parameter for the LDA algorithm.  The LDA approach uses some iterative randomization during processing so like typical modeling the more views of the data the more optimized the outcome.  For LDA I used a low number of iterations (3) to keep processing time minimized.  Increasing the number of iterations would definitely change the outcome for these routines.

The sentiment lists I leveraged for these tests were both skewed in one manner or another…being either positive/negative leaning, smaller than expected, oriented towards specific parts of speech or developed for tasks not focused on corporate email classification.  I believe developing a customized sentiment vocabulary tuned to this dataset would improve predictions.

Another approach not discussed in this article involves using a multi-token sequence implementation.  Rather than a simple bag-of-word frequency approach the algorithm could gain additional insights through multi-word phrases. This does involve a bit more work to prepare the data and in my experience multiple passes are required, but this method may improve labeling accuracy.

Of course there are other sentiment classification choices that can be deployed for these tasks as well.

Final Aggregate Results

Alignment Classification

Regarding the alignment classification scoring, the results are roughly the same between methods 1 and 3 with regard to labeling counts.  The actual scoring results vary between the two however, and confirms the logic between the methods is varied.

Looking at the subtopics grouped by each algorithm, it is difficult to pick one over the other in terms of conciseness, uniformity and semantics.  Both NMF and LDA produce logical grouping that are easily understood.  If I had to pick one over the other, for corporate email classification I find that LDA produces subtopic groupings that are slightly more focused within the corporate domain and the technology is better equipped for this task.

In the next post I may take some time to score the scores of each algorithm and determine if blending the results would provide an additional uptick in accuracy, but as it stands right now I will utilize the LDA outputs.

Output Sample

It is always a good idea to review the labels assigned to at least a portion of the dataset to get an intuitive feel for the accuracy of the results.  Typically I will create some form of scoring to get an accurate gauge of results over the entire dataset.

For this exercise I have pulled a small sample to display.  Looking at the sentiment and classification results in comparison with the email content quickly shows generally accurate results with a few expected inaccuracies.  It is optimistic to see one algorithm score correctly while another does not and visa versa…this implies a blended approach may improve accuracy.

The biggest drawback seen from the data is the bias associated to frequency-based solutions such as this.  For example, the sample that includes the text “I am interested in purchasing a Huglu shotgun” is ranked twice as a “work” topic and only once as “fun”.  This is because the term “purchasing” is much more heavily used in a corporate environment than the word “shotgun”.  The same can be said for the line “I thought you would be interested in the weather forecast for the Wedding”; the term “forecast” is a much stronger influence than the word “Wedding”.

Note: Developing customized sentiment logic that pre-compiles/pre-weights noun phrases pulled from the content and balances the weights to remove label bias would help improve this condition.

Labeled Email Record Samples

Overall the results for the unsupervised classification phase of this exercise are promising.  Each technique produced different results, but given the necessary adjustments I believe each method would create an adequate labeling function for the effort of determining company sentiment and work/life alignment.

Other Methods

The lexicon and word embedding approach shown in this article is one solution for unsupervised data classification available from many different options.  My implementation used cosine similarity to rank subtopics by mathematical distance to selected labels.  While effective there are other methods to consider that produce results specific to a given task, including:

  • Sentiment specific word embedding (SSWE). From the paper “Learning Sentiment-Specific Word Embedding
    for Twitter Sentiment Classification”.
  • Weighted text feature modeling (WTFM). From the paper “Tweet Sentiment Analysis by Incorporating Sentiment-Specific Word Embedding and Weighted Text Features”.
  • Emolex. The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions and two sentiments. Available at https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm.
  • Auto Topic Modeling.  My own concoction for auto classifying text.  This process entails building a multi-label model from content classified generically using a series of the closest associated terms.  Over time the labels are ranked for probability and used to create an inference model.
  • Autoencoder Dimensionality Reduction.  Using an autoencoder transformer approach, build a network than reduces the number of input token dimensions to the desired number of token/topic associations.
  • N-Grams.  Instead of single-token frequency distributions, parse content for multi-token sequences of words, characters, stems and syllables.  Use multiple combinations and masking techniques.
  • Custom Word Embeddings.  Leverage the Gensim library to build a custom word embedding vocabulary from text content.  This technique may require a large corpus to boost accuracy.  Developers should also consider building custom word embeddings using PyTorch or Tensorflow tools.  Utilizing customized word associations remove some of the artifacts instilled in general embeddings like Glove or Word2Vec that are built using sources that may not relate well to your corpus.
  • Sentence/Document Aggregated.  Combine content token word embeddings into normalized embeddings.
  • Cosine Hierarchy.  Use a tiered approach to classification by structuring labels into a semantic hierarchy.

Conclusion

For this article the focus was to produce a viable labeled training dataset for inference modeling and that goal was achieved.   With this dataset we can now create an accurate model for classifying email content from raw unstructured content without the need for manual labeling of the training data.  

Several different approaches were represented to identify the sentiment of the dataset and to label each data element with either a “personal” or “professional” purpose.  Overall the results in the current form are accurate enough to continue with the exercise using either the standalone results or combining via an ensemble algorithm.  In a production environment I would spend additional time fine tuning the algorithms and would leverage other techniques to improve upon the outcomes.

The next post will be the last in this series and will show implementation examples using PyTorch recurrent and transformer modeling concepts.  Time permitting I may also port the routines to Tensorflow as well.

If there are questions about the code or processing logic for this blog series or your company is in need of assistance implementing an AI or machine learning solution, please feel free to contact me at mike@avemacconsulting.com and I will help answer your questions or setup a quick call to discuss your project.

Source Code

If interested, I have included the source code developed for this exercise.  Hopefully the implementation will help kickstart an unsupervised classification project or generate some new ideas. 

All of the code for this series can be accessed from my Github repository at:

github.com/Mike-Schmidt-Avemac/ai-email-insights.

				
					'''
MIT License

Copyright (c) 2021 Avemac Systems LLC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
'''

#!/usr/bin/python3 -W ignore::DeprecationWarning
import sys
from textwrap import indent
from unicodedata import decomposition

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
from time import time
import datetime as dt
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD, NMF
import matplotlib.pyplot as plt
from nltk.corpus import stopwords, names, words
from nltk.sentiment import vader
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

from utils.auto_classify import AutoClassify

pd.set_option('display.max_rows', 100)
pd.set_option('display.min_rows', 20)
pd.set_option('display.max_colwidth', 100)

class UnsupervisedModeling():
    ''' Class for topic modeling unstructured data into labeled training data '''

    def __init__(self, config):
        self.data_dir = config['data_dir']
        self.word_filter_list_fn = config['word_filter_list_fn']
        self.manual_word_counts_fn = config['manual_word_counts_fn']
        self.topic_gradients_fn = config['topic_gradients_fn']
        self.plot_save_dir = config['plot_image_save_directory']
        self.sentiment_fn = config['sentiment_fn']

        raw_emails = pd.read_csv(self.data_dir + config['email_extracted_fn'])
        raw_emails.fillna(value="[]", inplace=True)

        self.email_df = self._fix_email_addresses('From_Address', raw_emails)
        self.email_df = self._fix_email_addresses('To_Address', raw_emails)
        self.email_df = self._fix_email_addresses('Cc_Address', raw_emails)
        self.email_df.pop('Bcc_Address') # from analysis phase, Bcc is duplicated - remove
        self.email_df.pop('Unnamed: 0') # artifact from analysis phase - remove

        self.email_df['DateTime'] = self.email_df['DateTime'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S%z'))
        self.email_df['DateTime_TS'] = self.email_df['DateTime'].apply(lambda x: x.timestamp())
        self.email_df['DateTime_HOUR'] = self.email_df['DateTime'].apply(lambda x: x.hour)
        self.email_df['DateTime_MONTH'] = self.email_df['DateTime'].apply(lambda x: x.month)

        # from the analysis phase - remove samples with content length greater than 6000 characters
        self.email_df = self.email_df[self.email_df['Body'].map(len) <= 6000].reset_index(drop=True)

        # build positive/negative sentiment dictionaries
        self.sentiment_d = self._create_sentiment_dictionary(self.data_dir, config['negative_sentiment_fn'], config['positive_sentiment_fn'])

        # common lemmatizer
        self.lemmatizer = WordNetLemmatizer()

        # auto classifier
        self.ac = AutoClassify(config=config)

        print(f'\n--- Init Complete\n\n')
        return

    def _address_clean(self, addr):
        ''' Additional email address cleaning '''
        addr = re.sub(r'e-mail <(.*)>',r'\1',addr)
        addr = re.sub(r' +', '', addr)
        addr = re.sub(r'/o.*=', '', addr)
        addr = re.sub(r'"', '', addr)
        return addr

    def _fix_email_addresses(self, type, df):
        ''' Split email address array strings into usable arrays '''
        split_embeds = lambda x: x.replace('[','').replace(']','').replace('\'','').split(',')
        addrs = [split_embeds(s) for s in tqdm(df[type].values)]
        u_addrs = [[self._address_clean(y) for y in x] for x in tqdm(addrs)]
        df[type] = u_addrs
        return df

    def _create_sentiment_dictionary(self, data_dir, negative_fn, positive_fn):
        negative_df = pd.read_csv(data_dir+negative_fn, names=['term'], header=None, comment=';')
        positive_df = pd.read_csv(data_dir+positive_fn, names=['term'], header=None, comment=';')

        sentiment = {x:'neg' for x in negative_df['term']}
        sentiment.update({x:'pos' for x in positive_df['term']})
        return sentiment

    def cliporpad(self, text:str, clen):
        return text.ljust(clen)[0:clen]

    ###########################
    # Sentiment Classification 
    ###########################
    def _classify_sentiment(self, model, features, nbr_top_tokens=20, title='', mode=1):
        ''' Extract topics from LDA results and calculate sentiment labels '''

        print(f'\n--- Classify Sentiment - {title}\n\n')
        topics = []
        s_weighted_score = lambda a,s,t: sum([s[x] for x in range(len(a)) if a[x] == t])
        s_minmax_score = lambda a,min,max: ((a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))) * (max - min) + min
        s_stdize_score = lambda a: (a - a.mean(axis=0)) / a.std(axis=0)

        for idx, component in enumerate(model.components_): # iterate through all of the topics
            feature_index = component.argsort()[::-1]
            feature_names = [features[x] for x in feature_index]
            scores = [component[x] for x in feature_index]

            feature_len = nbr_top_tokens if nbr_top_tokens <= len(feature_index) else len(feature_index)

            subtopics = feature_names[0: feature_len]
            subscores = s_minmax_score(np.array(scores[0: feature_len]), 0.0, 50.0) # range bound scores
            if mode == 1:
                subsentiments = [self.sentiment_d[feature_names[x]] for x in range(feature_len)]
                sublabel_weighted_score = -1*s_weighted_score(subsentiments, subscores, 'neg') + s_weighted_score(subsentiments, subscores, 'pos') 
                sublabel_weighted = 'neg' if sublabel_weighted_score < -15.0 else 'pos' if sublabel_weighted_score > 15.0 else 'neu'
            elif mode == 2:
                terms = dict(map(lambda x: x, zip(subtopics, subscores)))
                sublabel_weighted_score = self.ac.calculate_sentiment_weighted(terms=terms)
                sublabel_weighted = 'neg' if sublabel_weighted_score < -0.5 else 'pos' if sublabel_weighted_score > 0.5 else 'neu'
            else:
                sublabel_weighted_score = 0.0
                sublabel_weighted = ''

            topic = {}
            topic['topics'] = subtopics
            topic['scores'] = subscores
            topic['weighted_score'] = sublabel_weighted_score
            topic['label'] = sublabel_weighted
            topics.append(topic)

            # display topic plus support subtopic words
            print(f'{self.cliporpad(str(idx), 3)} {topic["label"]} / {component[feature_index[0]]} = {" ".join(topic["topics"])}')

        # convert to dataframe
        df = pd.DataFrame(topics)
        print(f'\n{df.sort_values(by="label", ascending=False).head(200)}\n\n')

        # group to show overall sentiment totals
        print(f'{df[["label"]].groupby(by="label").size()}\n')

        return df

    def sentiment_analysis(self, macro_filter=0.8, micro_filter=10, n_components=100, mode=1):
        ''' Analyze email content using CV/LDA in preparation for sentiment calculation '''

        # create corpus from email content
        contents = self.email_df['Body'].to_list()

        # custom stop words - not needed if using a fixed vocabulary
        sw_list = None

        # fixed vocabulary of negative/positive keywords
        if mode == 1:
            vocab = {x:i for i,x in enumerate(self.sentiment_d.keys())}
        elif mode == 2:
            vocab = self.ac.get_sentiment_vocab()
        else:
            vocab = None

        # lemmatize if not using fixed vocabulary of negative/positive keywords
        tokenizer = None

        # fetch counts of word phrases
        start = time()
        tfc = CountVectorizer(max_df=macro_filter, min_df=micro_filter, max_features=20000, strip_accents='unicode', analyzer='word', 
                              token_pattern=r"[a-zA-Z][a-z][a-z][a-z]+", ngram_range=(1,1), stop_words=sw_list, vocabulary=vocab,
                              tokenizer=tokenizer)
        tf = tfc.fit_transform(contents)
        tf_a = tf.toarray()
        tf_fns = np.array(tfc.get_feature_names())
        print(f'--- Email Content Analysis w/ Sentiment Vocab ({time()-start} seconds)\n')

        start = time()
        lda = LatentDirichletAllocation(n_components=n_components, max_iter=3, learning_method='online', learning_offset=50.0, random_state=1).fit(tf)
        print(f'--- LDA Analysis ({n_components} components in {time()-start} seconds)\n')

        results = self._classify_sentiment(lda, tf_fns, nbr_top_tokens=15, title='CV - LDA Model Sentiment '+str(mode), mode=mode)

        # add class label to training dataframe
        column = 'Class_Sentiment_'+str(mode)
        self._add_label_to_dataframe(lda.transform(tf), results, ('label',column))
        print(f'\n{self.email_df[[column]].groupby(by=column).size()}\n')

        return

    ####################################
    # Sentiment Classification w/ VADER
    ####################################
    def sentiment_analysis_vader(self):
        '''
            Using VADER from https://www.nltk.org/api/nltk.sentiment.html?highlight=vader#module-nltk.sentiment.vader

            Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. 
            Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
        '''
        # analyze and create supervised training dataset from email content
        vsa = vader.SentimentIntensityAnalyzer()

        start = time()
        vader_df = pd.DataFrame(self.email_df['Body'].apply(vsa.polarity_scores).to_list())
        vader_df['body'] = self.email_df['Body']
        #vader_df['dt'] = self.email_df['DateTime'].apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S%z'))
        vader_df['dt_y_m'] = self.email_df['DateTime'].apply(lambda x: (x.year*100)+x.month)
        print(f'\nVader Polarity Scoring ({time()-start} seconds)\n')

        print(f'\nVader Dataframe Statistics\n{vader_df.describe()}\n')

        sums = vader_df[['neg','neu','pos']].sum()
        print(f'\nVader Sentiment Totals\n{sums.head(20)}\n')

        vader_df['compound'].hist(bins=10); plt.show()
        vader_df.boxplot(column='compound', by='dt_y_m'); plt.show()
        vader_df['compound'].plot.kde(); plt.show()

        # add vader sentiment class to dataframe
        self.email_df['Class_Sentiment_Vader'] = vader_df['compound'].apply(lambda x: 'neg' if x < -0.20 else 'pos' if x > 0.20 else 'neu')

        return vader_df

    ##############################
    # Topic Classification Method 
    ##############################

    def _classify_topic_method(self, model, features, nbr_top_tokens=20, title='', classes=[], method='prob'):
        print(f'\n--- Classify Topic Words Method - {title}\n\n')
        topics = []
        for idx, component in enumerate(model.components_): # iterate through all of the topics
            feature_index = component.argsort()[::-1]
            feature_names = [features[x] for x in feature_index]
            scores = [component[x] for x in feature_index]

            subtopics = []
            subtopics_scores = []
            for x in range(0, nbr_top_tokens): # iterate through the number of requested subtopics and calculate sentiment scores
                if x < len(feature_index): 
                    subtopics.append(feature_names[x])
                    subtopics_scores.append(scores[x])

            topic = {}
            topic['topics'] = subtopics
            topic['topics_scores'] = subtopics_scores

            # find the class label for these topic terms
            terms = {subtopics[x]:subtopics_scores[x] for x in range(len(subtopics))}
            label = self.ac.classify_terms(classes=classes, terms=terms, method=method, use_weights=True if method != 'none' else False) # use AutoClassify to determine label
            topic['label'] = label

            # save
            topics.append(topic)

            # display topic plus support subtopic words
            print(f'{self.cliporpad(str(idx), 3)} {self.cliporpad(label, 15)} / {subtopics_scores[0]} = {" ".join(subtopics)}')

        # convert to dataframe and save for analysis
        df = pd.DataFrame(topics)
        columns = ['label','topics']
        print(f'\n{df[columns].sort_values(by="label", ascending=False).head(200)}\n\n')
        df.to_csv(self.data_dir+self.topic_gradients_fn, index=False)

        # group to show overall sentiment totals
        print(f'{df[["label"]].groupby(by="label").size()}\n')
        return df

    def _stop_word_support_function(self, macro_filter=0.5, micro_filter=10):
        ''' Routine to collect word count list and pos word filter list for developing stop word vocabulary'''

        print(f'\n--- Stop Word Support Function Start')
        start = time()

        # create corpus from email content
        contents = self.email_df['Body'].to_list()
        vocab = None

        sw_list = [name.lower() for name in names.words()]
        sw_list.extend([stopword.lower() for stopword in stopwords.words()])

        tokenizer = self._body_content_analysis_tokenizer

        # fetch counts of word phrases
        tfc = CountVectorizer(max_df=macro_filter, min_df=micro_filter, max_features=20000, strip_accents='unicode', analyzer='word', 
                              token_pattern=r"[a-zA-Z][a-z][a-z][a-z]+", ngram_range=(1,1), stop_words=sw_list, vocabulary=vocab,
                              tokenizer=tokenizer)
        tf = tfc.fit_transform(contents)
        tf_a = tf.toarray()
        tf_fns = np.array(tfc.get_feature_names())

        # form/sort topic/sub-topic dataframe and save to file for manual analysis due to high variability
        # will use manual inspection to develop an algorithm for filtering list to a functional vocabulary
        sums = np.sum(tf_a, axis=0)
        dense_word_matrix = []
        exclude_word_matrix = ['wa']
        word_filter = lambda word: pos_tag([word])[0][1][0:2] not in ['NN','VB']
        word_list = words.words()

        for x in tqdm(range(len(sums))):
            phrase = {'phrase':tf_fns[x], 'count':sums[x]}
            dense_word_matrix.append(phrase)

            # collect words to filter - add to stop_words
            oov = True if tf_fns[x] not in word_list else False
            try:
                if oov or word_filter(tf_fns[x]):
                    exclude_word_matrix.append(tf_fns[x])
            except Exception as err:
                exclude_word_matrix.append(tf_fns[x])

        print(f'\n--- Stop Word Support Function Complete in ({time()-start} seconds)\n')

        sums_df = pd.DataFrame(dense_word_matrix).sort_values(by='count',ascending=False).reset_index(drop=True)
        sums_df.to_csv(self.data_dir+self.manual_word_counts_fn, index=False) # save to file for manual inspection
        print(f'\n--- Word Matrix\n\n{sums_df.head(20)}')

        word_filter_df = pd.DataFrame(exclude_word_matrix, columns=['word']).sort_values(by='word').reset_index(drop=True)
        word_filter_df.to_csv(self.data_dir+self.word_filter_list_fn, index=False) # save to file for manual inspection
        print(f'\n--- Word Filter\n\n{word_filter_df.head(20)}')

        return word_filter_df['word'].to_list()

    def _add_label_to_dataframe(self, dcmp, topics, columns):
    #def _build_supervised_dataframe(self, lda, topics, contents):
        ''' Using the unsupervised sentiment analysis data, create a supervised learning dataset '''
        assert len(dcmp) == len(self.email_df), 'Length of decomposition matrix should match email dataframe length'

        values = []
        for x in tqdm(range(len(dcmp))):
            tidx = np.argmax(dcmp[x]) if np.argmax(dcmp[x]) > np.argmin(dcmp[x]) else -1
            value = topics.at[tidx, columns[0]] if tidx >= 0 else 'unknown'
            values.append(value)

        self.email_df[columns[1]] = pd.Series(values)
        return

    def _body_content_analysis_tokenizer(self, text, max_length=20):
        words = re.findall(r"[a-zA-Z][a-z][a-z][a-z]+", text)
        arr = [self.lemmatizer.lemmatize(w) for w in words if len(w) <= max_length] # could also use spacy here
        return arr

    def topic_classification(self, macro_filter=0.5, micro_filter=10, vectorizer='CountVectorizer', decomposition='LatentDirichletAllocation', classes=[], n_components=100, subtopics=15, method='prob', mode=1):
        '''
            General topic classification using various techniques.

            Content frequency distributions -> CountVectorizer & TFIDFVectorizer 
            Decomposition ->
                LDA - LatentDirichletAllocation
                LSA - TruncatedSVD
                NMF - Non-Negative Matrix Factorization

            Classification ->
                AutoClassify (developed by Avemac Systems LLC)
        '''

        # create corpus from email content
        contents = self.email_df['Body'].to_list()

        # custom stop words - not needed if using a fixed vocabulary
        sw_list = [name.lower() for name in names.words()]
        sw_list.extend([stopword.lower() for stopword in stopwords.words()])
        sw_list.extend(self._stop_word_support_function(macro_filter=macro_filter, micro_filter=micro_filter))
        sw_list.extend(['pirnie','skean','sithe','staab','montjoy','lawner','brawner']) # a few names that made it through the filters

        # fixed vocabulary of keywords
        vocab = None

        # lemmatizer
        tokenizer = self._body_content_analysis_tokenizer

        # fetch counts of word phrases
        print(f'\n--- Starting Email Content Analysis')

        start = time()

        if vectorizer == 'CountVectorizer':
            tfc = CountVectorizer(max_df=macro_filter, min_df=micro_filter, max_features=20000, strip_accents='unicode', analyzer='word', 
                                 token_pattern=r"[a-zA-Z][a-z][a-z][a-z]+", ngram_range=(1,1), stop_words=sw_list, vocabulary=vocab, tokenizer=tokenizer)
            title = 'Count'
        else:
            tfc = TfidfVectorizer(max_df=macro_filter, min_df=micro_filter, max_features=20000, strip_accents='unicode', analyzer='word', 
                                 token_pattern=r"[a-zA-Z][a-z][a-z][a-z]+", ngram_range=(1,1), stop_words=sw_list, vocabulary=vocab, tokenizer=tokenizer,
                                 use_idf=1, smooth_idf=1, sublinear_tf=1)
            title = 'TFIDF'

        tf = tfc.fit_transform(contents)
        tf_a = tf.toarray()
        tf_fns = np.array(tfc.get_feature_names())
        print(f'--- Content Frequency Analysis ({time()-start} seconds)\n')

        start = time()

        if decomposition == 'LatentDirichletAllocation':
            dcmp = LatentDirichletAllocation(n_components=n_components, max_iter=3, learning_method='online', learning_offset=50.0, random_state=1).fit(tf)
            title += ' - LDA'
        elif decomposition == 'TruncatedSVD':
            dcmp = TruncatedSVD(n_components=n_components, n_iter=100, random_state=1).fit(tf)
            title += ' - LSA'
        else:
            dcmp = NMF(n_components=n_components, random_state=1, beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=0.1, l1_ratio=0.5).fit(tf)
            title += ' - NMF'

        print(f'--- Decomposition Analysis ({n_components} components in {time()-start} seconds)\n')

        results = self._classify_topic_method(dcmp, tf_fns, nbr_top_tokens=subtopics, title=title, classes=classes, method=method)

        # add class label to training dataframe
        column = 'Class_Alignment_'+str(mode)
        self._add_label_to_dataframe(dcmp.transform(tf), results, ('label',column))
        print(f'\n{self.email_df[[column]].groupby(by=column).size()}\n')
        return


#########################
# Main
#########################

config = {
    'email_extracted_fn': 'extracted_emails.pd',
    'data_dir': '/proto/learning/avemac/email_analysis_blog/data/',
    'plot_image_save_directory': '/proto/learning/avemac/email_analysis_blog/plots/',
    'custom_stop_words_fn': 'custom_stop_words.txt',
    'negative_sentiment_fn': 'negative-words.txt',
    'positive_sentiment_fn': 'positive-words.txt',
    'sentiment_fn': 'sentiment.txt',
    'supervised_dataset_fn': 'supervised_email_train.csv',
    'word_filter_list_fn': 'word_filter_list.csv',
    'manual_word_counts_fn': 'content_word_counts.csv',
    'topic_gradients_fn': 'topic_gradients.csv',
}

usm = UnsupervisedModeling(config)

# Body content sentiment analysis method 1
'''
    Using sentiment lexicon from http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

    Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." 
    Proceedings of the ACM SIGKDD International Conference on Knowledge 
    Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA, 
'''
x = usm.sentiment_analysis(mode=1)

# Body content sentiment analysis method 2
'''
    Using AFINN from http://corpustext.com/reference/sentiment_afinn.html

    Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.
    Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages 718 in CEUR Workshop 
    Proceedings 93-98. 2011 May.

    Using AutoClassify for automatic topic labeling (Developed by Avemac Systems LLC)
'''
x = usm.sentiment_analysis(mode=2)

# Body content sentiment analysis with Vader
x = usm.sentiment_analysis_vader()

# Body content analysis - CountVectorize/LDA with AutoClassify
x = usm.topic_classification(macro_filter=0.5, vectorizer='CountVectorizer', decomposition='LatentDirichletAllocation', classes=['fun','work'], n_components=100, subtopics=20, method='softmax', mode=1)

# Body content analysis - CountVectorizer/NMF with AutoClassify
x = usm.topic_classification(macro_filter=0.5, vectorizer='CountVectorizer', decomposition='NMF', classes=['fun','work'], n_components=200, subtopics=20, method='softmax', mode=2)

# Body content analysis - TFIDF/NMF with AutoClassify
x = usm.topic_classification(macro_filter=0.5, vectorizer='TfidfVectorizer', decomposition='NMF', classes=['fun','work'], n_components=200, subtopics=20, method='softmax', mode=3)

##################
# Post Processing
##################

# Save resulting dataframe for later supervised modeling
usm.email_df.to_csv(config['data_dir']+config['supervised_dataset_fn'], index=False)

# Aggregate view
print(f'\n---Aggregate Class Results')
for column in [x for x in usm.email_df.columns if 'Class_' in x]:
    agg = usm.email_df[[column]].groupby(by=column).size().to_dict()
    print(f'{usm.cliporpad(column, 25)} {", ".join("=".join((k,str(v))) for (k,v) in agg.items())}')
print(f'\n')

exit()
				
			

Schedule a Meeting

More Posts