January 14, 2022

Analyzing Messages in Real-Time with SmartDispatch

Background

In the paper, “Creating Simplicity in the Workplace with NLP”, (Paterson 2021), I talk about using Natural Language Processing (NLP) to route or dispatch incoming emails to their appropriate recipients. We decided that an interactive demonstration of our SmartDispatch system would be more useful to our customers than an essay alone. This paper will walk you through how I went about building:

A dataset to use for NLP Machine Learning Model Training
A Deep Learning Neural Network to train our model with
A PyTorch Model Artifact to use for topic predictions
A PyTorch Model Artifiact to use for sentiment analysis
A RESTful API built in FastAPI to deploy the models
A front-end interface to display this functionality built in Streamlit

The Failed Dataset

I first wanted to find a good sampling of business emails in order to train the model. Since our clients will be mostly from small and medium sized businesses, ones that don’t necessarily have labels or tags in their email historically relating to their internal teams or departments, I wasn’t necessarily looking for a labelled dataset.

What are labels?

By “labelled” dataset, I’m referring to a set of emails or text messages that are each labelled with their topic or business department or function. With these two data points for each observation, the message and the label, I can train a model to know what marketing emails look like versus what an IT Helpdesk email looks like. A functioning model would then be able to predict the email or text message’s topic from a list of several options.

But those Emails…

In a cursory search, I quickly found the Enron Email dataset via the Kaggle website. This set of over a half million unlabelled emails became the foundation of our research. However, this proved problematic.

A big problem was that the emails in this corpus were predominantly interpersonal communications between colleagues. Since our end-goal is to create an email interpreter that will surmise the topic, tone or sentiment, and possibly the level of importance of an email as it enters the general inbound queue for a customer service team, we needed to be able to find clear differences in the text that could be coded.

But it’s a computer?

There’s a saying in ML that if a human can do it, then a computer can do it too. While not reflexive in every case, if a human cannot create a delineation between two written messages, then our current NLP capabilities in 2022 cannot do so either (ie., if I can’t understand your text message then Artificial Intelligence can’t do it either). Thus, rather than clear signals to indicate meaning, these interpersonal emails only added noise to our model.

Another big challenge with using this unlabelled dataset was our strategy for labeling it. Obviously, I couldn’t read 500K emails and label each as one of 5 categories by hand, that would take an inordinate amount of time. We thought about bringing on interns from the local University, but we decided that simply reading and labelling emails wasn’t a good use of anyone’s time, student or professional.

The LDA Model

There exist a number of unsupervised learning methods to deal with such a challenge. The NLP solution that is the most popular for this right now is called LDA, which stands for Latent Dirichlet Allocation. LDA is effective at creating clusters within a corpus of text by selecting similarities in the vectorized text, and then grouping text together into like categories.

A Cluster What-Now?

Clustering is a process by which the computer looks at similarities between data observations, in this case emails or text messages, and groups masses of them together in like-clusters. We do a similar thing in bookstores when we put the gardening books in one section and the sci-fi in another, and the poetry in yet another part of the bookstore. You can also think of how kids group up on a playground, or how in a city of a hundred thousand people you’ll find that the different bars each attract their own particular type of person. It’s like that.

Ready-made microservices

AWS has an LDA model at the ready in their microservice called Amazon Comprehend. While I later coded up my own LDA in order to have more autonomy over the fine-tuning of the model, I started off by asking the Comprehend application to label our 500K emails first into 3 subjects and then into 5. I actually repeated this step a few times as each individual pass of the clustering model can result in slightly different groupings of the text.

The Data Scientist can choose the number of clusters, but then in order to actually interpret and label them, it often takes a solid and focused human eye to read tens of emails in order to interpret how the model had clustered the texts. In one trial I clearly discerned a “Regulatory” category, a “Sales” category, a “Meetings/Travel” category, and a separate “Legal” category, leaving all else labelled “Interpersonal”. As sure as I thought I was here, there is no guarantee that your first or second or twelfth attempt will result in a reliably labelled dataset.

Death by Confirmation Bias

Often in our trials when it appeared that we were successful in labeling the emails with this approach, we would find after a day’s rest that our best clusters were a result of chance and confirmation bias in the sub-sample that I reviewed, and further the conclusions that I made didn’t necessarily map to the whole cluster.

To put it another way, I had seen the face of Keanu Reeves on the side of a building in San Francisco and thought that I’d discovered the Matrix (the movie not the nightclub). There were no clear labels for these clusters, I was only willing them to appear in front of me. Just as our eyes and our minds can play tricks on us (there is no spoon), our efforts to be good scientists can rival our human need to be successful creators (if my mixed metaphors were too much, there were no clear labels for the clusters in my trials after all).

Alternative Dataset

Given this realization, the second phase of our process involved shifting gears. This is where I found our winning dataset using the python Requests library to create a labeled dataset though an API to access social media posts.

Leaning on Past Experience

One of my prior projects in NLP involved using a free public API to pull posts and comments from social media sites in order to build myself a labelled dataset. Since I had this experience already, I went ahead and built a program that would hit the API and pull down about 80K posts and comments from 16 different labelled threads. I then labelled each with one of five topics, depending on which thread the message came from: “MachineLearning”, “FrontEnd”, “BackEnd”, “Marketing”, or “Finance”.

Good Data In -> Good Predictions Out

This proved to make a much more accurate predictor in the end. Essentially, we want to use natural, conversational language to train a model on the sequences of words that tend to be members of each of the aforementioned classes. For example, the goal is to train it to “know” the difference between the following:

{ ‘label‘ : ‘MachineLearning’,
‘Message’ : ‘We’re building a model in PyTorch to predict the topic of emails’}

VERSUS

{ ‘label‘ : ‘Marketing’,
‘Message’ : ‘We’re building our emails with drawings modeled on the Olympic Torch’}

Both of these messages contain the keywords “We’re”, “building”, “model”, “emails”, and “Torch”, yet each is clearly part of a different silo of the business.

To achieve our goal, we didn’t necessarily need to start with actual emails as our dataset. Instead, I could build a labelled dataset of key topics using an already-labelled source of natural, conversational language. Since the social media threads were already labelled with words such as “reactjs” or “javascript” or “webdev”, I could then put several of these together, in this case given the label of “FrontEnd”.

The NLP Transformer Model using Deep Learning

Finally, I took the 80K labelled entries, known as “documents” in the parlance of NLP, and used them to Fine-Tune a pre-trained Transformer model known as BERT. While our model is 80% accurate with 80% average precision at predicting each of the 5 topics, I could greatly improve the accuracy of this model by scraping more data, evening out the sample imbalance, and adjusting other hyperparameters in our training job.

We built a 5-class Classification Model by utilizing a process in Machine Learning known as Transfer Learning to fine-tune a pre-trained model. In our case, we utilized BERT, a model that Google pre-trained on over 3 Billion words from a couple of sources including Wikipedia to learn word embeddings and relationships through Bi-Directional Encoding using Transformers. (BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding, 2019, Devin, Chang, Lee, Toutanova, https://arxiv.org/pdf/1810.04805v2.pdf)

With the help of the Huggingface Transformers Library, we built out a custom library in PyTorch that I named Hermes after the Messenger of the Gods. As mentioned above, we began work with an LDA model in order to create labels for data that was previously unlabelled, and tried to apply this newly-labelled dataset first. This proved problematic for Hermes as the results were never better than a 42% accuracy, which is barely above a baseline model or null model. To clarify, in data science a baseline model, or null model, is the simple average (mean) of the target variable.

Put more simply, if I have 3 topics, “A”, “B”, and “C”, but my model always predicts that everything I put in will be part of topic “B”, then my model will be correct 33% of the time. If I have 5 topics and predict the same one each time, then my model will have 20% accuracy.

Since the first dataset, the Enron Emails, did not offer enough differences in their embeddings to Fine-Tune a pre-trained model with any reliable accuracy, we turned to our alternative dataset, the social media posts and comments described above.

Using my HermesDataset object, a custom built PyTorch Dataset (if you're geeky, this inherits from torch.datasets.Dataset) and the HermesPredictor library, I was able to create a custom PyTorch model artifact that predicted with reliable accuracy.

There are many like it but this one is mine

My model, called HermesModel_80.pt, can ingest a vectorized message and predict that message’s topic from the 5 options with 80% accuracy and 80% average precision. I know it’s pretty cool huh? After training the model on a GPU instance through a Deep-Learning optimized Amazon SageMaker Notebook Instance (p3.xlarge), I saved my model artifact to an S3 bucket so that I could access it later on a much smaller instance kernel and use it for inference.

While a fully-functioning and deployed model inside of a customer’s email queue could be trained to well above 90% accuracy using a larger training set, more hyperparameter tuning, and better control for pre-training sample imbalance, I’m very happy with the resulting application. I also used a sentiment analysis dataset that I have used in prior scholastic work, and I ran that dataset through the HermesPredictor library in the same manner to create a 90% accurate sentiment analysis model called HermesSentiment_90.pt.

Amazon Built-in Options

Amazon recently has created its own wrapper to fine-tune pretrained Transformer models using the Huggingface Transformers and other libraries. You can now use the SageMaker SDK, or utilize Amazon Sagemaker console, to deploy a pre-trained BERT model from the Huggingface library onto your dataset.

Unfortunately for me, in the short time I had to work on this project I did not find a good resource for building a data pipeline from raw text to a pytorch Dataset object that could be fed into the dataloader in a format readable by the BERT model. Thus, I chose to use native PyTorch outside of the sagemaker SDK this time, though I still called on Amazon’s SageMaker Notebook instances to utilize the CUDA GPU’s required by PyTorch.

The API

In order for us to put this model into production, it was necessary to create a RESTful API which could serve our model. According to wikipedia, “The REST architectural style emphasizes the scalability of interactions between components, … to facilitate caching components to reduce user-perceived latency, enforce security, and encapsulate legacy systems”. Thus it allows us to run inference on our model in the cloud and frees up processing time and reduces latency on the client side.

Jumpin Jack FastAPI

For this purpose I chose to use FastAPI as it is a 100% python solution. A lot of documentation around RESTful API’s on the internet are written with node.js or JavaScript programmers in mind, and for good reason–JavaScript is the ubiquitous language of our modern internet. However, as a Machine Learning Engineer who knows very little about utilizing the Document Object Model (that’s what our Front-End Devs are good at!), I appreciate a framework that allows me to build the API in very short time so that my model can be utilized faster in a production capacity. Think of a much faster time to POC, and you can bring in the paid web devs if the model is promoted to MVP.

Why So Fast?

FastAPI is also fairly lightweight and can run locally or in the cloud, serving inference from both models with minimal latency. We would increase the response speed of our downstream application by creating separate API’s for each model, possibly even creating a third API to handle the text encoding, and then running them from separate servers or separate virtual machines in kubernetes or docker containers, but our goal for this exercise is to create a demonstration application so for now, a two or three second lag-time requires no further re-tooling.

Also, the majority of our use-cases for this tech will occur under-the-hood of one of our customer’s enterprise email systems, although through putting our model into a FastAPI architecture, we could repurpose the model to work in chatbots, document review systems, or even internal help queues that review documentation for easy solutions.

The GUI Application

Finally, in order for a human user to interface with our API, I created a front-end application, or what is called a Graphical User Interface (GUI) using the Streamlit library in python. Streamlit employs some pretty easy pre-made HTML and Javascript to produce a front-end application that I can write entirely in python. As the developer, I only need to know the Python to use Streamlit.

As a Machine Learning Engineer, this allows me to iterate quickly, to develop my application on my own machine without the need for extra overhead, and it lets me produce this application without taxing the design or development team. Once a client asks us to build them a custom solution, the front-end team can then go to work to marry my API and Data Science Model to the customer’s color schemes, look-and-feel, and embed that naturally into existing software for their users if they want a customer-facing interface.

My Streamlit application can also be served from the same server or virtual machine that I serve the API from, however doing so can add latency since the server can only do one thing at a time–running code for the front-end app, the API, and inference against the endpoint sequentially.

Conclusion

In this paper I’ve walked you through my process to

Create and label a sample dataset
Train and save a deployable Machine Learning Model artifact in PyTorch
Build and deploy a RESTful API in FastAPI to serve the model
Build and deploy a Front-end application to showcase this to our customers

Whether you’re a little nerdish, or a full-silicon-jacket nerd like me, I hope that you’ve enjoyed this read and that you’ll consider the expertise that Cloud Brigade can lend to help your business move into a place of modernization through the embrace of AI.

See a video of the prototype App

Ocotber 27, 2020

Deadliest Counties Have Fewest White People

A recent study shows that US counties with the highest number of COVID-19 deaths per-capita contain the lowest percentages white people as a share of the population.

Using Data Science, a small team of software and machine learning engineers at Cloud Brigade in Santa Cruz, California have discovered that the single strongest indicator correlating United States counties with high rates of death per-capita as a result of COVID-19 is a high percentage of non-white residents in the population.

Looking at about three-dozen different factors and employing various machine learning models such as Linear Regression, Gradient Boosting Regressor, and Kmeans Clustering, the team was able to reasonably predict the levels of mortality and cases per-capita based on different levels of correlation from each factor, or feature.

The team factored in overall population, jobs per-capita, SNAP benefit recipients in a county, the number of people on Public Assistance, the number of Qualified Medicare Beneficiaries, the percentages of households spending more than 35 percent of income on their housing, the percentage of available low-income housing in a county, per-capita income, median household income, median family income, percentages of African American residents, Asian American residents, Latinx residents, Native American residents, and white residents, and other factors.

The results were clear as demonstrated by this graphic:

While the colors are different, the patterns clearly show the overlap of the deaths per-capita from COVID-19 on the right, and the white versus non-white populations (left)

The map on the left shows white population as a percentage of the county total, with the darkest blue being nearly 100% white and the lightest blue containing the fewest white residents per-capita. On the right, a nearly identical heatmap, with a different color-scheme, shows COVID-19’s deadliest counties in deep red, and the counties with the fewest deaths per-capita in yellow, with the orange colors designating the steps in between.

The maps clearly illustrate Native American Reservations in the west and upper-mid-west, areas in the Mississippi Delta and rural Georgia that have large majorities of African American residents, as well as majority Latinx areas such as south Texas, Arizona, and New Mexico; all of which are experiencing the highest death rates from COVID-19. You also can see that the multi-cultural areas around the San Francisco Bay such as Oakland, San Jose, and San Francisco do not have high levels of deaths per-capita displayed in the map on the right.

The team did not set out to find these details when they began the project. Unfortunately, this is what the data show us. This study does not suppose the causation of this effect, though there are obvious connections to per-capita income as shown in more detail at https://datademo.me. Here’s a look at the top 25 deadliest counties:

You can view the whole dashboard at https://DataDemo.me , and there you can scroll over each of the above rows and see it’s racial make-up. This bar chart is color coded by per-capita income, with the blue bars signifying counties whose per-capita income is above the national median, orange-red denoting below the median.

There are also new questions left as a result of this study, such as why do these mostly rural counties have a higher rate of death per-capita than tightly-populated Manhattan or Chicago? Is it a result of access to medical care in rural hospitals? Is it something else entirely?

The only question that the team had at the start of the project was just what factors could be discovered as correlated amongst disparate counties. In the spring of 2020, as COVID-19 was quickly spreading across the world, the machine learning crew tried to build some machine learning models to learn about this new disease. In the simplest terms, they wanted to find out if gathering copious amounts of data and running them through an unsupervised machine learning model could teach new things about the virus.

Heatmap displaying strong positive correlations (red) and strong negative correlations (blue) between various features. A feature above 20% is interesting, above 35% is significant

Unsupervised Learning is one of the three major types of Machine Learning along with Supervised and Reinforcement Learning. The purpose of the Unsupervised Learning model is to take datasets that are larger than a human eye can comb through easily, and to let the computer’s mathematical modeling algorithms attempt to cluster like datapoints with others. In short, the computer takes thousands (or hundreds of thousands) of rows of data, each row containing tens (or hundreds) of attributes, and it creates single points in multi-dimensional space for each of the rows of data. It then identifies clusters of these new data points and groups them together. This type of model is employed to detect fraud in online banking and ecommerce. After the model clusters the data, it is up to the researcher to then look at each of the clusters and identify why these things have been clustered together. Sound a bit esoteric? Think of it another way.

K-Means clustering is one type of Unsupervised Learning Model.
Photo Credit: https://www.iotforall.com/wp-content/uploads/2018/01/Screen-Shot-2018-01-17-at-8.10.14-PM.png

Imagine that you volunteer to help at a clothing drive and three thousand people drop off bags of donated clothing. If these people all dumped the bags into piles of loose clothing and walked off, it would be impossible to tell which of the piles had the most blue jeans, the most socks, the most bathing suits, et cetera. You would have an impossible time sorting everything out. But what if there was a way to scan each clothing bag, create some index number based on the contents, and then put groups of bags together with other bags that had similar contents (and index numbers) before you even opened them up? That’s the unsupervised model. Now you, the volunteer, can go to the first cluster, open up the bags and say “OK, these are the ones with lots of blue jeans,” and “those over there are the ones with lots of green hoodies.”

And that is what was done in this study. The researchers collated their data from the US Census Bureau, US Bureau of Economic Analysis, the US Department of Agriculture, and various COVID-19 data sources such as the Google Open COVID-19 Dataset, the New York Times, European Centre for Disease Prevention and Control, Global Health Data from the World Bank, and OpenStreetMap.

This group is now working on a timeseries forecasting model that incorporates the above data with weather data, updated local restrictions and masking policies, hospital admits and available hospital bed data, as well as considering the available medical treatments and vaccines to predict at the county level how the virus will spread and how deadly future outbreaks will become. You can see a dashboard of the work that they have done at https://DataDemo.me.

Matt Paterson is a Machine Learning Engineer at Cloud Brigade in Santa Cruz, California. He has been working in various Data Analysis and Financial Forecasting roles using various levels of technology from SQL and Excel to Python, Scikit-Learn, Tensorflow and Django since 2013. You can reach out to him directly at hello@hireMattPaterson.com

See the Tableau Dashboard

August 30, 2020

I found 34 new planets

Identifying planets orbiting nearby stars using a Recurrent Neural Network

—MATT PATERSON — hello@hireMattPaterson.com

I discovered some new planets on Friday...sort of. Using a Python Jupyter notebook, I employed Tensorflow and Keras to produce a recurrent neural network that predicted the existence of 34 previously unknown exoplanets in our interstellar neighborhood from a subset of about 2200 observations. Since I don’t trust the results completely, I wanted to share this project at its midpoint in hopes that other researchers or hobbyists might have some insight to lend, or that we can compare our predictions.

You can view my ‘quick and dirty’ models here

or

you can view my ‘deeper dive’ models here

Two weeks ago I set out to create a convolutional neural network that would be able to identify the presence or absence of a planet orbiting a nearby star using an image of that star through the lens of the Kepler space telescope, or some other telescope. It was an exciting idea, but as it would turn out, there are not easily accessible image datasets of nearby stars with the resolution needed for the model I was working on. So, as I have had to do previously in the business world, I pivoted slightly and successfully created a recurrent neural network instead. My neural network analyzes the public Kepler Objects of Interest dataset and accurately predicts the presence of previously undiscovered planets.

This article will be updated on or around Labor Day Weekend 2020 with new findings and better visualizations

I created several models before employing the neural network, including a simple logistic regression model, a random forest classifier (both in Python, Scikit-Learn), and a feed-forward neural network with dropout (Keras). The recurrent neural network was the fourth model, which I tried after seeing a 96% accuracy score from my random forest, and as Charles Rice of General Assembly put it, “Deploying a neural net on a dataset this small is a bit like sandblasting a soup cracker, but it does seem to confirm the strength of the random forest [classifier].”

The exoplanet identification model in any form is a binary classification problem. Machine Learning models attempt to numerically predict the unknown, and typically a classification model will predict either a 1 (yes, positive) in the case that we are looking at an exoplanet orbiting a nearby star, or it predicts a 0 (no, negative) in the case that we are not looking at an exoplanet. To oversimplify, a logistic regression uses a sigmoid function to do this, rendering all results between 0 and 1 and assigning the class of the closest integer.

The Random Forest Classifier is an ensemble model that takes the best results of many randomly created decision trees that employ a gini algorithm (there are other decision algorithms in use, but I employ the gini). Over many iterations, the model learns which features from the original dataset, or more appropriately which combination of features, best indicate the class of the observation.

I was surprised that my accuracy scores on the Random Forest Classifier and the Recurrent Neural Network were over 96% and 99% respectively, and I truly thought that I was doing something wrong since I hadn't done much model tuning at that point. My data’s null hypothesis, baseline average, was 35% ‘Confirmed’ (exoplanets) and 65% ‘False Positive’ (not-exoplanets). The misleading use of the term ‘False Positive’ aside, I improved the accuracy of predictions from 65% (if we guessed every observation to NOT be an exoplanet) to over 96% simply by putting my baseline data into each of these models. That means that either the NASA scientists did all of the hard work on cleaning the datasets (I still had to clean the data, but again, the point is I didn’t do much to it), or that my models actually don’t work at all.

For now, as I said, I’m only halfway through this particular project. While I am actively tuning my model, running feature engineering and trying different combinations of features, I would love to hear what has worked for you in your attempts to predict the exoplanet existence, and more importantly what hasn’t worked.

Thank you for reading part one of this blog. Feel free to email me directly you’re your suggestions, and please come back next week to see the results of this project!

August 13, 2020

Covid-19’s Deadliest Counties

The Risk of Overburdening our Medical Infrastructure

—MATT PATERSON — hello@hireMattPaterson.com

The Novel Coronavirus will kill over 185,000 people in the United States by the beginning of September, 2020 (just a couple weeks from this writing), according to the Institute for Health Metrics and Evaluation (IHME), with the highest concentration of deaths per capita found in rural Georgia and the Mississippi Delta.

As part of our studies in Data Science, and to practice using Linear Regression and the KMeans algorithm at General Assembly, Matt Burrell, Eric Laverdiere and I set out to study how the impact of COVID-19 on our healthcare infrastructure might cause an overburdening when a natural disaster such as a hurricane or wildfire occurs. We drew data from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University, the IHME, Google’s BigQuery (a SQL database), the USGS, the US Census Bureau, and NOAA (links) . We used pure data to evaluate the risks and created a dashboard in a Tableau API to show our findings (you can also view our github repo ).

Our findings were interesting, though short of our intended scope. One of the largest barriers we faced was the quality and consistency of hospitalization data regarding COVID-19. While the sources were solid and accessible, they were not complete. There is even an article from July 21 in the New York Times stating that the federal government is specifically not releasing the collected hospitalization data at the county level. I was able to use the state-level data for the available states to create a linear regression model that could predict the number of people hospitalized. That model had Root Mean Squared Error of 1060.29, or in other words, I could predict give or take 1,060 people how many a state would see in the hospital as a result of COVID-19. With this, we can reverse engineer an estimated number of cases in the state using the deaths per hospitalization, the number of deaths outside of the hospital, and historical data for other respiratory illnesses regarding the estimated percentage of people that are hospitalized versus the estimated number of cases for a given time period. Despite having a relatively low error estimate, I didn’t think this statewide model was the particularly useful to our purposes considering it is on the state level and not the county level, making staffing forecasts and extra disaster relief planning impossible.

The next big pain point in the data is the number of “confirmed cases” of COVID-19. Confirmed cases data, which in and of itself is scary (we passed 5 million Americans confirmed to have tested positive with COVID-19 in the second week of August), has become a political football regardless of the color of your party pennant. This is true for many reasons, such as the variety and reliability of the individual tests. In other words, are people receiving a higher number of false negatives or false positives from each brand of test? Also, availability and accessibility of the testing is not constant. A person who is showing symptoms of COVID might not be able to get a test in every city, or healthcare workers in many counties are denied testing until they show severe symptoms and have already tried waiting out the illness at home in self-quarantine. This also questions the quality of the testing sample: are the people that are tested showing symptoms at all, and if so are they only the most drastic of cases? In some areas a person with no symptoms cannot get a test and in other areas people are being tested multiple times per week regardless of their temperature. Is this consistent from one county to another or from one state to another? No it is not. If there could be millions of positive cases out there showing no symptoms, there is no objective way to infer where they are geographically, or where the opposite might be true.

Our study sought to avoid all of those political arguments. We want to know what the data has to say. We don’t want to tell the data a story when it has a story of its own. It should be noted that testing is very important to controlling the spread of the disease, as is evidenced by many prominent people having to be tested before coming in to contact with other prominent people, but as a purely statistical look at the actual size of the spread of COVID-19 on a local and national level, we looked at the number of deaths from COVID-19 per capita. This was the only currently-available datapoint that was consistent and reported without caveats.

The map below shows every US county, over 3,000 of them. Each county is represented by a small dot, and the dot is color coded according to the scale in the upper right of the figure. You will notice that counties with high per-capita rates of death are colored red, such as Hancock County, Georgia with 402 deaths per 100K (the deadliest in the country as of July 31, 2020), and that counties such as Potter County, Pennsylvania that have 0.0 deaths per 100K are colored blue. The colors range to different shades of orange in the middle.

What was quite surprising was that rather than places like Detroit, Boston, and New York City being the deadliest per capita, we see four out of the five deadliest counties in the United States were in rural Georgia. We can also see that the Bayou, the Mississippi Delta region, and parts of Alabama and the Florida Panhandle are equally high on the deaths-per-capita range.

Here is a list of the 25 deadliest counties in the United States as of July 31, 2020, taking the total number of COVID deaths divided by the total population of the county times 100,000, to yield the number of deaths per 100K people:

It should be noted that Hancock County Georgia has around 10K residents and as of the last census, it was around 78% African American. Also at numbers two, three, and five are Randolph, Terrell, and Early County Georgia. With populations around 8K, 10K, and 11K respecively, and with populations that are about 63%, 62%, and 50% African American respectively, this very shallow dive in to the deaths per capita number suggests that something very dangerous is happening in rural Georgia.

Equally alarming, the number four county, McKinley County, New Mexico, is about 75% Native American. At roughly 14 people per square mile, this county is also quite rural by any definition. It is intuitively easy to see a dense city such as New York or Boston have a high rate of deaths due to COVID during cold winter months when people pack indoors and rely on packed subway cars and busses to commute to work when they aren’t privlidged enough to have work-from-home situations, but why these rural populations? And is it a coincidence that they are made up of minority populations? Again, this study does not have numbers on access to hospitals or medical care in those areas, it does not have empirical data concerning what type of work these populations have or do not have, whether they are able to stay home and isolate or not, or whether these particular counties are themselves more susceptible to respiratory illness due to pollution or other factors. We are purely dealing with the number of people that have died in a population as a percentage of the total, and what additional strain this could add to our healthcare system should a hurricane or tornado or other natural disaster present itself in the midst of this pandemic.

Wanting to make things easier to view from a wider, more regional standopoint, we broke the three thousand counties in to three hundred regional clusters using a KMeans algorithm, and then plotted them in the map below.

Here you can see the average regional deaths per 100K if you go to the Tableau API and hover your mouse over any of the cluster centroids. What really jumps out in this regional clustering is the severity of the deaths per capita in Northeastern Arizona as well as neighboring New Mexico on Native American Reservations.

This regional cluster map also shows that New York, Northern New Jersey and Southern Connecticut do have the highest regional average in the country after those Native American Reservations, but it should be noted that The Brox was number 7 on the deadliest counties list, with most of the top 25 being rural counties.

The lack of time-series sensitivity in this data hides the dangers that are faced in the rural south. We know that New York, Boston, and Detroit all faced a serious battle with COVID-19 and lost a lot of lives to the disease in April and May of 2020, contributing to an outsized proportion of the first 100K deaths (over a third). However, as the weather heats up and the population in the southern states are looking to the comfort of air conditioning, their proximity to one another might be part of the danger that they face in spite of their lack of normal population density. We would be able to make these inferences better with a timeseries model such as ARIMA, and may tackle that in the coming weeks.

Finally, talking about the weather, we looked at several natural disaster situations. Our Tableau API goes into detail on the historical data over the last twenty years regarding earthquakes, drought, wildfire, tornadoes, and hurricanes. This being August, it is currently both tornado season in the Midwest, Texas, and the South, and hurricane season in the Southeast and the Gulf Coast. You’ll notice in the map below that we identify the areas with the most property damage over the last two decades, and they are in the same areas that are currently seeing the death toll from COVID-19 spiking. Here is where the medical infrastructure question comes in.

If a hurricane like Katrina were to hit New Orleans and the Mississippi Delta this year, or to rip through Georgia and the Carolinas, we would have a situation where people who were evacuating flooding and storm damage might be asked to hunker down together in close quarters in high school gymnasiums and indoor sports facilities. These temporary shelters would then become Coronavirus petri-dishes, effectively creating an environment with people sleeping within six feet of complete strangers, when the best qualities of our humanity, compassion and love and empathy, would be the very things that would exacerbate this pandemic in areas where hospitals that are already seeing in influx of patients due to COVID will now be taxed beyond their capacity. Add to that any trauma victims from the storms themselves, and it’s a very fast slide into a medical infrastructure overload.

Again, our study did not look into qualitative political factors. The model does not consider that the Governor of Georgia invited people to return to the barber shops and nail salons as early as April while other parts of the country wore masks and maintained social distance and reduced the number of people they were in close contact with. There is neither a control group within such an area nor the pure data to compare it. However, I am sure that given the time, either one, some, or all of us in this project would like to investigate the timeseries data around COVID to show how the pure numbers around this virus have changed in each county in relation to steps that local officials took in its handling.

While our model avoids this data, I would like to simply link and share this graphic that the IHME created that does have applicable data and models how the use of masks and other precautions could prevent between 100,000 and 160,000 additional deaths by December 1, 2020:

I for one would be very excited to do a project with the IHME that allowed me access to the hospitalizations data, including admits, discharges and deaths, as well as which tests were available to whom and in what quantities and geographic areas.

The main findings that we came to in this study, however, were that the existence of this data does allow FEMA or other NGO or government relief agencies a head start on where we need to send additional medical resources ahead of the next possible disasters. It also is a great starting point in investigating why certain rural populations that have been historically subjugated in North America are seeing such high rates of death—around forty times higher than the national median rates of death per 100K.

In summary, be it a wildfire in Arizona or Southern California, or a Blizzard in the Northeast, or a Hurricane in the South in the next six weeks, we will see a situation where a normally occurring extreme-weather event will exacerbate our COVID troubles, and the question is how do we prepare ourselves for that to minimize the damage to our communities.

If you’d like to hear more or see what kind of other work I am doing, view my resume (using the button below, or the button next to my recommendations linked at the top of this page) and please shoot an email to hello@hireMattPaterson.com.

Matt Paterson is a Machine Learning Engineer in Santa Cruz, California and is currently a Data Science Fellow at General Assembly.

View My Resume

The Tableau map linked in this article was created by Eric Laverdier, laverdiere.eric@gmail.com. He used earthquake data that he cleaned and analyzed after pulling it from the USGS. He also used data provided by Matthew Burrell, matthew_burrell@outlook.com , that can be found at the NOAA, and COVID-19 data compiled and cleaned by Matt Paterson, hello@HireMattPaterson.com. The population numbers were pulled from the United States Census Bureau and the COVID data was found in several repos including the following: -COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University -https://github.com/CSSEGISandData/COVID-19 -https://github.com/nytimes/covid-19-data -Google BigQuery -https://covid19.healthdata.org/united-states-of-america

July 8, 2020

Why I Changed Careers

How I Discovered That I’ve Been a Data Scientist All Along

—MATT PATERSON — hello@hireMattPaterson.com

I should probably explain why this is so important to me now, this world of Data Science and Machine Learning. I appreciate you taking the time to read my blog, and if you are only here to try and solve some issue you are having in Python’s Pandas or Matplotlib or Seaborn or AWS…scroll a bit, it’s in another post…this post is about me. For those that are still reading, thanks for stopping by.

Before I knew that Nate Silver and Katie Malone were my idols, I was an English major. I wasn’t an English major because I loved Jane Austen and Marianne Evans and F. Scott Fitzgerald, though I did; I was an English major because I wanted to write novels. I wanted to create. I wanted to design and plan and show a world to the world, a world of my own creation that might enrich creation. I wanted to build something in the real world that came one-hundred percent from my own imagination. Naturally, once I graduated I became a very happy bar tender. Eventually I went into sales.

About five years and a couple promotions later, I was working with a startup in northern California. The founder and CEO was a self-proclaimed spreadsheet jockey, and the man that taught me to be a linear thinker. He ran his business by the numbers, teaching me and the rest of the management team to use A/B tests in order to gain insights on new initiatives before we scaled them out.

My mentor and predecessor taught me how to use MySQL to mine our data warehouse and then clean the data and put it into an excel pivot table in order to make sense of it all. I would take thousands of rows of data, clean them, explore them, and then put them into charts and graphs to present this information to the C-suite. I found that the better I did at creating an easy-to-understand presentation, the less anyone needed to see what I was presenting.

After a while, I realized that I was spending about 18 hours each week compiling my report. It had become a constant. The report had become repetitive. I knew it was time to automate it, but when I submitted my ticket to IT requesting my report be converted to a weekly dashboard, they balked. They said they loved the idea, but it wasn’t completely consistent with Australia and the UK’s reporting, and so they weren’t ready to commit the staffing hours to the project. I didn’t realize at that time that the company was about to sell for $100 million (it was GBP 71 Million, so $100M was the rough exchange rate at the time). Gutted, I decided that I could semi-automate this report on my own.

I spent the next couple of weeks building out a workbook in excel and creating standard SQL queries that the managers reporting to me could run. At 8:00 am every Monday, they would run the queries, remove the duplicates from the data that was already grouped by sales person, and then populate a pre-created data sheet in excel. They would then email that sheet to me by 8:30, and by 9:00 am, I would have my completed report.

By copying and pasting their sheets in to my master, I could then see top level sales numbers, customer service statistics, the call-center’s interaction as a result of certain marketing emails, and how the performance of the prior week and month-to-date affected not only our budget (actuals versus forecast), but also our headcount forecast for the coming 12-60 months. The report was tied in to the company’s updated budget forecast through Dropbox, and so while I wasn’t yet employing any Python, and the report was not actually automated, it was close; it was quick, and it was the fastest report for others to digest I had created to date. Also, should there be any notable variance to the top-level numbers versus forecast, any person viewing the report could drill down through hyperlinks and see why that variance existed.

The hyperlinks could take you through the console and back to its start. A manager could click on a number (usually an outlier) and it would take the viewer to the data’s source page. If you drilled down far enough, you could see which salesperson missed or outperformed their forecasted performance. You could see if a negative variance was caused by not closing enough business (conversion rate) or if they simply weren’t talking to enough prospective customers (contact rate). You could drill further still, in the case of the latter, and see if they had a low contact rate because they didn’t have enough opportunity (short on good leads), or if they didn’t employ enough effort (how many times did they dial their phone). It would also tell you if they just had been absent from work that week unexpectedly, accounting for the shortfall in production.

This report is not something special in the world of data and reporting, but to me at that time, a creative who still dreamed of being an author and a person that was increasingly fond of SQL and MS Excel, this was the most fun that I had in a professional capacity in my career to that point. Learning how to use H-Lookup was like beating a level in a video game—I had serotonin spikes and felt elated every time I solved a new puzzle.

Later, at my next startup I remember how exciting it was to watch the web-guru contractors build out our stack and design our front-end. I remember asking if it was hard, if I could do it. Over the next few years, between that company and another that I founded on my own, I would consider how fun and exciting it was to create this first big reporting tool and wish that I had the opportunity to again work with such a wealth of data. My own little wine business was never going to be such a juggernaut, even if it reached its maximum height.

But the world of Big Data was by then all the rage. I had no idea that when I took that position with a small startup, that I would learn so much about data-driven decision-making, or that this type of decision-making was even a new thing in the world. I had no idea that it would lead my curiosity down this path of Pandas and Numpy and Seaborn, or in to the seemingly endless configurations within the AWS console. Had I known how exciting this whole process was, how rewarding it could be, and how it could become the creative outlet that my younger-self yearned to find in the novel-writing market, I would have joined a boot-camp back in 2014 and gone straight in to data science then. Maybe I would have studied Computer Science instead of English as an undergrad.

Now, five or six years later, I am excited to be completing the Data Science Immersive boot camp through General Assembly and I hope that I can use the skills I’m gaining here working for a company that deals in outer space, or on autonomous vehicles here at home. I can’t wait to tell you more about the Kmeans algorithm, about supervised learning through Linear Regression models, and about Reinforcement Learning in future blog posts so check in to this blog monthly, and thank you for your interest!

If you’d like to hear more about my journey or see what kind of work I have been doing, view my resume (using the button below, or the button next to my recommendations linked at the top of this page) and please shoot an email to hello@hireMattPaterson.com.

View My Resume

June 18, 2020

Kmeans and Covid

Machine Learning and the world amidst the Coronavirus Pandemic, part 1: What does K means?

—MATT PATERSON — hello@hireMattPaterson.com

In late April, 2020 the world was starting to see the Covid-19 pandemic take hold. In Santa Cruz, California, I was following the directions of the local and state officials and staying at home, not working, and pouring over charts and graphs and numbers like everyone else in the world with an inquisitive mind and too much time on their hands. As it happened, just before the lockdown started, I had been studying traffic patterns locally with Cloud Brigade, using the AWS DeepLens camera in concert with a native algorithm in Amazon Sagemaker, and was learning ML and Data Science on the fly in that capacity.

We were hoping that we could create a Reinforcement Learning Algorithm and a simulation that would solve congestion problems. Unfortunately for us, though probably great for our air quality, the lockdown essentially stopped the traffic problems that we were setting out to solve. Temporary or not, I had to find a new Machine Learning project if I was going to break in to this field. Covid-19 happening in 2020 meant that we had some of humanity’s best Data Scientists and Mathematicians in history all able to work on this same data problem together, meaning there were myriad datasets available as Creative Commons and open source projects.

It was about that time that Chris Miller, Founder and CEO of Cloud Brigade in Santa Cruz, gave me a call and asked if I was still interested in our Machine Learning projects. He had an idea that we could use Machine Learning to draw insights that other sources weren’t talking about yet. We set out to create an unsupervised learning model using both Covid-19 daily statistics of infections and deaths, as well as US Census Bureau statistics detailing the socio-economic makeup of counties across the country, in hopes that using a PCA algorithm to reduce dimensionality and feeding those results in to a Kmeans algorithm might show us some related data that our current reporting was not diving in to.

The Kmeans algorithm takes in quantitative data and creates a sphere of datapoints in a multi-dimensional space, and then splits the sphere of datapoints into clusters, letting a human interpret why those clusters are the way they are. That might be a little oversimplified, or a bit too deep, depending on your own technical background. Let me explain it another way using the example of ecommerce fraud detection.

An ecommerce platform such as Amazon or Instacart takes in thousands if not millions of credit card transactions per day. Some of those transactions are not actually humans making a valid purchase from their own personal bank accounts or credit cards, but are instead people using stolen card numbers or even algorithms replicating credit card numbers and trying different combinations of expiration dates and three-digit security codes and zip codes until they are able to make a purchase. Scary stuff. Especially if you might be on the hook to your suppliers for millions of dollars in the case of a drop-shipper.

The sheer volume of transactions here makes it financially prohibitive for any company to employ enough humans to manually audit every transaction looking for anomalies or patterns that would point toward fraud. Not only that, as soon as we were able to define what that fraudulent transaction “looks like”, fraudsters would have already created new ways to cheat and get around our defenses. However, for years the industry has been using software supported by algorithms similar to Kmeans, that can find these fraud patterns for us.

These patterns aren’t necessarily specific, and the computer is not looking for them per se. What is happening is that the data are fed into the model, and model separates thousands of credit card transactions and looks for patterns and groups them based on those patterns. Some patterns could be purchases made by people of a specific age cohort, say 25-34, who live in a certain zip code, maybe 94115, and purchase between $49 - $149 in merchandise, and have made their second purchase with the company…there are usually many components (or columns of data) that are used and I think you get the picture. The above cohort might all be grouped together and separated from another group.

Maybe the other group happens to have only 20 or 30 datapoints instead of 200. Maybe this smaller group is similar by the fact that they tried their cards at least 7 times before the transaction went through. Maybe they are similar because their credit card numbers are consecutive. I am by no means even a casual armchair fraud detection expert, but the great thing about this type of Machine Learning is that I don’t have to be. Using what’s called an Unsupervised Learning Algorithm, I can let the computer split up all of this information into an easily digestible graph and then play “One of these things is not like the other…” or something like that.

So that’s where we at Cloud Brigade started from. I partnered with Cloud Brigade Data Engineer Quinn Wildman, also from Santa Cruz, CA, and began putting together our datasets and running them through the unsupervised learning models. This being my first project, it wasn’t as cut and dry as we’d expected. I’ll tell you all about our trials and tribulations in next week’s blog. Until then, I recommend you check out more articles about the Kmeans algorithm, the PCA algorithm, and the three major types of Machine Learning: Supervised Learning, Unsupervised Learning, and Reinforcement Learning.