Introduction of Problem¶

According to the Center for Disease Control and Prevention, 1 in 4 deaths in the United States are a result of heart disease (Source: https://www.cdc.gov/heartdisease/facts.htm).

My goal is to create a machine learning algorithm that will predict whether or not a person will have a heart attack within the next 2 years. The input for my machine learning algorithm will be 2 years worth of ICD-9 diagnosis codes. ICD is an acronym for International Classification of Diseases. ICD codes are the standard codes for recording diagnoses that a patient receives from medical professionals. Though ICD-10 is the standard today, ICD-9 was the standard when this data was collected.

I will be using health insurance claims data for my project. Therefore, I don't have access to specific blood pressure readings or cholesterol levels. However, I will show that it is still possible to create a respectable classifier using diagnosis codes alone.

Metro Dataset Overview¶

Through my research group, I have access to medical claims data for about 500,000 individuals over the course of 8 years. Metro is a name I will use to keep the source of the data confidential. Due to HIPAA restrictions, I cannot post any rows of the dataset. However, I will present some of the basic characteristics of this dataset.

Member Table¶

This table has some high-level information regarding each of the enrolled members in the dataset. Each row is a person.

Key Columns¶

DI_Indv_Sys_Id - Key used to join on members across the tables.
Bth_Dt - If age is relevant, this can be used to calculate it.
TotalMM - Total months of being a member.
MM_#### - Number of months of being a member in year ####.

Claims Table¶

This table contains all the information regarding the individual claims that were filed for each person. Each row is a health insurance claim. There are approximately 46,000,000 claims.

Key Columns¶

DI_Indv_Sys_Id - Key used to join on members across the tables.
Diag_Cd - Primary diagnosis code.
Diag_#_Fst4_Cd - The first four digits in the diagnosis code for the #-th diagnosis. There are up to 3 diagnoses listed per claim.
Diag_#_Fst4_Desc - Description for the diagnosis codes for the #-th diagnosis.
Srvc_Dt - Date of the insurance claim.

Data Preparation:¶

Creating a SQLite Database (Heart Attack Positive)¶

To answer the questions I have, I do not need access to the entire database. Therefore, I chose to create a separate database with just the data I knew I was going to need.

Here are the steps I followed in extracting the subset of relevant data. I will begin by describing how I extracted the data for all individuals that had a heart attack.

Determine which ICD-9 codes were synonymous with a heart attack.
- After performing a fair amount of research, I decided to consider the following diagnoses as heart attacks:
  - 427.41 - Ventricular fibrillation
  - 427.42 - Ventricular flutter
  - 427.5 - Cardiac arrest
Choose only a subset of the features to consider.
- This dataset has many features that are not relevant when trying to predict diagnoses. I have decided to use the following features in my dataset:
  - sys_id : unique system identifier for each individual
  - diag1 : The 4 digit truncation of the 1st diagnosis listed
  - diag2 : The 4 digit truncation of the 2nd diagnosis listed
  - diag3 : The 4 digit truncation of the 3rd diagnosis listed
  - srvc_dt : The date of the doctor visit
- Additionally, I chose to only consider all of the diagnoses that occurred within 2 years of the first recorded heart attack. To ensure that the data was not skewed by events immediately proceeding a heart attack, I excluded the 30 days prior to the first recorded heart attack. For example, if an individual had a heart attack on Feb 23, 2010, I would consider all diagnoses that occurred between Feb 23, 2008 and Jan 23, 2010.

Initial Data Exploration and Visualization¶

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

To start, I wanted to see what the diagnoses occur most often within 2 years of a heart attack.

The initial_pos.py script executed below not accomplishes two things:

Create a visualization specifying the top 20 diagnoses
Store the positive data points that will be used for my classification.

run initial_pos.py

High-level Statistics¶

Total number of individuals that had a heart attack: {{total_pos_members[0][0]}}
Total number of claims for these individuals: {{total_pos_claims[0][0]}}

For reference purposes, here are the descriptions of each of these top 20 ICD-9 codes. Recall that I am using the 4 character truncation of ICD-9 codes, so some of these descriptions will sound a bit broad. This is a consequence of the hierarchical structure of ICD-9 codes.

	ICD9	Description		ICD9	Description
1	401.1	Benign essential hypertension	11	719.4	Pain in joint
2	401.9	Unspecified essential hypertension	12	272.0	Pure hypercholesterolemia
3	786.0	Dyspnea and respiratory abnormalities	13	V04.8	Need for prophylactic vaccination against other viral diseases
4	786.5	Chest pain	14	V72.8	Other specified examinations
5	272.4	Other and unspecified hyperlipidemia	15	786.2	Cough
6	250.0	Diabetes mellitus without mention of complication	16	729.5	Pain in limb
7	414.0	Coronary atherosclerosis	17	428.0	Congestive heart failure, unspecified
8	V58.6	Long-term (current) medication use	18	466.0	Acute bronchitis
9	780.7	Malaise and fatigue	19	V72.3	Gynecological examination
10	789.0	Abdominal pain	20	272.2	Mixed hyperlipidemia

At surface value, it seems very unlikely that a "Cough" (#15) or "Gynecological examination" (#19) will be strong predictors of a heart attack. However, there is an outside chance that these may be contributing to a latent variable that is relevant. Rather than picking and choosing which seem to be the most predictive features, I chose to include all of these top 20 diagnoses as the features of my data points. In my code, I will call these top 20 diagnoses focus_diags. In the case that some of these diagnoses are not, in fact, relevant, the classifier should be capable of weeding them out.

Data Preparation (cont.)¶

Creating a SQLite Database (Heart Attack Negative)¶

I needed to put a lot of thought into how I was going to create a subset of the Metro data for the members that didn't have a heart attack. I could have chosen a few thousand people at random, but my classifier may have been as simple as "If the individual has hypertension or chest pain, predict an oncoming heart attack".

Therefore, I chose to be a bit more selective. I made a subset of 2,000 individuals that had experienced at least one of the focus_diags within the latter 2 years of their data. When choosing the negative samples in this way, the classifier's task becomes much more difficult. It must be able to differentiate between heart attack positive and heart attack negative when the diagnoses are (at least somewhat) similar.

The code to initialize the negative samples for my dataset can be found in initial_neg.py.

run initial_neg.py

The chart above shows the number of individuals exhibiting a given diagnosis within a 2-year time window. As was intended, the distribution of diagnoses is similar enough that it is difficult to determine right away what the key features will be.

In the visualization above, I used 1,100 negative samples (equal to the number of positive samples). For my training data, I will use 2,000 negative samples instead.

Creating Train/Test Data¶

As I mentioned above, I decided to use the top 20 diagnoses as the features of my dataset. To make the data compatible with the classifiers I decided to use, I summed the number of times an individual was diagnosed with each of the top 20 diagnoses. I also included the birth year of each patient as an additional feature.

The code used to prepare the final dataset can be found in finalize_dataset.py.

run finalize_dataset.py

Additional Visualizations¶

Now with the final dataset initialized, I will look to see if there are any other trends I can detect.

I will examine,

The distribution of ages of individuals in the dataset
Determine if there is a significant different in the distribution of diagnoses a year prior to a heart attack vs 2 years prior to a heart attack.

run additional_viz.py

There are two surface-level observations I would like to make based on these visualizations.

It is good to see that there isn't a strong bias in the age of the individuals in the dataset. If this was not so, it could significantly skew the quality of the classifier.
It appears as though ICD-9 codes 401.1 and 250.0 are both decent long-term predictors of being at risk for a heart attack.

ICD-9	Description
401.1	Benign essential hypertension
250.0	Diabetes mellitus without mention of complication

Considering hypertension and diabetes are both long-term health conditions, I think it makes sense that they would stand out in both of the charts above.

Applying Machine Learning Algorithms¶

I decided to apply the following classifiers to my dataset:

Decision Trees
Extra Trees Classifier
Gradient Boosted Trees
K-Nearest Neighbors
Logistic Regression
LSTM
Random Forest
Support Vector Machine
XGBoost

All of the code used to train, tune, and test these machine learning algorithms can be found in apply_ml_algs.py.

It is also worth noting that I had to adjust my data a little bit to be able to train the LSTM. Because LSTM's make predictions based on sequential data, the count-based data that I prepared above will not be usable here. The LSTM accepts a sequence of diagnosis codes for each individual. The code for preparing this data can also be found in apply_ml_algs.py.

run apply_ml_algs.py

Using TensorFlow backend.

It is worth noting that 10 trials isn't enough to make any definite statements, but my initial results suggest that tree-based methods and the LSTM are among the top performing classifiers. More trials and further parameter tuning would, of course, be ideal. However, due to my limited computational resources, this was not feasible.

I had hypothesized that the LSTM would have been the top performer. In some of my tests, it did out-perform all the other models, but only slightly. With limited computational resources, it is difficult to determine the optimal topology for an LSTM.

Conclusion¶

I set out to make a classifier to predict whether or not somebody is at risk for having a heart attack. My best classifier averaged about 86.5% accuracy.

The next step would be to get a hold of a larger dataset. Having only 1,100 patients that had had a heart attack makes me a little skeptical of my results. Another next step would be to run these models on a system with more computational power. The LSTM was particularly computationally intensive. With greater resources, it would be much easier to zero in on a better topology.

All things considered, I consider this to be a very respectable start.