According to the Center for Disease Control and Prevention, 1 in 4 deaths in the United States are a result of heart disease (Source: https://www.cdc.gov/heartdisease/facts.htm).
My goal is to create a machine learning algorithm that will predict whether or not a person will have a heart attack within the next 2 years. The input for my machine learning algorithm will be 2 years worth of ICD-9 diagnosis codes. ICD is an acronym for International Classification of Diseases. ICD codes are the standard codes for recording diagnoses that a patient receives from medical professionals. Though ICD-10 is the standard today, ICD-9 was the standard when this data was collected.
I will be using health insurance claims data for my project. Therefore, I don't have access to specific blood pressure readings or cholesterol levels. However, I will show that it is still possible to create a respectable classifier using diagnosis codes alone.
Through my research group, I have access to medical claims data for about 500,000 individuals over the course of 8 years. Metro is a name I will use to keep the source of the data confidential. Due to HIPAA restrictions, I cannot post any rows of the dataset. However, I will present some of the basic characteristics of this dataset.
This table has some high-level information regarding each of the enrolled members in the dataset. Each row is a person.
DI_Indv_Sys_Id - Key used to join on members across the tables.Bth_Dt - If age is relevant, this can be used to calculate it.TotalMM - Total months of being a member.MM_#### - Number of months of being a member in year ####.This table contains all the information regarding the individual claims that were filed for each person. Each row is a health insurance claim. There are approximately 46,000,000 claims.
DI_Indv_Sys_Id - Key used to join on members across the tables.Diag_Cd - Primary diagnosis code.Diag_#_Fst4_Cd - The first four digits in the diagnosis code for the #-th diagnosis. There are up to 3 diagnoses listed per claim.Diag_#_Fst4_Desc - Description for the diagnosis codes for the #-th diagnosis.Srvc_Dt - Date of the insurance claim.To answer the questions I have, I do not need access to the entire database. Therefore, I chose to create a separate database with just the data I knew I was going to need.
Here are the steps I followed in extracting the subset of relevant data. I will begin by describing how I extracted the data for all individuals that had a heart attack.
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
To start, I wanted to see what the diagnoses occur most often within 2 years of a heart attack.
The initial_pos.py script executed below not accomplishes two things:
run initial_pos.py
For reference purposes, here are the descriptions of each of these top 20 ICD-9 codes. Recall that I am using the 4 character truncation of ICD-9 codes, so some of these descriptions will sound a bit broad. This is a consequence of the hierarchical structure of ICD-9 codes.
| ICD9 | Description | ICD9 | Description | ||
|---|---|---|---|---|---|
| 1 | 401.1 | Benign essential hypertension | 11 | 719.4 | Pain in joint | 
| 2 | 401.9 | Unspecified essential hypertension | 12 | 272.0 | Pure hypercholesterolemia | 
| 3 | 786.0 | Dyspnea and respiratory abnormalities | 13 | V04.8 | Need for prophylactic vaccination against other viral diseases | 
| 4 | 786.5 | Chest pain | 14 | V72.8 | Other specified examinations | 
| 5 | 272.4 | Other and unspecified hyperlipidemia | 15 | 786.2 | Cough | 
| 6 | 250.0 | Diabetes mellitus without mention of complication | 16 | 729.5 | Pain in limb | 
| 7 | 414.0 | Coronary atherosclerosis | 17 | 428.0 | Congestive heart failure, unspecified | 
| 8 | V58.6 | Long-term (current) medication use | 18 | 466.0 | Acute bronchitis | 
| 9 | 780.7 | Malaise and fatigue | 19 | V72.3 | Gynecological examination | 
| 10 | 789.0 | Abdominal pain | 20 | 272.2 | Mixed hyperlipidemia | 
At surface value, it seems very unlikely that a "Cough" (#15) or 
"Gynecological examination" (#19) will be strong predictors of a heart 
attack. However, there is an outside chance that these may be 
contributing to a latent variable that is relevant. 
Rather than picking and choosing which seem to be the most predictive 
features, I chose to include all of these top 20 diagnoses as the 
features of my data points. In my code, I will call these top 
20 diagnoses focus_diags. In the case that some of these diagnoses 
are not, in fact, relevant, the classifier should be capable of 
weeding them out.
I needed to put a lot of thought into how I was going to create a subset of the Metro data for the members that didn't have a heart attack. I could have chosen a few thousand people at random, but my classifier may have been as simple as "If the individual has hypertension or chest pain, predict an oncoming heart attack".
Therefore, I chose to be a bit more selective. I made a subset of 
2,000 individuals that had experienced at least one of the 
focus_diags within the latter 2 years of their data. When 
choosing the negative samples in this way, the classifier's 
task becomes much more difficult. It must be able to differentiate 
between heart attack positive and heart attack negative when the 
diagnoses are (at least somewhat) similar.
The code to initialize the negative samples for my dataset can be found in 
initial_neg.py.
run initial_neg.py
The chart above shows the number of individuals exhibiting a given diagnosis within a 2-year time window. As was intended, the distribution of diagnoses is similar enough that it is difficult to determine right away what the key features will be.
In the visualization above, I used 1,100 negative samples (equal to the number of positive samples). For my training data, I will use 2,000 negative samples instead.
As I mentioned above, I decided to use the top 20 diagnoses as the features of my dataset. To make the data compatible with the classifiers I decided to use, I summed the number of times an individual was diagnosed with each of the top 20 diagnoses. I also included the birth year of each patient as an additional feature.
The code used to prepare the final dataset can be found in finalize_dataset.py.
run finalize_dataset.py
Now with the final dataset initialized, I will look to see if there are any other trends I can detect.
I will examine,
run additional_viz.py
There are two surface-level observations I would like to make based on these visualizations.
| ICD-9 | Description | 
|---|---|
| 401.1 | Benign essential hypertension | 
| 250.0 | Diabetes mellitus without mention of complication | 
Considering hypertension and diabetes are both long-term health conditions, I think it makes sense that they would stand out in both of the charts above.
I decided to apply the following classifiers to my dataset:
All of the code used to train, tune, and test these machine learning algorithms can be found in apply_ml_algs.py.
It is also worth noting that I had to adjust my data a little bit to be able to train the LSTM. Because LSTM's make predictions based on sequential data, the count-based data that I prepared above will not be usable here. The LSTM accepts a sequence of diagnosis codes for each individual. The code for preparing this data can also be found in apply_ml_algs.py.
run apply_ml_algs.py
It is worth noting that 10 trials isn't enough to make any definite statements, but my initial results suggest that tree-based methods and the LSTM are among the top performing classifiers. More trials and further parameter tuning would, of course, be ideal. However, due to my limited computational resources, this was not feasible.
I had hypothesized that the LSTM would have been the top performer. In some of my tests, it did out-perform all the other models, but only slightly. With limited computational resources, it is difficult to determine the optimal topology for an LSTM.
I set out to make a classifier to predict whether or not somebody is at risk for having a heart attack. My best classifier averaged about 86.5% accuracy.
The next step would be to get a hold of a larger dataset. Having only 1,100 patients that had had a heart attack makes me a little skeptical of my results. Another next step would be to run these models on a system with more computational power. The LSTM was particularly computationally intensive. With greater resources, it would be much easier to zero in on a better topology.
All things considered, I consider this to be a very respectable start.