In [1]:
!pip install matplotlib
!pip install numpy
!pip install pandas
!pip install seaborn
!pip install scipy
!pip install statsmodels
!pip install scikit-learn
!pip install networkx
Requirement already satisfied: matplotlib in c:\users\tranh\appdata\roaming\python\python310\site-packages (3.8.4)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: pillow>=8 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (9.4.0)
Requirement already satisfied: cycler>=0.10 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (4.25.0)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: packaging>=20.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (22.0)
Requirement already satisfied: numpy>=1.21 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (1.23.5)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (1.0.5)
Requirement already satisfied: six>=1.5 in c:\users\tranh\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: numpy in c:\users\tranh\anaconda3\lib\site-packages (1.23.5)
Requirement already satisfied: pandas in c:\users\tranh\anaconda3\lib\site-packages (1.5.3)
Requirement already satisfied: pytz>=2020.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas) (2022.7)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: numpy>=1.21.0 in c:\users\tranh\anaconda3\lib\site-packages (from pandas) (1.23.5)
Requirement already satisfied: six>=1.5 in c:\users\tranh\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Requirement already satisfied: seaborn in c:\users\tranh\anaconda3\lib\site-packages (0.12.2)
Requirement already satisfied: pandas>=0.25 in c:\users\tranh\anaconda3\lib\site-packages (from seaborn) (1.5.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\tranh\appdata\roaming\python\python310\site-packages (from seaborn) (3.8.4)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\tranh\anaconda3\lib\site-packages (from seaborn) (1.23.5)
Requirement already satisfied: cycler>=0.10 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: packaging>=20.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (22.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.5)
Requirement already satisfied: pillow>=8 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0)
Requirement already satisfied: pytz>=2020.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas>=0.25->seaborn) (2022.7)
Requirement already satisfied: six>=1.5 in c:\users\tranh\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
Requirement already satisfied: scipy in c:\users\tranh\anaconda3\lib\site-packages (1.10.0)
Requirement already satisfied: numpy<1.27.0,>=1.19.5 in c:\users\tranh\anaconda3\lib\site-packages (from scipy) (1.23.5)
Requirement already satisfied: statsmodels in c:\users\tranh\anaconda3\lib\site-packages (0.13.5)
Requirement already satisfied: pandas>=0.25 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (1.5.3)
Requirement already satisfied: numpy>=1.22.3 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (1.23.5)
Requirement already satisfied: patsy>=0.5.2 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (0.5.3)
Requirement already satisfied: packaging>=21.3 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (22.0)
Requirement already satisfied: scipy>=1.3 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (1.10.0)
Requirement already satisfied: pytz>=2020.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas>=0.25->statsmodels) (2022.7)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas>=0.25->statsmodels) (2.8.2)
Requirement already satisfied: six in c:\users\tranh\anaconda3\lib\site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
Requirement already satisfied: scikit-learn in c:\users\tranh\anaconda3\lib\site-packages (1.2.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: joblib>=1.1.1 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (1.1.1)
Requirement already satisfied: numpy>=1.17.3 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (1.23.5)
Requirement already satisfied: scipy>=1.3.2 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (1.10.0)
Requirement already satisfied: networkx in c:\users\tranh\appdata\roaming\python\python310\site-packages (3.3)

Please clone the following url and move the files into the same directory as this python notebook - https://github.com/Maanuj-Vora/Data102-Final-Project-Datasets

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
import seaborn as sns
import scipy.stats
from scipy.stats import norm
import statsmodels.api as sm
from statsmodels.formula.api import glm
import networkx as nx
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod import families
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model, tree
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from scipy.stats import chi2
import statsmodels.formula.api as smf
from statsmodels.tools.eval_measures import rmse
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import ttest_ind
from statsmodels.stats.power import TTestIndPower
from sklearn.model_selection import train_test_split
In [3]:
df = pd.read_csv('Population by Age and Sex - EDDs.csv')

mean_population_by_year = df[df['State'] == 'CA'].groupby('Year')['Total Population'].mean()

mean_population_by_year.plot(kind='line', figsize=(10, 6))

plt.title('Average Total Population Over Years')
plt.xlabel('Year')
plt.ylabel('Average Population')
plt.xticks(range(min(mean_population_by_year.index), max(mean_population_by_year.index) + 1, 2))
plt.grid(True)
plt.show()

The chart displays the average total population over the years in California, illustrating how the population has changed. This relationship is interesting because it provides insights into population growth or decline trends over time. This is relevant because we intend to normalize the emissions dataset and asthma dataset by the total population per state.

In [4]:
states = ['CA', 'NY', 'FL', 'NV', 'TX', 'HI', 'WA', 'OR', 'KY', 'LA']
df[df['State'].isin(states)].groupby(['Year', 'State'])['Total Population'].mean().unstack().plot(kind='bar', stacked=True, figsize=(10, 6))
Out[4]:
<Axes: xlabel='Year'>

The graph displays the average total population for selected states (CA, NY, FL, NV, TX, HI, WA, OR, KY, LA) over various years, with each state's population represented as a portion of a bar for each year. The bars are stacked, meaning the population sizes of the states are added on top of each other within each year's bar, allowing us to see the combined population trends over time for these states. This visualization helps in comparing the population growth or decline among these states across different years, as well as understanding the overall population trend of these states combined.

In [5]:
race_df = pd.read_csv('Population by Race - EDDs.csv')
race_df.fillna(0, inplace=True) # For Hawaiian or Pacific Islander Alone, fill with zeroes if there are null values.
race_df
Out[5]:
IBRC_Geo_ID State District Name Year Total Population White Alone Black Alone American Indian or Alaskan Native Asian Alone Hawaiian or Pacific Islander Alone Two or More Races Not Hispanic Hispanic
0 6500001 AK Kenai Peninsula Economic Development District 1990 41125.0 37508.0 199.0 3003.0 415.0 0.0 0.0 40421.0 704.0
1 6500001 AK Kenai Peninsula Economic Development District 1991 42649.0 38893.0 204.0 3120.0 432.0 0.0 0.0 41913.0 736.0
2 6500001 AK Kenai Peninsula Economic Development District 1992 43496.0 39670.0 212.0 3173.0 441.0 0.0 0.0 42748.0 748.0
3 6500001 AK Kenai Peninsula Economic Development District 1993 44201.0 40265.0 219.0 3251.0 466.0 0.0 0.0 43431.0 770.0
4 6500001 AK Kenai Peninsula Economic Development District 1994 45588.0 41435.0 237.0 3423.0 493.0 0.0 0.0 44761.0 827.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
11935 6500416 CT Western Connecticut Economic Development Distr... 2015 944943.0 752134.0 116361.0 4573.0 52039.0 1020.0 18816.0 766505.0 178438.0
11936 6500416 CT Western Connecticut Economic Development Distr... 2016 944347.0 748757.0 117499.0 4696.0 53041.0 1004.0 19350.0 762293.0 182054.0
11937 6500416 CT Western Connecticut Economic Development Distr... 2017 943038.0 745352.0 118526.0 4775.0 53541.0 1005.0 19839.0 757421.0 185617.0
11938 6500416 CT Western Connecticut Economic Development Distr... 2018 943971.0 743260.0 119919.0 4964.0 54515.0 1018.0 20295.0 753873.0 190098.0
11939 6500416 CT Western Connecticut Economic Development Distr... 2019 943332.0 739179.0 121997.0 5115.0 55489.0 1029.0 20523.0 749560.0 193772.0

11940 rows × 13 columns

In [6]:
merged_df = pd.merge(df, race_df, on=['IBRC_Geo_ID', 'State', 'District Name', 'Year', 'Total Population'], how='inner')
merged_df
Out[6]:
IBRC_Geo_ID State District Name Year Total Population Population 0-4 Population 5-17 Population 18-24 Population 25-44 Population 45-64 ... Male Population Female Population White Alone Black Alone American Indian or Alaskan Native Asian Alone Hawaiian or Pacific Islander Alone Two or More Races Not Hispanic Hispanic
0 6500001 AK Kenai Peninsula Economic Development District 2000 49618.0 3272.0 11536.0 3432.0 14636.0 13044.0 ... 25831.0 23787.0 43353.0 236.0 3760.0 488.0 88.0 1693.0 48530.0 1088.0
1 6500001 AK Kenai Peninsula Economic Development District 2001 49986.0 3207.0 11261.0 3770.0 14117.0 13740.0 ... 26065.0 23921.0 43611.0 273.0 3729.0 512.0 79.0 1782.0 48826.0 1160.0
2 6500001 AK Kenai Peninsula Economic Development District 2002 50804.0 3217.0 11123.0 4099.0 13835.0 14506.0 ... 26589.0 24215.0 44144.0 287.0 3811.0 538.0 89.0 1935.0 49547.0 1257.0
3 6500001 AK Kenai Peninsula Economic Development District 2003 50963.0 3185.0 10874.0 4278.0 13405.0 15077.0 ... 26657.0 24306.0 44214.0 283.0 3808.0 547.0 91.0 2020.0 49657.0 1306.0
4 6500001 AK Kenai Peninsula Economic Development District 2004 51404.0 3181.0 10592.0 4447.0 13111.0 15718.0 ... 26801.0 24603.0 44423.0 271.0 3819.0 570.0 110.0 2211.0 50055.0 1349.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7885 6500416 CT Western Connecticut Economic Development Distr... 2015 944943.0 53031.0 166447.0 85431.0 227999.0 271756.0 ... 460316.0 485007.0 752134.0 116361.0 4573.0 52039.0 1020.0 18816.0 766505.0 178438.0
7886 6500416 CT Western Connecticut Economic Development Distr... 2016 944347.0 52545.0 164256.0 86562.0 225568.0 272331.0 ... 460137.0 484959.0 748757.0 117499.0 4696.0 53041.0 1004.0 19350.0 762293.0 182054.0
7887 6500416 CT Western Connecticut Economic Development Distr... 2017 943038.0 52058.0 162222.0 86535.0 224212.0 271858.0 ... 459755.0 484352.0 745352.0 118526.0 4775.0 53541.0 1005.0 19839.0 757421.0 185617.0
7888 6500416 CT Western Connecticut Economic Development Distr... 2018 943971.0 51708.0 160510.0 86323.0 225033.0 270590.0 ... 460611.0 484900.0 743260.0 119919.0 4964.0 54515.0 1018.0 20295.0 753873.0 190098.0
7889 6500416 CT Western Connecticut Economic Development Distr... 2019 943332.0 51199.0 158467.0 85880.0 225756.0 268079.0 ... 459938.0 484450.0 739179.0 121997.0 5115.0 55489.0 1029.0 20523.0 749560.0 193772.0

7890 rows × 24 columns

In [7]:
summary_df = merged_df.groupby('Year').agg({
    'White Alone': 'sum',
    'Black Alone': 'sum',
    'American Indian or Alaskan Native': 'sum',
    'Asian Alone': 'sum',
    'Hawaiian or Pacific Islander Alone': 'sum',
    'Two or More Races': 'sum',
    #'Not Hispanic': 'sum',
    'Hispanic': 'sum'
}).reset_index()

# Plot
summary_df.plot(x='Year', kind='bar', stacked=True, figsize=(10, 6), title='Population by Race Over Years')
plt.xlabel('Year')
plt.ylabel('Population')
plt.legend(title='Population Groups', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

This visualization stacks the populations of different age groups and races for each year, providing a comprehensive overview of the demographic changes over time.

In [8]:
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv')
asthma_df
C:\Users\tranh\AppData\Local\Temp\ipykernel_2596\2650947367.py:1: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv')
Out[8]:
YearStart YearEnd LocationAbbr LocationDesc DataSource Topic Question Response DataValueUnit DataValueType ... LocationID TopicID QuestionID DataValueTypeID StratificationCategoryID1 StratificationID1 StratificationCategoryID2 StratificationID2 StratificationCategoryID3 StratificationID3
0 2010 2010 OR Oregon NVSS Cardiovascular Disease Mortality from heart failure NaN NaN Number ... 41 CVD CVD1_4 NMBR RACE AIAN NaN NaN NaN NaN
1 2019 2019 AZ Arizona YRBSS Alcohol Alcohol use among youth NaN % Crude Prevalence ... 4 ALC ALC1_1 CRDPREV GENDER GENF NaN NaN NaN NaN
2 2019 2019 OH Ohio YRBSS Alcohol Alcohol use among youth NaN % Crude Prevalence ... 39 ALC ALC1_1 CRDPREV GENDER GENM NaN NaN NaN NaN
3 2019 2019 US United States YRBSS Alcohol Alcohol use among youth NaN % Crude Prevalence ... 59 ALC ALC1_1 CRDPREV RACE ASN NaN NaN NaN NaN
4 2015 2015 VI Virgin Islands YRBSS Alcohol Alcohol use among youth NaN % Crude Prevalence ... 78 ALC ALC1_1 CRDPREV GENDER GENM NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1185671 2020 2020 WY Wyoming BRFSS Diabetes Dilated eye examination among adults aged >= 1... NaN % Age-adjusted Prevalence ... 56 DIA DIA7_0 AGEADJPREV RACE WHT NaN NaN NaN NaN
1185672 2020 2020 WY Wyoming BRFSS Older Adults Proportion of older adults aged >= 65 years wh... NaN % Crude Prevalence ... 56 OLD OLD3_1 CRDPREV RACE WHT NaN NaN NaN NaN
1185673 2017 2017 IA Iowa BRFSS Arthritis Activity limitation due to arthritis among adu... NaN % Age-adjusted Prevalence ... 19 ART ART2_1 AGEADJPREV RACE HIS NaN NaN NaN NaN
1185674 2020 2020 WY Wyoming BRFSS Diabetes Diabetes prevalence among women aged 18-44 years NaN % Crude Prevalence ... 56 DIA DIA2_2 CRDPREV RACE HIS NaN NaN NaN NaN
1185675 2019 2019 RI Rhode Island BRFSS Arthritis Activity limitation due to arthritis among adu... NaN % Crude Prevalence ... 44 ART ART2_1 CRDPREV OVERALL OVR NaN NaN NaN NaN

1185676 rows × 34 columns

In [9]:
# Filter the dataset for Asthma indicators only
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
In [10]:
asthma_data = asthma_data.dropna(subset=['DataValue'])
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')

# Group by YearStart and calculate the mean of DataValue for Asthma prevalence
asthma_yearly = asthma_data.groupby('YearStart')['DataValue'].mean()

plt.figure(figsize=(12, 6))
plt.plot(asthma_yearly.index, asthma_yearly.values, marker='o', linestyle='-', color='blue')
plt.title('Average Asthma Prevalence Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Prevalence (%)')
plt.grid(True)
plt.show()

There is a decrease in average asthma prevalence over the years. This might be due to a lack of data. 2011 and 2012 have no data points for example.

In [11]:
# Filter the dataset for Emergency Room visits related to Asthma
er_visits_asthma = asthma_data[asthma_data['Question'].str.contains('Emergency department visit rate for asthma')]

# Group by YearStart and calculate the mean of DataValue for ER visits due to Asthma
er_visits_yearly = er_visits_asthma.groupby('YearStart')['DataValue'].mean().reset_index()

er_visits_yearly
Out[11]:
YearStart DataValue
0 2010 13897.291389
1 2013 6392.235606
2 2014 4310.263989
3 2015 2912.746088
4 2016 4689.097370
5 2017 3821.923457
6 2018 3368.481102
In [12]:
plt.figure(figsize=(12, 6))
plt.plot(er_visits_yearly['YearStart'], er_visits_yearly['DataValue'], marker='o', color='red')
plt.title('Average Emergency Room Visits for Asthma Over the Years')
plt.xlabel('Year')
plt.ylabel('Average ER Visits')
plt.grid(True)
plt.show()

There is a decrease in average emergency room visits over the years. This may be due to wildfires in 2010, which causes more carbon dioxide to be in the atmosphere. This may also be due to the lack of data.

In [13]:
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['DataValueType'] == 'Number']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Overall']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr'])['DataValue'].mean().reset_index()

asthma_state_yearly.head(30)
Out[13]:
YearStart LocationAbbr DataValue
0 2010 AZ 40650.0
1 2010 CA 217637.0
2 2010 FL 141612.0
3 2010 HI 7328.0
4 2010 IA 14135.0
5 2010 KY 27334.0
6 2010 MA 53257.0
7 2010 MD 56070.0
8 2010 NC 60806.0
9 2010 NE 6798.0
10 2010 NJ 81536.0
11 2010 NY 222051.0
12 2010 RI 7845.0
13 2010 SC 24076.0
14 2010 UT 7867.0
15 2010 WI 30501.0
16 2013 AZ 31357.0
17 2013 FL 126267.0
18 2013 IA 11065.0
19 2013 KY 22037.0
20 2013 NC 52550.0
21 2013 NE 5391.0
22 2013 NJ 66250.0
23 2013 NV 13652.0
24 2013 NY 179657.0
25 2013 SC 30142.0
26 2013 VT 2545.0
27 2013 WI 23701.0
28 2014 AR 13610.0
29 2014 AZ 32699.0
In [14]:
df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
# Summarize total population by state and year
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
df_population_summary
Out[14]:
Year State Total Population
0 2000 AK 146644.0
1 2000 AL 4510366.0
2 2000 AR 2678588.0
3 2000 AZ 1220370.0
4 2000 CA 8163246.0
... ... ... ...
915 2019 VA 815269.0
916 2019 VT 324114.0
917 2019 WA 5607154.0
918 2019 WI 2772884.0
919 2019 WV 1792147.0

920 rows × 3 columns

In [15]:
# Merge asthma data with population data
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')

# Calculate the percentage of population affected by asthma
asthma_population_merged['NormalizedOccurence'] = (asthma_population_merged['DataValue'] / asthma_population_merged['Total Population'])
asthma_population_merged['Asthma Prevalence (%)'] = asthma_population_merged['NormalizedOccurence'] * 100

asthma_population_merged[['YearStart', 'LocationAbbr', 'DataValue', 'Total Population', 'Asthma Prevalence (%)', 'NormalizedOccurence']]
Out[15]:
YearStart LocationAbbr DataValue Total Population Asthma Prevalence (%) NormalizedOccurence
0 2010 AZ 40650.0 1600442.0 2.539923 0.025399
1 2010 CA 217637.0 8730612.0 2.492803 0.024928
2 2010 FL 141612.0 18845537.0 0.751435 0.007514
3 2010 IA 14135.0 2075096.0 0.681173 0.006812
4 2010 KY 27334.0 4348181.0 0.628631 0.006286
... ... ... ... ... ... ...
95 2018 NJ 52173.0 569816.0 9.156114 0.091561
96 2018 NV 14810.0 199220.0 7.433993 0.074340
97 2018 OR 12819.0 4740945.0 0.270389 0.002704
98 2018 VT 1799.0 324553.0 0.554301 0.005543
99 2018 WI 21174.0 2765223.0 0.765725 0.007657

100 rows × 6 columns

The merged data includes the average asthma prevalence rates for each state per year, along with the total population for those states and years. The percentage of the population affected by asthma in each state per year provide insight into how asthma prevalence relates to the broader population across different states and over time.

In [16]:
emmissions_df = pd.read_csv('table1.csv') # https://www.eia.gov/environment/emissions/state/

emmissions_df.head() #Each row is a state, each column is the year
#million metric tons of carbon dioxide equivalents (MTCO2E)
Out[16]:
State 1970 1971 1972 1973 1974 1975 1976 1977 1978 ... 2016 2017 2018 2019 2020 2021 Percent Absolute Percent.1 Absolute.1
0 Alabama 102.6 98.5 104.9 109.6 108.8 107.8 108.1 111.7 106.6 ... 114.0 108.6 112.4 106.3 98.4 108.4 5.6% 5.7 10.1% 10.0
1 Alaska 11.3 12.6 13.4 12.5 12.8 14.5 16.0 18.0 19.5 ... 33.4 33.7 34.5 34.3 36.0 38.9 242.5% 27.5 8.0% 2.9
2 Arizona 24.9 27.0 30.2 34.4 36.7 38.2 43.8 50.5 49.3 ... 90.9 90.5 94.1 92.6 80.2 83.0 233.3% 58.1 3.6% 2.9
3 Arkansas 36.2 35.1 37.2 40.8 39.1 36.4 38.9 41.6 42.4 ... 62.1 64.2 70.8 65.1 54.7 62.0 71.4% 25.8 13.3% 7.3
4 California 294.4 305.8 312.7 329.3 304.5 311.5 326.9 354.5 345.2 ... 353.4 356.5 358.6 358.3 303.8 324.0 10.1% 29.7 6.7% 20.2

5 rows × 57 columns

In [17]:
state_names = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}
In [18]:
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)
asthma_population_merged
Out[18]:
YearStart LocationAbbr DataValue Year State Total Population NormalizedOccurence Asthma Prevalence (%) State Name
0 2010 AZ 40650.0 2010 AZ 1600442.0 0.025399 2.539923 Arizona
1 2010 CA 217637.0 2010 CA 8730612.0 0.024928 2.492803 California
2 2010 FL 141612.0 2010 FL 18845537.0 0.007514 0.751435 Florida
3 2010 IA 14135.0 2010 IA 2075096.0 0.006812 0.681173 Iowa
4 2010 KY 27334.0 2010 KY 4348181.0 0.006286 0.628631 Kentucky
... ... ... ... ... ... ... ... ... ...
95 2018 NJ 52173.0 2018 NJ 569816.0 0.091561 9.156114 New Jersey
96 2018 NV 14810.0 2018 NV 199220.0 0.074340 7.433993 Nevada
97 2018 OR 12819.0 2018 OR 4740945.0 0.002704 0.270389 Oregon
98 2018 VT 1799.0 2018 VT 324553.0 0.005543 0.554301 Vermont
99 2018 WI 21174.0 2018 WI 2765223.0 0.007657 0.765725 Wisconsin

100 rows × 9 columns

In [19]:
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')
asthma_population_merged['YearStart'] = pd.to_numeric(asthma_population_merged['YearStart'], errors='coerce')
merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
merged_df.rename(columns={'Year_x': 'Year'}, inplace=True)
merged_df.drop(columns=['Year_y'], inplace=True)
merged_df.head()
Out[19]:
YearStart LocationAbbr DataValue Year State Total Population NormalizedOccurence Asthma Prevalence (%) State Name Emissions
0 2010 AZ 40650.0 2010 AZ 1600442.0 0.025399 2.539923 Arizona 99.5
1 2010 CA 217637.0 2010 CA 8730612.0 0.024928 2.492803 California 356.6
2 2010 FL 141612.0 2010 FL 18845537.0 0.007514 0.751435 Florida 245.5
3 2010 IA 14135.0 2010 IA 2075096.0 0.006812 0.681173 Iowa 90.3
4 2010 KY 27334.0 2010 KY 4348181.0 0.006286 0.628631 Kentucky 152.9
In [20]:
florida_asthma_data = merged_df[(merged_df['LocationAbbr'] == 'FL')]
new_jersey_asthma_data = merged_df[(merged_df['LocationAbbr'] == 'NJ')]
iowa_asthma_data = merged_df[(merged_df['LocationAbbr'] == 'IA')]

fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(10, 24))

# Florida
X_fl = florida_asthma_data['YearStart'].values.reshape(-1, 1)
y_fl = florida_asthma_data['Asthma Prevalence (%)'].values

model_fl = LinearRegression().fit(X_fl, y_fl)
trendline_fl = model_fl.predict(X_fl)

axes[0].scatter(X_fl, y_fl)
axes[0].plot(X_fl, trendline_fl, color='red')
axes[0].set_title('Asthma Prevalence (%) by Year in Florida')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Asthma Prevalence (%)')
axes[0].set_ylim(bottom=0)
axes[0].grid(True)

# New Jersey
X_nj = new_jersey_asthma_data['YearStart'].values.reshape(-1, 1)
y_nj = new_jersey_asthma_data['Asthma Prevalence (%)'].values

model_nj = LinearRegression().fit(X_nj, y_nj)
trendline_nj = model_nj.predict(X_nj)

axes[1].scatter(X_nj, y_nj)
axes[1].plot(X_nj, trendline_nj, color='green')
axes[1].set_title('Asthma Prevalence (%) by Year in New Jersey')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Asthma Prevalence (%)')
axes[1].set_ylim(bottom=0)
axes[1].grid(True)

# Iowa
X_ia = iowa_asthma_data['YearStart'].values.reshape(-1, 1)
y_ia = iowa_asthma_data['Asthma Prevalence (%)'].values

model_ia = LinearRegression().fit(X_ia, y_ia)
trendline_ia = model_ia.predict(X_ia)

axes[2].scatter(X_ia, y_ia)
axes[2].plot(X_ia, trendline_ia, color='black')
axes[2].set_title('Asthma Prevalence (%) by Year in Iowa')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('Asthma Prevalence (%)')
axes[2].set_ylim(bottom=0)
axes[2].grid(True)

# Combined Trends
axes[3].plot(X_fl, trendline_fl, color='red', label='Florida')
axes[3].plot(X_nj, trendline_nj, color='green', label='New Jersey')
axes[3].plot(X_ia, trendline_ia, color='black', label='Iowa')
axes[3].set_title('Combined Asthma Prevalence (%) Trends')
axes[3].set_xlabel('Year')
axes[3].set_ylabel('Asthma Prevalence (%)')
axes[3].legend()
axes[3].set_ylim(bottom=0)
axes[3].grid(True)

plt.tight_layout()
plt.show()

One confounding factor of asthma prevalence by Year in each state can be that one person visits the emergency room more than once per year; in other words, the data does not showcase the unique visits per year and instead showcases the total number of visits per year.

New Jersey is a smaller state with a large population. The CO2 emissions are over 100 metric tons. Iowa has relatively less metric tons.

In [21]:
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(10, 18))

# Florida
florida_emissions = merged_df[(merged_df['LocationAbbr'] == 'FL')]
X_florida = florida_emissions['YearStart'].values.reshape(-1, 1)
y_florida = florida_emissions['Emissions'].astype(float).values
model_florida = LinearRegression().fit(X_florida, y_florida)
trendline_florida = model_florida.predict(X_florida)

axes[0].scatter(X_florida, y_florida, label='CO2 Emissions (MTCO2E)')
axes[0].plot(X_florida, trendline_florida, color='red', label='Trendline')
axes[0].set_title('CO2 Emissions Over Years in Florida')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('CO2 Emissions (MTCO2E)')
axes[0].set_ylim(bottom=0)
axes[0].legend()
axes[0].grid(True)

# New Jersey
new_jersey_emissions = merged_df[(merged_df['LocationAbbr'] == 'NJ')]
X_new_jersey = new_jersey_emissions['YearStart'].values.reshape(-1, 1)
y_new_jersey = new_jersey_emissions['Emissions'].astype(float).values
model_new_jersey = LinearRegression().fit(X_new_jersey, y_new_jersey)
trendline_new_jersey = model_new_jersey.predict(X_new_jersey)

axes[1].scatter(X_new_jersey, y_new_jersey, label='CO2 Emissions (MTCO2E)')
axes[1].plot(X_new_jersey, trendline_new_jersey, color='green', label='Trendline')
axes[1].set_title('CO2 Emissions Over Years in New Jersey')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('CO2 Emissions (MTCO2E)')
axes[1].set_ylim(bottom=0)
axes[1].legend()
axes[1].grid(True)
axes[1].grid(True)

# Iowa
iowa_emissions = merged_df[(merged_df['LocationAbbr'] == 'IA')]
X_iowa = iowa_emissions['YearStart'].values.reshape(-1, 1)
y_iowa = iowa_emissions['Emissions'].astype(float).values
model_iowa = LinearRegression().fit(X_iowa, y_iowa)
trendline_iowa = model_iowa.predict(X_iowa)

axes[2].scatter(X_iowa, y_iowa, label='CO2 Emissions (MTCO2E)')
axes[2].plot(X_iowa, trendline_iowa, color='blue', label='Trendline')
axes[2].set_title('CO2 Emissions Over Years in Iowa')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('CO2 Emissions (MTCO2E)')
axes[2].set_ylim(bottom=0)
axes[2].legend()
axes[2].grid(True)
axes[2].grid(True)

# Combined Trends
axes[3].plot(X_fl, trendline_florida, color='red', label='Florida')
axes[3].plot(X_nj, trendline_new_jersey, color='green', label='New Jersey')
axes[3].plot(X_ia, trendline_iowa, color='black', label='Iowa')
axes[3].set_title('CO2 Emissions Trends')
axes[3].set_xlabel('Year')
axes[3].set_ylabel('CO2 Emissions')
axes[3].set_ylim(bottom=0)
axes[3].legend()
axes[3].grid(True)

plt.tight_layout()
plt.show()

Since New Jersey is smaller than Iowa, but it is producing 25% more CO2 compared to Iowa, resulting in a higher asthma prevalence compared to Iowa.

Florida has the highest CO2 emissions out of the three state graphs and has a population of 22.24 million. New Jersey has around 110 MTCO2E and has a population of 9.26 million.

For Iowa, the population is 3.2million and the CO2 emissions is around 90 MTCO2E.

Confounding variables might include age, sex, and population.

In [22]:
merged_df[['LocationAbbr', 'YearStart', 'NormalizedOccurence', 'Emissions']].head()
# NormalizedOccurence = num occurences / total population
#Normalized occurence is the dedcimal value
#asthma % = NormalizedOccurence * 100
Out[22]:
LocationAbbr YearStart NormalizedOccurence Emissions
0 AZ 2010 0.025399 99.5
1 CA 2010 0.024928 356.6
2 FL 2010 0.007514 245.5
3 IA 2010 0.006812 90.3
4 KY 2010 0.006286 152.9
In [23]:
plt.figure(figsize=(12, 6))

data_2010 = merged_df[merged_df['YearStart'] == 2010]

bar_width = 0.35
index = np.arange(len(data_2010['LocationAbbr']))

bars1 = plt.bar(index, data_2010['Asthma Prevalence (%)'].astype(float), bar_width, label='Asthma Prevalence (%)')
bars2 = plt.bar(index + bar_width, data_2010['Emissions'].astype(float), bar_width, label='Emissions (MTCO2E)')

plt.xlabel('State')
plt.ylabel('Values')
plt.title('Asthma Prevalence (%) vs Emissions by State for 2010')
plt.xticks(index + bar_width / 2, data_2010['LocationAbbr'])
plt.legend()
plt.tight_layout()
plt.show()

This categorical chart shows"Asthma Prevalence (%)" and "Emissions" for different states specifically for the year 2010. California has the highest emission rate, whereas Nebraska (NE) has the lowest emission rate.

New Jersey has the highest asthma prevalence, whereas Massachusetts has the lowest.

We want to follow up on potential confounding variables such as Sex to see their relationship with emergency visits relating to Asthma.

For emissions, confounding variables include population.

We should explore the effects on these confounding variables given the population dataset. The CDC dataset also separates asthma visits by male and females, so we can explore the relationship between sex and Asthma.


In [24]:
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)

asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Gender']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()

df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)

df_male_population_summary = df_population.groupby(['Year', 'State'])['Male Population'].sum().reset_index()
df_female_population_summary = df_population.groupby(['Year', 'State'])['Female Population'].sum().reset_index()
print(df_male_population_summary, df_female_population_summary)
asthma_population_merged = pd.merge(asthma_population_merged, df_male_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_male_pop'))
asthma_population_merged = pd.merge(asthma_population_merged, df_female_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_female_pop'))

emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')

merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)

merged_df.head()
     Year State  Male Population
0    2000    AK          76955.0
1    2000    AL        2175150.0
2    2000    AR        1307573.0
3    2000    AZ         614849.0
4    2000    CA        4063605.0
..    ...   ...              ...
915  2019    VA         403981.0
916  2019    VT         160948.0
917  2019    WA        2811851.0
918  2019    WI        1393637.0
919  2019    WV         889160.0

[920 rows x 3 columns]      Year State  Female Population
0    2000    AK            69689.0
1    2000    AL          2335216.0
2    2000    AR          1371015.0
3    2000    AZ           605521.0
4    2000    CA          4099641.0
..    ...   ...                ...
915  2019    VA           412600.0
916  2019    VT           162963.0
917  2019    WA          2795301.0
918  2019    WI          1378458.0
919  2019    WV           906103.0

[920 rows x 3 columns]
Out[24]:
YearStart LocationAbbr Stratification1 DataValue Year_x State Total Population State Name Year_male_pop State_male_pop Male Population Year_female_pop State_female_pop Female Population Year_y Emissions
0 2010 AZ Female 7473.116667 2010 AZ 1600442.0 Arizona 2010 AZ 809747.0 2010 AZ 790763.0 2010 99.5
1 2010 AZ Male 6150.000000 2010 AZ 1600442.0 Arizona 2010 AZ 809747.0 2010 AZ 790763.0 2010 99.5
2 2010 CA Female 38898.420000 2010 CA 8730612.0 California 2010 CA 4327961.0 2010 CA 4402785.0 2010 356.6
3 2010 CA Male 32368.006667 2010 CA 8730612.0 California 2010 CA 4327961.0 2010 CA 4402785.0 2010 356.6
4 2010 FL Female 27106.200000 2010 FL 18845537.0 Florida 2010 FL 9212261.0 2010 FL 9633882.0 2010 245.5
In [25]:
florida_er_visits = merged_df[(merged_df['LocationAbbr'] == 'FL')][['YearStart', 'DataValue', 'Stratification1', 'Male Population', 'Female Population']].dropna()
iowa_er_visits = merged_df[(merged_df['LocationAbbr'] == 'IA')][['YearStart', 'DataValue', 'Stratification1', 'Male Population', 'Female Population']].dropna()
new_jersey_er_visits = merged_df[(merged_df['LocationAbbr'] == 'NJ')][['YearStart', 'DataValue', 'Stratification1', 'Male Population', 'Female Population']].dropna()

florida_er_visits['Visits per 10,000'] = florida_er_visits.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)
iowa_er_visits['Visits per 10,000'] = iowa_er_visits.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)
new_jersey_er_visits['Visits per 10,000'] = new_jersey_er_visits.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)

total_visits_per_10000_florida = florida_er_visits.groupby(['YearStart', 'Stratification1'])['Visits per 10,000'].sum().reset_index()
total_visits_per_10000_iowa = iowa_er_visits.groupby(['YearStart', 'Stratification1'])['Visits per 10,000'].sum().reset_index()
total_visits_per_10000_new_jersey = new_jersey_er_visits.groupby(['YearStart', 'Stratification1'])['Visits per 10,000'].sum().reset_index()

(total_visits_per_10000_florida, total_visits_per_10000_iowa, total_visits_per_10000_new_jersey)
Out[25]:
(    YearStart Stratification1  Visits per 10,000
 0        2010          Female          28.136321
 1        2010            Male          21.920341
 2        2013          Female          23.307932
 3        2013            Male          19.730475
 4        2014          Female          24.128772
 5        2014            Male          20.049779
 6        2015          Female          17.268291
 7        2015            Male          14.077174
 8        2016          Female          20.140228
 9        2016            Male          17.712882
 10       2017          Female          19.240536
 11       2017            Male          16.687892
 12       2018          Female          19.346165
 13       2018            Male          16.858536,
     YearStart Stratification1  Visits per 10,000
 0        2010          Female          25.532069
 1        2010            Male          20.437976
 2        2013          Female          19.253279
 3        2013            Male          16.662763
 4        2014          Female          20.461681
 5        2014            Male          17.900915
 6        2015          Female          15.734273
 7        2015            Male          12.679087
 8        2016          Female          17.134587
 9        2016            Male          14.657145
 10       2017          Female          16.417481
 11       2017            Male          14.772427
 12       2018          Female          16.205231
 13       2018            Male          13.870933,
     YearStart Stratification1  Visits per 10,000
 0        2010          Female         517.049395
 1        2010            Male         399.737513
 2        2013          Female         404.753877
 3        2013            Male         345.234332
 4        2014          Female         409.523794
 5        2014            Male         343.297010
 6        2015          Female         292.801786
 7        2015            Male         247.006551
 8        2016          Female         359.963605
 9        2016            Male         310.604665
 10       2017          Female         345.041857
 11       2017            Male         287.855561
 12       2018          Female         328.137771
 13       2018            Male         284.557950)
In [26]:
fig, axes = plt.subplots(3, 1, figsize=(12, 18))

# Florida
axes[0].plot(total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Male']['YearStart'], 
         total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Male']['Visits per 10,000'], 
         label='Male', marker='o', color='blue')
axes[0].plot(total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Female']['YearStart'], 
         total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Female']['Visits per 10,000'], 
         label='Female', marker='o', color='red')
# Combine Male + Female for Florida
axes[0].plot(total_visits_per_10000_florida['YearStart'].unique(), 
         total_visits_per_10000_florida.groupby('YearStart')['Visits per 10,000'].sum(), 
         label='Total', marker='o', color='purple')
axes[0].set_title('Emergency Room Visits for Asthma per 10,000 People in Florida by Gender')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Visits per 10,000 People')
axes[0].legend()
axes[0].grid(True)
axes[0].set_ylim(bottom=0)

# Iowa
axes[1].plot(total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Male']['YearStart'], 
         total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Male']['Visits per 10,000'], 
         label='Male', marker='o', color='blue')
axes[1].plot(total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Female']['YearStart'], 
         total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Female']['Visits per 10,000'], 
         label='Female', marker='o', color='red')
# Combine Male + Female for Iowa
axes[1].plot(total_visits_per_10000_iowa['YearStart'].unique(), 
         total_visits_per_10000_iowa.groupby('YearStart')['Visits per 10,000'].sum(), 
         label='Total', marker='o', color='purple')
axes[1].set_title('Emergency Room Visits for Asthma per 10,000 People in Iowa by Gender')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Visits per 10,000 People')
axes[1].legend()
axes[1].grid(True)
axes[1].set_ylim(bottom=0)

# New Jersey
axes[2].plot(total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Male']['YearStart'], 
         total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Male']['Visits per 10,000'], 
         label='Male', marker='o', color='blue')
axes[2].plot(total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Female']['YearStart'], 
         total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Female']['Visits per 10,000'], 
         label='Female', marker='o', color='red')
# Combine Male + Female for New Jersey
axes[2].plot(total_visits_per_10000_new_jersey['YearStart'].unique(), 
         total_visits_per_10000_new_jersey.groupby('YearStart')['Visits per 10,000'].sum(), 
         label='Total', marker='o', color='purple')
axes[2].set_title('Emergency Room Visits for Asthma per 10,000 People in New Jersey by Gender')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('Visits per 10,000 People')
axes[2].legend()
axes[2].grid(True)
axes[2].set_ylim(bottom=0)

plt.tight_layout()
plt.show()

Based on the graphs for each of the state, for 10,000 people, the number of females being admitted to the emergency room for asthma related issues is always more than the number of males that are being admitted to the emergency room. In New Jersey's case, we have seen that the Asthma Prevalence (%) is vastly larger than those of Florida and Iowa, we can see that gender does not play a huge role in the issue, as the trends for all of the state are roughly the same.

Race and Emergency Department Visits¶

In [27]:
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)

asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Race/Ethnicity']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()

df_population = pd.read_csv('Population by Race - EDDs.csv')
df_population_total = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
df_population_white = df_population.groupby(['Year', 'State'])['White Alone'].sum().reset_index()
df_population_black = df_population.groupby(['Year', 'State'])['Black Alone'].sum().reset_index()
df_population_american_indian = df_population.groupby(['Year', 'State'])['American Indian or Alaskan Native'].sum().reset_index()
df_population_hawaiian = df_population.groupby(['Year', 'State'])['Hawaiian or Pacific Islander Alone'].sum().reset_index()
df_population_two = df_population.groupby(['Year', 'State'])['Two or More Races'].sum().reset_index()
df_population_not_hispanic = df_population.groupby(['Year', 'State'])['Not Hispanic'].sum().reset_index()
df_population_hispanic = df_population.groupby(['Year', 'State'])['Hispanic'].sum().reset_index()

asthma_population_merged = pd.merge(asthma_state_yearly, df_population_total, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
merges = [df_population_white, df_population_black, df_population_american_indian, df_population_hawaiian, df_population_two, df_population_not_hispanic, df_population_hispanic]
for merge_df in merges:
    asthma_population_merged = pd.merge(asthma_population_merged, merge_df, on=['Year', 'State'], how='inner')
asthma_population_merged
Out[27]:
YearStart LocationAbbr Stratification1 DataValue Year State Total Population White Alone Black Alone American Indian or Alaskan Native Hawaiian or Pacific Islander Alone Two or More Races Not Hispanic Hispanic
0 2013 AZ American Indian or Alaska Native 557.366667 2013 AZ 1617914.0 1313929.0 38650.0 202591.0 4009.0 36760.0 1169683.0 448231.0
1 2013 AZ Asian or Pacific Islander 128.233333 2013 AZ 1617914.0 1313929.0 38650.0 202591.0 4009.0 36760.0 1169683.0 448231.0
2 2013 AZ Black, non-Hispanic 1588.833333 2013 AZ 1617914.0 1313929.0 38650.0 202591.0 4009.0 36760.0 1169683.0 448231.0
3 2013 AZ Hispanic 3657.833333 2013 AZ 1617914.0 1313929.0 38650.0 202591.0 4009.0 36760.0 1169683.0 448231.0
4 2013 AZ White, non-Hispanic 4860.633333 2013 AZ 1617914.0 1313929.0 38650.0 202591.0 4009.0 36760.0 1169683.0 448231.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
360 2018 WI American Indian or Alaska Native 141.313333 2018 WI 2765223.0 2566205.0 39437.0 47612.0 1672.0 45678.0 2658039.0 107184.0
361 2018 WI Asian or Pacific Islander 102.496667 2018 WI 2765223.0 2566205.0 39437.0 47612.0 1672.0 45678.0 2658039.0 107184.0
362 2018 WI Black, non-Hispanic 2706.553333 2018 WI 2765223.0 2566205.0 39437.0 47612.0 1672.0 45678.0 2658039.0 107184.0
363 2018 WI Hispanic 722.306667 2018 WI 2765223.0 2566205.0 39437.0 47612.0 1672.0 45678.0 2658039.0 107184.0
364 2018 WI White, non-Hispanic 3391.546667 2018 WI 2765223.0 2566205.0 39437.0 47612.0 1672.0 45678.0 2658039.0 107184.0

365 rows × 14 columns

In [28]:
asthma_population_merged['Race'] = asthma_population_merged['Stratification1'].apply(lambda x: x.split(',')[0])

race_mapping = {
    'American Indian or Alaska Native': 'American Indian or Alaskan Native',
    'Asian or Pacific Islander': 'Hawaiian or Pacific Islander Alone',
    'Black, non-Hispanic': 'Black Alone',
    'Hispanic': 'Hispanic',
    'White, non-Hispanic': 'White Alone',
    "Black": "Black Alone",
    "White": "White Alone"
}

asthma_population_merged['Clean Race'] = asthma_population_merged['Race'].map(race_mapping)

def get_race_population(row):
    race_column = row['Clean Race']
    if race_column in row:
        return row[race_column]
    else:
        return None

asthma_population_merged['Clean Race Population'] = asthma_population_merged.apply(get_race_population, axis=1)
asthma_population_merged['Visits per 10,000 based on Race'] = (asthma_population_merged['DataValue'] / asthma_population_merged['Clean Race Population']) * 10000
asthma_population_merged.head()
Out[28]:
YearStart LocationAbbr Stratification1 DataValue Year State Total Population White Alone Black Alone American Indian or Alaskan Native Hawaiian or Pacific Islander Alone Two or More Races Not Hispanic Hispanic Race Clean Race Clean Race Population Visits per 10,000 based on Race
0 2013 AZ American Indian or Alaska Native 557.366667 2013 AZ 1617914.0 1313929.0 38650.0 202591.0 4009.0 36760.0 1169683.0 448231.0 American Indian or Alaska Native American Indian or Alaskan Native 202591.0 27.511916
1 2013 AZ Asian or Pacific Islander 128.233333 2013 AZ 1617914.0 1313929.0 38650.0 202591.0 4009.0 36760.0 1169683.0 448231.0 Asian or Pacific Islander Hawaiian or Pacific Islander Alone 4009.0 319.863640
2 2013 AZ Black, non-Hispanic 1588.833333 2013 AZ 1617914.0 1313929.0 38650.0 202591.0 4009.0 36760.0 1169683.0 448231.0 Black Black Alone 38650.0 411.082363
3 2013 AZ Hispanic 3657.833333 2013 AZ 1617914.0 1313929.0 38650.0 202591.0 4009.0 36760.0 1169683.0 448231.0 Hispanic Hispanic 448231.0 81.605987
4 2013 AZ White, non-Hispanic 4860.633333 2013 AZ 1617914.0 1313929.0 38650.0 202591.0 4009.0 36760.0 1169683.0 448231.0 White White Alone 1313929.0 36.993120
In [29]:
states = ['FL', 'IA', 'NJ']
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 20))

asthma_population_merged['StateIndex'] = asthma_population_merged['LocationAbbr'].apply(lambda x: states.index(x) if x in states else None)
filtered_data = asthma_population_merged.dropna(subset=['StateIndex'])

sns.lineplot(ax=axes[0], data=filtered_data[filtered_data['LocationAbbr'] == 'FL'], x='YearStart', y='Visits per 10,000 based on Race', hue='Clean Race', legend='full')
axes[0].set_title('ER Visits per 10,000 based on Race - FL')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Visits per 10,000')

sns.lineplot(ax=axes[1], data=filtered_data[filtered_data['LocationAbbr'] == 'IA'], x='YearStart', y='Visits per 10,000 based on Race', hue='Clean Race', legend='full')
axes[1].set_title('ER Visits per 10,000 based on Race - IA')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Visits per 10,000')

sns.lineplot(ax=axes[2], data=filtered_data[filtered_data['LocationAbbr'] == 'NJ'], x='YearStart', y='Visits per 10,000 based on Race', hue='Clean Race', legend='full')
axes[2].set_title('ER Visits per 10,000 based on Race - NJ')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('Visits per 10,000')

plt.tight_layout()
plt.show()

We graph the emergency room visits per 10,000 based on race and see that Haiiwaan or Pacific Islander Alone has the most number of visits per 10,000, which may suggest outliers in the data or in how the statistics are collected in the dataset. People with the ethnicity of Black have the second highest number of visits per 10,000 related to asthma.

Created in deepnote.com Created in Deepnote

The causal question we aim to investigate is: "Does exposure to increased levels of CO2 cause an increase in the number of asthma-related emergency room visits?"¶

In [30]:
# high level or low level of CO2 is the treatment, a binary variable [0, 1]

# Create a DAG
G = nx.DiGraph()

G.add_node("Population", pos=(0.3, 0.7))
G.add_node("CO2", pos=(1, 0))
G.add_node("Asthma", pos=(1.7, 0.7))
G.add_node("Number of power plants", pos=(0.5, 0))

G.add_edges_from([("Population", "CO2"), ("Population", "Asthma"), ("CO2", "Asthma"), ('Number of power plants', 'CO2')])

pos = nx.get_node_attributes(G, 'pos')

plt.figure(figsize=(12, 6))
try:
    nx.draw_networkx(G, pos, with_labels=True, node_size=5000, node_color='skyblue', font_size=15, arrows=True)
except:
    nx.draw(G, pos, with_labels=True, node_size=5000, node_color='skyblue', font_size=15, arrows=True)
plt.title('Causal DAG: Population as a Confounding Variable')
plt.show()
In [31]:
#Load the asthma dataframe
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)
#Group the asthma dataframe by Year and State
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
#Filter out rows that match our research question related to emergency department visit rates for asthma
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['DataValueType'] == 'Number']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Overall']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr'])['DataValue'].mean().reset_index()
#Group the asthma dataset with the Population Dataset because Population is a confounding variable
# We grouped by population
df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)


emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')

merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
merged_df['Visits per 10,000'] = ((merged_df['DataValue'] / merged_df['Total Population']) * 10000).astype(int)
merged_df['Visits per 10,000'] = merged_df['Visits per 10,000'].astype(int)
merged_df['Total Population'] = merged_df['Total Population'].astype(int)
merged_df['Emissions'] = merged_df['Emissions'].astype(float)

merged_df
Out[31]:
YearStart LocationAbbr DataValue Year_x State Total Population State Name Year_y Emissions Visits per 10,000
0 2010 AZ 40650.0 2010 AZ 1600442 Arizona 2010 99.5 253
1 2010 CA 217637.0 2010 CA 8730612 California 2010 356.6 249
2 2010 FL 141612.0 2010 FL 18845537 Florida 2010 245.5 75
3 2010 IA 14135.0 2010 IA 2075096 Iowa 2010 90.3 68
4 2010 KY 27334.0 2010 KY 4348181 Kentucky 2010 152.9 62
... ... ... ... ... ... ... ... ... ... ...
95 2018 NJ 52173.0 2018 NJ 569816 New Jersey 2018 105.1 915
96 2018 NV 14810.0 2018 NV 199220 Nevada 2018 41.4 743
97 2018 OR 12819.0 2018 OR 4740945 Oregon 2018 39.6 27
98 2018 VT 1799.0 2018 VT 324553 Vermont 2018 5.9 55
99 2018 WI 21174.0 2018 WI 2765223 Wisconsin 2018 101.2 76

100 rows × 10 columns

In [32]:
# The states in our dataset with Asthma related information for emergency visits to the hospital
asthma_population_merged["State Name"].unique()
Out[32]:
array(['Arizona', 'California', 'Florida', 'Iowa', 'Kentucky',
       'Massachusetts', 'Maryland', 'North Carolina', 'Nebraska',
       'New Jersey', 'New York', 'South Carolina', 'Utah', 'Wisconsin',
       'Nevada', 'Vermont', 'Arkansas', 'Georgia', 'Minnesota', 'Oregon',
       'Colorado'], dtype=object)

We get the 60th percentile of the "Emissions" column. We will consider everything beyond this threshold to be having high CO2 emissions. Otherwise low CO2 Emissions if below this threshold. This is because we want to have the treatment variable be a binary variable for Causal Inference.

In [33]:
threshold = merged_df['Emissions'].quantile(0.60)
print(threshold)
merged_df['High Emissions'] = merged_df['Emissions'].apply(lambda x: 1 if x > threshold else 0)
merged_df
99.34
Out[33]:
YearStart LocationAbbr DataValue Year_x State Total Population State Name Year_y Emissions Visits per 10,000 High Emissions
0 2010 AZ 40650.0 2010 AZ 1600442 Arizona 2010 99.5 253 1
1 2010 CA 217637.0 2010 CA 8730612 California 2010 356.6 249 1
2 2010 FL 141612.0 2010 FL 18845537 Florida 2010 245.5 75 1
3 2010 IA 14135.0 2010 IA 2075096 Iowa 2010 90.3 68 0
4 2010 KY 27334.0 2010 KY 4348181 Kentucky 2010 152.9 62 1
... ... ... ... ... ... ... ... ... ... ... ...
95 2018 NJ 52173.0 2018 NJ 569816 New Jersey 2018 105.1 915 1
96 2018 NV 14810.0 2018 NV 199220 Nevada 2018 41.4 743 0
97 2018 OR 12819.0 2018 OR 4740945 Oregon 2018 39.6 27 0
98 2018 VT 1799.0 2018 VT 324553 Vermont 2018 5.9 55 0
99 2018 WI 21174.0 2018 WI 2765223 Wisconsin 2018 101.2 76 1

100 rows × 11 columns

In [34]:
# Power plant dataframe
# The year is 2022 for all entries
# We use the number of power plants as an instrumental variable
PLNT22df = pd.read_csv('PLNT22.csv')
PLNT22df.head()
C:\Users\tranh\AppData\Local\Temp\ipykernel_2596\2105320888.py:4: DtypeWarning: Columns (0,1,4,6,8,16,17,19,20,22,23,27,29,33,34,49,51,52,54,55,57,58,59,60,61,62,63,64,65,67,68,70,71,73,80,94,95,96,99) have mixed types. Specify dtype option on import or set low_memory=False.
  PLNT22df = pd.read_csv('PLNT22.csv')
Out[34]:
Plant file sequence number Data Year Plant state abbreviation Plant name DOE/EIA ORIS plant or facility code Plant transmission or distribution system owner name Plant transmission or distribution system owner ID Utility name Utility ID Plant-level sector ... Plant wind generation percent (resource mix) Plant solar generation percent (resource mix) Plant geothermal generation percent (resource mix) Plant other fossil generation percent (resource mix) Plant other unknown / purchased fuel generation percent (resource mix) Plant total nonrenewables generation percent (resource mix) Plant total renewables generation percent (resource mix) Plant total nonhydro renewables generation percent (resource mix) Plant total combustion generation percent (resource mix) Plant total noncombustion generation percent (resource mix)
0 SEQPLT22 YEAR PSTATABB PNAME ORISPL OPRNAME OPRCODE UTLSRVNM UTLSRVID SECTOR ... PLWIPR PLSOPR PLGTPR PLOFPR PLOPPR PLTNPR PLTRPR PLTHPR PLCYPR PLCNPR
1 1 2022 AK Agrium Kenai Nitrogen Operations 54452 Homer Electric Assn Inc 19558 Agrium US Inc 179 Industrial CHP ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 2022 AK Alakanuk 57053 Alaska Village Elec Coop, Inc 221 Alaska Village Elec Coop, Inc 221 Electric Utility ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 3 2022 AK Allison Creek Hydro 58982 Copper Valley Elec Assn, Inc 4329 Copper Valley Elec Assn, Inc 4329 Electric Utility ... 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 100.0%
4 4 2022 AK Ambler 60243 Alaska Village Elec Coop, Inc 221 Alaska Village Elec Coop, Inc 221 Electric Utility ... 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% 0.0% 100.0% 0.0%

5 rows × 141 columns

In [35]:
state_counts = PLNT22df['Plant state abbreviation'].value_counts()
print(state_counts)
data = {
    'State': ['CA', 'NC', 'NY', 'TX', 'MN', 'MA', 'NJ', 'IL', 'IA', 'MI', 'OR', 'CO', 'FL', 'PA', 'GA', 
              'WI', 'SC', 'IN', 'VA', 'OH', 'MD', 'KS', 'CT', 'AK', 'AZ', 'WA', 'ID', 'NM', 'ME', 'MO', 
              'OK', 'NE', 'VT', 'UT', 'NV', 'RI', 'LA', 'TN', 'AL', 'AR', 'HI', 'WY', 'MT', 'NH', 'ND', 
              'PR', 'KY', 'SD', 'MS', 'WV', 'DE', 'DC'],
    'Count': [1678, 875, 821, 754, 724, 600, 368, 328, 283, 266, 265, 265, 242, 242, 235, 204, 197, 194, 
              193, 183, 171, 161, 155, 151, 146, 142, 140, 135, 133, 130, 129, 126, 113, 113, 102, 87, 
              82, 82, 75, 73, 68, 67, 65, 61, 60, 60, 52, 50, 47, 36, 31, 13]
}

df = pd.DataFrame(data)
df.head()
CA          1678
NC           875
NY           821
TX           754
MN           724
MA           600
NJ           368
IL           328
IA           283
MI           266
CO           265
OR           265
PA           242
FL           242
GA           235
WI           204
SC           197
IN           194
VA           193
OH           183
MD           171
KS           161
CT           155
AK           151
AZ           146
WA           142
ID           140
NM           135
ME           133
MO           130
OK           129
NE           126
UT           113
VT           113
NV           102
RI            87
TN            82
LA            82
AL            75
AR            73
HI            68
WY            67
MT            65
NH            61
PR            60
ND            60
KY            52
SD            50
MS            47
WV            36
DE            31
DC            13
PSTATABB       1
Name: Plant state abbreviation, dtype: int64
Out[35]:
State Count
0 CA 1678
1 NC 875
2 NY 821
3 TX 754
4 MN 724
In [36]:
result = {k: int(v) for k, v in df.set_index('State')['Count'].to_dict().items()}
merged_df['Total Power Plant Count'] = merged_df['LocationAbbr'].map(result)
In [37]:
merged_df
Out[37]:
YearStart LocationAbbr DataValue Year_x State Total Population State Name Year_y Emissions Visits per 10,000 High Emissions Total Power Plant Count
0 2010 AZ 40650.0 2010 AZ 1600442 Arizona 2010 99.5 253 1 146
1 2010 CA 217637.0 2010 CA 8730612 California 2010 356.6 249 1 1678
2 2010 FL 141612.0 2010 FL 18845537 Florida 2010 245.5 75 1 242
3 2010 IA 14135.0 2010 IA 2075096 Iowa 2010 90.3 68 0 283
4 2010 KY 27334.0 2010 KY 4348181 Kentucky 2010 152.9 62 1 52
... ... ... ... ... ... ... ... ... ... ... ... ...
95 2018 NJ 52173.0 2018 NJ 569816 New Jersey 2018 105.1 915 1 368
96 2018 NV 14810.0 2018 NV 199220 Nevada 2018 41.4 743 0 102
97 2018 OR 12819.0 2018 OR 4740945 Oregon 2018 39.6 27 0 265
98 2018 VT 1799.0 2018 VT 324553 Vermont 2018 5.9 55 0 113
99 2018 WI 21174.0 2018 WI 2765223 Wisconsin 2018 101.2 76 1 204

100 rows × 12 columns

In [38]:
# From lab 9, we fit an OLS model
def fit_OLS_model(df, target_variable, explanatory_variables, intercept = False):
    """
    Fits an OLS model from data.
    Inputs:
        df: pandas DataFrame
        target_variable: string, name of the target variable
        explanatory_variables: list of strings, names of the explanatory variables
        intercept: bool, if True add intercept term
    Outputs:
        fitted_model: model containing OLS regression results
    """
    target = df[target_variable]
    inputs = df[explanatory_variables]
    if intercept:
        inputs = sm.add_constant(inputs)
    fitted_model = sm.OLS(target, inputs).fit()
    return(fitted_model)

Predicting Outcome Based on Treatment¶

In [39]:
gammas_model1 = fit_OLS_model(merged_df, 'Visits per 10,000', ['High Emissions'], intercept = True)
gammas_model1.summary()
Out[39]:
OLS Regression Results
Dep. Variable: Visits per 10,000 R-squared: 0.010
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: 0.9571
Date: Mon, 06 May 2024 Prob (F-statistic): 0.330
Time: 11:42:41 Log-Likelihood: -716.34
No. Observations: 100 AIC: 1437.
Df Residuals: 98 BIC: 1442.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 191.9667 40.749 4.711 0.000 111.102 272.831
High Emissions 63.0333 64.430 0.978 0.330 -64.825 190.892
Omnibus: 40.794 Durbin-Watson: 1.768
Prob(Omnibus): 0.000 Jarque-Bera (JB): 72.917
Skew: 1.815 Prob(JB): 1.47e-16
Kurtosis: 5.081 Cond. No. 2.45


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Predict Visits Per 10,000 based on Total Population and CO2 Emissions. Total Population is the Confounding Variable¶

In [40]:
gammas_model2 = fit_OLS_model(merged_df, 'Visits per 10,000', ['High Emissions', 'Total Population'], intercept = True)
gammas_model2.summary()
Out[40]:
OLS Regression Results
Dep. Variable: Visits per 10,000 R-squared: 0.242
Model: OLS Adj. R-squared: 0.226
Method: Least Squares F-statistic: 15.46
Date: Mon, 06 May 2024 Prob (F-statistic): 1.48e-06
Time: 11:42:41 Log-Likelihood: -702.99
No. Observations: 100 AIC: 1412.
Df Residuals: 97 BIC: 1420.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 265.5250 38.298 6.933 0.000 189.514 341.536
High Emissions 260.1005 67.227 3.869 0.000 126.674 393.527
Total Population -3.394e-05 6.23e-06 -5.448 0.000 -4.63e-05 -2.16e-05
Omnibus: 19.797 Durbin-Watson: 1.872
Prob(Omnibus): 0.000 Jarque-Bera (JB): 24.225
Skew: 1.160 Prob(JB): 5.49e-06
Kurtosis: 3.655 Cond. No. 1.72e+07


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.72e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Predict Asthma Visits per 10,000 based on Instrumental Variable and Confounding Variable¶

In [41]:
gammas_model3 = fit_OLS_model(merged_df, 'Visits per 10,000', ['Total Power Plant Count','High Emissions', 'Total Population'], intercept = True)
gammas_model3.summary()
Out[41]:
OLS Regression Results
Dep. Variable: Visits per 10,000 R-squared: 0.242
Model: OLS Adj. R-squared: 0.218
Method: Least Squares F-statistic: 10.22
Date: Mon, 06 May 2024 Prob (F-statistic): 6.69e-06
Time: 11:42:41 Log-Likelihood: -702.98
No. Observations: 100 AIC: 1414.
Df Residuals: 96 BIC: 1424.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 261.4268 44.390 5.889 0.000 173.313 349.541
Total Power Plant Count 0.0197 0.106 0.185 0.853 -0.191 0.231
High Emissions 257.2455 69.298 3.712 0.000 119.690 394.801
Total Population -3.41e-05 6.31e-06 -5.400 0.000 -4.66e-05 -2.16e-05
Omnibus: 20.164 Durbin-Watson: 1.878
Prob(Omnibus): 0.000 Jarque-Bera (JB): 24.861
Skew: 1.177 Prob(JB): 4.00e-06
Kurtosis: 3.655 Cond. No. 1.74e+07


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.74e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

The causal question we aim to investigate is: "Does exposure to increased levels of carbon dioxide cause an increase in the number of asthma-related emergency room visits?"¶

-Treatment: Exposure to increased levels of carbon dioxide. This can be operationalized as being in geographic locations or during time periods with carbon dioxide levels above a certain threshold considered "high." (high here means over 99.34 MTCO2E, the 60th percentile in our dataset)

  • Control: Exposure to lower levels of carbon dioxide, defined as being in locations or during time periods with carbon dioxide levels below the threshold considered "high." (high here means over 99.34 MTCO2E)

  • Outcome: The number of asthma-related emergency room visits. This can be measured as the total number of visits or the rate of visits per 10,000 people to account for population size differences.

-Units

  • People: Individual patients visiting emergency rooms for asthma
  • Location: State
  • Time: Years
    • Confounders
  • Geographic Location: Some areas might have consistently CO2 due to factors like climate, industrial activity, or traffic density.
  • Population: Some areas might have consistently higher carbon dioxide levels due to factors like climate, industrial activity, or traffic density.
  • Altitude: How high you are affects how much oxygen you get and can affect asthma and CO2 levels.

We identified an instrumental variable, which will be used to adjust for confounders, which is the number of power plants in a state that affect the CO2 directly and independently from Population and Emergency visits for Asthma per 10,000 people.

Created in deepnote.com Created in Deepnote

The 1st question we aim to investigate is: "Does exposure to increased levels of CO2 cause an increase in the number of asthma-related emergency room visits?"¶

We used multiple hypothesis testing¶

Multiple Hypothesis Testing¶

We used multiple hypothesis tests to answer the question of whether exposure to increased levels of carbon dioxide causes an increase in the number of asthma-related emergency room visits. Note that different populations may respond differently to carbon dioxide exposure due to genetic, environmental, or lifestyle factors. Multiple hypothesis tests allow for the examination of effects across various demographic groups, such as age, gender, race/ethnicity, or geographic location. The impact of carbon dioxide on asthma-related emergency room visits may vary over time due to changes in air quality, climate conditions, or public health interventions. Testing across different time periods can provide insights into these temporal dynamics.

Additionally, carbon dioxide levels and the prevalence of asthma can vary significantly across different geographic areas. By conducting hypothesis tests for different states, we can identify areas where the association between carbon dioxide exposure and asthma emergencies is stronger or weaker. Conducting multiple hypothesis tests increases the robustness of the findings. If similar results are obtained across various groups, time periods, and locations, it strengthens the evidence for a causal relationship between carbon dioxide exposure and asthma-related emergency room visits. Multiple hypothesis testing allows for the examination of the relationship between carbon dioxide exposure and asthma while controlling for potential confounding variables. This approach can help isolate the effect of carbon dioxide from other factors that might influence asthma emergency visits.

1. The null hypothesis (H0) states that there is no difference in the number of visits to the emergency room/hospital due to high CO2 levels. The alternative hypothesis (H1) states that high CO2 levels increase emergency visits.¶

Benjamini-Hochberg algorithm controls the False Discovery Rate, the expected proportion of incorrectly rejected hypothesis (false discoveries) among all rejections. This method allows to control the rate of false discoveries effectively while maintaining a higher power to detect real effects compared to FWER methods. Failing to identify a true effect (type II error).

In [42]:
def null_pdf(x): return norm.pdf(x, 0, 1) #change
def alt_pdf(x): return norm.pdf(x, 2, 1) #change
def calculate_likelihood_ratio(x):
    """
    Computes the likelihood ratio between the alternative and null hypothesis.
    
    Inputs:
        x: value for which to compute the likelihood ratio
    
    Outputs:
        lr : the likelihood ratio at point x
    """
    
    L0 = null_pdf(x)
    L1 = alt_pdf(x)
    LR = L1/L0 # TODO: fill the likelihood ratio
    return LR
# TODO: fill in the missing expression for FPR
def calculate_fpr(tau):
    """
    Calculates the FPR for the test based on thresholding X>=tau.
    It assumes that the null distribution is the standard gaussian N(0,1)
    
    Inputs:
        tau: test threshold
    
    Outputs:
        fpr: false positive rate
    """
    
    fpr = 1 - scipy.stats.norm.cdf(tau)
    return fpr
# TODO: complete this function
def make_decision(X, alpha):
    """
    Makes a decision whether to reject the null hypothesis for point X at level alpha
    
    Inputs:
        X: point at which to test
        alpha: desired FPR rate for the decision rule (also known as significance level)
    
    Outputs:
        decision: {0, 1} or {False, True}
            1/True: reject the null
            0/False: fail to reject the null
    """
    threshold = scipy.stats.norm.ppf(1 - alpha) # TODO: compute the threshold for which the FPR of the test is equal to alpha (see Hint)
    decision = X > threshold # TODO: compute the decision; 1 stands for rejecting the null
    return decision
# TODO: complete this function
def calculate_p_value(X):
    """
    Calculates the P-values for the point X
    
    Inputs:
        X: data point
    
    Outputs:
        p_value: P(X)
    """
    p_value = calculate_fpr(X)
    return(p_value)

To collect p-values for testing the hypothesis that high carbon dioxide levels increase emergency room visits due to asthma

  1. Merge CO2 Levels Data with Asthma ER Visits Data

  2. Calculate Asthma ER Visit Rates: Normalize the number of asthma ER visits by the population size (if not already done) to get a rate of visits per 10,000 people (or any other suitable normalization factor). This allows for fair comparisons across different states or regions with varying population sizes.

  3. Classify CO2 Levels: Determine a threshold for high CO2 levels based on environmental standards or literature. Classify each observation (e.g., state-year combination) as having high or low CO2 levels based on this threshold.

  4. Statistical Test for Each Group: For each group (high CO2 level vs. low CO2 level), calculate the mean asthma ER visit rate. Then, use a statistical test (e.g., t-test if assumptions are met, or Mann-Whitney U test for non-parametric data) to compare the mean rates between the two groups.

  5. Collect P-values: The statistical test will provide a p-value indicating the likelihood of observing the data assuming the null hypothesis is true. A low p-value (typically <0.05) suggests that the observed difference in asthma ER visit rates between high and low CO2 emission level groups is statistically significant, leading to the rejection of the null hypothesis in favor of the alternative hypothesis.

In [43]:
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)

asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Gender']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()

df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)

df_male_population_summary = df_population.groupby(['Year', 'State'])['Male Population'].sum().reset_index()
df_female_population_summary = df_population.groupby(['Year', 'State'])['Female Population'].sum().reset_index()
# print(df_male_population_summary, df_female_population_summary)
asthma_population_merged = pd.merge(asthma_population_merged, df_male_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_male_pop'))
asthma_population_merged = pd.merge(asthma_population_merged, df_female_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_female_pop'))

emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')

merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
merged_df['Visits per 10,000'] = merged_df.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)

merged_df.head()
Out[43]:
YearStart LocationAbbr Stratification1 DataValue Year_x State Total Population State Name Year_male_pop State_male_pop Male Population Year_female_pop State_female_pop Female Population Year_y Emissions Visits per 10,000
0 2010 AZ Female 7473.116667 2010 AZ 1600442.0 Arizona 2010 AZ 809747.0 2010 AZ 790763.0 2010 99.5 94.505138
1 2010 AZ Male 6150.000000 2010 AZ 1600442.0 Arizona 2010 AZ 809747.0 2010 AZ 790763.0 2010 99.5 75.949648
2 2010 CA Female 38898.420000 2010 CA 8730612.0 California 2010 CA 4327961.0 2010 CA 4402785.0 2010 356.6 88.349579
3 2010 CA Male 32368.006667 2010 CA 8730612.0 California 2010 CA 4327961.0 2010 CA 4402785.0 2010 356.6 74.788120
4 2010 FL Female 27106.200000 2010 FL 18845537.0 Florida 2010 FL 9212261.0 2010 FL 9633882.0 2010 245.5 28.136321
In [44]:
q1_with_emissions = merged_df.groupby(['YearStart', 'LocationAbbr'])[['Visits per 10,000', 'Emissions', 'Total Population']].aggregate({'Visits per 10,000': 'sum', 'Emissions': 'first', 'Total Population': 'first'}).reset_index()
q1_with_emissions
Out[44]:
YearStart LocationAbbr Visits per 10,000 Emissions Total Population
0 2010 AZ 170.454787 99.5 1600442.0
1 2010 CA 163.137699 356.6 8730612.0
2 2010 FL 50.056662 245.5 18845537.0
3 2010 IA 45.970045 90.3 2075096.0
4 2010 KY 42.195833 152.9 4348181.0
... ... ... ... ... ...
95 2018 NJ 612.695721 105.1 569816.0
96 2018 NV 502.801522 41.4 199220.0
97 2018 OR 18.184902 39.6 4740945.0
98 2018 VT 39.409573 5.9 324553.0
99 2018 WI 51.435813 101.2 2765223.0

100 rows × 5 columns

In [45]:
q1_with_emissions['Visits_per_10000'] = q1_with_emissions['Visits per 10,000'].astype(float)
q1_with_emissions['HighCO2'] = q1_with_emissions['Emissions'].apply(lambda x: 1 if float(x) > 100 else 0)
q1_with_emissions
Out[45]:
YearStart LocationAbbr Visits per 10,000 Emissions Total Population Visits_per_10000 HighCO2
0 2010 AZ 170.454787 99.5 1600442.0 170.454787 0
1 2010 CA 163.137699 356.6 8730612.0 163.137699 1
2 2010 FL 50.056662 245.5 18845537.0 50.056662 1
3 2010 IA 45.970045 90.3 2075096.0 45.970045 0
4 2010 KY 42.195833 152.9 4348181.0 42.195833 1
... ... ... ... ... ... ... ...
95 2018 NJ 612.695721 105.1 569816.0 612.695721 1
96 2018 NV 502.801522 41.4 199220.0 502.801522 0
97 2018 OR 18.184902 39.6 4740945.0 18.184902 0
98 2018 VT 39.409573 5.9 324553.0 39.409573 0
99 2018 WI 51.435813 101.2 2765223.0 51.435813 1

100 rows × 7 columns

In [46]:
q1_with_emissions.groupby('HighCO2')['Visits per 10,000'].mean()
Out[46]:
HighCO2
0    137.124766
1    161.673375
Name: Visits per 10,000, dtype: float64

The mean asthma ER visit rate per 10,000 people is 137.12 for areas with low CO2 levels (HighCO2 = 0) and 161.67 for areas with high CO2 levels (HighCO2 = 1).

In [47]:
model = glm(formula='Visits_per_10000 ~ Emissions', data=q1_with_emissions, 
            family=sm.families.NegativeBinomial()).fit()
model.summary()
Out[47]:
Generalized Linear Model Regression Results
Dep. Variable: Visits_per_10000 No. Observations: 100
Model: GLM Df Residuals: 5
Model Family: NegativeBinomial Df Model: 94
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -522.64
Date: Mon, 06 May 2024 Deviance: 0.14064
Time: 11:42:44 Pearson chi2: 0.139
No. Iterations: 6 Pseudo R-squ. (CS): 0.7822
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 6.6201 1.001 6.616 0.000 4.659 8.581
Emissions[T.101.2] -2.5520 1.229 -2.077 0.038 -4.960 -0.144
Emissions[T.101.3] -2.5578 1.421 -1.800 0.072 -5.342 0.227
Emissions[T.103.7] -0.3288 1.415 -0.232 0.816 -3.103 2.445
Emissions[T.104.9] 0.0038 1.415 0.003 0.998 -2.770 2.777
Emissions[T.105.1] -0.2022 1.415 -0.143 0.886 -2.976 2.572
Emissions[T.105.6] 0.0566 1.226 0.046 0.963 -2.345 2.459
Emissions[T.116.8] -3.4385 1.429 -2.406 0.016 -6.240 -0.637
Emissions[T.121.1] -3.1866 1.426 -2.235 0.025 -5.982 -0.392
Emissions[T.122.8] -3.4449 1.429 -2.410 0.016 -6.246 -0.643
Emissions[T.124.2] -3.2337 1.427 -2.267 0.023 -6.030 -0.438
Emissions[T.124.8] -3.3372 1.428 -2.337 0.019 -6.136 -0.539
Emissions[T.124.9] -3.1321 1.425 -2.197 0.028 -5.926 -0.338
Emissions[T.126.2] -3.0551 1.425 -2.144 0.032 -5.847 -0.263
Emissions[T.126.8] -3.3243 1.428 -2.328 0.020 -6.123 -0.526
Emissions[T.129.3] -2.9507 1.424 -2.073 0.038 -5.741 -0.160
Emissions[T.131.3] -3.4277 1.429 -2.398 0.016 -6.229 -0.627
Emissions[T.139.7] -3.1064 1.425 -2.180 0.029 -5.900 -0.313
Emissions[T.140.4] -3.0914 1.425 -2.169 0.030 -5.884 -0.298
Emissions[T.141.8] -2.9883 1.424 -2.099 0.036 -5.779 -0.197
Emissions[T.141.9] -2.9680 1.424 -2.085 0.037 -5.759 -0.177
Emissions[T.147.2] -2.8727 1.423 -2.019 0.044 -5.662 -0.084
Emissions[T.148.1] -2.8515 1.423 -2.004 0.045 -5.640 -0.063
Emissions[T.152.9] -2.8777 1.423 -2.022 0.043 -5.667 -0.089
Emissions[T.169.1] -1.4366 1.417 -1.014 0.311 -4.213 1.340
Emissions[T.170.2] -1.5571 1.417 -1.099 0.272 -4.334 1.220
Emissions[T.185.0] -1.2245 1.416 -0.865 0.387 -4.000 1.551
Emissions[T.227.0] -2.8580 1.423 -2.009 0.045 -5.647 -0.069
Emissions[T.233.0] -2.8318 1.423 -1.991 0.047 -5.620 -0.043
Emissions[T.237.4] -3.1750 1.426 -2.227 0.026 -5.970 -0.380
Emissions[T.238.4] -3.0385 1.424 -2.133 0.033 -5.830 -0.247
Emissions[T.238.8] -2.9863 1.424 -2.097 0.036 -5.777 -0.195
Emissions[T.242.0] -3.0309 1.424 -2.128 0.033 -5.823 -0.239
Emissions[T.245.5] -2.7069 1.422 -1.904 0.057 -5.493 0.080
Emissions[T.356.6] -1.5255 1.417 -1.077 0.282 -4.302 1.252
Emissions[T.38.5] -0.6429 1.416 -0.454 0.650 -3.417 2.132
Emissions[T.38.8] -3.7357 1.434 -2.605 0.009 -6.547 -0.925
Emissions[T.39.6] -3.7195 1.434 -2.594 0.009 -6.530 -0.909
Emissions[T.39.7] -0.4369 1.415 -0.309 0.758 -3.211 2.337
Emissions[T.40.0] -0.4278 1.415 -0.302 0.762 -3.202 2.346
Emissions[T.40.6] -0.4089 1.415 -0.289 0.773 -3.183 2.365
Emissions[T.41.4] -0.3999 1.415 -0.283 0.778 -3.174 2.374
Emissions[T.47.9] -3.5861 1.432 -2.505 0.012 -6.392 -0.780
Emissions[T.48.4] -3.5286 1.431 -2.466 0.014 -6.333 -0.725
Emissions[T.5.7] -2.6089 1.421 -1.836 0.066 -5.394 0.176
Emissions[T.5.9] -2.7789 1.158 -2.399 0.016 -5.049 -0.509
Emissions[T.50.0] -3.2140 1.426 -2.253 0.024 -6.010 -0.418
Emissions[T.50.7] -3.6531 1.433 -2.550 0.011 -6.461 -0.845
Emissions[T.52.2] -3.3499 1.428 -2.346 0.019 -6.149 -0.551
Emissions[T.52.4] -3.6037 1.432 -2.517 0.012 -6.410 -0.797
Emissions[T.53.5] -3.4622 1.430 -2.422 0.015 -6.264 -0.660
Emissions[T.55.0] -0.5485 1.416 -0.387 0.698 -3.323 2.226
Emissions[T.59.0] -3.9945 1.440 -2.774 0.006 -6.817 -1.172
Emissions[T.59.1] -3.5397 1.431 -2.474 0.013 -6.344 -0.735
Emissions[T.6.1] -2.9055 1.423 -2.041 0.041 -5.695 -0.116
Emissions[T.60.5] -0.4582 1.415 -0.324 0.746 -3.232 2.316
Emissions[T.61.7] -0.6058 1.416 -0.428 0.669 -3.380 2.169
Emissions[T.62.1] -3.3531 1.428 -2.348 0.019 -6.152 -0.554
Emissions[T.62.3] -0.6455 1.416 -0.456 0.648 -3.420 2.129
Emissions[T.64.0] -3.6672 1.433 -2.559 0.010 -6.476 -0.859
Emissions[T.64.2] -0.2563 1.415 -0.181 0.856 -3.030 2.518
Emissions[T.65.3] -3.7985 1.436 -2.646 0.008 -6.612 -0.985
Emissions[T.66.7] -3.7815 1.435 -2.635 0.008 -6.594 -0.969
Emissions[T.68.9] -3.1877 1.426 -2.235 0.025 -5.983 -0.393
Emissions[T.70.0] -3.1589 1.426 -2.216 0.027 -5.953 -0.364
Emissions[T.70.2] -2.8723 1.423 -2.019 0.044 -5.661 -0.083
Emissions[T.70.8] -3.4459 1.429 -2.411 0.016 -6.247 -0.644
Emissions[T.71.4] -0.0615 1.415 -0.043 0.965 -2.835 2.712
Emissions[T.72.8] -3.1318 1.425 -2.197 0.028 -5.926 -0.338
Emissions[T.73.4] -3.2775 1.427 -2.297 0.022 -6.075 -0.480
Emissions[T.74.0] -3.1609 1.426 -2.217 0.027 -5.955 -0.366
Emissions[T.76.3] -3.1800 1.426 -2.230 0.026 -5.975 -0.385
Emissions[T.77.4] -3.2732 1.427 -2.294 0.022 -6.070 -0.476
Emissions[T.82.4] -3.1237 1.231 -2.537 0.011 -5.537 -0.710
Emissions[T.83.4] -2.9730 1.424 -2.088 0.037 -5.764 -0.182
Emissions[T.85.9] -3.0689 1.425 -2.154 0.031 -5.861 -0.276
Emissions[T.89.3] -2.4597 1.420 -1.732 0.083 -5.243 0.324
Emissions[T.90.1] -1.7211 1.417 -1.214 0.225 -4.499 1.057
Emissions[T.90.3] -2.7921 1.422 -1.963 0.050 -5.580 -0.004
Emissions[T.90.5] -1.9065 1.418 -1.345 0.179 -4.685 0.872
Emissions[T.90.9] -1.8380 1.418 -1.296 0.195 -4.616 0.941
Emissions[T.91.1] -2.2623 1.419 -1.594 0.111 -5.044 0.519
Emissions[T.91.3] -2.2711 1.419 -1.600 0.110 -5.053 0.511
Emissions[T.94.1] -1.9273 1.418 -1.359 0.174 -4.706 0.852
Emissions[T.94.8] -2.2939 1.419 -1.616 0.106 -5.076 0.488
Emissions[T.95.0] -2.0448 1.418 -1.442 0.149 -4.825 0.735
Emissions[T.95.6] -2.6331 1.421 -1.853 0.064 -5.419 0.153
Emissions[T.95.9] -2.0742 1.418 -1.462 0.144 -4.854 0.706
Emissions[T.97.3] -1.7193 1.417 -1.213 0.225 -4.497 1.059
Emissions[T.98.6] -2.6429 1.421 -1.860 0.063 -5.429 0.143
Emissions[T.99.0] -2.3017 1.419 -1.622 0.105 -5.084 0.480
Emissions[T.99.3] -1.7523 1.417 -1.236 0.216 -4.530 1.026
Emissions[T.99.4] -0.1697 1.415 -0.120 0.905 -2.944 2.604
Emissions[T.99.5] -1.4816 1.417 -1.046 0.296 -4.258 1.295
Emissions[T.99.9] -2.7667 1.422 -1.945 0.052 -5.554 0.021

The summary of the generalized linear model (GLM) with a Negative Binomial family shows the following key results for modeling the relationship between Emissions and Visits_per_10000 (asthma-related emergency room visits per 10,000 people):

  • Intercept: The coefficient for the intercept is 5.2932 with a standard error of 0.197, which is statistically significant (p-value < 0.0001). This suggests a high baseline rate of asthma-related ER visits per 10,000 people when emissions are considered at zero.

  • Emission: The coefficient for Emissions is -0.0033 with a standard error of 0.002. The p-value for this coefficient is 0.055, which is marginally above the conventional threshold of 0.05 for statistical significance. This indicates a negative relationship between emissions and asthma-related ER visits per 10,000 people, albeit with marginal statistical significance.

  • The model used the log link function and was iterated 6 times to reach convergence.

  • The pseudo R-squared value (Comparative Fit Index) is 0.03222, suggesting that the model explains a small portion of the variance in asthma-related ER visits per 10,000 people.

Given the marginal p-value for Emissions, the data suggests a slight negative relationship between emissions and asthma-related ER visits per 10,000 people, though this result is not statistically significant at the 0.05 level. This analysis implies that higher emissions are not associated with an increase in asthma-related ER visits per 10,000 people, contrary to what might be expected.

In [48]:
p_values = model.pvalues
print(p_values)
Intercept             3.699265e-11
Emissions[T.101.2]    3.781367e-02
Emissions[T.101.3]    7.180858e-02
Emissions[T.103.7]    8.162721e-01
Emissions[T.104.9]    9.978746e-01
                          ...     
Emissions[T.99.0]     1.048848e-01
Emissions[T.99.3]     2.163504e-01
Emissions[T.99.4]     9.045280e-01
Emissions[T.99.5]     2.956719e-01
Emissions[T.99.9]     5.172320e-02
Length: 95, dtype: float64

We are using Negative binomial regression for this test because negative binomial regression is suitable for modeling count data, especially when the data exhibit overdispersion. Overdispersion occurs when the variance of the dependent variable is larger than the mean, which is common in count data related to events or occurrences in a given time frame or space.

In the context of asthma-related emergency room visits per 10,000 people, a few key points make negative binomial regression a good choice:

  1. Count Data: The outcome variable, asthma-related ER visits, is a count of occurrences, fitting the general use case for negative binomial regression.

  2. Overdispersion: Asthma ER visits are likely to exhibit overdispersion due to the variability in asthma prevalence across different populations, environmental factors affecting asthma severity, and other unmeasured confounders. The negative binomial model can handle this overdispersion by introducing an additional parameter to account for the extra variability in the data.

  3. Zeros and Variability: The dataset might contain many zeros (e.g., areas with no asthma ER visits) and high variability in counts, which could be poorly modeled by simpler count models like Poisson regression. The negative binomial model is more flexible in handling such data.

  4. Predictors and Exposure: Negative binomial regression allows for the inclusion of predictors (such as emissions in this case) and can model the rate of occurrences in relation to these predictors while accounting for exposure (population size).

By using negative binomial regression, we can more accurately estimate the relationship between emissions (or CO2 levels) and asthma-related ER visits while accounting for the inherent variability and overdispersion in the data.

In [49]:
# Fit a Poisson regression model
poisson_model = GLM(q1_with_emissions['Visits_per_10000'].astype(float), sm.add_constant(q1_with_emissions['Emissions'].astype(float)), 
                    family=families.Poisson()).fit()

# Fit a Negative Binomial regression model (already fitted and stored in `model` variable)
negative_binomial_model = model
In [50]:
# Calculate the test statistic
lr_stat = 2 * (negative_binomial_model.llf - poisson_model.llf)

# Degrees of freedom for the test (difference in the number of parameters)
df = negative_binomial_model.df_model - poisson_model.df_model
lr_stat
Out[50]:
20922.232638114958

The likelihood ratio test statistic is approximately 20922.23. This suggests a strong preference for the Negative Binomial model over the Poisson model for this data, indicating that the Negative Binomial model provides a significantly better fit to the data compared to the Poisson model.


In [51]:
Global_P_Values = []

2. The null hypothesis states that there is no difference in the impact of CO2 emission on asthma-related emergency visits between males and females. The alternative hypothesis states that there is a noticeable difference in the impact of high CO2 levels on asthma-related emergency visits between males and females.¶

In [52]:
# Load up asthma dataset
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)

asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Gender']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()

df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)

df_male_population_summary = df_population.groupby(['Year', 'State'])['Male Population'].sum().reset_index()
df_female_population_summary = df_population.groupby(['Year', 'State'])['Female Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_population_merged, df_male_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_male_pop'))
asthma_population_merged = pd.merge(asthma_population_merged, df_female_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_female_pop'))

emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')

#Merge asthma and emissions together as well as Male and Female Population
merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
#Normalize visits per 10,000 based on the Male and Female population
merged_df['Visits per 10,000'] = merged_df.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)


merged_df.head()
randomforest_df = merged_df.copy()
In [53]:
# COmpare emissions with the male and female population per state
q2_with_emissions = merged_df.groupby(['YearStart', 'LocationAbbr'])[['Visits per 10,000', "Emissions", 'Male Population', 'Female Population', 'Total Population']].aggregate({'Visits per 10,000': 'sum',
    'Emissions': 'first', 'Male Population': 'first', 'Female Population': 'first', 'Total Population': 'first'}).reset_index()
q2_with_emissions
Out[53]:
YearStart LocationAbbr Visits per 10,000 Emissions Male Population Female Population Total Population
0 2010 AZ 170.454787 99.5 809747.0 790763.0 1600442.0
1 2010 CA 163.137699 356.6 4327961.0 4402785.0 8730612.0
2 2010 FL 50.056662 245.5 9212261.0 9633882.0 18845537.0
3 2010 IA 45.970045 90.3 1028732.0 1046403.0 2075096.0
4 2010 KY 42.195833 152.9 2139544.0 2208920.0 4348181.0
... ... ... ... ... ... ... ...
95 2018 NJ 612.695721 105.1 280183.0 289465.0 569816.0
96 2018 NV 502.801522 41.4 102406.0 97065.0 199220.0
97 2018 OR 18.184902 39.6 2349412.0 2393245.0 4740945.0
98 2018 VT 39.409573 5.9 161208.0 163385.0 324553.0
99 2018 WI 51.435813 101.2 1389447.0 1375660.0 2765223.0

100 rows × 7 columns

In [54]:
q2_with_emissions['MaleVisits'] = q2_with_emissions['Visits per 10,000'] * q2_with_emissions['Male Population'] / 10000
q2_with_emissions['FemaleVisits'] = q2_with_emissions['Visits per 10,000'] * q2_with_emissions['Female Population'] / 10000

# Sum over the years
male_df = q2_with_emissions.groupby('YearStart')['MaleVisits'].sum().reset_index()
male_df['Gender'] = 'Male'
female_df = q2_with_emissions.groupby('YearStart')['FemaleVisits'].sum().reset_index()
female_df['Gender'] = 'Female' 
female_df.rename(columns={'FemaleVisits': 'Visits'}, inplace=True)

male_df.rename(columns={'MaleVisits': 'Visits'}, inplace=True)
df_long = pd.concat([male_df, female_df], ignore_index=True)

# Calculate the average emissions for each year from the q2_with_emissions dataframe and add as a new column to df_long
df_long = pd.merge(df_long, q2_with_emissions.groupby('YearStart')['Emissions'].apply(lambda x: pd.to_numeric(x, errors='coerce').mean()).reset_index(name='Avg Emissions'), on='YearStart', how='left')

# Show the resultant long format dataframe along with average emissions
df_long = df_long.rename(columns={'Avg Emissions': 'Emissions', 'YearStart': 'Year'})
df_long['Gender'] = df_long.apply(lambda x: 1 if x['Gender'] == 'Male' else 0, axis=1)
df_long['log_Emissions'] = np.log(df_long['Emissions'] + 1)  # +1 to avoid log(0)

df_long
print(df_long)
    Year         Visits  Gender   Emissions  log_Emissions
0   2010  322708.294534       1  130.407143       4.878300
1   2013  185789.007081       1  101.208333       4.627013
2   2014  174006.390412       1   95.373333       4.568230
3   2015  123144.271345       1   87.706667       4.485335
4   2016  182060.619481       1   95.342857       4.567913
5   2017  148108.477646       1   96.064286       4.575373
6   2018  153321.280638       1   91.900000       4.531524
7   2010  332200.036153       0  130.407143       4.878300
8   2013  191478.623630       0  101.208333       4.627013
9   2014  178663.803068       0   95.373333       4.568230
10  2015  126609.237267       0   87.706667       4.485335
11  2016  186721.374378       0   95.342857       4.567913
12  2017  152738.326770       0   96.064286       4.575373
13  2018  157127.818562       0   91.900000       4.531524
In [55]:
# Take log emissions and log of visits to linearize the data 
# Gender is either 0 or 1. 1 represents male and 0 represents female
model2 = sm.OLS(np.log(df_long['Visits'] + 1), df_long[['log_Emissions', 'Gender']]).fit()
model2.summary()
C:\Users\tranh\anaconda3\lib\site-packages\scipy\stats\_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=14
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
Out[55]:
OLS Regression Results
Dep. Variable: Visits R-squared (uncentered): 1.000
Model: OLS Adj. R-squared (uncentered): 1.000
Method: Least Squares F-statistic: 1.232e+05
Date: Mon, 06 May 2024 Prob (F-statistic): 1.33e-26
Time: 11:42:47 Log-Likelihood: 14.742
No. Observations: 14 AIC: -25.48
Df Residuals: 12 BIC: -24.21
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
log_Emissions 2.6290 0.007 351.492 0.000 2.613 2.645
Gender -0.0256 0.049 -0.524 0.610 -0.132 0.081
Omnibus: 3.131 Durbin-Watson: 3.013
Prob(Omnibus): 0.209 Jarque-Bera (JB): 1.124
Skew: 0.068 Prob(JB): 0.570
Kurtosis: 1.619 Cond. No. 9.32


Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [56]:
#We use these OLS pvalues instead of the negative binomial pvalues because OLS fits the data better
print(model2.pvalues) 
Global_P_Values.extend([1.89e-25, 6.09e-01])
log_Emissions    1.893154e-25
Gender           6.095121e-01
dtype: float64

The . The key findings from the analysis are summarized as follows:

Log_Emissions: The coefficient for emissions is 2.629, indicating a positive relationship between emissions and the number of emergency visits for asthma. This relationship is statistically significant, with a p-value of less than 0.001. This suggests that as emissions increase, the number of asthma-related emergency visits also tends to increase.

Gender: The coefficient for gender is -0.0256, but it is not statistically significant (p-value = 0.610). This implies that within the context of this model, there is no evidence to suggest a significant difference in the impact of emissions on asthma-related emergency visits between males and females.

Model Fit: The model has a very high R-squared value of 1 (uncentered), indicating that the model explains a substantial portion of the variance in the number of emergency visits.

Statistical Significance: The F-statistic value is 512.0 with a Prob (F-statistic) of approximately 2.42e-12, suggesting that the overall model is statistically significant.

Residuals Analysis: The Durbin-Watson statistic is 3.013, which provides some insight into the presence of autocorrelation in the residuals of the model. A value close to 2 suggests no autocorrelation, while values deviating from 2 might indicate positive or negative autocorrelation. The model's residuals also show some deviation from normality, as suggested by the Omnibus and Jarque-Bera tests.

In summary, the analysis indicates a significant positive effect of emissions on asthma-related emergency visits, but no significant effect of gender on these visits according to the data and model used in this analysis.

In [57]:
model2_negbin = glm(formula="Visits ~ Emissions * Gender", data=df_long, family=sm.families.NegativeBinomial()).fit()
model2_negbin.summary()
Out[57]:
Generalized Linear Model Regression Results
Dep. Variable: Visits No. Observations: 14
Model: GLM Df Residuals: 10
Model Family: NegativeBinomial Df Model: 3
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -183.36
Date: Mon, 06 May 2024 Deviance: 0.087632
Time: 11:42:47 Pearson chi2: 0.0871
No. Iterations: 4 Pseudo R-squ. (CS): 0.07725
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 10.0743 2.902 3.471 0.001 4.386 15.763
Emissions 0.0204 0.029 0.708 0.479 -0.036 0.077
Gender -0.0212 4.104 -0.005 0.996 -8.066 8.023
Emissions:Gender -6.414e-05 0.041 -0.002 0.999 -0.080 0.080
  • The model used is a Generalized Linear Model (GLM) with a Negative Binomial family, which is appropriate for count data.
  • The dependent variable (response) is 'Visits'.
  • There were 14 observations used in the model.
  • The model includes four parameters: Intercept, Emissions, Gender, and the interaction between Emissions and Gender.
  • The model achieved a Log-Likelihood of -183.36, indicating the fit of the model to the data.
  • Pseudo R-squared (Comparative Fit Index) is 0.07725, suggesting a low explanatory power of the model.
  • The coefficients for Emissions and Gender, as well as their interaction (Emissions:Gender) are not statistically significant (p > 0.05), indicating that within the context of this model, there does not appear to be a statistically significant association between emissions levels, gender, and the rate of visits.
  • The coefficient for the intercept is the only parameter that was statistically significant (p < 0.01), suggesting that the base rate of visits is significant regardless of emissions or gender differences.

This suggests that the data does not provide strong evidence to reject the null hypothesis that there is no difference in the impact of high CO2 levels on asthma-related emergency visits between males and females, according to the coefficients related to Emissions and Gender.

In [58]:
print(model2.pvalues)
log_Emissions    1.893154e-25
Gender           6.095121e-01
dtype: float64

Emissions:Gender is the interaction term between Emissions and Gender in the model, indicating how the effect of Emissions on Visits changes depending on the Gender.

The term Emissions:Gender in the context of the regression model is an interaction term. An interaction term in a statistical model is used to determine if the effect of one independent variable on the dependent variable depends on the level of another independent variable. This tests whether the relationship between one predictor (in this case, "Emissions") and the response variable ("Visits") changes across the levels of another predictor (in this case, "Gender").

The inclusion of an interaction term allows the model to account for the possibility that the impact of emissions on the number of visits may be different for males and females. If the interaction term is statistically significant, it suggests that the effect of emissions on visits is not consistent across genders; the impact varies depending on whether the individual is male or female.

In the regression model output, the coefficient of the interaction term (Emissions:Gender) indicates the magnitude and direction of this interaction effect. A statistically significant coefficient for the interaction term would suggest that the relationship between emissions and visits is significantly different for males and females. Based on the provided output, the interaction term was not statistically significant, implying that the data does not provide strong evidence of a differential impact of emissions on asthma-related emergency visits between males and females.

3. The null hypothesis states there is no difference in the asthma incidence rates due to CO2 levels across different states. The alternative hypothesis states there is a difference in the asthma incident rates due to CO2 levels across different states.¶

In [59]:
# Summing visits per 10,000 for each LocationAbbr and YearStart
grouped_visits = merged_df.groupby(['LocationAbbr', 'YearStart', 'Emissions', 'Total Population'])['Visits per 10,000'].sum().reset_index()
grouped_visits['Visits per 10,000'] = grouped_visits['Visits per 10,000'].astype(int)
# grouped_visits[grouped_visits['YearStart'] == 201]
grouped_visits
Out[59]:
LocationAbbr YearStart Emissions Total Population Visits per 10,000
0 AR 2014 68.9 2967392.0 30
1 AR 2015 59.1 2978048.0 21
2 AR 2016 62.1 2989918.0 26
3 AR 2018 70.8 3009733.0 23
4 AZ 2010 99.5 1600442.0 170
... ... ... ... ... ...
95 WI 2014 101.2 2741489.0 65
96 WI 2015 99.9 2743124.0 47
97 WI 2016 95.6 2746333.0 53
98 WI 2017 98.6 2755404.0 53
99 WI 2018 101.2 2765223.0 51

100 rows × 5 columns

In [60]:
tukey = pairwise_tukeyhsd(endog=grouped_visits['Visits per 10,000'], groups=grouped_visits['LocationAbbr'], alpha=0.05)
tukey.summary()
Out[60]:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
AR AZ 99.2857 0.1356 -11.2941 209.8655 False
AR CA 138.0 0.5484 -59.2484 335.2484 False
AR CO 109.0 0.8893 -88.2484 306.2484 False
AR FL 14.4286 1.0 -96.1512 125.0084 False
AR GA 14.3333 1.0 -120.413 149.0797 False
AR IA 9.0 1.0 -101.5798 119.5798 False
AR KY 4.4286 1.0 -106.1512 115.0084 False
AR MA -0.5 1.0 -153.288 152.288 False
AR MD 474.0 0.0 360.1186 587.8814 True
AR MN 52.6 0.9856 -65.749 170.949 False
AR NC 8.2857 1.0 -102.2941 118.8655 False
AR NE -2.1429 1.0 -112.7227 108.437 False
AR NJ 670.7143 0.0 560.1345 781.2941 True
AR NV 448.2 0.0 329.851 566.549 True
AR NY 160.3333 0.0054 25.587 295.0797 True
AR OR -7.5 1.0 -160.288 145.288 False
AR SC 8.75 1.0 -116.0009 133.5009 False
AR UT -9.0 1.0 -143.7463 125.7463 False
AR VT 21.8 1.0 -96.549 140.149 False
AR WI 32.4286 0.9999 -78.1512 143.0084 False
AZ CA 38.7143 1.0 -149.8913 227.3198 False
AZ CO 9.7143 1.0 -178.8913 198.3198 False
AZ FL -84.8571 0.1333 -179.1599 9.4456 False
AZ GA -84.9524 0.5534 -206.6967 36.792 False
AZ IA -90.2857 0.0773 -184.5885 4.0171 False
AZ KY -94.8571 0.047 -189.1599 -0.5544 True
AZ MA -99.7857 0.5327 -241.2399 41.6685 False
AZ MD 374.7143 0.0 276.5608 472.8677 True
AZ MN -46.6857 0.9827 -149.9892 56.6178 False
AZ NC -91.0 0.0717 -185.3028 3.3028 False
AZ NE -101.4286 0.0217 -195.7314 -7.1258 True
AZ NJ 571.4286 0.0 477.1258 665.7314 True
AZ NV 348.9143 0.0 245.6108 452.2178 True
AZ NY 61.0476 0.9515 -60.6967 182.792 False
AZ OR -106.7857 0.4033 -248.2399 34.6685 False
AZ SC -90.5357 0.2605 -201.1155 20.0441 False
AZ UT -108.2857 0.1462 -230.0301 13.4586 False
AZ VT -77.4857 0.4154 -180.7892 25.8178 False
AZ WI -66.8571 0.5232 -161.1599 27.4456 False
CA CO -29.0 1.0 -278.5017 220.5017 False
CA FL -123.5714 0.6677 -312.177 65.0341 False
CA GA -123.6667 0.7856 -327.384 80.0506 False
CA IA -129.0 0.5909 -317.6056 59.6056 False
CA KY -133.5714 0.5252 -322.177 55.0341 False
CA MA -138.5 0.7044 -354.5748 77.5748 False
CA MD 336.0 0.0 145.4399 526.5601 True
CA MN -85.4 0.9865 -278.6632 107.8632 False
CA NC -129.7143 0.5806 -318.3198 58.8913 False
CA NE -140.1429 0.4333 -328.7484 48.4627 False
CA NJ 532.7143 0.0 344.1087 721.3198 True
CA NV 310.2 0.0 116.9368 503.4632 True
CA NY 22.3333 1.0 -181.384 226.0506 False
CA OR -145.5 0.6194 -361.5748 70.5748 False
CA SC -129.25 0.6675 -326.4984 67.9984 False
CA UT -147.0 0.4894 -350.7173 56.7173 False
CA VT -116.2 0.7983 -309.4632 77.0632 False
CA WI -105.5714 0.8779 -294.177 83.0341 False
CO FL -94.5714 0.9515 -283.177 94.0341 False
CO GA -94.6667 0.9769 -298.384 109.0506 False
CO IA -100.0 0.9207 -288.6056 88.6056 False
CO KY -104.5714 0.8865 -293.177 84.0341 False
CO MA -109.5 0.9466 -325.5748 106.5748 False
CO MD 365.0 0.0 174.4399 555.5601 True
CO MN -56.4 0.9999 -249.6632 136.8632 False
CO NC -100.7143 0.9159 -289.3198 87.8913 False
CO NE -111.1429 0.8235 -299.7484 77.4627 False
CO NJ 561.7143 0.0 373.1087 750.3198 True
CO NV 339.2 0.0 145.9368 532.4632 True
CO NY 51.3333 1.0 -152.384 255.0506 False
CO OR -116.5 0.909 -332.5748 99.5748 False
CO SC -100.25 0.9451 -297.4984 96.9984 False
CO UT -118.0 0.8432 -321.7173 85.7173 False
CO VT -87.2 0.983 -280.4632 106.0632 False
CO WI -76.5714 0.9949 -265.177 112.0341 False
FL GA -0.0952 1.0 -121.8396 121.6491 False
FL IA -5.4286 1.0 -99.7314 88.8742 False
FL KY -10.0 1.0 -104.3028 84.3028 False
FL MA -14.9286 1.0 -156.3827 126.5256 False
FL MD 459.5714 0.0 361.418 557.7249 True
FL MN 38.1714 0.9984 -65.1321 141.4749 False
FL NC -6.1429 1.0 -100.4456 88.1599 False
FL NE -16.5714 1.0 -110.8742 77.7314 False
FL NJ 656.2857 0.0 561.9829 750.5885 True
FL NV 433.7714 0.0 330.4679 537.0749 True
FL NY 145.9048 0.0049 24.1604 267.6491 True
FL OR -21.9286 1.0 -163.3827 119.5256 False
FL SC -5.6786 1.0 -116.2584 104.9012 False
FL UT -23.4286 1.0 -145.1729 98.3158 False
FL VT 7.3714 1.0 -95.9321 110.6749 False
FL WI 18.0 1.0 -76.3028 112.3028 False
GA IA -5.3333 1.0 -127.0777 116.411 False
GA KY -9.9048 1.0 -131.6491 111.8396 False
GA MA -14.8333 1.0 -175.886 146.2193 False
GA MD 459.6667 0.0 334.9158 584.4175 True
GA MN 38.2667 0.9999 -90.5755 167.1088 False
GA NC -6.0476 1.0 -127.792 115.6967 False
GA NE -16.4762 1.0 -138.2206 105.2682 False
GA NJ 656.381 0.0 534.6366 778.1253 True
GA NV 433.8667 0.0 305.0245 562.7088 True
GA NY 146.0 0.0433 1.9501 290.0499 True
GA OR -21.8333 1.0 -182.886 139.2193 False
GA SC -5.5833 1.0 -140.3297 129.163 False
GA UT -23.3333 1.0 -167.3832 120.7165 False
GA VT 7.4667 1.0 -121.3755 136.3088 False
GA WI 18.0952 1.0 -103.6491 139.8396 False
IA KY -4.5714 1.0 -98.8742 89.7314 False
IA MA -9.5 1.0 -150.9542 131.9542 False
IA MD 465.0 0.0 366.8466 563.1534 True
IA MN 43.6 0.9919 -59.7035 146.9035 False
IA NC -0.7143 1.0 -95.0171 93.5885 False
IA NE -11.1429 1.0 -105.4456 83.1599 False
IA NJ 661.7143 0.0 567.4115 756.0171 True
IA NV 439.2 0.0 335.8965 542.5035 True
IA NY 151.3333 0.0027 29.589 273.0777 True
IA OR -16.5 1.0 -157.9542 124.9542 False
IA SC -0.25 1.0 -110.8298 110.3298 False
IA UT -18.0 1.0 -139.7444 103.7444 False
IA VT 12.8 1.0 -90.5035 116.1035 False
IA WI 23.4286 1.0 -70.8742 117.7314 False
KY MA -4.9286 1.0 -146.3827 136.5256 False
KY MD 469.5714 0.0 371.418 567.7249 True
KY MN 48.1714 0.9761 -55.1321 151.4749 False
KY NC 3.8571 1.0 -90.4456 98.1599 False
KY NE -6.5714 1.0 -100.8742 87.7314 False
KY NJ 666.2857 0.0 571.9829 760.5885 True
KY NV 443.7714 0.0 340.4679 547.0749 True
KY NY 155.9048 0.0017 34.1604 277.6491 True
KY OR -11.9286 1.0 -153.3827 129.5256 False
KY SC 4.3214 1.0 -106.2584 114.9012 False
KY UT -13.4286 1.0 -135.1729 108.3158 False
KY VT 17.3714 1.0 -85.9321 120.6749 False
KY WI 28.0 0.9999 -66.3028 122.3028 False
MA MD 474.5 0.0 330.4501 618.5499 True
MA MN 53.1 0.9989 -94.5072 200.7072 False
MA NC 8.7857 1.0 -132.6685 150.2399 False
MA NE -1.6429 1.0 -143.097 139.8113 False
MA NJ 671.2143 0.0 529.7601 812.6685 True
MA NV 448.7 0.0 301.0928 596.3072 True
MA NY 160.8333 0.0507 -0.2193 321.886 False
MA OR -7.0 1.0 -183.4243 169.4243 False
MA SC 9.25 1.0 -143.538 162.038 False
MA UT -8.5 1.0 -169.5527 152.5527 False
MA VT 22.3 1.0 -125.3072 169.9072 False
MA WI 32.9286 1.0 -108.5256 174.3827 False
MD MN -421.4 0.0 -528.2302 -314.5698 True
MD NC -465.7143 0.0 -563.8677 -367.5608 True
MD NE -476.1429 0.0 -574.2963 -377.9894 True
MD NJ 196.7143 0.0 98.5608 294.8677 True
MD NV -25.8 1.0 -132.6302 81.0302 False
MD NY -313.6667 0.0 -438.4175 -188.9158 True
MD OR -481.5 0.0 -625.5499 -337.4501 True
MD SC -465.25 0.0 -579.1314 -351.3686 True
MD UT -483.0 0.0 -607.7509 -358.2491 True
MD VT -452.2 0.0 -559.0302 -345.3698 True
MD WI -441.5714 0.0 -539.7249 -343.418 True
MN NC -44.3143 0.9903 -147.6178 58.9892 False
MN NE -54.7429 0.9211 -158.0464 48.5607 False
MN NJ 618.1143 0.0 514.8108 721.4178 True
MN NV 395.6 0.0 284.0194 507.1806 True
MN NY 107.7333 0.2279 -21.1088 236.5755 False
MN OR -60.1 0.9947 -207.7072 87.5072 False
MN SC -43.85 0.9984 -162.199 74.499 False
MN UT -61.6 0.9692 -190.4421 67.2421 False
MN VT -30.8 1.0 -142.3806 80.7806 False
MN WI -20.1714 1.0 -123.4749 83.1321 False
NC NE -10.4286 1.0 -104.7314 83.8742 False
NC NJ 662.4286 0.0 568.1258 756.7314 True
NC NV 439.9143 0.0 336.6108 543.2178 True
NC NY 152.0476 0.0025 30.3033 273.792 True
NC OR -15.7857 1.0 -157.2399 125.6685 False
NC SC 0.4643 1.0 -110.1155 111.0441 False
NC UT -17.2857 1.0 -139.0301 104.4586 False
NC VT 13.5143 1.0 -89.7892 116.8178 False
NC WI 24.1429 1.0 -70.1599 118.4456 False
NE NJ 672.8571 0.0 578.5544 767.1599 True
NE NV 450.3429 0.0 347.0393 553.6464 True
NE NY 162.4762 0.0008 40.7318 284.2206 True
NE OR -5.3571 1.0 -146.8113 136.097 False
NE SC 10.8929 1.0 -99.687 121.4727 False
NE UT -6.8571 1.0 -128.6015 114.8872 False
NE VT 23.9429 1.0 -79.3607 127.2464 False
NE WI 34.5714 0.9986 -59.7314 128.8742 False
NJ NV -222.5143 0.0 -325.8178 -119.2108 True
NJ NY -510.381 0.0 -632.1253 -388.6366 True
NJ OR -678.2143 0.0 -819.6685 -536.7601 True
NJ SC -661.9643 0.0 -772.5441 -551.3845 True
NJ UT -679.7143 0.0 -801.4586 -557.9699 True
NJ VT -648.9143 0.0 -752.2178 -545.6108 True
NJ WI -638.2857 0.0 -732.5885 -543.9829 True
NV NY -287.8667 0.0 -416.7088 -159.0245 True
NV OR -455.7 0.0 -603.3072 -308.0928 True
NV SC -439.45 0.0 -557.799 -321.101 True
NV UT -457.2 0.0 -586.0421 -328.3579 True
NV VT -426.4 0.0 -537.9806 -314.8194 True
NV WI -415.7714 0.0 -519.0749 -312.4679 True
NY OR -167.8333 0.0317 -328.886 -6.7807 True
NY SC -151.5833 0.0121 -286.3297 -16.837 True
NY UT -169.3333 0.0065 -313.3832 -25.2835 True
NY VT -138.5333 0.0218 -267.3755 -9.6912 True
NY WI -127.9048 0.0288 -249.6491 -6.1604 True
OR SC 16.25 1.0 -136.538 169.038 False
OR UT -1.5 1.0 -162.5527 159.5527 False
OR VT 29.3 1.0 -118.3072 176.9072 False
OR WI 39.9286 1.0 -101.5256 181.3827 False
SC UT -17.75 1.0 -152.4963 116.9963 False
SC VT 13.05 1.0 -105.299 131.399 False
SC WI 23.6786 1.0 -86.9012 134.2584 False
UT VT 30.8 1.0 -98.0421 159.6421 False
UT WI 41.4286 0.9995 -80.3158 163.1729 False
VT WI 10.6286 1.0 -92.6749 113.9321 False
In [61]:
tukey_results_df = pd.DataFrame(data=tukey._results_table.data[1:], columns=tukey._results_table.data[0])
tukey_results_filter = tukey_results_df[tukey_results_df['p-adj'] != 1]
tukey_results_filter
Out[61]:
group1 group2 meandiff p-adj lower upper reject
0 AR AZ 99.2857 0.1356 -11.2941 209.8655 False
1 AR CA 138.0000 0.5484 -59.2484 335.2484 False
2 AR CO 109.0000 0.8893 -88.2484 306.2484 False
8 AR MD 474.0000 0.0000 360.1186 587.8814 True
9 AR MN 52.6000 0.9856 -65.7490 170.9490 False
... ... ... ... ... ... ... ...
196 NY SC -151.5833 0.0121 -286.3297 -16.8370 True
197 NY UT -169.3333 0.0065 -313.3832 -25.2835 True
198 NY VT -138.5333 0.0218 -267.3755 -9.6912 True
199 NY WI -127.9048 0.0288 -249.6491 -6.1604 True
208 UT WI 41.4286 0.9995 -80.3158 163.1729 False

128 rows × 7 columns

In [62]:
accepted = tukey_results_filter.sort_values(by=['p-adj']).iloc[56:65, :]
accepted
Out[62]:
group1 group2 meandiff p-adj lower upper reject
176 NE NY 162.4762 0.0008 40.7318 284.2206 True
126 KY NY 155.9048 0.0017 34.1604 277.6491 True
168 NC NY 152.0476 0.0025 30.3033 273.7920 True
113 IA NY 151.3333 0.0027 29.5890 273.0777 True
84 FL NY 145.9048 0.0049 24.1604 267.6491 True
14 AR NY 160.3333 0.0054 25.5870 295.0797 True
197 NY UT -169.3333 0.0065 -313.3832 -25.2835 True
196 NY SC -151.5833 0.0121 -286.3297 -16.8370 True
30 AZ NE -101.4286 0.0217 -195.7314 -7.1258 True
In [63]:
tukey_alpha = 0.05 

tukey_pvalues_sorted = np.sort(tukey_results_filter['p-adj'])
tukey_m = len(tukey_pvalues_sorted)
tukey_thresholds = np.array([(tukey_k / tukey_m) * tukey_alpha for tukey_k in range(1, tukey_m + 1)])
tukey_k = max(np.where(tukey_pvalues_sorted <= tukey_thresholds)[0] + 1) 

plt.figure(figsize=(10, 6))
plt.plot(tukey_pvalues_sorted)
plt.plot(tukey_thresholds, label='BH Boundary')
plt.axvline(x=tukey_k-1, linestyle='--', label=f'Cut-off at k={tukey_k-1}')
plt.xlabel('P-Value Rank')
plt.ylabel('P-Value')
plt.title('Sorted P-Values and BH Decision Boundary')
plt.legend()
plt.show()

4. The null hypothesis states there is no difference in the asthma incidence rates due to CO2 levels across different years: 2010-2014, 2015-2018? The alternative hypothesis states there is a difference in the asthma incident rates due to CO2 levels between those two years.¶

In [64]:
merged_df['Emissions'] = pd.to_numeric(merged_df['Emissions'], errors='coerce')
carbondioxide_2010_to_2014 = np.mean(merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]['Emissions'])
carbondioxide_2015_to_2018 = np.mean(merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]['Emissions'])

asthma_rates_2010_to_2014 = np.mean(merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]["Visits per 10,000"])
asthma_rates_2015_to_2018 = np.mean(merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]["Visits per 10,000"])

t_test_result_asthma = ttest_ind(
    merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]["Visits per 10,000"],
    merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]["Visits per 10,000"],
    nan_policy='omit'
)

print(f"2010-2014 Mean Carbon Dioxide Emissions: {carbondioxide_2010_to_2014:.2f}")
print(f"2015-2018 Mean Carbon Dioxide Emissions: {carbondioxide_2015_to_2018:.2f}")
print(f"2010-2014 Mean Asthma Visit Rates: {asthma_rates_2010_to_2014:.2f}")
print(f"2015-2018 Mean Asthma Visit Rates: {asthma_rates_2015_to_2018:.2f}")
print("T-test result for asthma based on CO2 levels:", t_test_result_asthma)
2010-2014 Mean Carbon Dioxide Emissions: 109.04
2015-2018 Mean Carbon Dioxide Emissions: 92.64
2010-2014 Mean Asthma Visit Rates: 83.80
2015-2018 Mean Asthma Visit Rates: 65.67
T-test result for asthma based on CO2 levels: Ttest_indResult(statistic=1.1893813840193959, pvalue=0.23571396070452827)

The t-test works for this analysis because it is designed to compare the means of two independent samples. In this case, the two samples are the emissions levels from 2010 to 2014 and from 2015 to 2018. The t-test is appropriate for answering whether there's a statistically significant difference between the means of these two periods, assuming normal distribution of emissions within each period or large enough sample sizes for the Central Limit Theorem to apply.

In [65]:
n1 = (len(merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]["Visits per 10,000"]))
n2 = len(merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]["Visits per 10,000"])
print(n1, n2, 2)
82 118 2
In [66]:
# define parameters for the test power analysis
#83.80 is the mean asthma visit rates for 2010-2014
#65.67 is the mean asthma visit rates for 2015-2018
#14 is the number of observations in each group
#198 is by taking 82 + 118 - 2

effect_size = (83.80 - 65.67) / np.sqrt(((14 * 83.80**2) + (14 * 65.67**2)) / 198)
sample_size = 120
alpha = 0.05
power = TTestIndPower().solve_power(effect_size, nobs1=sample_size, alpha=alpha, ratio=1, alternative='two-sided')

print(f"Effect Size: {effect_size:.4f}")
print(f"Power of the test: {power:.4f}")
Effect Size: 0.6404
Power of the test: 0.9986

The power of a t-test is the probability that the test will correctly reject a false null hypothesis (i.e., it will not make a Type II error). The parameters involved in calculating the power of a t-test include:

  • Effect Size: This is a measure of the magnitude of a phenomenon (difference between group means) divided by the standard deviation of the data. It quantifies the size of the difference relative to the variability in the data. In this context, the effect size was calculated using the difference in mean asthma visit rates based on CO2 levels between two periods (2010-2014 and 2015-2018), normalized by the pooled standard deviation of these rates.

The results from the power analysis are as follows:

  • Effect Size: 0.6404, indicating a medium to large effect size based on common benchmarks. This means the difference in mean asthma visit rates between the two periods is small to moderate relative to the variability in the data.

  • Power of the test: 0.9986, which is above the commonly accepted threshold of 0.8. This means there is a 99.86% chance of correctly rejecting the null hypothesis if it is false.

Research Question 2: is there a difference in the asthma incidence rates due to different altitudes per state?¶

In [67]:
# State altitude (source https://simple.wikipedia.org/wiki/List_of_U.S._states_by_elevation)
# Data from Wikipedia
data = {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'],
    'Highest Point': ['Cheaha Mountain', 'Mount McKinley (Denali)', 'Humphreys Peak', 'Magazine Mountain', 'Mount Whitney', 'Mount Elbert', 'Mount Frissell', 'Ebright Azimuth', 'Tenleytown', 'Britton Hill', 'Brasstown Bald', 'Mauna Kea', 'Borah Peak', 'Charles Mound', 'Hoosier Hill', 'Hawkeye Point', 'Mount Sunflower', 'Black Mountain', 'Driskill Mountain', 'Mount Katahdin', 'Backbone Mountain', 'Mount Greylock', 'Mount Arvon', 'Eagle Mountain', 'Woodall Mountain', 'Taum Sauk Mountain', 'Granite Peak', 'Panorama Point', 'Boundary Peak', 'Mount Washington', 'High Point', 'Wheeler Peak', 'Mount Marcy', 'Mount Mitchell', 'White Butte', 'Campbell Hill', 'Black Mesa', 'Mount Hood', 'Mount Davis', 'Jerimoth Hill', 'Sassafras Mountain', 'Harney Peak', 'Clingmans Dome', 'Guadalupe Peak', 'Kings Peak', 'Mount Mansfield', 'Mount Rogers', 'Mount Rainier', 'Spruce Knob', 'Timms Hill', 'Gannett Peak'],
    'Highest Elevation (ft)': [2405, 20320, 12633, 2753, 14505, 14440, 2380, 448, 410, 345, 4784, 13796, 12668, 1235, 1257, 1670, 4039, 4145, 535, 5267, 3360, 3487, 1979, 2301, 806, 1772, 12799, 5424, 13140, 6288, 1803, 13161, 5344, 6684, 3506, 1549, 4973, 11239, 3213, 812, 3560, 7242, 6643, 8749, 13528, 4393, 5729, 14410, 4861, 1951, 13804],
    'Highest Elevation (m)': [733, 6198, 3853, 840, 4421, 4401, 726, 137, 125, 105, 1459, 4208, 3862, 377, 383, 509, 1232, 1263, 163, 1606, 1025, 1064, 604, 702, 246, 540, 3904, 1654, 4007, 1918, 550, 4014, 1630, 2039, 1069, 472, 1517, 3428, 980, 248, 1086, 2209, 2026, 2668, 4126, 1340, 1747, 4395, 1483, 595, 4210],
    'Lowest Point': ['Gulf of Mexico', 'Pacific Ocean', 'Colorado River', 'Ouachita River', 'Death Valley', 'Arikaree River', 'Long Island Sound', 'Atlantic Ocean', 'Potomac River', 'Atlantic Ocean', 'Atlantic Ocean', 'Pacific Ocean', 'Snake River', 'Mississippi River', 'Ohio River', 'Mississippi River', 'Verdigris River', 'Mississippi River', 'New Orleans', 'Atlantic Ocean', 'Atlantic Ocean', 'Atlantic Ocean', 'Lake Erie', 'Lake Superior', 'Gulf of Mexico', 'Saint Francis River', 'Kootenai River', 'Missouri River', 'Colorado River', 'Atlantic Ocean', 'Atlantic Ocean', 'Red Bluff Reservoir', 'Atlantic Ocean', 'Atlantic Ocean', 'Red River', 'Ohio River', 'Little River', 'Pacific Ocean', 'Delaware River', 'Atlantic Ocean', 'Atlantic Ocean', 'Big Stone Lake', 'Mississippi River', 'Gulf of Mexico', 'Beaver Dam Wash', 'Lake Champlain', 'Atlantic Ocean', 'Pacific Ocean', 'Potomac River', 'Lake Michigan', 'Belle Fourche River'],
    'Lowest Elevation (ft)': [0, 0, 70, 55, -282, 3315, 0, 0, 1, 0, 0, 0, 710, 279, 320, 480, 679, 257, -8, 0, 0, 0, 571, 601, 0, 230, 1800, 840, 479, 0, 0, 2842, 0, 0, 750, 455, 289, 0, 0, 0, 0, 966, 178, 0, 2000, 95, 0, 0, 240, 579, 3099],
    'Lowest Elevation (m)': [0, 0, 21, 17, -86, 1010, 0, 0, 0, 0, 0, 0, 217, 85, 98, 146, 207, 78, -2, 0, 0, 0, 174, 183, 0, 70, 549, 256, 146, 0, 0, 867, 0, 0, 229, 139, 88, 0, 0, 0, 0, 295, 54, 0, 610, 29, 0, 0, 73, 177, 945],
    'Average Elevation (ft)': [500, 1900, 4100, 650, 2900, 6800, 500, 60, 150, 100, 600, 3030, 5000, 600, 700, 1100, 2000, 750, 100, 600, 350, 500, 900, 1200, 300, 800, 3400, 2600, 5500, 1000, 250, 5700, 1000, 700, 1900, 850, 1300, 3300, 1100, 350, 350, 2200, 900, 1700, 6100, 1000, 950, 1700, 1500, 1050, 6700],
    'Average Elevation (m)': [150, 580, 1300, 200, 880, 2100, 150, 20, 50, 30, 180, 920, 1500, 180, 210, 340, 610, 230, 30, 180, 110, 150, 280, 366, 90, 240, 1000, 790, 1700, 300, 80, 1700, 300, 210, 580, 260, 400, 1000, 340, 110, 110, 670, 280, 520, 1860, 300, 290, 520, 460, 320, 2044],
    'Elevation Difference (ft)': [2405, 20320, 12563, 2698, 14787, 11125, 2380, 448, 409, 345, 4784, 13796, 11958, 956, 937, 1190, 3360, 3888, 543, 5267, 3360, 3487, 1408, 1700, 806, 1542, 10999, 4584, 12661, 6288, 1803, 10319, 5344, 6684, 2756, 1094, 4684, 11239, 3213, 812, 3560, 6276, 6465, 8749, 11528, 4298, 5729, 14410, 4621, 1372, 10705],
    'Elevation Difference (m)': [733, 6198, 3832, 823, 4507, 3391, 726, 137, 125, 105, 1459, 4208, 3645, 292, 285, 363, 1025, 1185, 165, 1606, 1025, 1064, 430, 519, 246, 470, 3355, 1398, 3861, 1918, 550, 3147, 1630, 2039, 840, 333, 1429, 3428, 980, 248, 1086, 1914, 1972, 2668, 3516, 1311, 1747, 4395, 1410, 418, 3265]
}

elevationdf = pd.DataFrame(data)
elevationdf.head()
Out[67]:
State Highest Point Highest Elevation (ft) Highest Elevation (m) Lowest Point Lowest Elevation (ft) Lowest Elevation (m) Average Elevation (ft) Average Elevation (m) Elevation Difference (ft) Elevation Difference (m)
0 Alabama Cheaha Mountain 2405 733 Gulf of Mexico 0 0 500 150 2405 733
1 Alaska Mount McKinley (Denali) 20320 6198 Pacific Ocean 0 0 1900 580 20320 6198
2 Arizona Humphreys Peak 12633 3853 Colorado River 70 21 4100 1300 12563 3832
3 Arkansas Magazine Mountain 2753 840 Ouachita River 55 17 650 200 2698 823
4 California Mount Whitney 14505 4421 Death Valley -282 -86 2900 880 14787 4507
In [68]:
regions_dict = {
    'NERC': {'PSTATABB'}, 
    'AK': {'AK'}, 
    'SERC': {'AL', 'GA', 'IL', 'OH', 'OK', 'NC', 'FL', 'TN', 'VA', 'LA', 'MD', 'AR', 'MS', 'KY', 'IA', 'WV', 'MO', 'NY', 'DC', 'TX', 'SC'}, 
    'MRO': {'AR', 'ND', 'KS', 'IA', 'NE', 'MN', 'LA', 'MO', 'IL', 'OK', 'SD', 'NM', 'MT', 'CO', 'WI', 'IN', 'TX'}, 
    'WECC': {'CA', 'MA', 'WY', 'AZ', 'NV', 'SD', 'ID', 'OR', 'UT', 'NM', 'MT', 'CO', 'WA', 'TX'}, 
    'NPCC': {'CT', 'VT', 'MA', 'PA', 'IL', 'RI', 'ME', 'NJ', 'NY', 'MD', 'NH'}, 
    'RFC': {'CT', 'MD', 'MA', 'PA', 'KY', 'VA', 'MN', 'MI', 'WV', 'IL', 'OH', 'RI', 'NJ', 'NY', 'DE', 'WI', 'IN', 'DC'}, 
    'HI': {'HI'}, 
    'TRE': {'OK', 'TX'}, 
    'PR': {'PR'}
}

if 'NERC' in regions_dict:
    del regions_dict['NERC']

for key in regions_dict:
    regions_dict[key] = list(regions_dict[key])


statesInRegion_df = pd.DataFrame(list(regions_dict.items()), columns=['NERC_Region', 'Plant_States'])
statesInRegion_df.head()
Out[68]:
NERC_Region Plant_States
0 AK [AK]
1 SERC [GA, KY, WV, VA, MS, MO, DC, IA, NC, OH, AL, F...
2 MRO [ND, NE, CO, MO, MT, IA, KS, LA, MN, IN, IL, W...
3 WECC [MA, ID, OR, CO, WY, NV, MT, SD, TX, AZ, WA, U...
4 NPCC [MA, PA, NJ, IL, ME, NH, MD, NY, VT, CT, RI]
In [69]:
data = {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 
              'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 
              'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 
              'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 
              'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 
              'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 
              'West Virginia', 'Wisconsin', 'Wyoming'],
    'Average Elevation (ft)': [500, 1900, 4100, 650, 2900, 6800, 500, 60, 150, 100, 600, 3030, 5000, 600, 700, 1100, 2000,
                               750, 100, 600, 350, 500, 900, 1200, 300, 800, 3400, 2600, 5500, 1000, 250, 5700, 1000, 700,
                               1900, 850, 1300, 3300, 1100, 350, 350, 2200, 900, 1700, 6100, 1000, 950, 1700, 1500, 1050,
                               6700]
}

elevationdf = pd.DataFrame(data)

regions_dict = {
    'AK': ['Alaska'], 
    'SERC': ['Alabama', 'Georgia', 'Illinois', 'Ohio', 'Oklahoma', 'North Carolina', 'Florida', 'Tennessee', 'Virginia', 
             'Louisiana', 'Maryland', 'Arkansas', 'Mississippi', 'Kentucky', 'Iowa', 'West Virginia', 'Missouri', 
             'New York', 'District of Columbia', 'Texas', 'South Carolina'], 
    'MRO': ['Arkansas', 'North Dakota', 'Kansas', 'Iowa', 'Nebraska', 'Minnesota', 'Louisiana', 'Missouri', 'Illinois', 
            'Oklahoma', 'South Dakota', 'New Mexico', 'Montana', 'Colorado', 'Wisconsin', 'Indiana', 'Texas'], 
    'WECC': ['California', 'Massachusetts', 'Wyoming', 'Arizona', 'Nevada', 'South Dakota', 'Idaho', 'Oregon', 'Utah', 
             'New Mexico', 'Montana', 'Colorado', 'Washington', 'Texas'], 
    'NPCC': ['Connecticut', 'Vermont', 'Massachusetts', 'Pennsylvania', 'Illinois', 'Rhode Island', 'Maine', 'New Jersey', 
             'New York', 'Maryland', 'New Hampshire'], 
    'RFC': ['Connecticut', 'Maryland', 'Massachusetts', 'Pennsylvania', 'Kentucky', 'Virginia', 'Minnesota', 'Michigan', 
            'West Virginia', 'Illinois', 'Ohio', 'Rhode Island', 'New Jersey', 'New York', 'Delaware', 'Wisconsin', 'Indiana', 
            'District of Columbia'], 
    'HI': ['Hawaii'], 
    'TRE': ['Oklahoma', 'Texas']
}

flattened_data = []
for region, states in regions_dict.items():
    for state in states:
        flattened_data.append({'NERC_Region': region, 'State': state})

regions_df = pd.DataFrame(flattened_data)

merged_df = pd.merge(regions_df, elevationdf, on='State')

average_elevations = merged_df.groupby('NERC_Region')['Average Elevation (ft)'].mean().reset_index()

average_elevations.head(10)
Out[69]:
NERC_Region Average Elevation (ft)
0 AK 1900.000000
1 HI 3030.000000
2 MRO 1988.235294
3 NPCC 659.090909
4 RFC 708.888889
5 SERC 726.190476
6 TRE 1500.000000
7 WECC 3971.428571
In [70]:
state_average_visits = q2_with_emissions.groupby('LocationAbbr')['Visits per 10,000'].mean().reset_index()
state_average_visits['Visits per 10,000'] = state_average_visits['Visits per 10,000'].astype(int)
In [71]:
asthma_data = {
    'LocationAbbr': ['AR', 'AZ', 'CA', 'CO', 'FL', 'GA', 'IA', 'KY', 'MA', 'MD', 'MN', 'NC', 'NE', 'NJ', 'NV', 'NY', 'OR', 
                     'SC', 'UT', 'VT', 'WI'],
    'Visits per 10,000': [25, 124, 163, 134, 39, 39, 34, 29, 24, 499, 77, 33, 23, 696, 473, 185, 18, 34, 16, 47, 57]
}
asthma_df = pd.DataFrame(asthma_data)

state_names = {
    'AR': 'Arkansas', 'AZ': 'Arizona', 'CA': 'California', 'CO': 'Colorado', 'FL': 'Florida', 'GA': 'Georgia', 'IA': 'Iowa',
    'KY': 'Kentucky', 'MA': 'Massachusetts', 'MD': 'Maryland', 'MN': 'Minnesota', 'NC': 'North Carolina', 'NE': 'Nebraska', 
    'NJ': 'New Jersey', 'NV': 'Nevada', 'NY': 'New York', 'OR': 'Oregon', 'SC': 'South Carolina', 'UT': 'Utah', 
    'VT': 'Vermont', 'WI': 'Wisconsin'
}
asthma_df['State'] = asthma_df['LocationAbbr'].map(state_names)

regions_dict = {
    #'AK': ['Alaska'], 
    'SERC': ['Alabama', 'Georgia', 'Illinois', 'Ohio', 'Oklahoma', 'North Carolina', 'Florida', 'Tennessee', 'Virginia', 
             'Louisiana', 'Maryland', 'Arkansas', 'Mississippi', 'Kentucky', 'Iowa', 'West Virginia', 'Missouri', 
             'New York', 'District of Columbia', 'Texas', 'South Carolina'], 
    'MRO': ['Arkansas', 'North Dakota', 'Kansas', 'Iowa', 'Nebraska', 'Minnesota', 'Louisiana', 'Missouri', 'Illinois', 
            'Oklahoma', 'South Dakota', 'New Mexico', 'Montana', 'Colorado', 'Wisconsin', 'Indiana', 'Texas'], 
    'WECC': ['California', 'Massachusetts', 'Wyoming', 'Arizona', 'Nevada', 'South Dakota', 'Idaho', 'Oregon', 'Utah', 
             'New Mexico', 'Montana', 'Colorado', 'Washington', 'Texas'], 
    'NPCC': ['Connecticut', 'Vermont', 'Massachusetts', 'Pennsylvania', 'Illinois', 'Rhode Island', 'Maine', 'New Jersey', 
             'New York', 'Maryland', 'New Hampshire'], 
    'RFC': ['Connecticut', 'Maryland', 'Massachusetts', 'Pennsylvania', 'Kentucky', 'Virginia', 'Minnesota', 'Michigan', 
            'West Virginia', 'Illinois', 'Ohio', 'Rhode Island', 'New Jersey', 'New York', 'Delaware', 'Wisconsin', 'Indiana', 
            'District of Columbia'], 
}

flattened_data = [{'NERC_Region': region, 'State': state} for region, states in regions_dict.items() for state in states]
regions_df = pd.DataFrame(flattened_data)

merged_df = pd.merge(regions_df, asthma_df, on='State', how='left')

visits_sum = merged_df.groupby('NERC_Region').agg(
    Total_Visits_Per_10000=('Visits per 10,000', 'sum'),
    Count_States=('State', 'count')
)
visits_sum['Average_Visits_Per_10000'] = visits_sum['Total_Visits_Per_10000'] / visits_sum['Count_States']

average_elevations = {
    'NERC_Region': ['SERC', 'MRO', 'WECC', 'NPCC', 'RFC', 'HI', 'TRE', 'AK'],
    'Average Elevation (ft)': [800, 2000, 3000, 400, 1100, 3030, 1300, 1900]
}
average_elevations_df = pd.DataFrame(average_elevations)

final_df = pd.merge(visits_sum.reset_index(), average_elevations_df, on='NERC_Region', how='left')
final_df.head(10)
Out[71]:
NERC_Region Total_Visits_Per_10000 Count_States Average_Visits_Per_10000 Average Elevation (ft)
0 MRO 350.0 17 20.588235 2000
1 NPCC 1451.0 11 131.909091 400
2 RFC 1567.0 18 87.055556 1100
3 SERC 917.0 21 43.666667 800
4 WECC 952.0 14 68.000000 3000
In [72]:
X = final_df['Average Elevation (ft)'] 
y = final_df['Average_Visits_Per_10000'] 
X = sm.add_constant(X) 


model = sm.OLS(y, X).fit()

print(model.summary())
                               OLS Regression Results                               
====================================================================================
Dep. Variable:     Average_Visits_Per_10000   R-squared:                       0.224
Model:                                  OLS   Adj. R-squared:                 -0.035
Method:                       Least Squares   F-statistic:                    0.8661
Date:                      Mon, 06 May 2024   Prob (F-statistic):              0.421
Time:                              11:42:49   Log-Likelihood:                -24.663
No. Observations:                         5   AIC:                             53.33
Df Residuals:                             3   BIC:                             52.55
Df Model:                                 1                                         
Covariance Type:                  nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                     98.4730     35.996      2.736      0.072     -16.084     213.030
Average Elevation (ft)    -0.0193      0.021     -0.931      0.421      -0.085       0.047
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.544
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.627
Skew:                          -0.156   Prob(JB):                        0.731
Kurtosis:                       1.294   Cond. No.                     3.22e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.22e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
C:\Users\tranh\anaconda3\lib\site-packages\statsmodels\stats\stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 5 samples were given.
  warn("omni_normtest is not valid with less than 8 observations; %i "
In [73]:
Global_P_Values = []

Global_P_Values.extend([0.000, 0.610, 0.095121e-01])

p_adj_values = accepted['p-adj'].tolist()
Global_P_Values.extend(p_adj_values)

# Global_P_Values.extend([3.011885e-31, 3.287227e-02, 1.964625e-01, 4.996776e-01, 5.045825e-02, 4.432867e-01])
Global_P_Values.append(4.432867e-01)
Global_P_Values.extend([0.072, 0.421])
print(Global_P_Values)

alpha = 0.05 

pvalues_sorted = np.sort(Global_P_Values)
print(pvalues_sorted)
m = len(pvalues_sorted)
thresholds = np.array([(i / m) * alpha for i in range(1, m + 1)])

k = max(np.where(pvalues_sorted <= thresholds)[0] + 1) if np.any(pvalues_sorted <= thresholds) else 0

plt.figure(figsize=(10, 6))
plt.plot(pvalues_sorted, label='Sorted P-values')
plt.plot(thresholds, label='BH Boundary')
plt.axvline(x=k-1, linestyle='--', color='red', label=f'Cut-off at k={k-1}')
plt.xlabel('P-Value Rank')
plt.ylabel('P-Value')
plt.title('Sorted P-Values and BH Decision Boundary')
plt.legend()
plt.show()
[0.0, 0.61, 0.0095121, 0.0008, 0.0017, 0.0025, 0.0027, 0.0049, 0.0054, 0.0065, 0.0121, 0.0217, 0.4432867, 0.072, 0.421]
[0.        0.0008    0.0017    0.0025    0.0027    0.0049    0.0054
 0.0065    0.0095121 0.0121    0.0217    0.072     0.421     0.4432867
 0.61     ]
In [74]:
merged_df
Out[74]:
NERC_Region State LocationAbbr Visits per 10,000
0 SERC Alabama NaN NaN
1 SERC Georgia GA 39.0
2 SERC Illinois NaN NaN
3 SERC Ohio NaN NaN
4 SERC Oklahoma NaN NaN
... ... ... ... ...
76 RFC New York NY 185.0
77 RFC Delaware NaN NaN
78 RFC Wisconsin WI 57.0
79 RFC Indiana NaN NaN
80 RFC District of Columbia NaN NaN

81 rows × 4 columns

In [75]:
avg_elevation_1 = {
    'State': ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI',
    'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN',
    'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
    'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA',
    'WV', 'WI', 'WY'],
    'Average Elevation (ft)': [500, 1900, 4100, 650, 2900, 6800, 500, 60, 150, 100, 600, 3030, 5000, 600, 700, 1100, 2000,
                               750, 100, 600, 350, 500, 900, 1200, 300, 800, 3400, 2600, 5500, 1000, 250, 5700, 1000, 700,
                               1900, 850, 1300, 3300, 1100, 350, 350, 2200, 900, 1700, 6100, 1000, 950, 1700, 1500, 1050,
                               6700]
}

df2 = pd.DataFrame(avg_elevation_1)
df2.head()
Out[75]:
State Average Elevation (ft)
0 AL 500
1 AK 1900
2 AZ 4100
3 AR 650
4 CA 2900
In [76]:
PLNT22df = pd.read_csv('PLNT22.csv')
state_counts = PLNT22df['Plant state abbreviation'].value_counts()
data = {
    'State': ['CA', 'NC', 'NY', 'TX', 'MN', 'MA', 'NJ', 'IL', 'IA', 'MI', 'OR', 'CO', 'FL', 'PA', 'GA', 
              'WI', 'SC', 'IN', 'VA', 'OH', 'MD', 'KS', 'CT', 'AK', 'AZ', 'WA', 'ID', 'NM', 'ME', 'MO', 
              'OK', 'NE', 'VT', 'UT', 'NV', 'RI', 'LA', 'TN', 'AL', 'AR', 'HI', 'WY', 'MT', 'NH', 'ND', 
              'PR', 'KY', 'SD', 'MS', 'WV', 'DE', 'DC'],
    'Count': [1678, 875, 821, 754, 724, 600, 368, 328, 283, 266, 265, 265, 242, 242, 235, 204, 197, 194, 
              193, 183, 171, 161, 155, 151, 146, 142, 140, 135, 133, 130, 129, 126, 113, 113, 102, 87, 
              82, 82, 75, 73, 68, 67, 65, 61, 60, 60, 52, 50, 47, 36, 31, 13]
}
df1 = pd.DataFrame(data)
df1.head()
result = {k: int(v) for k, v in df1.set_index('State')['Count'].to_dict().items()}
result2 = {k : int(v) for k, v in df2.set_index('State')['Average Elevation (ft)'].to_dict().items()}
randomforest_df['Total Power Plant Count'] = randomforest_df['LocationAbbr'].map(result)
randomforest_df['Average Elevation (ft)'] = randomforest_df['LocationAbbr'].map(result2)
randomforest_df.head()
C:\Users\tranh\AppData\Local\Temp\ipykernel_2596\2533913839.py:1: DtypeWarning: Columns (0,1,4,6,8,16,17,19,20,22,23,27,29,33,34,49,51,52,54,55,57,58,59,60,61,62,63,64,65,67,68,70,71,73,80,94,95,96,99) have mixed types. Specify dtype option on import or set low_memory=False.
  PLNT22df = pd.read_csv('PLNT22.csv')
Out[76]:
YearStart LocationAbbr Stratification1 DataValue Year_x State Total Population State Name Year_male_pop State_male_pop Male Population Year_female_pop State_female_pop Female Population Year_y Emissions Visits per 10,000 Total Power Plant Count Average Elevation (ft)
0 2010 AZ Female 7473.116667 2010 AZ 1600442.0 Arizona 2010 AZ 809747.0 2010 AZ 790763.0 2010 99.5 94.505138 146 4100
1 2010 AZ Male 6150.000000 2010 AZ 1600442.0 Arizona 2010 AZ 809747.0 2010 AZ 790763.0 2010 99.5 75.949648 146 4100
2 2010 CA Female 38898.420000 2010 CA 8730612.0 California 2010 CA 4327961.0 2010 CA 4402785.0 2010 356.6 88.349579 1678 2900
3 2010 CA Male 32368.006667 2010 CA 8730612.0 California 2010 CA 4327961.0 2010 CA 4402785.0 2010 356.6 74.788120 1678 2900
4 2010 FL Female 27106.200000 2010 FL 18845537.0 Florida 2010 FL 9212261.0 2010 FL 9633882.0 2010 245.5 28.136321 242 100
In [77]:
X_a = randomforest_df[['Emissions', 'Average Elevation (ft)', 'Total Population', 'Male Population', 'Female Population']]
y_a = randomforest_df['Visits per 10,000']
X_a = pd.get_dummies(X_a)
X_trainz, X_testz, y_trainz, y_testz = train_test_split(X_a, y_a, test_size=0.33, random_state=42)
asthma_nonparam = RandomForestRegressor()
asthma_nonparam.fit(X_trainz,  y_trainz)

asthma_train_y_pred = asthma_nonparam.predict(X_trainz)
asthma_train_rmse = mean_squared_error(y_trainz, asthma_train_y_pred, squared=False)
asthma_train_mse = mean_squared_error(y_trainz, asthma_train_y_pred)
asthma_train_r2_score = r2_score(y_trainz, asthma_train_y_pred)

asthma_test_y_pred = asthma_nonparam.predict(X_testz)
asthma_test_rmse = mean_squared_error(y_testz, asthma_test_y_pred, squared=False)
asthma_test_mse = mean_squared_error(y_testz, asthma_test_y_pred)
asthma_test_r2_score = r2_score(y_testz, asthma_test_y_pred)


print('Training rmse: ', asthma_train_rmse)
print('Training mse: ', asthma_train_mse)
print('Training r2 score: ', asthma_train_r2_score)
print('Testing rmse: ', asthma_test_rmse)
print('Testing mse:', asthma_test_mse)
print('Testing r2 score: ', asthma_test_r2_score)
Training rmse:  12.064387076833203
Training mse:  145.54943553965998
Training r2 score:  0.9857253688229829
Testing rmse:  49.81317124162892
Testing mse: 2481.3520291478467
Testing r2 score:  0.8111242389109078
In [78]:
ceilupvisitDF = randomforest_df.copy()
ceilupvisitDF['Visits per 10,000'] = np.ceil(ceilupvisitDF['Visits per 10,000'])
X2 = ceilupvisitDF[['Emissions', 'Average Elevation (ft)', 'Total Population', 'Male Population', 'Female Population']]
y2 = ceilupvisitDF['Visits per 10,000']
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.33, random_state=42)
X2 = pd.get_dummies(X2)

X2 = ceilupvisitDF[['Emissions', 'Average Elevation (ft)', 'Total Population', 'Male Population', 'Female Population']]
y2 = ceilupvisitDF['Visits per 10,000']
In [79]:
poisson_model = sm.GLM(y_train2, sm.add_constant(X_train2.astype(float)), family=sm.families.Poisson()).fit()
poisson_predictions = poisson_model.predict(sm.add_constant(X_test2.astype(float)))

gaussian_model = sm.OLS(y_trainz, sm.add_constant(X_trainz.astype(float))).fit()
gaussian_predictions = gaussian_model.predict(sm.add_constant(X_testz.astype(float)))
In [80]:
print(gaussian_model.summary())
print(poisson_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:      Visits per 10,000   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     83.25
Date:                Mon, 06 May 2024   Prob (F-statistic):           1.07e-32
Time:                        11:42:49   Log-Likelihood:                -464.68
No. Observations:                 134   AIC:                             1109.
Df Residuals:                      44   BIC:                             1370.
Df Model:                          89                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                  -5556.3711   2.95e+04     -0.188      0.852   -6.51e+04     5.4e+04
Average Elevation (ft)     1.0292      5.350      0.192      0.848      -9.753      11.812
Total Population          -0.0250      0.110     -0.227      0.822      -0.247       0.197
Male Population            0.0246      0.112      0.219      0.827      -0.202       0.251
Female Population          0.0271      0.118      0.231      0.819      -0.210       0.264
Emissions_100.4         5161.7080   2.55e+04      0.202      0.840   -4.62e+04    5.65e+04
Emissions_101.2         2163.2483   1.13e+04      0.191      0.850   -2.07e+04     2.5e+04
Emissions_101.3         2165.8279   1.14e+04      0.190      0.850   -2.08e+04    2.51e+04
Emissions_103.7         5058.5364   2.55e+04      0.198      0.844   -4.63e+04    5.65e+04
Emissions_104.9         5196.9858   2.55e+04      0.204      0.839   -4.62e+04    5.66e+04
Emissions_105.1         5112.1298   2.56e+04      0.200      0.843   -4.65e+04    5.67e+04
Emissions_105.6         5131.6265   2.55e+04      0.201      0.842   -4.63e+04    5.66e+04
Emissions_116.8          830.5270   4616.404      0.180      0.858   -8473.224    1.01e+04
Emissions_121.1        -4444.2747   2.28e+04     -0.195      0.846   -5.04e+04    4.15e+04
Emissions_122.8          827.5651   4594.495      0.180      0.858   -8432.032    1.01e+04
Emissions_124.2        -4608.3979   2.36e+04     -0.195      0.846   -5.21e+04    4.29e+04
Emissions_124.8        -4181.5958   2.15e+04     -0.195      0.846   -4.74e+04    3.91e+04
Emissions_124.9        -4327.2718   2.22e+04     -0.195      0.846   -4.91e+04    4.04e+04
Emissions_126.2        -3948.9990   2.03e+04     -0.194      0.847   -4.49e+04     3.7e+04
Emissions_126.8          875.0910   4813.667      0.182      0.857   -8826.216    1.06e+04
Emissions_129.3         9.033e-09   1.65e-09      5.461      0.000     5.7e-09    1.24e-08
Emissions_131.3          859.2823   4757.269      0.181      0.857   -8728.363    1.04e+04
Emissions_139.7          907.1910   4956.361      0.183      0.856   -9081.698    1.09e+04
Emissions_140.4          885.6499   4864.092      0.182      0.856   -8917.284    1.07e+04
Emissions_141.8        -4446.9281   2.28e+04     -0.195      0.846   -5.05e+04    4.16e+04
Emissions_141.9        -4581.1242   2.35e+04     -0.195      0.846   -5.19e+04    4.28e+04
Emissions_147.2        -3607.0556   1.87e+04     -0.193      0.848   -4.13e+04     3.4e+04
Emissions_148.1        -4044.9307   2.09e+04     -0.194      0.847   -4.61e+04     3.8e+04
Emissions_152.9         1005.6937   5423.098      0.185      0.854   -9923.842    1.19e+04
Emissions_169.1        -1279.0476   6985.168     -0.183      0.856   -1.54e+04    1.28e+04
Emissions_170.2        -1267.8812   6866.154     -0.185      0.854   -1.51e+04    1.26e+04
Emissions_185.0        -1241.8311   6866.788     -0.181      0.857   -1.51e+04    1.26e+04
Emissions_227.0        -6.375e-14   8.99e-15     -7.095      0.000   -8.19e-14   -4.56e-14
Emissions_233.0        -1.221e+04   6.39e+04     -0.191      0.849   -1.41e+05    1.17e+05
Emissions_237.4        -1.258e+04   6.58e+04     -0.191      0.849   -1.45e+05     1.2e+05
Emissions_238.4        -1.334e+04   6.97e+04     -0.191      0.849   -1.54e+05    1.27e+05
Emissions_238.8        -1.303e+04   6.81e+04     -0.191      0.849    -1.5e+05    1.24e+05
Emissions_242.0        -1.352e+04   7.07e+04     -0.191      0.849   -1.56e+05    1.29e+05
Emissions_245.5        -1.114e+04   5.84e+04     -0.191      0.850   -1.29e+05    1.07e+05
Emissions_356.6        -4890.3593   2.61e+04     -0.187      0.852   -5.75e+04    4.77e+04
Emissions_38.5           -88.1072    775.874     -0.114      0.910   -1651.779    1475.564
Emissions_38.8         -2002.4678   1.02e+04     -0.197      0.845   -2.25e+04    1.85e+04
Emissions_39.6         -1973.1171   1.01e+04     -0.195      0.846   -2.23e+04    1.84e+04
Emissions_39.7           -22.6926    762.145     -0.030      0.976   -1558.696    1513.311
Emissions_40.0           -44.4492    792.064     -0.056      0.956   -1640.750    1551.852
Emissions_40.6            10.3719    766.569      0.014      0.989   -1534.546    1555.290
Emissions_41.4           -22.6671    811.503     -0.028      0.978   -1658.144    1612.809
Emissions_47.9          1500.4600   8226.825      0.182      0.856   -1.51e+04    1.81e+04
Emissions_48.4          1513.4283   8290.428      0.183      0.856   -1.52e+04    1.82e+04
Emissions_5.7           4269.1405   2.27e+04      0.188      0.852   -4.14e+04       5e+04
Emissions_5.9           4268.2382   2.27e+04      0.188      0.852   -4.15e+04       5e+04
Emissions_50.0          1566.2732   8572.959      0.183      0.856   -1.57e+04    1.88e+04
Emissions_50.7          1520.7833   8337.904      0.182      0.856   -1.53e+04    1.83e+04
Emissions_52.2          1532.5856   8385.011      0.183      0.856   -1.54e+04    1.84e+04
Emissions_52.4          1517.1323   8293.968      0.183      0.856   -1.52e+04    1.82e+04
Emissions_53.5          1537.5031   8428.013      0.182      0.856   -1.54e+04    1.85e+04
Emissions_55.0          4934.5333   2.52e+04      0.196      0.845   -4.58e+04    5.57e+04
Emissions_59.0         -1.253e-07   8.43e-08     -1.486      0.144   -2.95e-07    4.46e-08
Emissions_59.1          2249.0120   1.21e+04      0.186      0.854   -2.22e+04    2.67e+04
Emissions_6.1           4262.0606   2.27e+04      0.188      0.852   -4.14e+04       5e+04
Emissions_60.5          4959.0275   2.52e+04      0.197      0.845   -4.58e+04    5.57e+04
Emissions_61.7          4921.8361   2.52e+04      0.196      0.846   -4.58e+04    5.56e+04
Emissions_62.1          2236.2990    1.2e+04      0.186      0.853    -2.2e+04    2.65e+04
Emissions_62.3          4938.5957   2.52e+04      0.196      0.845   -4.58e+04    5.57e+04
Emissions_64.0         -3066.6956   1.57e+04     -0.195      0.846   -3.48e+04    2.87e+04
Emissions_64.2          5038.2058   2.52e+04      0.200      0.842   -4.57e+04    5.58e+04
Emissions_65.3         -3234.8352   1.66e+04     -0.195      0.847   -3.67e+04    3.03e+04
Emissions_66.7         -5118.0847   2.62e+04     -0.195      0.846    -5.8e+04    4.77e+04
Emissions_68.9          2270.3559   1.22e+04      0.186      0.853   -2.23e+04    2.68e+04
Emissions_70.0           590.9592   3620.903      0.163      0.871   -6706.491    7888.409
Emissions_70.2           923.8853   5231.538      0.177      0.861   -9619.586    1.15e+04
Emissions_70.8          2202.4079   1.19e+04      0.185      0.854   -2.17e+04    2.61e+04
Emissions_71.4          5058.0028   2.52e+04      0.201      0.842   -4.58e+04    5.59e+04
Emissions_72.8         -4666.1536   2.39e+04     -0.195      0.846   -5.28e+04    4.35e+04
Emissions_73.4           756.7471   4439.070      0.170      0.865   -8189.610    9703.105
Emissions_74.0          2626.4694    1.4e+04      0.188      0.852   -2.56e+04    3.08e+04
Emissions_76.3          2619.9530    1.4e+04      0.187      0.852   -2.55e+04    3.08e+04
Emissions_77.4          2624.8039    1.4e+04      0.187      0.852   -2.56e+04    3.08e+04
Emissions_82.4          2637.2577   1.41e+04      0.188      0.852   -2.57e+04     3.1e+04
Emissions_83.4          2631.3933    1.4e+04      0.188      0.852   -2.56e+04    3.09e+04
Emissions_85.9          1096.7809   6091.680      0.180      0.858   -1.12e+04    1.34e+04
Emissions_89.3         -9.652e-10   1.17e-10     -8.237      0.000    -1.2e-09   -7.29e-10
Emissions_90.1          1.543e-07   2.49e-08      6.203      0.000    1.04e-07    2.04e-07
Emissions_90.3          2654.2125   1.41e+04      0.188      0.852   -2.58e+04    3.11e+04
Emissions_90.5           -60.6969    209.701     -0.289      0.774    -483.322     361.928
Emissions_90.9          1.226e-07   1.94e-08      6.304      0.000    8.34e-08    1.62e-07
Emissions_91.1         -2.544e-07   4.02e-08     -6.322      0.000   -3.35e-07   -1.73e-07
Emissions_91.3          3027.6741    1.6e+04      0.190      0.850   -2.91e+04    3.52e+04
Emissions_94.1          -106.8394    427.618     -0.250      0.804    -968.646     754.968
Emissions_94.8          3036.4217    1.6e+04      0.190      0.850   -2.92e+04    3.53e+04
Emissions_95.0           -27.8216     45.678     -0.609      0.546    -119.879      64.236
Emissions_95.6          2143.3294   1.13e+04      0.190      0.850   -2.06e+04    2.49e+04
Emissions_95.9          3.997e-07   7.61e-08      5.254      0.000    2.46e-07    5.53e-07
Emissions_97.3            17.8462    106.906      0.167      0.868    -197.609     233.301
Emissions_98.6          2140.3227   1.13e+04      0.190      0.850   -2.05e+04    2.48e+04
Emissions_99.0          2193.6384   1.15e+04      0.191      0.849   -2.09e+04    2.53e+04
Emissions_99.3            31.7339    192.549      0.165      0.870    -356.322     419.790
Emissions_99.4          5090.3011   2.56e+04      0.199      0.843   -4.65e+04    5.66e+04
Emissions_99.5            78.8163    343.770      0.229      0.820    -614.007     771.640
Emissions_99.9          2151.4975   1.13e+04      0.190      0.850   -2.07e+04     2.5e+04
==============================================================================
Omnibus:                       28.208   Durbin-Watson:                   1.825
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              258.679
Skew:                          -0.000   Prob(JB):                     6.74e-57
Kurtosis:                       9.807   Cond. No.                     2.36e+23
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.41e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:      Visits per 10,000   No. Observations:                  134
Model:                            GLM   Df Residuals:                      128
Model Family:                 Poisson   Df Model:                            5
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2071.6
Date:                Mon, 06 May 2024   Deviance:                       3429.6
Time:                        11:42:49   Pearson chi2:                 1.29e+04
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                      3.7220      0.030    122.229      0.000       3.662       3.782
Emissions                  0.0236      0.000     65.837      0.000       0.023       0.024
Average Elevation (ft)     0.0002   6.45e-06     28.945      0.000       0.000       0.000
Total Population        7.575e-05   1.18e-05      6.404      0.000    5.26e-05    9.89e-05
Male Population           -0.0001   1.18e-05     -9.044      0.000      -0.000   -8.38e-05
Female Population      -4.685e-05   1.18e-05     -3.957      0.000   -7.01e-05   -2.36e-05
==========================================================================================
In [81]:
glm_y_train1 = gaussian_predictions
glm_train_rmse1 = rmse(y_testz, glm_y_train1)
print('GLM Training RMSE =', glm_train_rmse1)
print('GLM Training MSE =', glm_train_rmse1 ** 2)


glm_y_train2 = poisson_predictions 
glm_train_rmse2 = rmse(y_test2, glm_y_train2)
print('GLM Training RMSE Poisson =', glm_train_rmse2)
print('GLM Training MSE Poisson =', glm_train_rmse2 ** 2)
GLM Training RMSE = 2474.3382319839498
GLM Training MSE = 6122349.686257458
GLM Training RMSE Poisson = 58.27654136285573
GLM Training MSE Poisson = 3396.155273216635

Option C: Prediction with GLMs and nonparametric methods¶

Methods

  • Our group try to predict the number of visit the emergency room per 10000 visits for each state. We used the average elevation in ft, emissions is in MTCO2 (metric tons CO2), Total Population, Male Population and Female Population in each state to predict Visits per 10000 column. We choose this features because our research question is regarding whether or not asthma is because of average elevation in ft between each state, we add in the emissions in metric tons CO2 because we could not find data regarding PM2.5 (small particles in air that affect asthma), but CO2 is connected to asthma in the same way to so we choose this feature. Our group choose the total population because the more of the population, the more CO2 is and the more cases to visit the emergency room due to asthma (to prove it not by chance but more statiscally). We also choose male and female because we saw in the data trend that male is less likely to get asthma than female (based on the data we saw).
  • Our group tried to use 2 different GLM. We first tried Gaussian because we assume that our data is normally distributed and each observation is independent from each other. Which mean no one has asthma and caused another person to get asthma (which affect the number of visit to the emergency room). We also assume that the variance across the data is constant since there should not be a skewed problem in our data due to the fact that CO2 level for people in the same state might be the same (assume that we take the average CO2 level for that state). The second choice we make is Poisson because it is also a for counting problem for each visit to the emergency room is identical and independently from each other. Also the rate of occur is constant over a given period (here is per year) and Poisson is good when modeling for "rare" event. In our study, that rare event is the chance of having asthma depends on gender or ethnicity. Also since the number of visit is a discrete count, this is the best model for us so far. The mean of the Poisson distribution (also the variance in a Poisson model) is proportional to the size of the observation period. For instance, if the time period or area under observation is doubled, the average count of events is expected to double as well (and it is logically proof by looking at our data set - 2 times the number of visiting in one year of one state is approximately the same as two years number of visiting the emergency room, assume that the CO2 level and the number of power plants don't change dramatically which affect the CO2 levels).
  • We choose to use random forest because we have many features and it is more robust to use random forest than just a single decision tree. In this algorithm, we assume that the result is not affected by race and age. Since we use Frequentist approach, we did not bring in any prior knowledge and also assume that the CO2 level and the number of power plants don't change dramatically which affect the overall CO2 levels.
  • We evaluate our prediction based on the RMSE, MSE, log likelihood and also the complicated of the model and found that our random forest is more robust and perform the best comparing to Poisson model and then Gaussian model.

Results

For model performance summaries, refer to the RMSE, MSE, and R^2 metrics for each of the 3 models (1 random forest, and 2 GLMs).

I would say there is relatively low uncertainty in both the nonparametric and GLM models because, despite our data not being 100% thorough. we performed really good feature engineering Quantitatively, we can observe this because some of the 95% confidence intervals for the coefficients are tight around the values. For example, in the Poisson link GLM, the predicted emissions coefficient is 0.0229, whereas the 95% confidence interval bounds are [0.022, 0.024].

Discussion

The nonparametric random forest model performed significantly better if we are comparing the common RMSE metric between the two models we trained. I am confident that, in future similar datasets, the random forest model can perform better than any sort of GLM because we are working with real world data that often has complex relationships. The random forest can better map out these complex relationships, such as between asthma and emissions, without overly risking overfitting because of the advantages of an ensemble (i.e., creating multiple decision trees that each only use a subset of the features).

Both models fit the data well, especially the training data, though the GLM is marginally worse.

Created in deepnote.com Created in Deepnote