!pip install matplotlib
!pip install numpy
!pip install pandas
!pip install seaborn
!pip install scipy
!pip install statsmodels
!pip install scikit-learn
!pip install networkx

Requirement already satisfied: matplotlib in c:\users\tranh\appdata\roaming\python\python310\site-packages (3.8.4)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: pillow>=8 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (9.4.0)
Requirement already satisfied: cycler>=0.10 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (4.25.0)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: packaging>=20.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (22.0)
Requirement already satisfied: numpy>=1.21 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (1.23.5)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (1.0.5)
Requirement already satisfied: six>=1.5 in c:\users\tranh\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: numpy in c:\users\tranh\anaconda3\lib\site-packages (1.23.5)
Requirement already satisfied: pandas in c:\users\tranh\anaconda3\lib\site-packages (1.5.3)
Requirement already satisfied: pytz>=2020.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas) (2022.7)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: numpy>=1.21.0 in c:\users\tranh\anaconda3\lib\site-packages (from pandas) (1.23.5)
Requirement already satisfied: six>=1.5 in c:\users\tranh\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Requirement already satisfied: seaborn in c:\users\tranh\anaconda3\lib\site-packages (0.12.2)
Requirement already satisfied: pandas>=0.25 in c:\users\tranh\anaconda3\lib\site-packages (from seaborn) (1.5.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\tranh\appdata\roaming\python\python310\site-packages (from seaborn) (3.8.4)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\tranh\anaconda3\lib\site-packages (from seaborn) (1.23.5)
Requirement already satisfied: cycler>=0.10 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: packaging>=20.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (22.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.5)
Requirement already satisfied: pillow>=8 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0)
Requirement already satisfied: pytz>=2020.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas>=0.25->seaborn) (2022.7)
Requirement already satisfied: six>=1.5 in c:\users\tranh\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
Requirement already satisfied: scipy in c:\users\tranh\anaconda3\lib\site-packages (1.10.0)
Requirement already satisfied: numpy<1.27.0,>=1.19.5 in c:\users\tranh\anaconda3\lib\site-packages (from scipy) (1.23.5)
Requirement already satisfied: statsmodels in c:\users\tranh\anaconda3\lib\site-packages (0.13.5)
Requirement already satisfied: pandas>=0.25 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (1.5.3)
Requirement already satisfied: numpy>=1.22.3 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (1.23.5)
Requirement already satisfied: patsy>=0.5.2 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (0.5.3)
Requirement already satisfied: packaging>=21.3 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (22.0)
Requirement already satisfied: scipy>=1.3 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (1.10.0)
Requirement already satisfied: pytz>=2020.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas>=0.25->statsmodels) (2022.7)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas>=0.25->statsmodels) (2.8.2)
Requirement already satisfied: six in c:\users\tranh\anaconda3\lib\site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
Requirement already satisfied: scikit-learn in c:\users\tranh\anaconda3\lib\site-packages (1.2.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: joblib>=1.1.1 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (1.1.1)
Requirement already satisfied: numpy>=1.17.3 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (1.23.5)
Requirement already satisfied: scipy>=1.3.2 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (1.10.0)
Requirement already satisfied: networkx in c:\users\tranh\appdata\roaming\python\python310\site-packages (3.3)


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
import seaborn as sns
import scipy.stats
from scipy.stats import norm
import statsmodels.api as sm
from statsmodels.formula.api import glm
import networkx as nx
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod import families
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model, tree
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from scipy.stats import chi2
import statsmodels.formula.api as smf
from statsmodels.tools.eval_measures import rmse
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import ttest_ind
from statsmodels.stats.power import TTestIndPower
from sklearn.model_selection import train_test_split


df = pd.read_csv('Population by Age and Sex - EDDs.csv')

mean_population_by_year = df[df['State'] == 'CA'].groupby('Year')['Total Population'].mean()

mean_population_by_year.plot(kind='line', figsize=(10, 6))

plt.title('Average Total Population Over Years')
plt.xlabel('Year')
plt.ylabel('Average Population')
plt.xticks(range(min(mean_population_by_year.index), max(mean_population_by_year.index) + 1, 2))
plt.grid(True)
plt.show()


states = ['CA', 'NY', 'FL', 'NV', 'TX', 'HI', 'WA', 'OR', 'KY', 'LA']
df[df['State'].isin(states)].groupby(['Year', 'State'])['Total Population'].mean().unstack().plot(kind='bar', stacked=True, figsize=(10, 6))

<Axes: xlabel='Year'>


race_df = pd.read_csv('Population by Race - EDDs.csv')
race_df.fillna(0, inplace=True) # For Hawaiian or Pacific Islander Alone, fill with zeroes if there are null values.
race_df


merged_df = pd.merge(df, race_df, on=['IBRC_Geo_ID', 'State', 'District Name', 'Year', 'Total Population'], how='inner')
merged_df


summary_df = merged_df.groupby('Year').agg({
    'White Alone': 'sum',
    'Black Alone': 'sum',
    'American Indian or Alaskan Native': 'sum',
    'Asian Alone': 'sum',
    'Hawaiian or Pacific Islander Alone': 'sum',
    'Two or More Races': 'sum',
    #'Not Hispanic': 'sum',
    'Hispanic': 'sum'
}).reset_index()

# Plot
summary_df.plot(x='Year', kind='bar', stacked=True, figsize=(10, 6), title='Population by Race Over Years')
plt.xlabel('Year')
plt.ylabel('Population')
plt.legend(title='Population Groups', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv')
asthma_df

C:\Users\tranh\AppData\Local\Temp\ipykernel_2596\2650947367.py:1: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv')


# Filter the dataset for Asthma indicators only
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']


asthma_data = asthma_data.dropna(subset=['DataValue'])
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')

# Group by YearStart and calculate the mean of DataValue for Asthma prevalence
asthma_yearly = asthma_data.groupby('YearStart')['DataValue'].mean()

plt.figure(figsize=(12, 6))
plt.plot(asthma_yearly.index, asthma_yearly.values, marker='o', linestyle='-', color='blue')
plt.title('Average Asthma Prevalence Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Prevalence (%)')
plt.grid(True)
plt.show()


# Filter the dataset for Emergency Room visits related to Asthma
er_visits_asthma = asthma_data[asthma_data['Question'].str.contains('Emergency department visit rate for asthma')]

# Group by YearStart and calculate the mean of DataValue for ER visits due to Asthma
er_visits_yearly = er_visits_asthma.groupby('YearStart')['DataValue'].mean().reset_index()

er_visits_yearly


plt.figure(figsize=(12, 6))
plt.plot(er_visits_yearly['YearStart'], er_visits_yearly['DataValue'], marker='o', color='red')
plt.title('Average Emergency Room Visits for Asthma Over the Years')
plt.xlabel('Year')
plt.ylabel('Average ER Visits')
plt.grid(True)
plt.show()


asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['DataValueType'] == 'Number']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Overall']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr'])['DataValue'].mean().reset_index()

asthma_state_yearly.head(30)


df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
# Summarize total population by state and year
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
df_population_summary


# Merge asthma data with population data
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')

# Calculate the percentage of population affected by asthma
asthma_population_merged['NormalizedOccurence'] = (asthma_population_merged['DataValue'] / asthma_population_merged['Total Population'])
asthma_population_merged['Asthma Prevalence (%)'] = asthma_population_merged['NormalizedOccurence'] * 100

asthma_population_merged[['YearStart', 'LocationAbbr', 'DataValue', 'Total Population', 'Asthma Prevalence (%)', 'NormalizedOccurence']]


emmissions_df = pd.read_csv('table1.csv') # https://www.eia.gov/environment/emissions/state/

emmissions_df.head() #Each row is a state, each column is the year
#million metric tons of carbon dioxide equivalents (MTCO2E)


state_names = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}


asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)
asthma_population_merged


emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')
asthma_population_merged['YearStart'] = pd.to_numeric(asthma_population_merged['YearStart'], errors='coerce')
merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
merged_df.rename(columns={'Year_x': 'Year'}, inplace=True)
merged_df.drop(columns=['Year_y'], inplace=True)
merged_df.head()


florida_asthma_data = merged_df[(merged_df['LocationAbbr'] == 'FL')]
new_jersey_asthma_data = merged_df[(merged_df['LocationAbbr'] == 'NJ')]
iowa_asthma_data = merged_df[(merged_df['LocationAbbr'] == 'IA')]

fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(10, 24))

# Florida
X_fl = florida_asthma_data['YearStart'].values.reshape(-1, 1)
y_fl = florida_asthma_data['Asthma Prevalence (%)'].values

model_fl = LinearRegression().fit(X_fl, y_fl)
trendline_fl = model_fl.predict(X_fl)

axes[0].scatter(X_fl, y_fl)
axes[0].plot(X_fl, trendline_fl, color='red')
axes[0].set_title('Asthma Prevalence (%) by Year in Florida')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Asthma Prevalence (%)')
axes[0].set_ylim(bottom=0)
axes[0].grid(True)

# New Jersey
X_nj = new_jersey_asthma_data['YearStart'].values.reshape(-1, 1)
y_nj = new_jersey_asthma_data['Asthma Prevalence (%)'].values

model_nj = LinearRegression().fit(X_nj, y_nj)
trendline_nj = model_nj.predict(X_nj)

axes[1].scatter(X_nj, y_nj)
axes[1].plot(X_nj, trendline_nj, color='green')
axes[1].set_title('Asthma Prevalence (%) by Year in New Jersey')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Asthma Prevalence (%)')
axes[1].set_ylim(bottom=0)
axes[1].grid(True)

# Iowa
X_ia = iowa_asthma_data['YearStart'].values.reshape(-1, 1)
y_ia = iowa_asthma_data['Asthma Prevalence (%)'].values

model_ia = LinearRegression().fit(X_ia, y_ia)
trendline_ia = model_ia.predict(X_ia)

axes[2].scatter(X_ia, y_ia)
axes[2].plot(X_ia, trendline_ia, color='black')
axes[2].set_title('Asthma Prevalence (%) by Year in Iowa')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('Asthma Prevalence (%)')
axes[2].set_ylim(bottom=0)
axes[2].grid(True)

# Combined Trends
axes[3].plot(X_fl, trendline_fl, color='red', label='Florida')
axes[3].plot(X_nj, trendline_nj, color='green', label='New Jersey')
axes[3].plot(X_ia, trendline_ia, color='black', label='Iowa')
axes[3].set_title('Combined Asthma Prevalence (%) Trends')
axes[3].set_xlabel('Year')
axes[3].set_ylabel('Asthma Prevalence (%)')
axes[3].legend()
axes[3].set_ylim(bottom=0)
axes[3].grid(True)

plt.tight_layout()
plt.show()


fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(10, 18))

# Florida
florida_emissions = merged_df[(merged_df['LocationAbbr'] == 'FL')]
X_florida = florida_emissions['YearStart'].values.reshape(-1, 1)
y_florida = florida_emissions['Emissions'].astype(float).values
model_florida = LinearRegression().fit(X_florida, y_florida)
trendline_florida = model_florida.predict(X_florida)

axes[0].scatter(X_florida, y_florida, label='CO2 Emissions (MTCO2E)')
axes[0].plot(X_florida, trendline_florida, color='red', label='Trendline')
axes[0].set_title('CO2 Emissions Over Years in Florida')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('CO2 Emissions (MTCO2E)')
axes[0].set_ylim(bottom=0)
axes[0].legend()
axes[0].grid(True)

# New Jersey
new_jersey_emissions = merged_df[(merged_df['LocationAbbr'] == 'NJ')]
X_new_jersey = new_jersey_emissions['YearStart'].values.reshape(-1, 1)
y_new_jersey = new_jersey_emissions['Emissions'].astype(float).values
model_new_jersey = LinearRegression().fit(X_new_jersey, y_new_jersey)
trendline_new_jersey = model_new_jersey.predict(X_new_jersey)

axes[1].scatter(X_new_jersey, y_new_jersey, label='CO2 Emissions (MTCO2E)')
axes[1].plot(X_new_jersey, trendline_new_jersey, color='green', label='Trendline')
axes[1].set_title('CO2 Emissions Over Years in New Jersey')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('CO2 Emissions (MTCO2E)')
axes[1].set_ylim(bottom=0)
axes[1].legend()
axes[1].grid(True)
axes[1].grid(True)

# Iowa
iowa_emissions = merged_df[(merged_df['LocationAbbr'] == 'IA')]
X_iowa = iowa_emissions['YearStart'].values.reshape(-1, 1)
y_iowa = iowa_emissions['Emissions'].astype(float).values
model_iowa = LinearRegression().fit(X_iowa, y_iowa)
trendline_iowa = model_iowa.predict(X_iowa)

axes[2].scatter(X_iowa, y_iowa, label='CO2 Emissions (MTCO2E)')
axes[2].plot(X_iowa, trendline_iowa, color='blue', label='Trendline')
axes[2].set_title('CO2 Emissions Over Years in Iowa')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('CO2 Emissions (MTCO2E)')
axes[2].set_ylim(bottom=0)
axes[2].legend()
axes[2].grid(True)
axes[2].grid(True)

# Combined Trends
axes[3].plot(X_fl, trendline_florida, color='red', label='Florida')
axes[3].plot(X_nj, trendline_new_jersey, color='green', label='New Jersey')
axes[3].plot(X_ia, trendline_iowa, color='black', label='Iowa')
axes[3].set_title('CO2 Emissions Trends')
axes[3].set_xlabel('Year')
axes[3].set_ylabel('CO2 Emissions')
axes[3].set_ylim(bottom=0)
axes[3].legend()
axes[3].grid(True)

plt.tight_layout()
plt.show()


merged_df[['LocationAbbr', 'YearStart', 'NormalizedOccurence', 'Emissions']].head()
# NormalizedOccurence = num occurences / total population
#Normalized occurence is the dedcimal value
#asthma % = NormalizedOccurence * 100


plt.figure(figsize=(12, 6))

data_2010 = merged_df[merged_df['YearStart'] == 2010]

bar_width = 0.35
index = np.arange(len(data_2010['LocationAbbr']))

bars1 = plt.bar(index, data_2010['Asthma Prevalence (%)'].astype(float), bar_width, label='Asthma Prevalence (%)')
bars2 = plt.bar(index + bar_width, data_2010['Emissions'].astype(float), bar_width, label='Emissions (MTCO2E)')

plt.xlabel('State')
plt.ylabel('Values')
plt.title('Asthma Prevalence (%) vs Emissions by State for 2010')
plt.xticks(index + bar_width / 2, data_2010['LocationAbbr'])
plt.legend()
plt.tight_layout()
plt.show()


asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)

asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Gender']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()

df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)

df_male_population_summary = df_population.groupby(['Year', 'State'])['Male Population'].sum().reset_index()
df_female_population_summary = df_population.groupby(['Year', 'State'])['Female Population'].sum().reset_index()
print(df_male_population_summary, df_female_population_summary)
asthma_population_merged = pd.merge(asthma_population_merged, df_male_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_male_pop'))
asthma_population_merged = pd.merge(asthma_population_merged, df_female_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_female_pop'))

emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')

merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)

merged_df.head()

     Year State  Male Population
0    2000    AK          76955.0
1    2000    AL        2175150.0
2    2000    AR        1307573.0
3    2000    AZ         614849.0
4    2000    CA        4063605.0
..    ...   ...              ...
915  2019    VA         403981.0
916  2019    VT         160948.0
917  2019    WA        2811851.0
918  2019    WI        1393637.0
919  2019    WV         889160.0

[920 rows x 3 columns]      Year State  Female Population
0    2000    AK            69689.0
1    2000    AL          2335216.0
2    2000    AR          1371015.0
3    2000    AZ           605521.0
4    2000    CA          4099641.0
..    ...   ...                ...
915  2019    VA           412600.0
916  2019    VT           162963.0
917  2019    WA          2795301.0
918  2019    WI          1378458.0
919  2019    WV           906103.0

[920 rows x 3 columns]


florida_er_visits = merged_df[(merged_df['LocationAbbr'] == 'FL')][['YearStart', 'DataValue', 'Stratification1', 'Male Population', 'Female Population']].dropna()
iowa_er_visits = merged_df[(merged_df['LocationAbbr'] == 'IA')][['YearStart', 'DataValue', 'Stratification1', 'Male Population', 'Female Population']].dropna()
new_jersey_er_visits = merged_df[(merged_df['LocationAbbr'] == 'NJ')][['YearStart', 'DataValue', 'Stratification1', 'Male Population', 'Female Population']].dropna()

florida_er_visits['Visits per 10,000'] = florida_er_visits.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)
iowa_er_visits['Visits per 10,000'] = iowa_er_visits.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)
new_jersey_er_visits['Visits per 10,000'] = new_jersey_er_visits.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)

total_visits_per_10000_florida = florida_er_visits.groupby(['YearStart', 'Stratification1'])['Visits per 10,000'].sum().reset_index()
total_visits_per_10000_iowa = iowa_er_visits.groupby(['YearStart', 'Stratification1'])['Visits per 10,000'].sum().reset_index()
total_visits_per_10000_new_jersey = new_jersey_er_visits.groupby(['YearStart', 'Stratification1'])['Visits per 10,000'].sum().reset_index()

(total_visits_per_10000_florida, total_visits_per_10000_iowa, total_visits_per_10000_new_jersey)

(    YearStart Stratification1  Visits per 10,000
 0        2010          Female          28.136321
 1        2010            Male          21.920341
 2        2013          Female          23.307932
 3        2013            Male          19.730475
 4        2014          Female          24.128772
 5        2014            Male          20.049779
 6        2015          Female          17.268291
 7        2015            Male          14.077174
 8        2016          Female          20.140228
 9        2016            Male          17.712882
 10       2017          Female          19.240536
 11       2017            Male          16.687892
 12       2018          Female          19.346165
 13       2018            Male          16.858536,
     YearStart Stratification1  Visits per 10,000
 0        2010          Female          25.532069
 1        2010            Male          20.437976
 2        2013          Female          19.253279
 3        2013            Male          16.662763
 4        2014          Female          20.461681
 5        2014            Male          17.900915
 6        2015          Female          15.734273
 7        2015            Male          12.679087
 8        2016          Female          17.134587
 9        2016            Male          14.657145
 10       2017          Female          16.417481
 11       2017            Male          14.772427
 12       2018          Female          16.205231
 13       2018            Male          13.870933,
     YearStart Stratification1  Visits per 10,000
 0        2010          Female         517.049395
 1        2010            Male         399.737513
 2        2013          Female         404.753877
 3        2013            Male         345.234332
 4        2014          Female         409.523794
 5        2014            Male         343.297010
 6        2015          Female         292.801786
 7        2015            Male         247.006551
 8        2016          Female         359.963605
 9        2016            Male         310.604665
 10       2017          Female         345.041857
 11       2017            Male         287.855561
 12       2018          Female         328.137771
 13       2018            Male         284.557950)


fig, axes = plt.subplots(3, 1, figsize=(12, 18))

# Florida
axes[0].plot(total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Male']['YearStart'], 
         total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Male']['Visits per 10,000'], 
         label='Male', marker='o', color='blue')
axes[0].plot(total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Female']['YearStart'], 
         total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Female']['Visits per 10,000'], 
         label='Female', marker='o', color='red')
# Combine Male + Female for Florida
axes[0].plot(total_visits_per_10000_florida['YearStart'].unique(), 
         total_visits_per_10000_florida.groupby('YearStart')['Visits per 10,000'].sum(), 
         label='Total', marker='o', color='purple')
axes[0].set_title('Emergency Room Visits for Asthma per 10,000 People in Florida by Gender')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Visits per 10,000 People')
axes[0].legend()
axes[0].grid(True)
axes[0].set_ylim(bottom=0)

# Iowa
axes[1].plot(total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Male']['YearStart'], 
         total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Male']['Visits per 10,000'], 
         label='Male', marker='o', color='blue')
axes[1].plot(total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Female']['YearStart'], 
         total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Female']['Visits per 10,000'], 
         label='Female', marker='o', color='red')
# Combine Male + Female for Iowa
axes[1].plot(total_visits_per_10000_iowa['YearStart'].unique(), 
         total_visits_per_10000_iowa.groupby('YearStart')['Visits per 10,000'].sum(), 
         label='Total', marker='o', color='purple')
axes[1].set_title('Emergency Room Visits for Asthma per 10,000 People in Iowa by Gender')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Visits per 10,000 People')
axes[1].legend()
axes[1].grid(True)
axes[1].set_ylim(bottom=0)

# New Jersey
axes[2].plot(total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Male']['YearStart'], 
         total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Male']['Visits per 10,000'], 
         label='Male', marker='o', color='blue')
axes[2].plot(total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Female']['YearStart'], 
         total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Female']['Visits per 10,000'], 
         label='Female', marker='o', color='red')
# Combine Male + Female for New Jersey
axes[2].plot(total_visits_per_10000_new_jersey['YearStart'].unique(), 
         total_visits_per_10000_new_jersey.groupby('YearStart')['Visits per 10,000'].sum(), 
         label='Total', marker='o', color='purple')
axes[2].set_title('Emergency Room Visits for Asthma per 10,000 People in New Jersey by Gender')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('Visits per 10,000 People')
axes[2].legend()
axes[2].grid(True)
axes[2].set_ylim(bottom=0)

plt.tight_layout()
plt.show()


asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)

asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Race/Ethnicity']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()

df_population = pd.read_csv('Population by Race - EDDs.csv')
df_population_total = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
df_population_white = df_population.groupby(['Year', 'State'])['White Alone'].sum().reset_index()
df_population_black = df_population.groupby(['Year', 'State'])['Black Alone'].sum().reset_index()
df_population_american_indian = df_population.groupby(['Year', 'State'])['American Indian or Alaskan Native'].sum().reset_index()
df_population_hawaiian = df_population.groupby(['Year', 'State'])['Hawaiian or Pacific Islander Alone'].sum().reset_index()
df_population_two = df_population.groupby(['Year', 'State'])['Two or More Races'].sum().reset_index()
df_population_not_hispanic = df_population.groupby(['Year', 'State'])['Not Hispanic'].sum().reset_index()
df_population_hispanic = df_population.groupby(['Year', 'State'])['Hispanic'].sum().reset_index()

asthma_population_merged = pd.merge(asthma_state_yearly, df_population_total, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
merges = [df_population_white, df_population_black, df_population_american_indian, df_population_hawaiian, df_population_two, df_population_not_hispanic, df_population_hispanic]
for merge_df in merges:
    asthma_population_merged = pd.merge(asthma_population_merged, merge_df, on=['Year', 'State'], how='inner')
asthma_population_merged


asthma_population_merged['Race'] = asthma_population_merged['Stratification1'].apply(lambda x: x.split(',')[0])

race_mapping = {
    'American Indian or Alaska Native': 'American Indian or Alaskan Native',
    'Asian or Pacific Islander': 'Hawaiian or Pacific Islander Alone',
    'Black, non-Hispanic': 'Black Alone',
    'Hispanic': 'Hispanic',
    'White, non-Hispanic': 'White Alone',
    "Black": "Black Alone",
    "White": "White Alone"
}

asthma_population_merged['Clean Race'] = asthma_population_merged['Race'].map(race_mapping)

def get_race_population(row):
    race_column = row['Clean Race']
    if race_column in row:
        return row[race_column]
    else:
        return None

asthma_population_merged['Clean Race Population'] = asthma_population_merged.apply(get_race_population, axis=1)
asthma_population_merged['Visits per 10,000 based on Race'] = (asthma_population_merged['DataValue'] / asthma_population_merged['Clean Race Population']) * 10000
asthma_population_merged.head()


states = ['FL', 'IA', 'NJ']
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 20))

asthma_population_merged['StateIndex'] = asthma_population_merged['LocationAbbr'].apply(lambda x: states.index(x) if x in states else None)
filtered_data = asthma_population_merged.dropna(subset=['StateIndex'])

sns.lineplot(ax=axes[0], data=filtered_data[filtered_data['LocationAbbr'] == 'FL'], x='YearStart', y='Visits per 10,000 based on Race', hue='Clean Race', legend='full')
axes[0].set_title('ER Visits per 10,000 based on Race - FL')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Visits per 10,000')

sns.lineplot(ax=axes[1], data=filtered_data[filtered_data['LocationAbbr'] == 'IA'], x='YearStart', y='Visits per 10,000 based on Race', hue='Clean Race', legend='full')
axes[1].set_title('ER Visits per 10,000 based on Race - IA')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Visits per 10,000')

sns.lineplot(ax=axes[2], data=filtered_data[filtered_data['LocationAbbr'] == 'NJ'], x='YearStart', y='Visits per 10,000 based on Race', hue='Clean Race', legend='full')
axes[2].set_title('ER Visits per 10,000 based on Race - NJ')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('Visits per 10,000')

plt.tight_layout()
plt.show()


# high level or low level of CO2 is the treatment, a binary variable [0, 1]

# Create a DAG
G = nx.DiGraph()

G.add_node("Population", pos=(0.3, 0.7))
G.add_node("CO2", pos=(1, 0))
G.add_node("Asthma", pos=(1.7, 0.7))
G.add_node("Number of power plants", pos=(0.5, 0))

G.add_edges_from([("Population", "CO2"), ("Population", "Asthma"), ("CO2", "Asthma"), ('Number of power plants', 'CO2')])

pos = nx.get_node_attributes(G, 'pos')

plt.figure(figsize=(12, 6))
try:
    nx.draw_networkx(G, pos, with_labels=True, node_size=5000, node_color='skyblue', font_size=15, arrows=True)
except:
    nx.draw(G, pos, with_labels=True, node_size=5000, node_color='skyblue', font_size=15, arrows=True)
plt.title('Causal DAG: Population as a Confounding Variable')
plt.show()


#Load the asthma dataframe
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)
#Group the asthma dataframe by Year and State
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
#Filter out rows that match our research question related to emergency department visit rates for asthma
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['DataValueType'] == 'Number']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Overall']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr'])['DataValue'].mean().reset_index()
#Group the asthma dataset with the Population Dataset because Population is a confounding variable
# We grouped by population
df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)


emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')

merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
merged_df['Visits per 10,000'] = ((merged_df['DataValue'] / merged_df['Total Population']) * 10000).astype(int)
merged_df['Visits per 10,000'] = merged_df['Visits per 10,000'].astype(int)
merged_df['Total Population'] = merged_df['Total Population'].astype(int)
merged_df['Emissions'] = merged_df['Emissions'].astype(float)

merged_df


# The states in our dataset with Asthma related information for emergency visits to the hospital
asthma_population_merged["State Name"].unique()

array(['Arizona', 'California', 'Florida', 'Iowa', 'Kentucky',
       'Massachusetts', 'Maryland', 'North Carolina', 'Nebraska',
       'New Jersey', 'New York', 'South Carolina', 'Utah', 'Wisconsin',
       'Nevada', 'Vermont', 'Arkansas', 'Georgia', 'Minnesota', 'Oregon',
       'Colorado'], dtype=object)


threshold = merged_df['Emissions'].quantile(0.60)
print(threshold)
merged_df['High Emissions'] = merged_df['Emissions'].apply(lambda x: 1 if x > threshold else 0)
merged_df

99.34


# Power plant dataframe
# The year is 2022 for all entries
# We use the number of power plants as an instrumental variable
PLNT22df = pd.read_csv('PLNT22.csv')
PLNT22df.head()

C:\Users\tranh\AppData\Local\Temp\ipykernel_2596\2105320888.py:4: DtypeWarning: Columns (0,1,4,6,8,16,17,19,20,22,23,27,29,33,34,49,51,52,54,55,57,58,59,60,61,62,63,64,65,67,68,70,71,73,80,94,95,96,99) have mixed types. Specify dtype option on import or set low_memory=False.
  PLNT22df = pd.read_csv('PLNT22.csv')


state_counts = PLNT22df['Plant state abbreviation'].value_counts()
print(state_counts)
data = {
    'State': ['CA', 'NC', 'NY', 'TX', 'MN', 'MA', 'NJ', 'IL', 'IA', 'MI', 'OR', 'CO', 'FL', 'PA', 'GA', 
              'WI', 'SC', 'IN', 'VA', 'OH', 'MD', 'KS', 'CT', 'AK', 'AZ', 'WA', 'ID', 'NM', 'ME', 'MO', 
              'OK', 'NE', 'VT', 'UT', 'NV', 'RI', 'LA', 'TN', 'AL', 'AR', 'HI', 'WY', 'MT', 'NH', 'ND', 
              'PR', 'KY', 'SD', 'MS', 'WV', 'DE', 'DC'],
    'Count': [1678, 875, 821, 754, 724, 600, 368, 328, 283, 266, 265, 265, 242, 242, 235, 204, 197, 194, 
              193, 183, 171, 161, 155, 151, 146, 142, 140, 135, 133, 130, 129, 126, 113, 113, 102, 87, 
              82, 82, 75, 73, 68, 67, 65, 61, 60, 60, 52, 50, 47, 36, 31, 13]
}

df = pd.DataFrame(data)
df.head()

CA          1678
NC           875
NY           821
TX           754
MN           724
MA           600
NJ           368
IL           328
IA           283
MI           266
CO           265
OR           265
PA           242
FL           242
GA           235
WI           204
SC           197
IN           194
VA           193
OH           183
MD           171
KS           161
CT           155
AK           151
AZ           146
WA           142
ID           140
NM           135
ME           133
MO           130
OK           129
NE           126
UT           113
VT           113
NV           102
RI            87
TN            82
LA            82
AL            75
AR            73
HI            68
WY            67
MT            65
NH            61
PR            60
ND            60
KY            52
SD            50
MS            47
WV            36
DE            31
DC            13
PSTATABB       1
Name: Plant state abbreviation, dtype: int64


result = {k: int(v) for k, v in df.set_index('State')['Count'].to_dict().items()}
merged_df['Total Power Plant Count'] = merged_df['LocationAbbr'].map(result)


merged_df


# From lab 9, we fit an OLS model
def fit_OLS_model(df, target_variable, explanatory_variables, intercept = False):
    """
    Fits an OLS model from data.
    Inputs:
        df: pandas DataFrame
        target_variable: string, name of the target variable
        explanatory_variables: list of strings, names of the explanatory variables
        intercept: bool, if True add intercept term
    Outputs:
        fitted_model: model containing OLS regression results
    """
    target = df[target_variable]
    inputs = df[explanatory_variables]
    if intercept:
        inputs = sm.add_constant(inputs)
    fitted_model = sm.OLS(target, inputs).fit()
    return(fitted_model)


gammas_model1 = fit_OLS_model(merged_df, 'Visits per 10,000', ['High Emissions'], intercept = True)
gammas_model1.summary()


gammas_model2 = fit_OLS_model(merged_df, 'Visits per 10,000', ['High Emissions', 'Total Population'], intercept = True)
gammas_model2.summary()


gammas_model3 = fit_OLS_model(merged_df, 'Visits per 10,000', ['Total Power Plant Count','High Emissions', 'Total Population'], intercept = True)
gammas_model3.summary()


def null_pdf(x): return norm.pdf(x, 0, 1) #change
def alt_pdf(x): return norm.pdf(x, 2, 1) #change
def calculate_likelihood_ratio(x):
    """
    Computes the likelihood ratio between the alternative and null hypothesis.
    
    Inputs:
        x: value for which to compute the likelihood ratio
    
    Outputs:
        lr : the likelihood ratio at point x
    """
    
    L0 = null_pdf(x)
    L1 = alt_pdf(x)
    LR = L1/L0 # TODO: fill the likelihood ratio
    return LR
# TODO: fill in the missing expression for FPR
def calculate_fpr(tau):
    """
    Calculates the FPR for the test based on thresholding X>=tau.
    It assumes that the null distribution is the standard gaussian N(0,1)
    
    Inputs:
        tau: test threshold
    
    Outputs:
        fpr: false positive rate
    """
    
    fpr = 1 - scipy.stats.norm.cdf(tau)
    return fpr
# TODO: complete this function
def make_decision(X, alpha):
    """
    Makes a decision whether to reject the null hypothesis for point X at level alpha
    
    Inputs:
        X: point at which to test
        alpha: desired FPR rate for the decision rule (also known as significance level)
    
    Outputs:
        decision: {0, 1} or {False, True}
            1/True: reject the null
            0/False: fail to reject the null
    """
    threshold = scipy.stats.norm.ppf(1 - alpha) # TODO: compute the threshold for which the FPR of the test is equal to alpha (see Hint)
    decision = X > threshold # TODO: compute the decision; 1 stands for rejecting the null
    return decision
# TODO: complete this function
def calculate_p_value(X):
    """
    Calculates the P-values for the point X
    
    Inputs:
        X: data point
    
    Outputs:
        p_value: P(X)
    """
    p_value = calculate_fpr(X)
    return(p_value)


asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)

asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Gender']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()

df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)

df_male_population_summary = df_population.groupby(['Year', 'State'])['Male Population'].sum().reset_index()
df_female_population_summary = df_population.groupby(['Year', 'State'])['Female Population'].sum().reset_index()
# print(df_male_population_summary, df_female_population_summary)
asthma_population_merged = pd.merge(asthma_population_merged, df_male_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_male_pop'))
asthma_population_merged = pd.merge(asthma_population_merged, df_female_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_female_pop'))

emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')

merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
merged_df['Visits per 10,000'] = merged_df.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)

merged_df.head()


q1_with_emissions = merged_df.groupby(['YearStart', 'LocationAbbr'])[['Visits per 10,000', 'Emissions', 'Total Population']].aggregate({'Visits per 10,000': 'sum', 'Emissions': 'first', 'Total Population': 'first'}).reset_index()
q1_with_emissions


q1_with_emissions['Visits_per_10000'] = q1_with_emissions['Visits per 10,000'].astype(float)
q1_with_emissions['HighCO2'] = q1_with_emissions['Emissions'].apply(lambda x: 1 if float(x) > 100 else 0)
q1_with_emissions


q1_with_emissions.groupby('HighCO2')['Visits per 10,000'].mean()

HighCO2
0    137.124766
1    161.673375
Name: Visits per 10,000, dtype: float64


model = glm(formula='Visits_per_10000 ~ Emissions', data=q1_with_emissions, 
            family=sm.families.NegativeBinomial()).fit()
model.summary()


p_values = model.pvalues
print(p_values)

Intercept             3.699265e-11
Emissions[T.101.2]    3.781367e-02
Emissions[T.101.3]    7.180858e-02
Emissions[T.103.7]    8.162721e-01
Emissions[T.104.9]    9.978746e-01
                          ...     
Emissions[T.99.0]     1.048848e-01
Emissions[T.99.3]     2.163504e-01
Emissions[T.99.4]     9.045280e-01
Emissions[T.99.5]     2.956719e-01
Emissions[T.99.9]     5.172320e-02
Length: 95, dtype: float64


# Fit a Poisson regression model
poisson_model = GLM(q1_with_emissions['Visits_per_10000'].astype(float), sm.add_constant(q1_with_emissions['Emissions'].astype(float)), 
                    family=families.Poisson()).fit()

# Fit a Negative Binomial regression model (already fitted and stored in `model` variable)
negative_binomial_model = model


# Calculate the test statistic
lr_stat = 2 * (negative_binomial_model.llf - poisson_model.llf)

# Degrees of freedom for the test (difference in the number of parameters)
df = negative_binomial_model.df_model - poisson_model.df_model
lr_stat

20922.232638114958


Global_P_Values = []


# Load up asthma dataset
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)

asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Gender']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()

df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'FL': 'Florida',
    'GA': 'Georgia',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)

df_male_population_summary = df_population.groupby(['Year', 'State'])['Male Population'].sum().reset_index()
df_female_population_summary = df_population.groupby(['Year', 'State'])['Female Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_population_merged, df_male_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_male_pop'))
asthma_population_merged = pd.merge(asthma_population_merged, df_female_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_female_pop'))

emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')

#Merge asthma and emissions together as well as Male and Female Population
merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
#Normalize visits per 10,000 based on the Male and Female population
merged_df['Visits per 10,000'] = merged_df.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)


merged_df.head()
randomforest_df = merged_df.copy()


# COmpare emissions with the male and female population per state
q2_with_emissions = merged_df.groupby(['YearStart', 'LocationAbbr'])[['Visits per 10,000', "Emissions", 'Male Population', 'Female Population', 'Total Population']].aggregate({'Visits per 10,000': 'sum',
    'Emissions': 'first', 'Male Population': 'first', 'Female Population': 'first', 'Total Population': 'first'}).reset_index()
q2_with_emissions


q2_with_emissions['MaleVisits'] = q2_with_emissions['Visits per 10,000'] * q2_with_emissions['Male Population'] / 10000
q2_with_emissions['FemaleVisits'] = q2_with_emissions['Visits per 10,000'] * q2_with_emissions['Female Population'] / 10000

# Sum over the years
male_df = q2_with_emissions.groupby('YearStart')['MaleVisits'].sum().reset_index()
male_df['Gender'] = 'Male'
female_df = q2_with_emissions.groupby('YearStart')['FemaleVisits'].sum().reset_index()
female_df['Gender'] = 'Female' 
female_df.rename(columns={'FemaleVisits': 'Visits'}, inplace=True)

male_df.rename(columns={'MaleVisits': 'Visits'}, inplace=True)
df_long = pd.concat([male_df, female_df], ignore_index=True)

# Calculate the average emissions for each year from the q2_with_emissions dataframe and add as a new column to df_long
df_long = pd.merge(df_long, q2_with_emissions.groupby('YearStart')['Emissions'].apply(lambda x: pd.to_numeric(x, errors='coerce').mean()).reset_index(name='Avg Emissions'), on='YearStart', how='left')

# Show the resultant long format dataframe along with average emissions
df_long = df_long.rename(columns={'Avg Emissions': 'Emissions', 'YearStart': 'Year'})
df_long['Gender'] = df_long.apply(lambda x: 1 if x['Gender'] == 'Male' else 0, axis=1)
df_long['log_Emissions'] = np.log(df_long['Emissions'] + 1)  # +1 to avoid log(0)

df_long
print(df_long)

    Year         Visits  Gender   Emissions  log_Emissions
0   2010  322708.294534       1  130.407143       4.878300
1   2013  185789.007081       1  101.208333       4.627013
2   2014  174006.390412       1   95.373333       4.568230
3   2015  123144.271345       1   87.706667       4.485335
4   2016  182060.619481       1   95.342857       4.567913
5   2017  148108.477646       1   96.064286       4.575373
6   2018  153321.280638       1   91.900000       4.531524
7   2010  332200.036153       0  130.407143       4.878300
8   2013  191478.623630       0  101.208333       4.627013
9   2014  178663.803068       0   95.373333       4.568230
10  2015  126609.237267       0   87.706667       4.485335
11  2016  186721.374378       0   95.342857       4.567913
12  2017  152738.326770       0   96.064286       4.575373
13  2018  157127.818562       0   91.900000       4.531524


# Take log emissions and log of visits to linearize the data 
# Gender is either 0 or 1. 1 represents male and 0 represents female
model2 = sm.OLS(np.log(df_long['Visits'] + 1), df_long[['log_Emissions', 'Gender']]).fit()
model2.summary()

C:\Users\tranh\anaconda3\lib\site-packages\scipy\stats\_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=14
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "


#We use these OLS pvalues instead of the negative binomial pvalues because OLS fits the data better
print(model2.pvalues) 
Global_P_Values.extend([1.89e-25, 6.09e-01])

log_Emissions    1.893154e-25
Gender           6.095121e-01
dtype: float64


model2_negbin = glm(formula="Visits ~ Emissions * Gender", data=df_long, family=sm.families.NegativeBinomial()).fit()
model2_negbin.summary()


print(model2.pvalues)

log_Emissions    1.893154e-25
Gender           6.095121e-01
dtype: float64


# Summing visits per 10,000 for each LocationAbbr and YearStart
grouped_visits = merged_df.groupby(['LocationAbbr', 'YearStart', 'Emissions', 'Total Population'])['Visits per 10,000'].sum().reset_index()
grouped_visits['Visits per 10,000'] = grouped_visits['Visits per 10,000'].astype(int)
# grouped_visits[grouped_visits['YearStart'] == 201]
grouped_visits


tukey = pairwise_tukeyhsd(endog=grouped_visits['Visits per 10,000'], groups=grouped_visits['LocationAbbr'], alpha=0.05)
tukey.summary()


tukey_results_df = pd.DataFrame(data=tukey._results_table.data[1:], columns=tukey._results_table.data[0])
tukey_results_filter = tukey_results_df[tukey_results_df['p-adj'] != 1]
tukey_results_filter


accepted = tukey_results_filter.sort_values(by=['p-adj']).iloc[56:65, :]
accepted


tukey_alpha = 0.05 

tukey_pvalues_sorted = np.sort(tukey_results_filter['p-adj'])
tukey_m = len(tukey_pvalues_sorted)
tukey_thresholds = np.array([(tukey_k / tukey_m) * tukey_alpha for tukey_k in range(1, tukey_m + 1)])
tukey_k = max(np.where(tukey_pvalues_sorted <= tukey_thresholds)[0] + 1) 

plt.figure(figsize=(10, 6))
plt.plot(tukey_pvalues_sorted)
plt.plot(tukey_thresholds, label='BH Boundary')
plt.axvline(x=tukey_k-1, linestyle='--', label=f'Cut-off at k={tukey_k-1}')
plt.xlabel('P-Value Rank')
plt.ylabel('P-Value')
plt.title('Sorted P-Values and BH Decision Boundary')
plt.legend()
plt.show()


merged_df['Emissions'] = pd.to_numeric(merged_df['Emissions'], errors='coerce')
carbondioxide_2010_to_2014 = np.mean(merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]['Emissions'])
carbondioxide_2015_to_2018 = np.mean(merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]['Emissions'])

asthma_rates_2010_to_2014 = np.mean(merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]["Visits per 10,000"])
asthma_rates_2015_to_2018 = np.mean(merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]["Visits per 10,000"])

t_test_result_asthma = ttest_ind(
    merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]["Visits per 10,000"],
    merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]["Visits per 10,000"],
    nan_policy='omit'
)

print(f"2010-2014 Mean Carbon Dioxide Emissions: {carbondioxide_2010_to_2014:.2f}")
print(f"2015-2018 Mean Carbon Dioxide Emissions: {carbondioxide_2015_to_2018:.2f}")
print(f"2010-2014 Mean Asthma Visit Rates: {asthma_rates_2010_to_2014:.2f}")
print(f"2015-2018 Mean Asthma Visit Rates: {asthma_rates_2015_to_2018:.2f}")
print("T-test result for asthma based on CO2 levels:", t_test_result_asthma)

2010-2014 Mean Carbon Dioxide Emissions: 109.04
2015-2018 Mean Carbon Dioxide Emissions: 92.64
2010-2014 Mean Asthma Visit Rates: 83.80
2015-2018 Mean Asthma Visit Rates: 65.67
T-test result for asthma based on CO2 levels: Ttest_indResult(statistic=1.1893813840193959, pvalue=0.23571396070452827)


n1 = (len(merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]["Visits per 10,000"]))
n2 = len(merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]["Visits per 10,000"])
print(n1, n2, 2)

82 118 2


# define parameters for the test power analysis
#83.80 is the mean asthma visit rates for 2010-2014
#65.67 is the mean asthma visit rates for 2015-2018
#14 is the number of observations in each group
#198 is by taking 82 + 118 - 2

effect_size = (83.80 - 65.67) / np.sqrt(((14 * 83.80**2) + (14 * 65.67**2)) / 198)
sample_size = 120
alpha = 0.05
power = TTestIndPower().solve_power(effect_size, nobs1=sample_size, alpha=alpha, ratio=1, alternative='two-sided')

print(f"Effect Size: {effect_size:.4f}")
print(f"Power of the test: {power:.4f}")

Effect Size: 0.6404
Power of the test: 0.9986


# State altitude (source https://simple.wikipedia.org/wiki/List_of_U.S._states_by_elevation)
# Data from Wikipedia
data = {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'],
    'Highest Point': ['Cheaha Mountain', 'Mount McKinley (Denali)', 'Humphreys Peak', 'Magazine Mountain', 'Mount Whitney', 'Mount Elbert', 'Mount Frissell', 'Ebright Azimuth', 'Tenleytown', 'Britton Hill', 'Brasstown Bald', 'Mauna Kea', 'Borah Peak', 'Charles Mound', 'Hoosier Hill', 'Hawkeye Point', 'Mount Sunflower', 'Black Mountain', 'Driskill Mountain', 'Mount Katahdin', 'Backbone Mountain', 'Mount Greylock', 'Mount Arvon', 'Eagle Mountain', 'Woodall Mountain', 'Taum Sauk Mountain', 'Granite Peak', 'Panorama Point', 'Boundary Peak', 'Mount Washington', 'High Point', 'Wheeler Peak', 'Mount Marcy', 'Mount Mitchell', 'White Butte', 'Campbell Hill', 'Black Mesa', 'Mount Hood', 'Mount Davis', 'Jerimoth Hill', 'Sassafras Mountain', 'Harney Peak', 'Clingmans Dome', 'Guadalupe Peak', 'Kings Peak', 'Mount Mansfield', 'Mount Rogers', 'Mount Rainier', 'Spruce Knob', 'Timms Hill', 'Gannett Peak'],
    'Highest Elevation (ft)': [2405, 20320, 12633, 2753, 14505, 14440, 2380, 448, 410, 345, 4784, 13796, 12668, 1235, 1257, 1670, 4039, 4145, 535, 5267, 3360, 3487, 1979, 2301, 806, 1772, 12799, 5424, 13140, 6288, 1803, 13161, 5344, 6684, 3506, 1549, 4973, 11239, 3213, 812, 3560, 7242, 6643, 8749, 13528, 4393, 5729, 14410, 4861, 1951, 13804],
    'Highest Elevation (m)': [733, 6198, 3853, 840, 4421, 4401, 726, 137, 125, 105, 1459, 4208, 3862, 377, 383, 509, 1232, 1263, 163, 1606, 1025, 1064, 604, 702, 246, 540, 3904, 1654, 4007, 1918, 550, 4014, 1630, 2039, 1069, 472, 1517, 3428, 980, 248, 1086, 2209, 2026, 2668, 4126, 1340, 1747, 4395, 1483, 595, 4210],
    'Lowest Point': ['Gulf of Mexico', 'Pacific Ocean', 'Colorado River', 'Ouachita River', 'Death Valley', 'Arikaree River', 'Long Island Sound', 'Atlantic Ocean', 'Potomac River', 'Atlantic Ocean', 'Atlantic Ocean', 'Pacific Ocean', 'Snake River', 'Mississippi River', 'Ohio River', 'Mississippi River', 'Verdigris River', 'Mississippi River', 'New Orleans', 'Atlantic Ocean', 'Atlantic Ocean', 'Atlantic Ocean', 'Lake Erie', 'Lake Superior', 'Gulf of Mexico', 'Saint Francis River', 'Kootenai River', 'Missouri River', 'Colorado River', 'Atlantic Ocean', 'Atlantic Ocean', 'Red Bluff Reservoir', 'Atlantic Ocean', 'Atlantic Ocean', 'Red River', 'Ohio River', 'Little River', 'Pacific Ocean', 'Delaware River', 'Atlantic Ocean', 'Atlantic Ocean', 'Big Stone Lake', 'Mississippi River', 'Gulf of Mexico', 'Beaver Dam Wash', 'Lake Champlain', 'Atlantic Ocean', 'Pacific Ocean', 'Potomac River', 'Lake Michigan', 'Belle Fourche River'],
    'Lowest Elevation (ft)': [0, 0, 70, 55, -282, 3315, 0, 0, 1, 0, 0, 0, 710, 279, 320, 480, 679, 257, -8, 0, 0, 0, 571, 601, 0, 230, 1800, 840, 479, 0, 0, 2842, 0, 0, 750, 455, 289, 0, 0, 0, 0, 966, 178, 0, 2000, 95, 0, 0, 240, 579, 3099],
    'Lowest Elevation (m)': [0, 0, 21, 17, -86, 1010, 0, 0, 0, 0, 0, 0, 217, 85, 98, 146, 207, 78, -2, 0, 0, 0, 174, 183, 0, 70, 549, 256, 146, 0, 0, 867, 0, 0, 229, 139, 88, 0, 0, 0, 0, 295, 54, 0, 610, 29, 0, 0, 73, 177, 945],
    'Average Elevation (ft)': [500, 1900, 4100, 650, 2900, 6800, 500, 60, 150, 100, 600, 3030, 5000, 600, 700, 1100, 2000, 750, 100, 600, 350, 500, 900, 1200, 300, 800, 3400, 2600, 5500, 1000, 250, 5700, 1000, 700, 1900, 850, 1300, 3300, 1100, 350, 350, 2200, 900, 1700, 6100, 1000, 950, 1700, 1500, 1050, 6700],
    'Average Elevation (m)': [150, 580, 1300, 200, 880, 2100, 150, 20, 50, 30, 180, 920, 1500, 180, 210, 340, 610, 230, 30, 180, 110, 150, 280, 366, 90, 240, 1000, 790, 1700, 300, 80, 1700, 300, 210, 580, 260, 400, 1000, 340, 110, 110, 670, 280, 520, 1860, 300, 290, 520, 460, 320, 2044],
    'Elevation Difference (ft)': [2405, 20320, 12563, 2698, 14787, 11125, 2380, 448, 409, 345, 4784, 13796, 11958, 956, 937, 1190, 3360, 3888, 543, 5267, 3360, 3487, 1408, 1700, 806, 1542, 10999, 4584, 12661, 6288, 1803, 10319, 5344, 6684, 2756, 1094, 4684, 11239, 3213, 812, 3560, 6276, 6465, 8749, 11528, 4298, 5729, 14410, 4621, 1372, 10705],
    'Elevation Difference (m)': [733, 6198, 3832, 823, 4507, 3391, 726, 137, 125, 105, 1459, 4208, 3645, 292, 285, 363, 1025, 1185, 165, 1606, 1025, 1064, 430, 519, 246, 470, 3355, 1398, 3861, 1918, 550, 3147, 1630, 2039, 840, 333, 1429, 3428, 980, 248, 1086, 1914, 1972, 2668, 3516, 1311, 1747, 4395, 1410, 418, 3265]
}

elevationdf = pd.DataFrame(data)
elevationdf.head()


regions_dict = {
    'NERC': {'PSTATABB'}, 
    'AK': {'AK'}, 
    'SERC': {'AL', 'GA', 'IL', 'OH', 'OK', 'NC', 'FL', 'TN', 'VA', 'LA', 'MD', 'AR', 'MS', 'KY', 'IA', 'WV', 'MO', 'NY', 'DC', 'TX', 'SC'}, 
    'MRO': {'AR', 'ND', 'KS', 'IA', 'NE', 'MN', 'LA', 'MO', 'IL', 'OK', 'SD', 'NM', 'MT', 'CO', 'WI', 'IN', 'TX'}, 
    'WECC': {'CA', 'MA', 'WY', 'AZ', 'NV', 'SD', 'ID', 'OR', 'UT', 'NM', 'MT', 'CO', 'WA', 'TX'}, 
    'NPCC': {'CT', 'VT', 'MA', 'PA', 'IL', 'RI', 'ME', 'NJ', 'NY', 'MD', 'NH'}, 
    'RFC': {'CT', 'MD', 'MA', 'PA', 'KY', 'VA', 'MN', 'MI', 'WV', 'IL', 'OH', 'RI', 'NJ', 'NY', 'DE', 'WI', 'IN', 'DC'}, 
    'HI': {'HI'}, 
    'TRE': {'OK', 'TX'}, 
    'PR': {'PR'}
}

if 'NERC' in regions_dict:
    del regions_dict['NERC']

for key in regions_dict:
    regions_dict[key] = list(regions_dict[key])


statesInRegion_df = pd.DataFrame(list(regions_dict.items()), columns=['NERC_Region', 'Plant_States'])
statesInRegion_df.head()


data = {
    'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 
              'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 
              'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 
              'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 
              'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 
              'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 
              'West Virginia', 'Wisconsin', 'Wyoming'],
    'Average Elevation (ft)': [500, 1900, 4100, 650, 2900, 6800, 500, 60, 150, 100, 600, 3030, 5000, 600, 700, 1100, 2000,
                               750, 100, 600, 350, 500, 900, 1200, 300, 800, 3400, 2600, 5500, 1000, 250, 5700, 1000, 700,
                               1900, 850, 1300, 3300, 1100, 350, 350, 2200, 900, 1700, 6100, 1000, 950, 1700, 1500, 1050,
                               6700]
}

elevationdf = pd.DataFrame(data)

regions_dict = {
    'AK': ['Alaska'], 
    'SERC': ['Alabama', 'Georgia', 'Illinois', 'Ohio', 'Oklahoma', 'North Carolina', 'Florida', 'Tennessee', 'Virginia', 
             'Louisiana', 'Maryland', 'Arkansas', 'Mississippi', 'Kentucky', 'Iowa', 'West Virginia', 'Missouri', 
             'New York', 'District of Columbia', 'Texas', 'South Carolina'], 
    'MRO': ['Arkansas', 'North Dakota', 'Kansas', 'Iowa', 'Nebraska', 'Minnesota', 'Louisiana', 'Missouri', 'Illinois', 
            'Oklahoma', 'South Dakota', 'New Mexico', 'Montana', 'Colorado', 'Wisconsin', 'Indiana', 'Texas'], 
    'WECC': ['California', 'Massachusetts', 'Wyoming', 'Arizona', 'Nevada', 'South Dakota', 'Idaho', 'Oregon', 'Utah', 
             'New Mexico', 'Montana', 'Colorado', 'Washington', 'Texas'], 
    'NPCC': ['Connecticut', 'Vermont', 'Massachusetts', 'Pennsylvania', 'Illinois', 'Rhode Island', 'Maine', 'New Jersey', 
             'New York', 'Maryland', 'New Hampshire'], 
    'RFC': ['Connecticut', 'Maryland', 'Massachusetts', 'Pennsylvania', 'Kentucky', 'Virginia', 'Minnesota', 'Michigan', 
            'West Virginia', 'Illinois', 'Ohio', 'Rhode Island', 'New Jersey', 'New York', 'Delaware', 'Wisconsin', 'Indiana', 
            'District of Columbia'], 
    'HI': ['Hawaii'], 
    'TRE': ['Oklahoma', 'Texas']
}

flattened_data = []
for region, states in regions_dict.items():
    for state in states:
        flattened_data.append({'NERC_Region': region, 'State': state})

regions_df = pd.DataFrame(flattened_data)

merged_df = pd.merge(regions_df, elevationdf, on='State')

average_elevations = merged_df.groupby('NERC_Region')['Average Elevation (ft)'].mean().reset_index()

average_elevations.head(10)


state_average_visits = q2_with_emissions.groupby('LocationAbbr')['Visits per 10,000'].mean().reset_index()
state_average_visits['Visits per 10,000'] = state_average_visits['Visits per 10,000'].astype(int)


asthma_data = {
    'LocationAbbr': ['AR', 'AZ', 'CA', 'CO', 'FL', 'GA', 'IA', 'KY', 'MA', 'MD', 'MN', 'NC', 'NE', 'NJ', 'NV', 'NY', 'OR', 
                     'SC', 'UT', 'VT', 'WI'],
    'Visits per 10,000': [25, 124, 163, 134, 39, 39, 34, 29, 24, 499, 77, 33, 23, 696, 473, 185, 18, 34, 16, 47, 57]
}
asthma_df = pd.DataFrame(asthma_data)

state_names = {
    'AR': 'Arkansas', 'AZ': 'Arizona', 'CA': 'California', 'CO': 'Colorado', 'FL': 'Florida', 'GA': 'Georgia', 'IA': 'Iowa',
    'KY': 'Kentucky', 'MA': 'Massachusetts', 'MD': 'Maryland', 'MN': 'Minnesota', 'NC': 'North Carolina', 'NE': 'Nebraska', 
    'NJ': 'New Jersey', 'NV': 'Nevada', 'NY': 'New York', 'OR': 'Oregon', 'SC': 'South Carolina', 'UT': 'Utah', 
    'VT': 'Vermont', 'WI': 'Wisconsin'
}
asthma_df['State'] = asthma_df['LocationAbbr'].map(state_names)

regions_dict = {
    #'AK': ['Alaska'], 
    'SERC': ['Alabama', 'Georgia', 'Illinois', 'Ohio', 'Oklahoma', 'North Carolina', 'Florida', 'Tennessee', 'Virginia', 
             'Louisiana', 'Maryland', 'Arkansas', 'Mississippi', 'Kentucky', 'Iowa', 'West Virginia', 'Missouri', 
             'New York', 'District of Columbia', 'Texas', 'South Carolina'], 
    'MRO': ['Arkansas', 'North Dakota', 'Kansas', 'Iowa', 'Nebraska', 'Minnesota', 'Louisiana', 'Missouri', 'Illinois', 
            'Oklahoma', 'South Dakota', 'New Mexico', 'Montana', 'Colorado', 'Wisconsin', 'Indiana', 'Texas'], 
    'WECC': ['California', 'Massachusetts', 'Wyoming', 'Arizona', 'Nevada', 'South Dakota', 'Idaho', 'Oregon', 'Utah', 
             'New Mexico', 'Montana', 'Colorado', 'Washington', 'Texas'], 
    'NPCC': ['Connecticut', 'Vermont', 'Massachusetts', 'Pennsylvania', 'Illinois', 'Rhode Island', 'Maine', 'New Jersey', 
             'New York', 'Maryland', 'New Hampshire'], 
    'RFC': ['Connecticut', 'Maryland', 'Massachusetts', 'Pennsylvania', 'Kentucky', 'Virginia', 'Minnesota', 'Michigan', 
            'West Virginia', 'Illinois', 'Ohio', 'Rhode Island', 'New Jersey', 'New York', 'Delaware', 'Wisconsin', 'Indiana', 
            'District of Columbia'], 
}

flattened_data = [{'NERC_Region': region, 'State': state} for region, states in regions_dict.items() for state in states]
regions_df = pd.DataFrame(flattened_data)

merged_df = pd.merge(regions_df, asthma_df, on='State', how='left')

visits_sum = merged_df.groupby('NERC_Region').agg(
    Total_Visits_Per_10000=('Visits per 10,000', 'sum'),
    Count_States=('State', 'count')
)
visits_sum['Average_Visits_Per_10000'] = visits_sum['Total_Visits_Per_10000'] / visits_sum['Count_States']

average_elevations = {
    'NERC_Region': ['SERC', 'MRO', 'WECC', 'NPCC', 'RFC', 'HI', 'TRE', 'AK'],
    'Average Elevation (ft)': [800, 2000, 3000, 400, 1100, 3030, 1300, 1900]
}
average_elevations_df = pd.DataFrame(average_elevations)

final_df = pd.merge(visits_sum.reset_index(), average_elevations_df, on='NERC_Region', how='left')
final_df.head(10)


X = final_df['Average Elevation (ft)'] 
y = final_df['Average_Visits_Per_10000'] 
X = sm.add_constant(X) 


model = sm.OLS(y, X).fit()

print(model.summary())

                               OLS Regression Results                               
====================================================================================
Dep. Variable:     Average_Visits_Per_10000   R-squared:                       0.224
Model:                                  OLS   Adj. R-squared:                 -0.035
Method:                       Least Squares   F-statistic:                    0.8661
Date:                      Mon, 06 May 2024   Prob (F-statistic):              0.421
Time:                              11:42:49   Log-Likelihood:                -24.663
No. Observations:                         5   AIC:                             53.33
Df Residuals:                             3   BIC:                             52.55
Df Model:                                 1                                         
Covariance Type:                  nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                     98.4730     35.996      2.736      0.072     -16.084     213.030
Average Elevation (ft)    -0.0193      0.021     -0.931      0.421      -0.085       0.047
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.544
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.627
Skew:                          -0.156   Prob(JB):                        0.731
Kurtosis:                       1.294   Cond. No.                     3.22e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.22e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

C:\Users\tranh\anaconda3\lib\site-packages\statsmodels\stats\stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 5 samples were given.
  warn("omni_normtest is not valid with less than 8 observations; %i "


Global_P_Values = []

Global_P_Values.extend([0.000, 0.610, 0.095121e-01])

p_adj_values = accepted['p-adj'].tolist()
Global_P_Values.extend(p_adj_values)

# Global_P_Values.extend([3.011885e-31, 3.287227e-02, 1.964625e-01, 4.996776e-01, 5.045825e-02, 4.432867e-01])
Global_P_Values.append(4.432867e-01)
Global_P_Values.extend([0.072, 0.421])
print(Global_P_Values)

alpha = 0.05 

pvalues_sorted = np.sort(Global_P_Values)
print(pvalues_sorted)
m = len(pvalues_sorted)
thresholds = np.array([(i / m) * alpha for i in range(1, m + 1)])

k = max(np.where(pvalues_sorted <= thresholds)[0] + 1) if np.any(pvalues_sorted <= thresholds) else 0

plt.figure(figsize=(10, 6))
plt.plot(pvalues_sorted, label='Sorted P-values')
plt.plot(thresholds, label='BH Boundary')
plt.axvline(x=k-1, linestyle='--', color='red', label=f'Cut-off at k={k-1}')
plt.xlabel('P-Value Rank')
plt.ylabel('P-Value')
plt.title('Sorted P-Values and BH Decision Boundary')
plt.legend()
plt.show()

[0.0, 0.61, 0.0095121, 0.0008, 0.0017, 0.0025, 0.0027, 0.0049, 0.0054, 0.0065, 0.0121, 0.0217, 0.4432867, 0.072, 0.421]
[0.        0.0008    0.0017    0.0025    0.0027    0.0049    0.0054
 0.0065    0.0095121 0.0121    0.0217    0.072     0.421     0.4432867
 0.61     ]


merged_df


avg_elevation_1 = {
    'State': ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI',
    'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN',
    'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
    'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA',
    'WV', 'WI', 'WY'],
    'Average Elevation (ft)': [500, 1900, 4100, 650, 2900, 6800, 500, 60, 150, 100, 600, 3030, 5000, 600, 700, 1100, 2000,
                               750, 100, 600, 350, 500, 900, 1200, 300, 800, 3400, 2600, 5500, 1000, 250, 5700, 1000, 700,
                               1900, 850, 1300, 3300, 1100, 350, 350, 2200, 900, 1700, 6100, 1000, 950, 1700, 1500, 1050,
                               6700]
}

df2 = pd.DataFrame(avg_elevation_1)
df2.head()


PLNT22df = pd.read_csv('PLNT22.csv')
state_counts = PLNT22df['Plant state abbreviation'].value_counts()
data = {
    'State': ['CA', 'NC', 'NY', 'TX', 'MN', 'MA', 'NJ', 'IL', 'IA', 'MI', 'OR', 'CO', 'FL', 'PA', 'GA', 
              'WI', 'SC', 'IN', 'VA', 'OH', 'MD', 'KS', 'CT', 'AK', 'AZ', 'WA', 'ID', 'NM', 'ME', 'MO', 
              'OK', 'NE', 'VT', 'UT', 'NV', 'RI', 'LA', 'TN', 'AL', 'AR', 'HI', 'WY', 'MT', 'NH', 'ND', 
              'PR', 'KY', 'SD', 'MS', 'WV', 'DE', 'DC'],
    'Count': [1678, 875, 821, 754, 724, 600, 368, 328, 283, 266, 265, 265, 242, 242, 235, 204, 197, 194, 
              193, 183, 171, 161, 155, 151, 146, 142, 140, 135, 133, 130, 129, 126, 113, 113, 102, 87, 
              82, 82, 75, 73, 68, 67, 65, 61, 60, 60, 52, 50, 47, 36, 31, 13]
}
df1 = pd.DataFrame(data)
df1.head()
result = {k: int(v) for k, v in df1.set_index('State')['Count'].to_dict().items()}
result2 = {k : int(v) for k, v in df2.set_index('State')['Average Elevation (ft)'].to_dict().items()}
randomforest_df['Total Power Plant Count'] = randomforest_df['LocationAbbr'].map(result)
randomforest_df['Average Elevation (ft)'] = randomforest_df['LocationAbbr'].map(result2)
randomforest_df.head()

C:\Users\tranh\AppData\Local\Temp\ipykernel_2596\2533913839.py:1: DtypeWarning: Columns (0,1,4,6,8,16,17,19,20,22,23,27,29,33,34,49,51,52,54,55,57,58,59,60,61,62,63,64,65,67,68,70,71,73,80,94,95,96,99) have mixed types. Specify dtype option on import or set low_memory=False.
  PLNT22df = pd.read_csv('PLNT22.csv')


X_a = randomforest_df[['Emissions', 'Average Elevation (ft)', 'Total Population', 'Male Population', 'Female Population']]
y_a = randomforest_df['Visits per 10,000']
X_a = pd.get_dummies(X_a)
X_trainz, X_testz, y_trainz, y_testz = train_test_split(X_a, y_a, test_size=0.33, random_state=42)
asthma_nonparam = RandomForestRegressor()
asthma_nonparam.fit(X_trainz,  y_trainz)

asthma_train_y_pred = asthma_nonparam.predict(X_trainz)
asthma_train_rmse = mean_squared_error(y_trainz, asthma_train_y_pred, squared=False)
asthma_train_mse = mean_squared_error(y_trainz, asthma_train_y_pred)
asthma_train_r2_score = r2_score(y_trainz, asthma_train_y_pred)

asthma_test_y_pred = asthma_nonparam.predict(X_testz)
asthma_test_rmse = mean_squared_error(y_testz, asthma_test_y_pred, squared=False)
asthma_test_mse = mean_squared_error(y_testz, asthma_test_y_pred)
asthma_test_r2_score = r2_score(y_testz, asthma_test_y_pred)


print('Training rmse: ', asthma_train_rmse)
print('Training mse: ', asthma_train_mse)
print('Training r2 score: ', asthma_train_r2_score)
print('Testing rmse: ', asthma_test_rmse)
print('Testing mse:', asthma_test_mse)
print('Testing r2 score: ', asthma_test_r2_score)

Training rmse:  12.064387076833203
Training mse:  145.54943553965998
Training r2 score:  0.9857253688229829
Testing rmse:  49.81317124162892
Testing mse: 2481.3520291478467
Testing r2 score:  0.8111242389109078


ceilupvisitDF = randomforest_df.copy()
ceilupvisitDF['Visits per 10,000'] = np.ceil(ceilupvisitDF['Visits per 10,000'])
X2 = ceilupvisitDF[['Emissions', 'Average Elevation (ft)', 'Total Population', 'Male Population', 'Female Population']]
y2 = ceilupvisitDF['Visits per 10,000']
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.33, random_state=42)
X2 = pd.get_dummies(X2)

X2 = ceilupvisitDF[['Emissions', 'Average Elevation (ft)', 'Total Population', 'Male Population', 'Female Population']]
y2 = ceilupvisitDF['Visits per 10,000']


poisson_model = sm.GLM(y_train2, sm.add_constant(X_train2.astype(float)), family=sm.families.Poisson()).fit()
poisson_predictions = poisson_model.predict(sm.add_constant(X_test2.astype(float)))

gaussian_model = sm.OLS(y_trainz, sm.add_constant(X_trainz.astype(float))).fit()
gaussian_predictions = gaussian_model.predict(sm.add_constant(X_testz.astype(float)))


print(gaussian_model.summary())
print(poisson_model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:      Visits per 10,000   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     83.25
Date:                Mon, 06 May 2024   Prob (F-statistic):           1.07e-32
Time:                        11:42:49   Log-Likelihood:                -464.68
No. Observations:                 134   AIC:                             1109.
Df Residuals:                      44   BIC:                             1370.
Df Model:                          89                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                  -5556.3711   2.95e+04     -0.188      0.852   -6.51e+04     5.4e+04
Average Elevation (ft)     1.0292      5.350      0.192      0.848      -9.753      11.812
Total Population          -0.0250      0.110     -0.227      0.822      -0.247       0.197
Male Population            0.0246      0.112      0.219      0.827      -0.202       0.251
Female Population          0.0271      0.118      0.231      0.819      -0.210       0.264
Emissions_100.4         5161.7080   2.55e+04      0.202      0.840   -4.62e+04    5.65e+04
Emissions_101.2         2163.2483   1.13e+04      0.191      0.850   -2.07e+04     2.5e+04
Emissions_101.3         2165.8279   1.14e+04      0.190      0.850   -2.08e+04    2.51e+04
Emissions_103.7         5058.5364   2.55e+04      0.198      0.844   -4.63e+04    5.65e+04
Emissions_104.9         5196.9858   2.55e+04      0.204      0.839   -4.62e+04    5.66e+04
Emissions_105.1         5112.1298   2.56e+04      0.200      0.843   -4.65e+04    5.67e+04
Emissions_105.6         5131.6265   2.55e+04      0.201      0.842   -4.63e+04    5.66e+04
Emissions_116.8          830.5270   4616.404      0.180      0.858   -8473.224    1.01e+04
Emissions_121.1        -4444.2747   2.28e+04     -0.195      0.846   -5.04e+04    4.15e+04
Emissions_122.8          827.5651   4594.495      0.180      0.858   -8432.032    1.01e+04
Emissions_124.2        -4608.3979   2.36e+04     -0.195      0.846   -5.21e+04    4.29e+04
Emissions_124.8        -4181.5958   2.15e+04     -0.195      0.846   -4.74e+04    3.91e+04
Emissions_124.9        -4327.2718   2.22e+04     -0.195      0.846   -4.91e+04    4.04e+04
Emissions_126.2        -3948.9990   2.03e+04     -0.194      0.847   -4.49e+04     3.7e+04
Emissions_126.8          875.0910   4813.667      0.182      0.857   -8826.216    1.06e+04
Emissions_129.3         9.033e-09   1.65e-09      5.461      0.000     5.7e-09    1.24e-08
Emissions_131.3          859.2823   4757.269      0.181      0.857   -8728.363    1.04e+04
Emissions_139.7          907.1910   4956.361      0.183      0.856   -9081.698    1.09e+04
Emissions_140.4          885.6499   4864.092      0.182      0.856   -8917.284    1.07e+04
Emissions_141.8        -4446.9281   2.28e+04     -0.195      0.846   -5.05e+04    4.16e+04
Emissions_141.9        -4581.1242   2.35e+04     -0.195      0.846   -5.19e+04    4.28e+04
Emissions_147.2        -3607.0556   1.87e+04     -0.193      0.848   -4.13e+04     3.4e+04
Emissions_148.1        -4044.9307   2.09e+04     -0.194      0.847   -4.61e+04     3.8e+04
Emissions_152.9         1005.6937   5423.098      0.185      0.854   -9923.842    1.19e+04
Emissions_169.1        -1279.0476   6985.168     -0.183      0.856   -1.54e+04    1.28e+04
Emissions_170.2        -1267.8812   6866.154     -0.185      0.854   -1.51e+04    1.26e+04
Emissions_185.0        -1241.8311   6866.788     -0.181      0.857   -1.51e+04    1.26e+04
Emissions_227.0        -6.375e-14   8.99e-15     -7.095      0.000   -8.19e-14   -4.56e-14
Emissions_233.0        -1.221e+04   6.39e+04     -0.191      0.849   -1.41e+05    1.17e+05
Emissions_237.4        -1.258e+04   6.58e+04     -0.191      0.849   -1.45e+05     1.2e+05
Emissions_238.4        -1.334e+04   6.97e+04     -0.191      0.849   -1.54e+05    1.27e+05
Emissions_238.8        -1.303e+04   6.81e+04     -0.191      0.849    -1.5e+05    1.24e+05
Emissions_242.0        -1.352e+04   7.07e+04     -0.191      0.849   -1.56e+05    1.29e+05
Emissions_245.5        -1.114e+04   5.84e+04     -0.191      0.850   -1.29e+05    1.07e+05
Emissions_356.6        -4890.3593   2.61e+04     -0.187      0.852   -5.75e+04    4.77e+04
Emissions_38.5           -88.1072    775.874     -0.114      0.910   -1651.779    1475.564
Emissions_38.8         -2002.4678   1.02e+04     -0.197      0.845   -2.25e+04    1.85e+04
Emissions_39.6         -1973.1171   1.01e+04     -0.195      0.846   -2.23e+04    1.84e+04
Emissions_39.7           -22.6926    762.145     -0.030      0.976   -1558.696    1513.311
Emissions_40.0           -44.4492    792.064     -0.056      0.956   -1640.750    1551.852
Emissions_40.6            10.3719    766.569      0.014      0.989   -1534.546    1555.290
Emissions_41.4           -22.6671    811.503     -0.028      0.978   -1658.144    1612.809
Emissions_47.9          1500.4600   8226.825      0.182      0.856   -1.51e+04    1.81e+04
Emissions_48.4          1513.4283   8290.428      0.183      0.856   -1.52e+04    1.82e+04
Emissions_5.7           4269.1405   2.27e+04      0.188      0.852   -4.14e+04       5e+04
Emissions_5.9           4268.2382   2.27e+04      0.188      0.852   -4.15e+04       5e+04
Emissions_50.0          1566.2732   8572.959      0.183      0.856   -1.57e+04    1.88e+04
Emissions_50.7          1520.7833   8337.904      0.182      0.856   -1.53e+04    1.83e+04
Emissions_52.2          1532.5856   8385.011      0.183      0.856   -1.54e+04    1.84e+04
Emissions_52.4          1517.1323   8293.968      0.183      0.856   -1.52e+04    1.82e+04
Emissions_53.5          1537.5031   8428.013      0.182      0.856   -1.54e+04    1.85e+04
Emissions_55.0          4934.5333   2.52e+04      0.196      0.845   -4.58e+04    5.57e+04
Emissions_59.0         -1.253e-07   8.43e-08     -1.486      0.144   -2.95e-07    4.46e-08
Emissions_59.1          2249.0120   1.21e+04      0.186      0.854   -2.22e+04    2.67e+04
Emissions_6.1           4262.0606   2.27e+04      0.188      0.852   -4.14e+04       5e+04
Emissions_60.5          4959.0275   2.52e+04      0.197      0.845   -4.58e+04    5.57e+04
Emissions_61.7          4921.8361   2.52e+04      0.196      0.846   -4.58e+04    5.56e+04
Emissions_62.1          2236.2990    1.2e+04      0.186      0.853    -2.2e+04    2.65e+04
Emissions_62.3          4938.5957   2.52e+04      0.196      0.845   -4.58e+04    5.57e+04
Emissions_64.0         -3066.6956   1.57e+04     -0.195      0.846   -3.48e+04    2.87e+04
Emissions_64.2          5038.2058   2.52e+04      0.200      0.842   -4.57e+04    5.58e+04
Emissions_65.3         -3234.8352   1.66e+04     -0.195      0.847   -3.67e+04    3.03e+04
Emissions_66.7         -5118.0847   2.62e+04     -0.195      0.846    -5.8e+04    4.77e+04
Emissions_68.9          2270.3559   1.22e+04      0.186      0.853   -2.23e+04    2.68e+04
Emissions_70.0           590.9592   3620.903      0.163      0.871   -6706.491    7888.409
Emissions_70.2           923.8853   5231.538      0.177      0.861   -9619.586    1.15e+04
Emissions_70.8          2202.4079   1.19e+04      0.185      0.854   -2.17e+04    2.61e+04
Emissions_71.4          5058.0028   2.52e+04      0.201      0.842   -4.58e+04    5.59e+04
Emissions_72.8         -4666.1536   2.39e+04     -0.195      0.846   -5.28e+04    4.35e+04
Emissions_73.4           756.7471   4439.070      0.170      0.865   -8189.610    9703.105
Emissions_74.0          2626.4694    1.4e+04      0.188      0.852   -2.56e+04    3.08e+04
Emissions_76.3          2619.9530    1.4e+04      0.187      0.852   -2.55e+04    3.08e+04
Emissions_77.4          2624.8039    1.4e+04      0.187      0.852   -2.56e+04    3.08e+04
Emissions_82.4          2637.2577   1.41e+04      0.188      0.852   -2.57e+04     3.1e+04
Emissions_83.4          2631.3933    1.4e+04      0.188      0.852   -2.56e+04    3.09e+04
Emissions_85.9          1096.7809   6091.680      0.180      0.858   -1.12e+04    1.34e+04
Emissions_89.3         -9.652e-10   1.17e-10     -8.237      0.000    -1.2e-09   -7.29e-10
Emissions_90.1          1.543e-07   2.49e-08      6.203      0.000    1.04e-07    2.04e-07
Emissions_90.3          2654.2125   1.41e+04      0.188      0.852   -2.58e+04    3.11e+04
Emissions_90.5           -60.6969    209.701     -0.289      0.774    -483.322     361.928
Emissions_90.9          1.226e-07   1.94e-08      6.304      0.000    8.34e-08    1.62e-07
Emissions_91.1         -2.544e-07   4.02e-08     -6.322      0.000   -3.35e-07   -1.73e-07
Emissions_91.3          3027.6741    1.6e+04      0.190      0.850   -2.91e+04    3.52e+04
Emissions_94.1          -106.8394    427.618     -0.250      0.804    -968.646     754.968
Emissions_94.8          3036.4217    1.6e+04      0.190      0.850   -2.92e+04    3.53e+04
Emissions_95.0           -27.8216     45.678     -0.609      0.546    -119.879      64.236
Emissions_95.6          2143.3294   1.13e+04      0.190      0.850   -2.06e+04    2.49e+04
Emissions_95.9          3.997e-07   7.61e-08      5.254      0.000    2.46e-07    5.53e-07
Emissions_97.3            17.8462    106.906      0.167      0.868    -197.609     233.301
Emissions_98.6          2140.3227   1.13e+04      0.190      0.850   -2.05e+04    2.48e+04
Emissions_99.0          2193.6384   1.15e+04      0.191      0.849   -2.09e+04    2.53e+04
Emissions_99.3            31.7339    192.549      0.165      0.870    -356.322     419.790
Emissions_99.4          5090.3011   2.56e+04      0.199      0.843   -4.65e+04    5.66e+04
Emissions_99.5            78.8163    343.770      0.229      0.820    -614.007     771.640
Emissions_99.9          2151.4975   1.13e+04      0.190      0.850   -2.07e+04     2.5e+04
==============================================================================
Omnibus:                       28.208   Durbin-Watson:                   1.825
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              258.679
Skew:                          -0.000   Prob(JB):                     6.74e-57
Kurtosis:                       9.807   Cond. No.                     2.36e+23
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.41e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:      Visits per 10,000   No. Observations:                  134
Model:                            GLM   Df Residuals:                      128
Model Family:                 Poisson   Df Model:                            5
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2071.6
Date:                Mon, 06 May 2024   Deviance:                       3429.6
Time:                        11:42:49   Pearson chi2:                 1.29e+04
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                      3.7220      0.030    122.229      0.000       3.662       3.782
Emissions                  0.0236      0.000     65.837      0.000       0.023       0.024
Average Elevation (ft)     0.0002   6.45e-06     28.945      0.000       0.000       0.000
Total Population        7.575e-05   1.18e-05      6.404      0.000    5.26e-05    9.89e-05
Male Population           -0.0001   1.18e-05     -9.044      0.000      -0.000   -8.38e-05
Female Population      -4.685e-05   1.18e-05     -3.957      0.000   -7.01e-05   -2.36e-05
==========================================================================================


glm_y_train1 = gaussian_predictions
glm_train_rmse1 = rmse(y_testz, glm_y_train1)
print('GLM Training RMSE =', glm_train_rmse1)
print('GLM Training MSE =', glm_train_rmse1 ** 2)


glm_y_train2 = poisson_predictions 
glm_train_rmse2 = rmse(y_test2, glm_y_train2)
print('GLM Training RMSE Poisson =', glm_train_rmse2)
print('GLM Training MSE Poisson =', glm_train_rmse2 ** 2)

GLM Training RMSE = 2474.3382319839498
GLM Training MSE = 6122349.686257458
GLM Training RMSE Poisson = 58.27654136285573
GLM Training MSE Poisson = 3396.155273216635

	IBRC_Geo_ID	State	District Name	Year	Total Population	White Alone	Black Alone	American Indian or Alaskan Native	Asian Alone	Hawaiian or Pacific Islander Alone	Two or More Races	Not Hispanic	Hispanic
0	6500001	AK	Kenai Peninsula Economic Development District	1990	41125.0	37508.0	199.0	3003.0	415.0	0.0	0.0	40421.0	704.0
1	6500001	AK	Kenai Peninsula Economic Development District	1991	42649.0	38893.0	204.0	3120.0	432.0	0.0	0.0	41913.0	736.0
2	6500001	AK	Kenai Peninsula Economic Development District	1992	43496.0	39670.0	212.0	3173.0	441.0	0.0	0.0	42748.0	748.0
3	6500001	AK	Kenai Peninsula Economic Development District	1993	44201.0	40265.0	219.0	3251.0	466.0	0.0	0.0	43431.0	770.0
4	6500001	AK	Kenai Peninsula Economic Development District	1994	45588.0	41435.0	237.0	3423.0	493.0	0.0	0.0	44761.0	827.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
11935	6500416	CT	Western Connecticut Economic Development Distr...	2015	944943.0	752134.0	116361.0	4573.0	52039.0	1020.0	18816.0	766505.0	178438.0
11936	6500416	CT	Western Connecticut Economic Development Distr...	2016	944347.0	748757.0	117499.0	4696.0	53041.0	1004.0	19350.0	762293.0	182054.0
11937	6500416	CT	Western Connecticut Economic Development Distr...	2017	943038.0	745352.0	118526.0	4775.0	53541.0	1005.0	19839.0	757421.0	185617.0
11938	6500416	CT	Western Connecticut Economic Development Distr...	2018	943971.0	743260.0	119919.0	4964.0	54515.0	1018.0	20295.0	753873.0	190098.0
11939	6500416	CT	Western Connecticut Economic Development Distr...	2019	943332.0	739179.0	121997.0	5115.0	55489.0	1029.0	20523.0	749560.0	193772.0

	IBRC_Geo_ID	State	District Name	Year	Total Population	Population 0-4	Population 5-17	Population 18-24	Population 25-44	Population 45-64	...	Male Population	Female Population	White Alone	Black Alone	American Indian or Alaskan Native	Asian Alone	Hawaiian or Pacific Islander Alone	Two or More Races	Not Hispanic	Hispanic
0	6500001	AK	Kenai Peninsula Economic Development District	2000	49618.0	3272.0	11536.0	3432.0	14636.0	13044.0	...	25831.0	23787.0	43353.0	236.0	3760.0	488.0	88.0	1693.0	48530.0	1088.0
1	6500001	AK	Kenai Peninsula Economic Development District	2001	49986.0	3207.0	11261.0	3770.0	14117.0	13740.0	...	26065.0	23921.0	43611.0	273.0	3729.0	512.0	79.0	1782.0	48826.0	1160.0
2	6500001	AK	Kenai Peninsula Economic Development District	2002	50804.0	3217.0	11123.0	4099.0	13835.0	14506.0	...	26589.0	24215.0	44144.0	287.0	3811.0	538.0	89.0	1935.0	49547.0	1257.0
3	6500001	AK	Kenai Peninsula Economic Development District	2003	50963.0	3185.0	10874.0	4278.0	13405.0	15077.0	...	26657.0	24306.0	44214.0	283.0	3808.0	547.0	91.0	2020.0	49657.0	1306.0
4	6500001	AK	Kenai Peninsula Economic Development District	2004	51404.0	3181.0	10592.0	4447.0	13111.0	15718.0	...	26801.0	24603.0	44423.0	271.0	3819.0	570.0	110.0	2211.0	50055.0	1349.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7885	6500416	CT	Western Connecticut Economic Development Distr...	2015	944943.0	53031.0	166447.0	85431.0	227999.0	271756.0	...	460316.0	485007.0	752134.0	116361.0	4573.0	52039.0	1020.0	18816.0	766505.0	178438.0
7886	6500416	CT	Western Connecticut Economic Development Distr...	2016	944347.0	52545.0	164256.0	86562.0	225568.0	272331.0	...	460137.0	484959.0	748757.0	117499.0	4696.0	53041.0	1004.0	19350.0	762293.0	182054.0
7887	6500416	CT	Western Connecticut Economic Development Distr...	2017	943038.0	52058.0	162222.0	86535.0	224212.0	271858.0	...	459755.0	484352.0	745352.0	118526.0	4775.0	53541.0	1005.0	19839.0	757421.0	185617.0
7888	6500416	CT	Western Connecticut Economic Development Distr...	2018	943971.0	51708.0	160510.0	86323.0	225033.0	270590.0	...	460611.0	484900.0	743260.0	119919.0	4964.0	54515.0	1018.0	20295.0	753873.0	190098.0
7889	6500416	CT	Western Connecticut Economic Development Distr...	2019	943332.0	51199.0	158467.0	85880.0	225756.0	268079.0	...	459938.0	484450.0	739179.0	121997.0	5115.0	55489.0	1029.0	20523.0	749560.0	193772.0

	YearStart	YearEnd	LocationAbbr	LocationDesc	DataSource	Topic	Question	Response	DataValueUnit	DataValueType	...	LocationID	TopicID	QuestionID	DataValueTypeID	StratificationCategoryID1	StratificationID1	StratificationCategoryID2	StratificationID2	StratificationCategoryID3	StratificationID3
0	2010	2010	OR	Oregon	NVSS	Cardiovascular Disease	Mortality from heart failure	NaN	NaN	Number	...	41	CVD	CVD1_4	NMBR	RACE	AIAN	NaN	NaN	NaN	NaN
1	2019	2019	AZ	Arizona	YRBSS	Alcohol	Alcohol use among youth	NaN	%	Crude Prevalence	...	4	ALC	ALC1_1	CRDPREV	GENDER	GENF	NaN	NaN	NaN	NaN
2	2019	2019	OH	Ohio	YRBSS	Alcohol	Alcohol use among youth	NaN	%	Crude Prevalence	...	39	ALC	ALC1_1	CRDPREV	GENDER	GENM	NaN	NaN	NaN	NaN
3	2019	2019	US	United States	YRBSS	Alcohol	Alcohol use among youth	NaN	%	Crude Prevalence	...	59	ALC	ALC1_1	CRDPREV	RACE	ASN	NaN	NaN	NaN	NaN
4	2015	2015	VI	Virgin Islands	YRBSS	Alcohol	Alcohol use among youth	NaN	%	Crude Prevalence	...	78	ALC	ALC1_1	CRDPREV	GENDER	GENM	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1185671	2020	2020	WY	Wyoming	BRFSS	Diabetes	Dilated eye examination among adults aged >= 1...	NaN	%	Age-adjusted Prevalence	...	56	DIA	DIA7_0	AGEADJPREV	RACE	WHT	NaN	NaN	NaN	NaN
1185672	2020	2020	WY	Wyoming	BRFSS	Older Adults	Proportion of older adults aged >= 65 years wh...	NaN	%	Crude Prevalence	...	56	OLD	OLD3_1	CRDPREV	RACE	WHT	NaN	NaN	NaN	NaN
1185673	2017	2017	IA	Iowa	BRFSS	Arthritis	Activity limitation due to arthritis among adu...	NaN	%	Age-adjusted Prevalence	...	19	ART	ART2_1	AGEADJPREV	RACE	HIS	NaN	NaN	NaN	NaN
1185674	2020	2020	WY	Wyoming	BRFSS	Diabetes	Diabetes prevalence among women aged 18-44 years	NaN	%	Crude Prevalence	...	56	DIA	DIA2_2	CRDPREV	RACE	HIS	NaN	NaN	NaN	NaN
1185675	2019	2019	RI	Rhode Island	BRFSS	Arthritis	Activity limitation due to arthritis among adu...	NaN	%	Crude Prevalence	...	44	ART	ART2_1	CRDPREV	OVERALL	OVR	NaN	NaN	NaN	NaN

	YearStart	LocationAbbr	DataValue	Total Population	Asthma Prevalence (%)	NormalizedOccurence
0	2010	AZ	40650.0	1600442.0	2.539923	0.025399
1	2010	CA	217637.0	8730612.0	2.492803	0.024928
2	2010	FL	141612.0	18845537.0	0.751435	0.007514
3	2010	IA	14135.0	2075096.0	0.681173	0.006812
4	2010	KY	27334.0	4348181.0	0.628631	0.006286
...	...	...	...	...	...	...
95	2018	NJ	52173.0	569816.0	9.156114	0.091561
96	2018	NV	14810.0	199220.0	7.433993	0.074340
97	2018	OR	12819.0	4740945.0	0.270389	0.002704
98	2018	VT	1799.0	324553.0	0.554301	0.005543
99	2018	WI	21174.0	2765223.0	0.765725	0.007657

	State	1970	1971	1972	1973	1974	1975	1976	1977	1978	...	2016	2017	2018	2019	2020	2021	Percent	Absolute	Percent.1	Absolute.1
0	Alabama	102.6	98.5	104.9	109.6	108.8	107.8	108.1	111.7	106.6	...	114.0	108.6	112.4	106.3	98.4	108.4	5.6%	5.7	10.1%	10.0
1	Alaska	11.3	12.6	13.4	12.5	12.8	14.5	16.0	18.0	19.5	...	33.4	33.7	34.5	34.3	36.0	38.9	242.5%	27.5	8.0%	2.9
2	Arizona	24.9	27.0	30.2	34.4	36.7	38.2	43.8	50.5	49.3	...	90.9	90.5	94.1	92.6	80.2	83.0	233.3%	58.1	3.6%	2.9
3	Arkansas	36.2	35.1	37.2	40.8	39.1	36.4	38.9	41.6	42.4	...	62.1	64.2	70.8	65.1	54.7	62.0	71.4%	25.8	13.3%	7.3
4	California	294.4	305.8	312.7	329.3	304.5	311.5	326.9	354.5	345.2	...	353.4	356.5	358.6	358.3	303.8	324.0	10.1%	29.7	6.7%	20.2

Race and Emergency Department Visits¶

Predicting Outcome Based on Treatment¶

Predict Visits Per 10,000 based on Total Population and CO2 Emissions. Total Population is the Confounding Variable¶

Predict Asthma Visits per 10,000 based on Instrumental Variable and Confounding Variable¶

We used multiple hypothesis testing¶

Multiple Hypothesis Testing¶

1. The null hypothesis (H0) states that there is no difference in the number of visits to the emergency room/hospital due to high CO2 levels. The alternative hypothesis (H1) states that high CO2 levels increase emergency visits.¶

3. The null hypothesis states there is no difference in the asthma incidence rates due to CO2 levels across different states. The alternative hypothesis states there is a difference in the asthma incident rates due to CO2 levels across different states.¶

4. The null hypothesis states there is no difference in the asthma incidence rates due to CO2 levels across different years: 2010-2014, 2015-2018? The alternative hypothesis states there is a difference in the asthma incident rates due to CO2 levels between those two years.¶

Research Question 2: is there a difference in the asthma incidence rates due to different altitudes per state?¶

Option C: Prediction with GLMs and nonparametric methods¶

	YearStart	DataValue
0	2010	13897.291389
1	2013	6392.235606
2	2014	4310.263989
3	2015	2912.746088
4	2016	4689.097370
5	2017	3821.923457
6	2018	3368.481102

	Year	State	Total Population
0	2000	AK	146644.0
1	2000	AL	4510366.0
2	2000	AR	2678588.0
3	2000	AZ	1220370.0
4	2000	CA	8163246.0
...	...	...	...
915	2019	VA	815269.0
916	2019	VT	324114.0
917	2019	WA	5607154.0
918	2019	WI	2772884.0
919	2019	WV	1792147.0

	LocationAbbr	YearStart	NormalizedOccurence	Emissions
0	AZ	2010	0.025399	99.5
1	CA	2010	0.024928	356.6
2	FL	2010	0.007514	245.5
3	IA	2010	0.006812	90.3
4	KY	2010	0.006286	152.9

	YearStart	LocationAbbr	Stratification1	DataValue	Year_x	State	Total Population	State Name	Year_male_pop	State_male_pop	Male Population	Year_female_pop	State_female_pop	Female Population	Year_y	Emissions
0	2010	AZ	Female	7473.116667	2010	AZ	1600442.0	Arizona	2010	AZ	809747.0	2010	AZ	790763.0	2010	99.5
1	2010	AZ	Male	6150.000000	2010	AZ	1600442.0	Arizona	2010	AZ	809747.0	2010	AZ	790763.0	2010	99.5
2	2010	CA	Female	38898.420000	2010	CA	8730612.0	California	2010	CA	4327961.0	2010	CA	4402785.0	2010	356.6
3	2010	CA	Male	32368.006667	2010	CA	8730612.0	California	2010	CA	4327961.0	2010	CA	4402785.0	2010	356.6
4	2010	FL	Female	27106.200000	2010	FL	18845537.0	Florida	2010	FL	9212261.0	2010	FL	9633882.0	2010	245.5

	YearStart	LocationAbbr	Stratification1	DataValue	Year	State	Total Population	White Alone	Black Alone	American Indian or Alaskan Native	Hawaiian or Pacific Islander Alone	Two or More Races	Not Hispanic	Hispanic
0	2013	AZ	American Indian or Alaska Native	557.366667	2013	AZ	1617914.0	1313929.0	38650.0	202591.0	4009.0	36760.0	1169683.0	448231.0
1	2013	AZ	Asian or Pacific Islander	128.233333	2013	AZ	1617914.0	1313929.0	38650.0	202591.0	4009.0	36760.0	1169683.0	448231.0
2	2013	AZ	Black, non-Hispanic	1588.833333	2013	AZ	1617914.0	1313929.0	38650.0	202591.0	4009.0	36760.0	1169683.0	448231.0
3	2013	AZ	Hispanic	3657.833333	2013	AZ	1617914.0	1313929.0	38650.0	202591.0	4009.0	36760.0	1169683.0	448231.0
4	2013	AZ	White, non-Hispanic	4860.633333	2013	AZ	1617914.0	1313929.0	38650.0	202591.0	4009.0	36760.0	1169683.0	448231.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
360	2018	WI	American Indian or Alaska Native	141.313333	2018	WI	2765223.0	2566205.0	39437.0	47612.0	1672.0	45678.0	2658039.0	107184.0
361	2018	WI	Asian or Pacific Islander	102.496667	2018	WI	2765223.0	2566205.0	39437.0	47612.0	1672.0	45678.0	2658039.0	107184.0
362	2018	WI	Black, non-Hispanic	2706.553333	2018	WI	2765223.0	2566205.0	39437.0	47612.0	1672.0	45678.0	2658039.0	107184.0
363	2018	WI	Hispanic	722.306667	2018	WI	2765223.0	2566205.0	39437.0	47612.0	1672.0	45678.0	2658039.0	107184.0
364	2018	WI	White, non-Hispanic	3391.546667	2018	WI	2765223.0	2566205.0	39437.0	47612.0	1672.0	45678.0	2658039.0	107184.0

	Plant file sequence number	Data Year	Plant state abbreviation	Plant name	DOE/EIA ORIS plant or facility code	Plant transmission or distribution system owner name	Plant transmission or distribution system owner ID	Utility name	Utility ID	Plant-level sector	...	Plant wind generation percent (resource mix)	Plant solar generation percent (resource mix)	Plant geothermal generation percent (resource mix)	Plant other fossil generation percent (resource mix)	Plant other unknown / purchased fuel generation percent (resource mix)	Plant total nonrenewables generation percent (resource mix)	Plant total renewables generation percent (resource mix)	Plant total nonhydro renewables generation percent (resource mix)	Plant total combustion generation percent (resource mix)	Plant total noncombustion generation percent (resource mix)
0	SEQPLT22	YEAR	PSTATABB	PNAME	ORISPL	OPRNAME	OPRCODE	UTLSRVNM	UTLSRVID	SECTOR	...	PLWIPR	PLSOPR	PLGTPR	PLOFPR	PLOPPR	PLTNPR	PLTRPR	PLTHPR	PLCYPR	PLCNPR
1	1	2022	AK	Agrium Kenai Nitrogen Operations	54452	Homer Electric Assn Inc	19558	Agrium US Inc	179	Industrial CHP	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2	2022	AK	Alakanuk	57053	Alaska Village Elec Coop, Inc	221	Alaska Village Elec Coop, Inc	221	Electric Utility	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	3	2022	AK	Allison Creek Hydro	58982	Copper Valley Elec Assn, Inc	4329	Copper Valley Elec Assn, Inc	4329	Electric Utility	...	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	100.0%	0.0%	0.0%	100.0%
4	4	2022	AK	Ambler	60243	Alaska Village Elec Coop, Inc	221	Alaska Village Elec Coop, Inc	221	Electric Utility	...	0.0%	0.0%	0.0%	0.0%	0.0%	100.0%	0.0%	0.0%	100.0%	0.0%

Dep. Variable:	Visits per 10,000	R-squared:	0.010
Model:	OLS	Adj. R-squared:	-0.000
Method:	Least Squares	F-statistic:	0.9571
Date:	Mon, 06 May 2024	Prob (F-statistic):	0.330
Time:	11:42:41	Log-Likelihood:	-716.34
No. Observations:	100	AIC:	1437.
Df Residuals:	98	BIC:	1442.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	191.9667	40.749	4.711	0.000	111.102	272.831
High Emissions	63.0333	64.430	0.978	0.330	-64.825	190.892

Omnibus:	40.794	Durbin-Watson:	1.768
Prob(Omnibus):	0.000	Jarque-Bera (JB):	72.917
Skew:	1.815	Prob(JB):	1.47e-16
Kurtosis:	5.081	Cond. No.	2.45

Omnibus:	19.797	Durbin-Watson:	1.872
Prob(Omnibus):	0.000	Jarque-Bera (JB):	24.225
Skew:	1.160	Prob(JB):	5.49e-06
Kurtosis:	3.655	Cond. No.	1.72e+07

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	265.5250	38.298	6.933	0.000	189.514	341.536
High Emissions	260.1005	67.227	3.869	0.000	126.674	393.527
Total Population	-3.394e-05	6.23e-06	-5.448	0.000	-4.63e-05	-2.16e-05

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	6.6201	1.001	6.616	0.000	4.659	8.581
Emissions[T.101.2]	-2.5520	1.229	-2.077	0.038	-4.960	-0.144
Emissions[T.101.3]	-2.5578	1.421	-1.800	0.072	-5.342	0.227
Emissions[T.103.7]	-0.3288	1.415	-0.232	0.816	-3.103	2.445
Emissions[T.104.9]	0.0038	1.415	0.003	0.998	-2.770	2.777
Emissions[T.105.1]	-0.2022	1.415	-0.143	0.886	-2.976	2.572
Emissions[T.105.6]	0.0566	1.226	0.046	0.963	-2.345	2.459
Emissions[T.116.8]	-3.4385	1.429	-2.406	0.016	-6.240	-0.637
Emissions[T.121.1]	-3.1866	1.426	-2.235	0.025	-5.982	-0.392
Emissions[T.122.8]	-3.4449	1.429	-2.410	0.016	-6.246	-0.643
Emissions[T.124.2]	-3.2337	1.427	-2.267	0.023	-6.030	-0.438
Emissions[T.124.8]	-3.3372	1.428	-2.337	0.019	-6.136	-0.539
Emissions[T.124.9]	-3.1321	1.425	-2.197	0.028	-5.926	-0.338
Emissions[T.126.2]	-3.0551	1.425	-2.144	0.032	-5.847	-0.263
Emissions[T.126.8]	-3.3243	1.428	-2.328	0.020	-6.123	-0.526
Emissions[T.129.3]	-2.9507	1.424	-2.073	0.038	-5.741	-0.160
Emissions[T.131.3]	-3.4277	1.429	-2.398	0.016	-6.229	-0.627
Emissions[T.139.7]	-3.1064	1.425	-2.180	0.029	-5.900	-0.313
Emissions[T.140.4]	-3.0914	1.425	-2.169	0.030	-5.884	-0.298
Emissions[T.141.8]	-2.9883	1.424	-2.099	0.036	-5.779	-0.197
Emissions[T.141.9]	-2.9680	1.424	-2.085	0.037	-5.759	-0.177
Emissions[T.147.2]	-2.8727	1.423	-2.019	0.044	-5.662	-0.084
Emissions[T.148.1]	-2.8515	1.423	-2.004	0.045	-5.640	-0.063
Emissions[T.152.9]	-2.8777	1.423	-2.022	0.043	-5.667	-0.089
Emissions[T.169.1]	-1.4366	1.417	-1.014	0.311	-4.213	1.340
Emissions[T.170.2]	-1.5571	1.417	-1.099	0.272	-4.334	1.220
Emissions[T.185.0]	-1.2245	1.416	-0.865	0.387	-4.000	1.551
Emissions[T.227.0]	-2.8580	1.423	-2.009	0.045	-5.647	-0.069
Emissions[T.233.0]	-2.8318	1.423	-1.991	0.047	-5.620	-0.043
Emissions[T.237.4]	-3.1750	1.426	-2.227	0.026	-5.970	-0.380
Emissions[T.238.4]	-3.0385	1.424	-2.133	0.033	-5.830	-0.247
Emissions[T.238.8]	-2.9863	1.424	-2.097	0.036	-5.777	-0.195
Emissions[T.242.0]	-3.0309	1.424	-2.128	0.033	-5.823	-0.239
Emissions[T.245.5]	-2.7069	1.422	-1.904	0.057	-5.493	0.080
Emissions[T.356.6]	-1.5255	1.417	-1.077	0.282	-4.302	1.252
Emissions[T.38.5]	-0.6429	1.416	-0.454	0.650	-3.417	2.132
Emissions[T.38.8]	-3.7357	1.434	-2.605	0.009	-6.547	-0.925
Emissions[T.39.6]	-3.7195	1.434	-2.594	0.009	-6.530	-0.909
Emissions[T.39.7]	-0.4369	1.415	-0.309	0.758	-3.211	2.337

	coef	std err	t	P>\|t\|	[0.025	0.975]
log_Emissions	2.6290	0.007	351.492	0.000	2.613	2.645
Gender	-0.0256	0.049	-0.524	0.610	-0.132	0.081

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	10.0743	2.902	3.471	0.001	4.386	15.763
Emissions	0.0204	0.029	0.708	0.479	-0.036	0.077
Gender	-0.0212	4.104	-0.005	0.996	-8.066	8.023
Emissions:Gender	-6.414e-05	0.041	-0.002	0.999	-0.080	0.080

Omnibus:	20.164	Durbin-Watson:	1.878
Prob(Omnibus):	0.000	Jarque-Bera (JB):	24.861
Skew:	1.177	Prob(JB):	4.00e-06
Kurtosis:	3.655	Cond. No.	1.74e+07

	YearStart	LocationAbbr	Visits per 10,000	Emissions	Total Population
0	2010	AZ	170.454787	99.5	1600442.0
1	2010	CA	163.137699	356.6	8730612.0
2	2010	FL	50.056662	245.5	18845537.0
3	2010	IA	45.970045	90.3	2075096.0
4	2010	KY	42.195833	152.9	4348181.0
...	...	...	...	...	...
95	2018	NJ	612.695721	105.1	569816.0
96	2018	NV	502.801522	41.4	199220.0
97	2018	OR	18.184902	39.6	4740945.0
98	2018	VT	39.409573	5.9	324553.0
99	2018	WI	51.435813	101.2	2765223.0

Dep. Variable:	Visits_per_10000	No. Observations:	100
Model:	GLM	Df Residuals:	5
Model Family:	NegativeBinomial	Df Model:	94
Link Function:	Log	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-522.64
Date:	Mon, 06 May 2024	Deviance:	0.14064
Time:	11:42:44	Pearson chi2:	0.139
No. Iterations:	6	Pseudo R-squ. (CS):	0.7822
Covariance Type:	nonrobust

Dep. Variable:	Visits	R-squared (uncentered):	1.000
Model:	OLS	Adj. R-squared (uncentered):	1.000
Method:	Least Squares	F-statistic:	1.232e+05
Date:	Mon, 06 May 2024	Prob (F-statistic):	1.33e-26
Time:	11:42:47	Log-Likelihood:	14.742
No. Observations:	14	AIC:	-25.48
Df Residuals:	12	BIC:	-24.21
Df Model:	2
Covariance Type:	nonrobust

Omnibus:	3.131	Durbin-Watson:	3.013
Prob(Omnibus):	0.209	Jarque-Bera (JB):	1.124
Skew:	0.068	Prob(JB):	0.570
Kurtosis:	1.619	Cond. No.	9.32

	LocationAbbr	YearStart	Emissions	Total Population	Visits per 10,000
0	AR	2014	68.9	2967392.0	30
1	AR	2015	59.1	2978048.0	21
2	AR	2016	62.1	2989918.0	26
3	AR	2018	70.8	3009733.0	23
4	AZ	2010	99.5	1600442.0	170
...	...	...	...	...	...
95	WI	2014	101.2	2741489.0	65
96	WI	2015	99.9	2743124.0	47
97	WI	2016	95.6	2746333.0	53
98	WI	2017	98.6	2755404.0	53
99	WI	2018	101.2	2765223.0	51

group1	group2	meandiff	p-adj	lower	upper	reject
AR	AZ	99.2857	0.1356	-11.2941	209.8655	False
AR	CA	138.0	0.5484	-59.2484	335.2484	False
AR	CO	109.0	0.8893	-88.2484	306.2484	False
AR	FL	14.4286	1.0	-96.1512	125.0084	False
AR	GA	14.3333	1.0	-120.413	149.0797	False
AR	IA	9.0	1.0	-101.5798	119.5798	False
AR	KY	4.4286	1.0	-106.1512	115.0084	False
AR	MA	-0.5	1.0	-153.288	152.288	False
AR	MD	474.0	0.0	360.1186	587.8814	True
AR	MN	52.6	0.9856	-65.749	170.949	False
AR	NC	8.2857	1.0	-102.2941	118.8655	False
AR	NE	-2.1429	1.0	-112.7227	108.437	False
AR	NJ	670.7143	0.0	560.1345	781.2941	True
AR	NV	448.2	0.0	329.851	566.549	True
AR	NY	160.3333	0.0054	25.587	295.0797	True
AR	OR	-7.5	1.0	-160.288	145.288	False
AR	SC	8.75	1.0	-116.0009	133.5009	False
AR	UT	-9.0	1.0	-143.7463	125.7463	False
AR	VT	21.8	1.0	-96.549	140.149	False
AR	WI	32.4286	0.9999	-78.1512	143.0084	False
AZ	CA	38.7143	1.0	-149.8913	227.3198	False
AZ	CO	9.7143	1.0	-178.8913	198.3198	False
AZ	FL	-84.8571	0.1333	-179.1599	9.4456	False
AZ	GA	-84.9524	0.5534	-206.6967	36.792	False
AZ	IA	-90.2857	0.0773	-184.5885	4.0171	False
AZ	KY	-94.8571	0.047	-189.1599	-0.5544	True
AZ	MA	-99.7857	0.5327	-241.2399	41.6685	False
AZ	MD	374.7143	0.0	276.5608	472.8677	True
AZ	MN	-46.6857	0.9827	-149.9892	56.6178	False
AZ	NC	-91.0	0.0717	-185.3028	3.3028	False
AZ	NE	-101.4286	0.0217	-195.7314	-7.1258	True
AZ	NJ	571.4286	0.0	477.1258	665.7314	True
AZ	NV	348.9143	0.0	245.6108	452.2178	True
AZ	NY	61.0476	0.9515	-60.6967	182.792	False
AZ	OR	-106.7857	0.4033	-248.2399	34.6685	False
AZ	SC	-90.5357	0.2605	-201.1155	20.0441	False
AZ	UT	-108.2857	0.1462	-230.0301	13.4586	False
AZ	VT	-77.4857	0.4154	-180.7892	25.8178	False
AZ	WI	-66.8571	0.5232	-161.1599	27.4456	False
CA	CO	-29.0	1.0	-278.5017	220.5017	False
CA	FL	-123.5714	0.6677	-312.177	65.0341	False
CA	GA	-123.6667	0.7856	-327.384	80.0506	False
CA	IA	-129.0	0.5909	-317.6056	59.6056	False
CA	KY	-133.5714	0.5252	-322.177	55.0341	False
CA	MA	-138.5	0.7044	-354.5748	77.5748	False
CA	MD	336.0	0.0	145.4399	526.5601	True
CA	MN	-85.4	0.9865	-278.6632	107.8632	False
CA	NC	-129.7143	0.5806	-318.3198	58.8913	False
CA	NE	-140.1429	0.4333	-328.7484	48.4627	False
CA	NJ	532.7143	0.0	344.1087	721.3198	True
CA	NV	310.2	0.0	116.9368	503.4632	True
CA	NY	22.3333	1.0	-181.384	226.0506	False
CA	OR	-145.5	0.6194	-361.5748	70.5748	False
CA	SC	-129.25	0.6675	-326.4984	67.9984	False
CA	UT	-147.0	0.4894	-350.7173	56.7173	False
CA	VT	-116.2	0.7983	-309.4632	77.0632	False
CA	WI	-105.5714	0.8779	-294.177	83.0341	False
CO	FL	-94.5714	0.9515	-283.177	94.0341	False
CO	GA	-94.6667	0.9769	-298.384	109.0506	False
CO	IA	-100.0	0.9207	-288.6056	88.6056	False
CO	KY	-104.5714	0.8865	-293.177	84.0341	False
CO	MA	-109.5	0.9466	-325.5748	106.5748	False
CO	MD	365.0	0.0	174.4399	555.5601	True
CO	MN	-56.4	0.9999	-249.6632	136.8632	False
CO	NC	-100.7143	0.9159	-289.3198	87.8913	False
CO	NE	-111.1429	0.8235	-299.7484	77.4627	False
CO	NJ	561.7143	0.0	373.1087	750.3198	True
CO	NV	339.2	0.0	145.9368	532.4632	True
CO	NY	51.3333	1.0	-152.384	255.0506	False
CO	OR	-116.5	0.909	-332.5748	99.5748	False
CO	SC	-100.25	0.9451	-297.4984	96.9984	False
CO	UT	-118.0	0.8432	-321.7173	85.7173	False
CO	VT	-87.2	0.983	-280.4632	106.0632	False
CO	WI	-76.5714	0.9949	-265.177	112.0341	False
FL	GA	-0.0952	1.0	-121.8396	121.6491	False
FL	IA	-5.4286	1.0	-99.7314	88.8742	False
FL	KY	-10.0	1.0	-104.3028	84.3028	False
FL	MA	-14.9286	1.0	-156.3827	126.5256	False
FL	MD	459.5714	0.0	361.418	557.7249	True
FL	MN	38.1714	0.9984	-65.1321	141.4749	False
FL	NC	-6.1429	1.0	-100.4456	88.1599	False
FL	NE	-16.5714	1.0	-110.8742	77.7314	False
FL	NJ	656.2857	0.0	561.9829	750.5885	True
FL	NV	433.7714	0.0	330.4679	537.0749	True
FL	NY	145.9048	0.0049	24.1604	267.6491	True
FL	OR	-21.9286	1.0	-163.3827	119.5256	False
FL	SC	-5.6786	1.0	-116.2584	104.9012	False
FL	UT	-23.4286	1.0	-145.1729	98.3158	False
FL	VT	7.3714	1.0	-95.9321	110.6749	False
FL	WI	18.0	1.0	-76.3028	112.3028	False
GA	IA	-5.3333	1.0	-127.0777	116.411	False
GA	KY	-9.9048	1.0	-131.6491	111.8396	False
GA	MA	-14.8333	1.0	-175.886	146.2193	False
GA	MD	459.6667	0.0	334.9158	584.4175	True
GA	MN	38.2667	0.9999	-90.5755	167.1088	False
GA	NC	-6.0476	1.0	-127.792	115.6967	False
GA	NE	-16.4762	1.0	-138.2206	105.2682	False
GA	NJ	656.381	0.0	534.6366	778.1253	True
GA	NV	433.8667	0.0	305.0245	562.7088	True
GA	NY	146.0	0.0433	1.9501	290.0499	True
GA	OR	-21.8333	1.0	-182.886	139.2193	False
GA	SC	-5.5833	1.0	-140.3297	129.163	False
GA	UT	-23.3333	1.0	-167.3832	120.7165	False
GA	VT	7.4667	1.0	-121.3755	136.3088	False
GA	WI	18.0952	1.0	-103.6491	139.8396	False
IA	KY	-4.5714	1.0	-98.8742	89.7314	False
IA	MA	-9.5	1.0	-150.9542	131.9542	False
IA	MD	465.0	0.0	366.8466	563.1534	True
IA	MN	43.6	0.9919	-59.7035	146.9035	False
IA	NC	-0.7143	1.0	-95.0171	93.5885	False
IA	NE	-11.1429	1.0	-105.4456	83.1599	False
IA	NJ	661.7143	0.0	567.4115	756.0171	True
IA	NV	439.2	0.0	335.8965	542.5035	True
IA	NY	151.3333	0.0027	29.589	273.0777	True
IA	OR	-16.5	1.0	-157.9542	124.9542	False
IA	SC	-0.25	1.0	-110.8298	110.3298	False
IA	UT	-18.0	1.0	-139.7444	103.7444	False
IA	VT	12.8	1.0	-90.5035	116.1035	False
IA	WI	23.4286	1.0	-70.8742	117.7314	False
KY	MA	-4.9286	1.0	-146.3827	136.5256	False
KY	MD	469.5714	0.0	371.418	567.7249	True
KY	MN	48.1714	0.9761	-55.1321	151.4749	False
KY	NC	3.8571	1.0	-90.4456	98.1599	False
KY	NE	-6.5714	1.0	-100.8742	87.7314	False
KY	NJ	666.2857	0.0	571.9829	760.5885	True
KY	NV	443.7714	0.0	340.4679	547.0749	True
KY	NY	155.9048	0.0017	34.1604	277.6491	True
KY	OR	-11.9286	1.0	-153.3827	129.5256	False
KY	SC	4.3214	1.0	-106.2584	114.9012	False
KY	UT	-13.4286	1.0	-135.1729	108.3158	False
KY	VT	17.3714	1.0	-85.9321	120.6749	False
KY	WI	28.0	0.9999	-66.3028	122.3028	False
MA	MD	474.5	0.0	330.4501	618.5499	True
MA	MN	53.1	0.9989	-94.5072	200.7072	False
MA	NC	8.7857	1.0	-132.6685	150.2399	False
MA	NE	-1.6429	1.0	-143.097	139.8113	False
MA	NJ	671.2143	0.0	529.7601	812.6685	True
MA	NV	448.7	0.0	301.0928	596.3072	True
MA	NY	160.8333	0.0507	-0.2193	321.886	False
MA	OR	-7.0	1.0	-183.4243	169.4243	False
MA	SC	9.25	1.0	-143.538	162.038	False
MA	UT	-8.5	1.0	-169.5527	152.5527	False
MA	VT	22.3	1.0	-125.3072	169.9072	False
MA	WI	32.9286	1.0	-108.5256	174.3827	False
MD	MN	-421.4	0.0	-528.2302	-314.5698	True
MD	NC	-465.7143	0.0	-563.8677	-367.5608	True
MD	NE	-476.1429	0.0	-574.2963	-377.9894	True
MD	NJ	196.7143	0.0	98.5608	294.8677	True
MD	NV	-25.8	1.0	-132.6302	81.0302	False
MD	NY	-313.6667	0.0	-438.4175	-188.9158	True
MD	OR	-481.5	0.0	-625.5499	-337.4501	True
MD	SC	-465.25	0.0	-579.1314	-351.3686	True
MD	UT	-483.0	0.0	-607.7509	-358.2491	True
MD	VT	-452.2	0.0	-559.0302	-345.3698	True
MD	WI	-441.5714	0.0	-539.7249	-343.418	True
MN	NC	-44.3143	0.9903	-147.6178	58.9892	False
MN	NE	-54.7429	0.9211	-158.0464	48.5607	False
MN	NJ	618.1143	0.0	514.8108	721.4178	True
MN	NV	395.6	0.0	284.0194	507.1806	True
MN	NY	107.7333	0.2279	-21.1088	236.5755	False
MN	OR	-60.1	0.9947	-207.7072	87.5072	False
MN	SC	-43.85	0.9984	-162.199	74.499	False
MN	UT	-61.6	0.9692	-190.4421	67.2421	False
MN	VT	-30.8	1.0	-142.3806	80.7806	False
MN	WI	-20.1714	1.0	-123.4749	83.1321	False
NC	NE	-10.4286	1.0	-104.7314	83.8742	False
NC	NJ	662.4286	0.0	568.1258	756.7314	True
NC	NV	439.9143	0.0	336.6108	543.2178	True
NC	NY	152.0476	0.0025	30.3033	273.792	True
NC	OR	-15.7857	1.0	-157.2399	125.6685	False
NC	SC	0.4643	1.0	-110.1155	111.0441	False
NC	UT	-17.2857	1.0	-139.0301	104.4586	False
NC	VT	13.5143	1.0	-89.7892	116.8178	False
NC	WI	24.1429	1.0	-70.1599	118.4456	False
NE	NJ	672.8571	0.0	578.5544	767.1599	True
NE	NV	450.3429	0.0	347.0393	553.6464	True
NE	NY	162.4762	0.0008	40.7318	284.2206	True
NE	OR	-5.3571	1.0	-146.8113	136.097	False
NE	SC	10.8929	1.0	-99.687	121.4727	False
NE	UT	-6.8571	1.0	-128.6015	114.8872	False
NE	VT	23.9429	1.0	-79.3607	127.2464	False
NE	WI	34.5714	0.9986	-59.7314	128.8742	False
NJ	NV	-222.5143	0.0	-325.8178	-119.2108	True
NJ	NY	-510.381	0.0	-632.1253	-388.6366	True
NJ	OR	-678.2143	0.0	-819.6685	-536.7601	True
NJ	SC	-661.9643	0.0	-772.5441	-551.3845	True
NJ	UT	-679.7143	0.0	-801.4586	-557.9699	True
NJ	VT	-648.9143	0.0	-752.2178	-545.6108	True
NJ	WI	-638.2857	0.0	-732.5885	-543.9829	True
NV	NY	-287.8667	0.0	-416.7088	-159.0245	True
NV	OR	-455.7	0.0	-603.3072	-308.0928	True
NV	SC	-439.45	0.0	-557.799	-321.101	True
NV	UT	-457.2	0.0	-586.0421	-328.3579	True
NV	VT	-426.4	0.0	-537.9806	-314.8194	True
NV	WI	-415.7714	0.0	-519.0749	-312.4679	True
NY	OR	-167.8333	0.0317	-328.886	-6.7807	True
NY	SC	-151.5833	0.0121	-286.3297	-16.837	True
NY	UT	-169.3333	0.0065	-313.3832	-25.2835	True
NY	VT	-138.5333	0.0218	-267.3755	-9.6912	True
NY	WI	-127.9048	0.0288	-249.6491	-6.1604	True
OR	SC	16.25	1.0	-136.538	169.038	False
OR	UT	-1.5	1.0	-162.5527	159.5527	False
OR	VT	29.3	1.0	-118.3072	176.9072	False
OR	WI	39.9286	1.0	-101.5256	181.3827	False
SC	UT	-17.75	1.0	-152.4963	116.9963	False
SC	VT	13.05	1.0	-105.299	131.399	False
SC	WI	23.6786	1.0	-86.9012	134.2584	False
UT	VT	30.8	1.0	-98.0421	159.6421	False
UT	WI	41.4286	0.9995	-80.3158	163.1729	False
VT	WI	10.6286	1.0	-92.6749	113.9321	False

	State	Highest Point	Highest Elevation (ft)	Highest Elevation (m)	Lowest Point	Lowest Elevation (ft)	Lowest Elevation (m)	Average Elevation (ft)	Average Elevation (m)	Elevation Difference (ft)	Elevation Difference (m)
0	Alabama	Cheaha Mountain	2405	733	Gulf of Mexico	0	0	500	150	2405	733
1	Alaska	Mount McKinley (Denali)	20320	6198	Pacific Ocean	0	0	1900	580	20320	6198
2	Arizona	Humphreys Peak	12633	3853	Colorado River	70	21	4100	1300	12563	3832
3	Arkansas	Magazine Mountain	2753	840	Ouachita River	55	17	650	200	2698	823
4	California	Mount Whitney	14505	4421	Death Valley	-282	-86	2900	880	14787	4507

	NERC_Region	Plant_States
0	AK	[AK]
1	SERC	[GA, KY, WV, VA, MS, MO, DC, IA, NC, OH, AL, F...
2	MRO	[ND, NE, CO, MO, MT, IA, KS, LA, MN, IN, IL, W...
3	WECC	[MA, ID, OR, CO, WY, NV, MT, SD, TX, AZ, WA, U...
4	NPCC	[MA, PA, NJ, IL, ME, NH, MD, NY, VT, CT, RI]

	NERC_Region	Average Elevation (ft)
0	AK	1900.000000
1	HI	3030.000000
2	MRO	1988.235294
3	NPCC	659.090909
4	RFC	708.888889
5	SERC	726.190476
6	TRE	1500.000000
7	WECC	3971.428571

Race and Emergency Department Visits¶

The causal question we aim to investigate is: "Does exposure to increased levels of CO2 cause an increase in the number of asthma-related emergency room visits?"¶

Predicting Outcome Based on Treatment¶

Predict Visits Per 10,000 based on Total Population and CO2 Emissions. Total Population is the Confounding Variable¶

Predict Asthma Visits per 10,000 based on Instrumental Variable and Confounding Variable¶

The causal question we aim to investigate is: "Does exposure to increased levels of carbon dioxide cause an increase in the number of asthma-related emergency room visits?"¶

The 1st question we aim to investigate is: "Does exposure to increased levels of CO2 cause an increase in the number of asthma-related emergency room visits?"¶

We used multiple hypothesis testing¶

Multiple Hypothesis Testing¶

1. The null hypothesis (H0) states that there is no difference in the number of visits to the emergency room/hospital due to high CO2 levels. The alternative hypothesis (H1) states that high CO2 levels increase emergency visits.¶

3. The null hypothesis states there is no difference in the asthma incidence rates due to CO2 levels across different states. The alternative hypothesis states there is a difference in the asthma incident rates due to CO2 levels across different states.¶

4. The null hypothesis states there is no difference in the asthma incidence rates due to CO2 levels across different years: 2010-2014, 2015-2018? The alternative hypothesis states there is a difference in the asthma incident rates due to CO2 levels between those two years.¶

Research Question 2: is there a difference in the asthma incidence rates due to different altitudes per state?¶

Option C: Prediction with GLMs and nonparametric methods¶

	NERC_Region	Total_Visits_Per_10000	Count_States	Average_Visits_Per_10000	Average Elevation (ft)
0	MRO	350.0	17	20.588235	2000
1	NPCC	1451.0	11	131.909091	400
2	RFC	1567.0	18	87.055556	1100
3	SERC	917.0	21	43.666667	800
4	WECC	952.0	14	68.000000	3000