!pip install matplotlib
!pip install numpy
!pip install pandas
!pip install seaborn
!pip install scipy
!pip install statsmodels
!pip install scikit-learn
!pip install networkx
Requirement already satisfied: matplotlib in c:\users\tranh\appdata\roaming\python\python310\site-packages (3.8.4) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (1.4.4) Requirement already satisfied: pillow>=8 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (9.4.0) Requirement already satisfied: cycler>=0.10 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (4.25.0) Requirement already satisfied: python-dateutil>=2.7 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (2.8.2) Requirement already satisfied: packaging>=20.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (22.0) Requirement already satisfied: numpy>=1.21 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (1.23.5) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (3.0.9) Requirement already satisfied: contourpy>=1.0.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib) (1.0.5) Requirement already satisfied: six>=1.5 in c:\users\tranh\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) Requirement already satisfied: numpy in c:\users\tranh\anaconda3\lib\site-packages (1.23.5) Requirement already satisfied: pandas in c:\users\tranh\anaconda3\lib\site-packages (1.5.3) Requirement already satisfied: pytz>=2020.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas) (2022.7) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas) (2.8.2) Requirement already satisfied: numpy>=1.21.0 in c:\users\tranh\anaconda3\lib\site-packages (from pandas) (1.23.5) Requirement already satisfied: six>=1.5 in c:\users\tranh\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0) Requirement already satisfied: seaborn in c:\users\tranh\anaconda3\lib\site-packages (0.12.2) Requirement already satisfied: pandas>=0.25 in c:\users\tranh\anaconda3\lib\site-packages (from seaborn) (1.5.3) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\tranh\appdata\roaming\python\python310\site-packages (from seaborn) (3.8.4) Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\tranh\anaconda3\lib\site-packages (from seaborn) (1.23.5) Requirement already satisfied: cycler>=0.10 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: packaging>=20.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (22.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4) Requirement already satisfied: contourpy>=1.0.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.5) Requirement already satisfied: pillow>=8 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2) Requirement already satisfied: fonttools>=4.22.0 in c:\users\tranh\anaconda3\lib\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.25.0) Requirement already satisfied: pytz>=2020.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas>=0.25->seaborn) (2022.7) Requirement already satisfied: six>=1.5 in c:\users\tranh\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0) Requirement already satisfied: scipy in c:\users\tranh\anaconda3\lib\site-packages (1.10.0) Requirement already satisfied: numpy<1.27.0,>=1.19.5 in c:\users\tranh\anaconda3\lib\site-packages (from scipy) (1.23.5) Requirement already satisfied: statsmodels in c:\users\tranh\anaconda3\lib\site-packages (0.13.5) Requirement already satisfied: pandas>=0.25 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (1.5.3) Requirement already satisfied: numpy>=1.22.3 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (1.23.5) Requirement already satisfied: patsy>=0.5.2 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (0.5.3) Requirement already satisfied: packaging>=21.3 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (22.0) Requirement already satisfied: scipy>=1.3 in c:\users\tranh\anaconda3\lib\site-packages (from statsmodels) (1.10.0) Requirement already satisfied: pytz>=2020.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas>=0.25->statsmodels) (2022.7) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\tranh\anaconda3\lib\site-packages (from pandas>=0.25->statsmodels) (2.8.2) Requirement already satisfied: six in c:\users\tranh\anaconda3\lib\site-packages (from patsy>=0.5.2->statsmodels) (1.16.0) Requirement already satisfied: scikit-learn in c:\users\tranh\anaconda3\lib\site-packages (1.2.1) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (2.2.0) Requirement already satisfied: joblib>=1.1.1 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (1.1.1) Requirement already satisfied: numpy>=1.17.3 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (1.23.5) Requirement already satisfied: scipy>=1.3.2 in c:\users\tranh\anaconda3\lib\site-packages (from scikit-learn) (1.10.0) Requirement already satisfied: networkx in c:\users\tranh\appdata\roaming\python\python310\site-packages (3.3)
Please clone the following url and move the files into the same directory as this python notebook - https://github.com/Maanuj-Vora/Data102-Final-Project-Datasets
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
import seaborn as sns
import scipy.stats
from scipy.stats import norm
import statsmodels.api as sm
from statsmodels.formula.api import glm
import networkx as nx
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod import families
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model, tree
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from scipy.stats import chi2
import statsmodels.formula.api as smf
from statsmodels.tools.eval_measures import rmse
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import ttest_ind
from statsmodels.stats.power import TTestIndPower
from sklearn.model_selection import train_test_split
df = pd.read_csv('Population by Age and Sex - EDDs.csv')
mean_population_by_year = df[df['State'] == 'CA'].groupby('Year')['Total Population'].mean()
mean_population_by_year.plot(kind='line', figsize=(10, 6))
plt.title('Average Total Population Over Years')
plt.xlabel('Year')
plt.ylabel('Average Population')
plt.xticks(range(min(mean_population_by_year.index), max(mean_population_by_year.index) + 1, 2))
plt.grid(True)
plt.show()
The chart displays the average total population over the years in California, illustrating how the population has changed. This relationship is interesting because it provides insights into population growth or decline trends over time. This is relevant because we intend to normalize the emissions dataset and asthma dataset by the total population per state.
states = ['CA', 'NY', 'FL', 'NV', 'TX', 'HI', 'WA', 'OR', 'KY', 'LA']
df[df['State'].isin(states)].groupby(['Year', 'State'])['Total Population'].mean().unstack().plot(kind='bar', stacked=True, figsize=(10, 6))
<Axes: xlabel='Year'>
The graph displays the average total population for selected states (CA, NY, FL, NV, TX, HI, WA, OR, KY, LA) over various years, with each state's population represented as a portion of a bar for each year. The bars are stacked, meaning the population sizes of the states are added on top of each other within each year's bar, allowing us to see the combined population trends over time for these states. This visualization helps in comparing the population growth or decline among these states across different years, as well as understanding the overall population trend of these states combined.
race_df = pd.read_csv('Population by Race - EDDs.csv')
race_df.fillna(0, inplace=True) # For Hawaiian or Pacific Islander Alone, fill with zeroes if there are null values.
race_df
IBRC_Geo_ID | State | District Name | Year | Total Population | White Alone | Black Alone | American Indian or Alaskan Native | Asian Alone | Hawaiian or Pacific Islander Alone | Two or More Races | Not Hispanic | Hispanic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6500001 | AK | Kenai Peninsula Economic Development District | 1990 | 41125.0 | 37508.0 | 199.0 | 3003.0 | 415.0 | 0.0 | 0.0 | 40421.0 | 704.0 |
1 | 6500001 | AK | Kenai Peninsula Economic Development District | 1991 | 42649.0 | 38893.0 | 204.0 | 3120.0 | 432.0 | 0.0 | 0.0 | 41913.0 | 736.0 |
2 | 6500001 | AK | Kenai Peninsula Economic Development District | 1992 | 43496.0 | 39670.0 | 212.0 | 3173.0 | 441.0 | 0.0 | 0.0 | 42748.0 | 748.0 |
3 | 6500001 | AK | Kenai Peninsula Economic Development District | 1993 | 44201.0 | 40265.0 | 219.0 | 3251.0 | 466.0 | 0.0 | 0.0 | 43431.0 | 770.0 |
4 | 6500001 | AK | Kenai Peninsula Economic Development District | 1994 | 45588.0 | 41435.0 | 237.0 | 3423.0 | 493.0 | 0.0 | 0.0 | 44761.0 | 827.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11935 | 6500416 | CT | Western Connecticut Economic Development Distr... | 2015 | 944943.0 | 752134.0 | 116361.0 | 4573.0 | 52039.0 | 1020.0 | 18816.0 | 766505.0 | 178438.0 |
11936 | 6500416 | CT | Western Connecticut Economic Development Distr... | 2016 | 944347.0 | 748757.0 | 117499.0 | 4696.0 | 53041.0 | 1004.0 | 19350.0 | 762293.0 | 182054.0 |
11937 | 6500416 | CT | Western Connecticut Economic Development Distr... | 2017 | 943038.0 | 745352.0 | 118526.0 | 4775.0 | 53541.0 | 1005.0 | 19839.0 | 757421.0 | 185617.0 |
11938 | 6500416 | CT | Western Connecticut Economic Development Distr... | 2018 | 943971.0 | 743260.0 | 119919.0 | 4964.0 | 54515.0 | 1018.0 | 20295.0 | 753873.0 | 190098.0 |
11939 | 6500416 | CT | Western Connecticut Economic Development Distr... | 2019 | 943332.0 | 739179.0 | 121997.0 | 5115.0 | 55489.0 | 1029.0 | 20523.0 | 749560.0 | 193772.0 |
11940 rows × 13 columns
merged_df = pd.merge(df, race_df, on=['IBRC_Geo_ID', 'State', 'District Name', 'Year', 'Total Population'], how='inner')
merged_df
IBRC_Geo_ID | State | District Name | Year | Total Population | Population 0-4 | Population 5-17 | Population 18-24 | Population 25-44 | Population 45-64 | ... | Male Population | Female Population | White Alone | Black Alone | American Indian or Alaskan Native | Asian Alone | Hawaiian or Pacific Islander Alone | Two or More Races | Not Hispanic | Hispanic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6500001 | AK | Kenai Peninsula Economic Development District | 2000 | 49618.0 | 3272.0 | 11536.0 | 3432.0 | 14636.0 | 13044.0 | ... | 25831.0 | 23787.0 | 43353.0 | 236.0 | 3760.0 | 488.0 | 88.0 | 1693.0 | 48530.0 | 1088.0 |
1 | 6500001 | AK | Kenai Peninsula Economic Development District | 2001 | 49986.0 | 3207.0 | 11261.0 | 3770.0 | 14117.0 | 13740.0 | ... | 26065.0 | 23921.0 | 43611.0 | 273.0 | 3729.0 | 512.0 | 79.0 | 1782.0 | 48826.0 | 1160.0 |
2 | 6500001 | AK | Kenai Peninsula Economic Development District | 2002 | 50804.0 | 3217.0 | 11123.0 | 4099.0 | 13835.0 | 14506.0 | ... | 26589.0 | 24215.0 | 44144.0 | 287.0 | 3811.0 | 538.0 | 89.0 | 1935.0 | 49547.0 | 1257.0 |
3 | 6500001 | AK | Kenai Peninsula Economic Development District | 2003 | 50963.0 | 3185.0 | 10874.0 | 4278.0 | 13405.0 | 15077.0 | ... | 26657.0 | 24306.0 | 44214.0 | 283.0 | 3808.0 | 547.0 | 91.0 | 2020.0 | 49657.0 | 1306.0 |
4 | 6500001 | AK | Kenai Peninsula Economic Development District | 2004 | 51404.0 | 3181.0 | 10592.0 | 4447.0 | 13111.0 | 15718.0 | ... | 26801.0 | 24603.0 | 44423.0 | 271.0 | 3819.0 | 570.0 | 110.0 | 2211.0 | 50055.0 | 1349.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7885 | 6500416 | CT | Western Connecticut Economic Development Distr... | 2015 | 944943.0 | 53031.0 | 166447.0 | 85431.0 | 227999.0 | 271756.0 | ... | 460316.0 | 485007.0 | 752134.0 | 116361.0 | 4573.0 | 52039.0 | 1020.0 | 18816.0 | 766505.0 | 178438.0 |
7886 | 6500416 | CT | Western Connecticut Economic Development Distr... | 2016 | 944347.0 | 52545.0 | 164256.0 | 86562.0 | 225568.0 | 272331.0 | ... | 460137.0 | 484959.0 | 748757.0 | 117499.0 | 4696.0 | 53041.0 | 1004.0 | 19350.0 | 762293.0 | 182054.0 |
7887 | 6500416 | CT | Western Connecticut Economic Development Distr... | 2017 | 943038.0 | 52058.0 | 162222.0 | 86535.0 | 224212.0 | 271858.0 | ... | 459755.0 | 484352.0 | 745352.0 | 118526.0 | 4775.0 | 53541.0 | 1005.0 | 19839.0 | 757421.0 | 185617.0 |
7888 | 6500416 | CT | Western Connecticut Economic Development Distr... | 2018 | 943971.0 | 51708.0 | 160510.0 | 86323.0 | 225033.0 | 270590.0 | ... | 460611.0 | 484900.0 | 743260.0 | 119919.0 | 4964.0 | 54515.0 | 1018.0 | 20295.0 | 753873.0 | 190098.0 |
7889 | 6500416 | CT | Western Connecticut Economic Development Distr... | 2019 | 943332.0 | 51199.0 | 158467.0 | 85880.0 | 225756.0 | 268079.0 | ... | 459938.0 | 484450.0 | 739179.0 | 121997.0 | 5115.0 | 55489.0 | 1029.0 | 20523.0 | 749560.0 | 193772.0 |
7890 rows × 24 columns
summary_df = merged_df.groupby('Year').agg({
'White Alone': 'sum',
'Black Alone': 'sum',
'American Indian or Alaskan Native': 'sum',
'Asian Alone': 'sum',
'Hawaiian or Pacific Islander Alone': 'sum',
'Two or More Races': 'sum',
#'Not Hispanic': 'sum',
'Hispanic': 'sum'
}).reset_index()
# Plot
summary_df.plot(x='Year', kind='bar', stacked=True, figsize=(10, 6), title='Population by Race Over Years')
plt.xlabel('Year')
plt.ylabel('Population')
plt.legend(title='Population Groups', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
This visualization stacks the populations of different age groups and races for each year, providing a comprehensive overview of the demographic changes over time.
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv')
asthma_df
C:\Users\tranh\AppData\Local\Temp\ipykernel_2596\2650947367.py:1: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False. asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv')
YearStart | YearEnd | LocationAbbr | LocationDesc | DataSource | Topic | Question | Response | DataValueUnit | DataValueType | ... | LocationID | TopicID | QuestionID | DataValueTypeID | StratificationCategoryID1 | StratificationID1 | StratificationCategoryID2 | StratificationID2 | StratificationCategoryID3 | StratificationID3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | 2010 | OR | Oregon | NVSS | Cardiovascular Disease | Mortality from heart failure | NaN | NaN | Number | ... | 41 | CVD | CVD1_4 | NMBR | RACE | AIAN | NaN | NaN | NaN | NaN |
1 | 2019 | 2019 | AZ | Arizona | YRBSS | Alcohol | Alcohol use among youth | NaN | % | Crude Prevalence | ... | 4 | ALC | ALC1_1 | CRDPREV | GENDER | GENF | NaN | NaN | NaN | NaN |
2 | 2019 | 2019 | OH | Ohio | YRBSS | Alcohol | Alcohol use among youth | NaN | % | Crude Prevalence | ... | 39 | ALC | ALC1_1 | CRDPREV | GENDER | GENM | NaN | NaN | NaN | NaN |
3 | 2019 | 2019 | US | United States | YRBSS | Alcohol | Alcohol use among youth | NaN | % | Crude Prevalence | ... | 59 | ALC | ALC1_1 | CRDPREV | RACE | ASN | NaN | NaN | NaN | NaN |
4 | 2015 | 2015 | VI | Virgin Islands | YRBSS | Alcohol | Alcohol use among youth | NaN | % | Crude Prevalence | ... | 78 | ALC | ALC1_1 | CRDPREV | GENDER | GENM | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1185671 | 2020 | 2020 | WY | Wyoming | BRFSS | Diabetes | Dilated eye examination among adults aged >= 1... | NaN | % | Age-adjusted Prevalence | ... | 56 | DIA | DIA7_0 | AGEADJPREV | RACE | WHT | NaN | NaN | NaN | NaN |
1185672 | 2020 | 2020 | WY | Wyoming | BRFSS | Older Adults | Proportion of older adults aged >= 65 years wh... | NaN | % | Crude Prevalence | ... | 56 | OLD | OLD3_1 | CRDPREV | RACE | WHT | NaN | NaN | NaN | NaN |
1185673 | 2017 | 2017 | IA | Iowa | BRFSS | Arthritis | Activity limitation due to arthritis among adu... | NaN | % | Age-adjusted Prevalence | ... | 19 | ART | ART2_1 | AGEADJPREV | RACE | HIS | NaN | NaN | NaN | NaN |
1185674 | 2020 | 2020 | WY | Wyoming | BRFSS | Diabetes | Diabetes prevalence among women aged 18-44 years | NaN | % | Crude Prevalence | ... | 56 | DIA | DIA2_2 | CRDPREV | RACE | HIS | NaN | NaN | NaN | NaN |
1185675 | 2019 | 2019 | RI | Rhode Island | BRFSS | Arthritis | Activity limitation due to arthritis among adu... | NaN | % | Crude Prevalence | ... | 44 | ART | ART2_1 | CRDPREV | OVERALL | OVR | NaN | NaN | NaN | NaN |
1185676 rows × 34 columns
# Filter the dataset for Asthma indicators only
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data.dropna(subset=['DataValue'])
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
# Group by YearStart and calculate the mean of DataValue for Asthma prevalence
asthma_yearly = asthma_data.groupby('YearStart')['DataValue'].mean()
plt.figure(figsize=(12, 6))
plt.plot(asthma_yearly.index, asthma_yearly.values, marker='o', linestyle='-', color='blue')
plt.title('Average Asthma Prevalence Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Prevalence (%)')
plt.grid(True)
plt.show()
There is a decrease in average asthma prevalence over the years. This might be due to a lack of data. 2011 and 2012 have no data points for example.
# Filter the dataset for Emergency Room visits related to Asthma
er_visits_asthma = asthma_data[asthma_data['Question'].str.contains('Emergency department visit rate for asthma')]
# Group by YearStart and calculate the mean of DataValue for ER visits due to Asthma
er_visits_yearly = er_visits_asthma.groupby('YearStart')['DataValue'].mean().reset_index()
er_visits_yearly
YearStart | DataValue | |
---|---|---|
0 | 2010 | 13897.291389 |
1 | 2013 | 6392.235606 |
2 | 2014 | 4310.263989 |
3 | 2015 | 2912.746088 |
4 | 2016 | 4689.097370 |
5 | 2017 | 3821.923457 |
6 | 2018 | 3368.481102 |
plt.figure(figsize=(12, 6))
plt.plot(er_visits_yearly['YearStart'], er_visits_yearly['DataValue'], marker='o', color='red')
plt.title('Average Emergency Room Visits for Asthma Over the Years')
plt.xlabel('Year')
plt.ylabel('Average ER Visits')
plt.grid(True)
plt.show()
There is a decrease in average emergency room visits over the years. This may be due to wildfires in 2010, which causes more carbon dioxide to be in the atmosphere. This may also be due to the lack of data.
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['DataValueType'] == 'Number']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Overall']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr'])['DataValue'].mean().reset_index()
asthma_state_yearly.head(30)
YearStart | LocationAbbr | DataValue | |
---|---|---|---|
0 | 2010 | AZ | 40650.0 |
1 | 2010 | CA | 217637.0 |
2 | 2010 | FL | 141612.0 |
3 | 2010 | HI | 7328.0 |
4 | 2010 | IA | 14135.0 |
5 | 2010 | KY | 27334.0 |
6 | 2010 | MA | 53257.0 |
7 | 2010 | MD | 56070.0 |
8 | 2010 | NC | 60806.0 |
9 | 2010 | NE | 6798.0 |
10 | 2010 | NJ | 81536.0 |
11 | 2010 | NY | 222051.0 |
12 | 2010 | RI | 7845.0 |
13 | 2010 | SC | 24076.0 |
14 | 2010 | UT | 7867.0 |
15 | 2010 | WI | 30501.0 |
16 | 2013 | AZ | 31357.0 |
17 | 2013 | FL | 126267.0 |
18 | 2013 | IA | 11065.0 |
19 | 2013 | KY | 22037.0 |
20 | 2013 | NC | 52550.0 |
21 | 2013 | NE | 5391.0 |
22 | 2013 | NJ | 66250.0 |
23 | 2013 | NV | 13652.0 |
24 | 2013 | NY | 179657.0 |
25 | 2013 | SC | 30142.0 |
26 | 2013 | VT | 2545.0 |
27 | 2013 | WI | 23701.0 |
28 | 2014 | AR | 13610.0 |
29 | 2014 | AZ | 32699.0 |
df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
# Summarize total population by state and year
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
df_population_summary
Year | State | Total Population | |
---|---|---|---|
0 | 2000 | AK | 146644.0 |
1 | 2000 | AL | 4510366.0 |
2 | 2000 | AR | 2678588.0 |
3 | 2000 | AZ | 1220370.0 |
4 | 2000 | CA | 8163246.0 |
... | ... | ... | ... |
915 | 2019 | VA | 815269.0 |
916 | 2019 | VT | 324114.0 |
917 | 2019 | WA | 5607154.0 |
918 | 2019 | WI | 2772884.0 |
919 | 2019 | WV | 1792147.0 |
920 rows × 3 columns
# Merge asthma data with population data
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
# Calculate the percentage of population affected by asthma
asthma_population_merged['NormalizedOccurence'] = (asthma_population_merged['DataValue'] / asthma_population_merged['Total Population'])
asthma_population_merged['Asthma Prevalence (%)'] = asthma_population_merged['NormalizedOccurence'] * 100
asthma_population_merged[['YearStart', 'LocationAbbr', 'DataValue', 'Total Population', 'Asthma Prevalence (%)', 'NormalizedOccurence']]
YearStart | LocationAbbr | DataValue | Total Population | Asthma Prevalence (%) | NormalizedOccurence | |
---|---|---|---|---|---|---|
0 | 2010 | AZ | 40650.0 | 1600442.0 | 2.539923 | 0.025399 |
1 | 2010 | CA | 217637.0 | 8730612.0 | 2.492803 | 0.024928 |
2 | 2010 | FL | 141612.0 | 18845537.0 | 0.751435 | 0.007514 |
3 | 2010 | IA | 14135.0 | 2075096.0 | 0.681173 | 0.006812 |
4 | 2010 | KY | 27334.0 | 4348181.0 | 0.628631 | 0.006286 |
... | ... | ... | ... | ... | ... | ... |
95 | 2018 | NJ | 52173.0 | 569816.0 | 9.156114 | 0.091561 |
96 | 2018 | NV | 14810.0 | 199220.0 | 7.433993 | 0.074340 |
97 | 2018 | OR | 12819.0 | 4740945.0 | 0.270389 | 0.002704 |
98 | 2018 | VT | 1799.0 | 324553.0 | 0.554301 | 0.005543 |
99 | 2018 | WI | 21174.0 | 2765223.0 | 0.765725 | 0.007657 |
100 rows × 6 columns
The merged data includes the average asthma prevalence rates for each state per year, along with the total population for those states and years. The percentage of the population affected by asthma in each state per year provide insight into how asthma prevalence relates to the broader population across different states and over time.
emmissions_df = pd.read_csv('table1.csv') # https://www.eia.gov/environment/emissions/state/
emmissions_df.head() #Each row is a state, each column is the year
#million metric tons of carbon dioxide equivalents (MTCO2E)
State | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | ... | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | Percent | Absolute | Percent.1 | Absolute.1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 102.6 | 98.5 | 104.9 | 109.6 | 108.8 | 107.8 | 108.1 | 111.7 | 106.6 | ... | 114.0 | 108.6 | 112.4 | 106.3 | 98.4 | 108.4 | 5.6% | 5.7 | 10.1% | 10.0 |
1 | Alaska | 11.3 | 12.6 | 13.4 | 12.5 | 12.8 | 14.5 | 16.0 | 18.0 | 19.5 | ... | 33.4 | 33.7 | 34.5 | 34.3 | 36.0 | 38.9 | 242.5% | 27.5 | 8.0% | 2.9 |
2 | Arizona | 24.9 | 27.0 | 30.2 | 34.4 | 36.7 | 38.2 | 43.8 | 50.5 | 49.3 | ... | 90.9 | 90.5 | 94.1 | 92.6 | 80.2 | 83.0 | 233.3% | 58.1 | 3.6% | 2.9 |
3 | Arkansas | 36.2 | 35.1 | 37.2 | 40.8 | 39.1 | 36.4 | 38.9 | 41.6 | 42.4 | ... | 62.1 | 64.2 | 70.8 | 65.1 | 54.7 | 62.0 | 71.4% | 25.8 | 13.3% | 7.3 |
4 | California | 294.4 | 305.8 | 312.7 | 329.3 | 304.5 | 311.5 | 326.9 | 354.5 | 345.2 | ... | 353.4 | 356.5 | 358.6 | 358.3 | 303.8 | 324.0 | 10.1% | 29.7 | 6.7% | 20.2 |
5 rows × 57 columns
state_names = {
'AL': 'Alabama',
'AK': 'Alaska',
'AZ': 'Arizona',
'AR': 'Arkansas',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'HI': 'Hawaii',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'IA': 'Iowa',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'ME': 'Maine',
'MD': 'Maryland',
'MA': 'Massachusetts',
'MI': 'Michigan',
'MN': 'Minnesota',
'MS': 'Mississippi',
'MO': 'Missouri',
'MT': 'Montana',
'NE': 'Nebraska',
'NV': 'Nevada',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NY': 'New York',
'NC': 'North Carolina',
'ND': 'North Dakota',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VT': 'Vermont',
'VA': 'Virginia',
'WA': 'Washington',
'WV': 'West Virginia',
'WI': 'Wisconsin',
'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)
asthma_population_merged
YearStart | LocationAbbr | DataValue | Year | State | Total Population | NormalizedOccurence | Asthma Prevalence (%) | State Name | |
---|---|---|---|---|---|---|---|---|---|
0 | 2010 | AZ | 40650.0 | 2010 | AZ | 1600442.0 | 0.025399 | 2.539923 | Arizona |
1 | 2010 | CA | 217637.0 | 2010 | CA | 8730612.0 | 0.024928 | 2.492803 | California |
2 | 2010 | FL | 141612.0 | 2010 | FL | 18845537.0 | 0.007514 | 0.751435 | Florida |
3 | 2010 | IA | 14135.0 | 2010 | IA | 2075096.0 | 0.006812 | 0.681173 | Iowa |
4 | 2010 | KY | 27334.0 | 2010 | KY | 4348181.0 | 0.006286 | 0.628631 | Kentucky |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 2018 | NJ | 52173.0 | 2018 | NJ | 569816.0 | 0.091561 | 9.156114 | New Jersey |
96 | 2018 | NV | 14810.0 | 2018 | NV | 199220.0 | 0.074340 | 7.433993 | Nevada |
97 | 2018 | OR | 12819.0 | 2018 | OR | 4740945.0 | 0.002704 | 0.270389 | Oregon |
98 | 2018 | VT | 1799.0 | 2018 | VT | 324553.0 | 0.005543 | 0.554301 | Vermont |
99 | 2018 | WI | 21174.0 | 2018 | WI | 2765223.0 | 0.007657 | 0.765725 | Wisconsin |
100 rows × 9 columns
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')
asthma_population_merged['YearStart'] = pd.to_numeric(asthma_population_merged['YearStart'], errors='coerce')
merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
merged_df.rename(columns={'Year_x': 'Year'}, inplace=True)
merged_df.drop(columns=['Year_y'], inplace=True)
merged_df.head()
YearStart | LocationAbbr | DataValue | Year | State | Total Population | NormalizedOccurence | Asthma Prevalence (%) | State Name | Emissions | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | AZ | 40650.0 | 2010 | AZ | 1600442.0 | 0.025399 | 2.539923 | Arizona | 99.5 |
1 | 2010 | CA | 217637.0 | 2010 | CA | 8730612.0 | 0.024928 | 2.492803 | California | 356.6 |
2 | 2010 | FL | 141612.0 | 2010 | FL | 18845537.0 | 0.007514 | 0.751435 | Florida | 245.5 |
3 | 2010 | IA | 14135.0 | 2010 | IA | 2075096.0 | 0.006812 | 0.681173 | Iowa | 90.3 |
4 | 2010 | KY | 27334.0 | 2010 | KY | 4348181.0 | 0.006286 | 0.628631 | Kentucky | 152.9 |
florida_asthma_data = merged_df[(merged_df['LocationAbbr'] == 'FL')]
new_jersey_asthma_data = merged_df[(merged_df['LocationAbbr'] == 'NJ')]
iowa_asthma_data = merged_df[(merged_df['LocationAbbr'] == 'IA')]
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(10, 24))
# Florida
X_fl = florida_asthma_data['YearStart'].values.reshape(-1, 1)
y_fl = florida_asthma_data['Asthma Prevalence (%)'].values
model_fl = LinearRegression().fit(X_fl, y_fl)
trendline_fl = model_fl.predict(X_fl)
axes[0].scatter(X_fl, y_fl)
axes[0].plot(X_fl, trendline_fl, color='red')
axes[0].set_title('Asthma Prevalence (%) by Year in Florida')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Asthma Prevalence (%)')
axes[0].set_ylim(bottom=0)
axes[0].grid(True)
# New Jersey
X_nj = new_jersey_asthma_data['YearStart'].values.reshape(-1, 1)
y_nj = new_jersey_asthma_data['Asthma Prevalence (%)'].values
model_nj = LinearRegression().fit(X_nj, y_nj)
trendline_nj = model_nj.predict(X_nj)
axes[1].scatter(X_nj, y_nj)
axes[1].plot(X_nj, trendline_nj, color='green')
axes[1].set_title('Asthma Prevalence (%) by Year in New Jersey')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Asthma Prevalence (%)')
axes[1].set_ylim(bottom=0)
axes[1].grid(True)
# Iowa
X_ia = iowa_asthma_data['YearStart'].values.reshape(-1, 1)
y_ia = iowa_asthma_data['Asthma Prevalence (%)'].values
model_ia = LinearRegression().fit(X_ia, y_ia)
trendline_ia = model_ia.predict(X_ia)
axes[2].scatter(X_ia, y_ia)
axes[2].plot(X_ia, trendline_ia, color='black')
axes[2].set_title('Asthma Prevalence (%) by Year in Iowa')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('Asthma Prevalence (%)')
axes[2].set_ylim(bottom=0)
axes[2].grid(True)
# Combined Trends
axes[3].plot(X_fl, trendline_fl, color='red', label='Florida')
axes[3].plot(X_nj, trendline_nj, color='green', label='New Jersey')
axes[3].plot(X_ia, trendline_ia, color='black', label='Iowa')
axes[3].set_title('Combined Asthma Prevalence (%) Trends')
axes[3].set_xlabel('Year')
axes[3].set_ylabel('Asthma Prevalence (%)')
axes[3].legend()
axes[3].set_ylim(bottom=0)
axes[3].grid(True)
plt.tight_layout()
plt.show()
One confounding factor of asthma prevalence by Year in each state can be that one person visits the emergency room more than once per year; in other words, the data does not showcase the unique visits per year and instead showcases the total number of visits per year.
New Jersey is a smaller state with a large population. The CO2 emissions are over 100 metric tons. Iowa has relatively less metric tons.
fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(10, 18))
# Florida
florida_emissions = merged_df[(merged_df['LocationAbbr'] == 'FL')]
X_florida = florida_emissions['YearStart'].values.reshape(-1, 1)
y_florida = florida_emissions['Emissions'].astype(float).values
model_florida = LinearRegression().fit(X_florida, y_florida)
trendline_florida = model_florida.predict(X_florida)
axes[0].scatter(X_florida, y_florida, label='CO2 Emissions (MTCO2E)')
axes[0].plot(X_florida, trendline_florida, color='red', label='Trendline')
axes[0].set_title('CO2 Emissions Over Years in Florida')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('CO2 Emissions (MTCO2E)')
axes[0].set_ylim(bottom=0)
axes[0].legend()
axes[0].grid(True)
# New Jersey
new_jersey_emissions = merged_df[(merged_df['LocationAbbr'] == 'NJ')]
X_new_jersey = new_jersey_emissions['YearStart'].values.reshape(-1, 1)
y_new_jersey = new_jersey_emissions['Emissions'].astype(float).values
model_new_jersey = LinearRegression().fit(X_new_jersey, y_new_jersey)
trendline_new_jersey = model_new_jersey.predict(X_new_jersey)
axes[1].scatter(X_new_jersey, y_new_jersey, label='CO2 Emissions (MTCO2E)')
axes[1].plot(X_new_jersey, trendline_new_jersey, color='green', label='Trendline')
axes[1].set_title('CO2 Emissions Over Years in New Jersey')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('CO2 Emissions (MTCO2E)')
axes[1].set_ylim(bottom=0)
axes[1].legend()
axes[1].grid(True)
axes[1].grid(True)
# Iowa
iowa_emissions = merged_df[(merged_df['LocationAbbr'] == 'IA')]
X_iowa = iowa_emissions['YearStart'].values.reshape(-1, 1)
y_iowa = iowa_emissions['Emissions'].astype(float).values
model_iowa = LinearRegression().fit(X_iowa, y_iowa)
trendline_iowa = model_iowa.predict(X_iowa)
axes[2].scatter(X_iowa, y_iowa, label='CO2 Emissions (MTCO2E)')
axes[2].plot(X_iowa, trendline_iowa, color='blue', label='Trendline')
axes[2].set_title('CO2 Emissions Over Years in Iowa')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('CO2 Emissions (MTCO2E)')
axes[2].set_ylim(bottom=0)
axes[2].legend()
axes[2].grid(True)
axes[2].grid(True)
# Combined Trends
axes[3].plot(X_fl, trendline_florida, color='red', label='Florida')
axes[3].plot(X_nj, trendline_new_jersey, color='green', label='New Jersey')
axes[3].plot(X_ia, trendline_iowa, color='black', label='Iowa')
axes[3].set_title('CO2 Emissions Trends')
axes[3].set_xlabel('Year')
axes[3].set_ylabel('CO2 Emissions')
axes[3].set_ylim(bottom=0)
axes[3].legend()
axes[3].grid(True)
plt.tight_layout()
plt.show()
Since New Jersey is smaller than Iowa, but it is producing 25% more CO2 compared to Iowa, resulting in a higher asthma prevalence compared to Iowa.
Florida has the highest CO2 emissions out of the three state graphs and has a population of 22.24 million. New Jersey has around 110 MTCO2E and has a population of 9.26 million.
For Iowa, the population is 3.2million and the CO2 emissions is around 90 MTCO2E.
Confounding variables might include age, sex, and population.
merged_df[['LocationAbbr', 'YearStart', 'NormalizedOccurence', 'Emissions']].head()
# NormalizedOccurence = num occurences / total population
#Normalized occurence is the dedcimal value
#asthma % = NormalizedOccurence * 100
LocationAbbr | YearStart | NormalizedOccurence | Emissions | |
---|---|---|---|---|
0 | AZ | 2010 | 0.025399 | 99.5 |
1 | CA | 2010 | 0.024928 | 356.6 |
2 | FL | 2010 | 0.007514 | 245.5 |
3 | IA | 2010 | 0.006812 | 90.3 |
4 | KY | 2010 | 0.006286 | 152.9 |
plt.figure(figsize=(12, 6))
data_2010 = merged_df[merged_df['YearStart'] == 2010]
bar_width = 0.35
index = np.arange(len(data_2010['LocationAbbr']))
bars1 = plt.bar(index, data_2010['Asthma Prevalence (%)'].astype(float), bar_width, label='Asthma Prevalence (%)')
bars2 = plt.bar(index + bar_width, data_2010['Emissions'].astype(float), bar_width, label='Emissions (MTCO2E)')
plt.xlabel('State')
plt.ylabel('Values')
plt.title('Asthma Prevalence (%) vs Emissions by State for 2010')
plt.xticks(index + bar_width / 2, data_2010['LocationAbbr'])
plt.legend()
plt.tight_layout()
plt.show()
This categorical chart shows"Asthma Prevalence (%)" and "Emissions" for different states specifically for the year 2010. California has the highest emission rate, whereas Nebraska (NE) has the lowest emission rate.
New Jersey has the highest asthma prevalence, whereas Massachusetts has the lowest.
We want to follow up on potential confounding variables such as Sex to see their relationship with emergency visits relating to Asthma.
For emissions, confounding variables include population.
We should explore the effects on these confounding variables given the population dataset. The CDC dataset also separates asthma visits by male and females, so we can explore the relationship between sex and Asthma.
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Gender']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()
df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
'AL': 'Alabama',
'AK': 'Alaska',
'AZ': 'Arizona',
'AR': 'Arkansas',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'HI': 'Hawaii',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'IA': 'Iowa',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'ME': 'Maine',
'MD': 'Maryland',
'MA': 'Massachusetts',
'MI': 'Michigan',
'MN': 'Minnesota',
'MS': 'Mississippi',
'MO': 'Missouri',
'MT': 'Montana',
'NE': 'Nebraska',
'NV': 'Nevada',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NY': 'New York',
'NC': 'North Carolina',
'ND': 'North Dakota',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VT': 'Vermont',
'VA': 'Virginia',
'WA': 'Washington',
'WV': 'West Virginia',
'WI': 'Wisconsin',
'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)
df_male_population_summary = df_population.groupby(['Year', 'State'])['Male Population'].sum().reset_index()
df_female_population_summary = df_population.groupby(['Year', 'State'])['Female Population'].sum().reset_index()
print(df_male_population_summary, df_female_population_summary)
asthma_population_merged = pd.merge(asthma_population_merged, df_male_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_male_pop'))
asthma_population_merged = pd.merge(asthma_population_merged, df_female_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_female_pop'))
emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')
merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
merged_df.head()
Year State Male Population 0 2000 AK 76955.0 1 2000 AL 2175150.0 2 2000 AR 1307573.0 3 2000 AZ 614849.0 4 2000 CA 4063605.0 .. ... ... ... 915 2019 VA 403981.0 916 2019 VT 160948.0 917 2019 WA 2811851.0 918 2019 WI 1393637.0 919 2019 WV 889160.0 [920 rows x 3 columns] Year State Female Population 0 2000 AK 69689.0 1 2000 AL 2335216.0 2 2000 AR 1371015.0 3 2000 AZ 605521.0 4 2000 CA 4099641.0 .. ... ... ... 915 2019 VA 412600.0 916 2019 VT 162963.0 917 2019 WA 2795301.0 918 2019 WI 1378458.0 919 2019 WV 906103.0 [920 rows x 3 columns]
YearStart | LocationAbbr | Stratification1 | DataValue | Year_x | State | Total Population | State Name | Year_male_pop | State_male_pop | Male Population | Year_female_pop | State_female_pop | Female Population | Year_y | Emissions | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | AZ | Female | 7473.116667 | 2010 | AZ | 1600442.0 | Arizona | 2010 | AZ | 809747.0 | 2010 | AZ | 790763.0 | 2010 | 99.5 |
1 | 2010 | AZ | Male | 6150.000000 | 2010 | AZ | 1600442.0 | Arizona | 2010 | AZ | 809747.0 | 2010 | AZ | 790763.0 | 2010 | 99.5 |
2 | 2010 | CA | Female | 38898.420000 | 2010 | CA | 8730612.0 | California | 2010 | CA | 4327961.0 | 2010 | CA | 4402785.0 | 2010 | 356.6 |
3 | 2010 | CA | Male | 32368.006667 | 2010 | CA | 8730612.0 | California | 2010 | CA | 4327961.0 | 2010 | CA | 4402785.0 | 2010 | 356.6 |
4 | 2010 | FL | Female | 27106.200000 | 2010 | FL | 18845537.0 | Florida | 2010 | FL | 9212261.0 | 2010 | FL | 9633882.0 | 2010 | 245.5 |
florida_er_visits = merged_df[(merged_df['LocationAbbr'] == 'FL')][['YearStart', 'DataValue', 'Stratification1', 'Male Population', 'Female Population']].dropna()
iowa_er_visits = merged_df[(merged_df['LocationAbbr'] == 'IA')][['YearStart', 'DataValue', 'Stratification1', 'Male Population', 'Female Population']].dropna()
new_jersey_er_visits = merged_df[(merged_df['LocationAbbr'] == 'NJ')][['YearStart', 'DataValue', 'Stratification1', 'Male Population', 'Female Population']].dropna()
florida_er_visits['Visits per 10,000'] = florida_er_visits.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)
iowa_er_visits['Visits per 10,000'] = iowa_er_visits.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)
new_jersey_er_visits['Visits per 10,000'] = new_jersey_er_visits.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)
total_visits_per_10000_florida = florida_er_visits.groupby(['YearStart', 'Stratification1'])['Visits per 10,000'].sum().reset_index()
total_visits_per_10000_iowa = iowa_er_visits.groupby(['YearStart', 'Stratification1'])['Visits per 10,000'].sum().reset_index()
total_visits_per_10000_new_jersey = new_jersey_er_visits.groupby(['YearStart', 'Stratification1'])['Visits per 10,000'].sum().reset_index()
(total_visits_per_10000_florida, total_visits_per_10000_iowa, total_visits_per_10000_new_jersey)
( YearStart Stratification1 Visits per 10,000 0 2010 Female 28.136321 1 2010 Male 21.920341 2 2013 Female 23.307932 3 2013 Male 19.730475 4 2014 Female 24.128772 5 2014 Male 20.049779 6 2015 Female 17.268291 7 2015 Male 14.077174 8 2016 Female 20.140228 9 2016 Male 17.712882 10 2017 Female 19.240536 11 2017 Male 16.687892 12 2018 Female 19.346165 13 2018 Male 16.858536, YearStart Stratification1 Visits per 10,000 0 2010 Female 25.532069 1 2010 Male 20.437976 2 2013 Female 19.253279 3 2013 Male 16.662763 4 2014 Female 20.461681 5 2014 Male 17.900915 6 2015 Female 15.734273 7 2015 Male 12.679087 8 2016 Female 17.134587 9 2016 Male 14.657145 10 2017 Female 16.417481 11 2017 Male 14.772427 12 2018 Female 16.205231 13 2018 Male 13.870933, YearStart Stratification1 Visits per 10,000 0 2010 Female 517.049395 1 2010 Male 399.737513 2 2013 Female 404.753877 3 2013 Male 345.234332 4 2014 Female 409.523794 5 2014 Male 343.297010 6 2015 Female 292.801786 7 2015 Male 247.006551 8 2016 Female 359.963605 9 2016 Male 310.604665 10 2017 Female 345.041857 11 2017 Male 287.855561 12 2018 Female 328.137771 13 2018 Male 284.557950)
fig, axes = plt.subplots(3, 1, figsize=(12, 18))
# Florida
axes[0].plot(total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Male']['YearStart'],
total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Male']['Visits per 10,000'],
label='Male', marker='o', color='blue')
axes[0].plot(total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Female']['YearStart'],
total_visits_per_10000_florida[total_visits_per_10000_florida['Stratification1'] == 'Female']['Visits per 10,000'],
label='Female', marker='o', color='red')
# Combine Male + Female for Florida
axes[0].plot(total_visits_per_10000_florida['YearStart'].unique(),
total_visits_per_10000_florida.groupby('YearStart')['Visits per 10,000'].sum(),
label='Total', marker='o', color='purple')
axes[0].set_title('Emergency Room Visits for Asthma per 10,000 People in Florida by Gender')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Visits per 10,000 People')
axes[0].legend()
axes[0].grid(True)
axes[0].set_ylim(bottom=0)
# Iowa
axes[1].plot(total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Male']['YearStart'],
total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Male']['Visits per 10,000'],
label='Male', marker='o', color='blue')
axes[1].plot(total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Female']['YearStart'],
total_visits_per_10000_iowa[total_visits_per_10000_iowa['Stratification1'] == 'Female']['Visits per 10,000'],
label='Female', marker='o', color='red')
# Combine Male + Female for Iowa
axes[1].plot(total_visits_per_10000_iowa['YearStart'].unique(),
total_visits_per_10000_iowa.groupby('YearStart')['Visits per 10,000'].sum(),
label='Total', marker='o', color='purple')
axes[1].set_title('Emergency Room Visits for Asthma per 10,000 People in Iowa by Gender')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Visits per 10,000 People')
axes[1].legend()
axes[1].grid(True)
axes[1].set_ylim(bottom=0)
# New Jersey
axes[2].plot(total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Male']['YearStart'],
total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Male']['Visits per 10,000'],
label='Male', marker='o', color='blue')
axes[2].plot(total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Female']['YearStart'],
total_visits_per_10000_new_jersey[total_visits_per_10000_new_jersey['Stratification1'] == 'Female']['Visits per 10,000'],
label='Female', marker='o', color='red')
# Combine Male + Female for New Jersey
axes[2].plot(total_visits_per_10000_new_jersey['YearStart'].unique(),
total_visits_per_10000_new_jersey.groupby('YearStart')['Visits per 10,000'].sum(),
label='Total', marker='o', color='purple')
axes[2].set_title('Emergency Room Visits for Asthma per 10,000 People in New Jersey by Gender')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('Visits per 10,000 People')
axes[2].legend()
axes[2].grid(True)
axes[2].set_ylim(bottom=0)
plt.tight_layout()
plt.show()
Based on the graphs for each of the state, for 10,000 people, the number of females being admitted to the emergency room for asthma related issues is always more than the number of males that are being admitted to the emergency room. In New Jersey's case, we have seen that the Asthma Prevalence (%) is vastly larger than those of Florida and Iowa, we can see that gender does not play a huge role in the issue, as the trends for all of the state are roughly the same.
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Race/Ethnicity']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()
df_population = pd.read_csv('Population by Race - EDDs.csv')
df_population_total = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
df_population_white = df_population.groupby(['Year', 'State'])['White Alone'].sum().reset_index()
df_population_black = df_population.groupby(['Year', 'State'])['Black Alone'].sum().reset_index()
df_population_american_indian = df_population.groupby(['Year', 'State'])['American Indian or Alaskan Native'].sum().reset_index()
df_population_hawaiian = df_population.groupby(['Year', 'State'])['Hawaiian or Pacific Islander Alone'].sum().reset_index()
df_population_two = df_population.groupby(['Year', 'State'])['Two or More Races'].sum().reset_index()
df_population_not_hispanic = df_population.groupby(['Year', 'State'])['Not Hispanic'].sum().reset_index()
df_population_hispanic = df_population.groupby(['Year', 'State'])['Hispanic'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_total, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
merges = [df_population_white, df_population_black, df_population_american_indian, df_population_hawaiian, df_population_two, df_population_not_hispanic, df_population_hispanic]
for merge_df in merges:
asthma_population_merged = pd.merge(asthma_population_merged, merge_df, on=['Year', 'State'], how='inner')
asthma_population_merged
YearStart | LocationAbbr | Stratification1 | DataValue | Year | State | Total Population | White Alone | Black Alone | American Indian or Alaskan Native | Hawaiian or Pacific Islander Alone | Two or More Races | Not Hispanic | Hispanic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | AZ | American Indian or Alaska Native | 557.366667 | 2013 | AZ | 1617914.0 | 1313929.0 | 38650.0 | 202591.0 | 4009.0 | 36760.0 | 1169683.0 | 448231.0 |
1 | 2013 | AZ | Asian or Pacific Islander | 128.233333 | 2013 | AZ | 1617914.0 | 1313929.0 | 38650.0 | 202591.0 | 4009.0 | 36760.0 | 1169683.0 | 448231.0 |
2 | 2013 | AZ | Black, non-Hispanic | 1588.833333 | 2013 | AZ | 1617914.0 | 1313929.0 | 38650.0 | 202591.0 | 4009.0 | 36760.0 | 1169683.0 | 448231.0 |
3 | 2013 | AZ | Hispanic | 3657.833333 | 2013 | AZ | 1617914.0 | 1313929.0 | 38650.0 | 202591.0 | 4009.0 | 36760.0 | 1169683.0 | 448231.0 |
4 | 2013 | AZ | White, non-Hispanic | 4860.633333 | 2013 | AZ | 1617914.0 | 1313929.0 | 38650.0 | 202591.0 | 4009.0 | 36760.0 | 1169683.0 | 448231.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
360 | 2018 | WI | American Indian or Alaska Native | 141.313333 | 2018 | WI | 2765223.0 | 2566205.0 | 39437.0 | 47612.0 | 1672.0 | 45678.0 | 2658039.0 | 107184.0 |
361 | 2018 | WI | Asian or Pacific Islander | 102.496667 | 2018 | WI | 2765223.0 | 2566205.0 | 39437.0 | 47612.0 | 1672.0 | 45678.0 | 2658039.0 | 107184.0 |
362 | 2018 | WI | Black, non-Hispanic | 2706.553333 | 2018 | WI | 2765223.0 | 2566205.0 | 39437.0 | 47612.0 | 1672.0 | 45678.0 | 2658039.0 | 107184.0 |
363 | 2018 | WI | Hispanic | 722.306667 | 2018 | WI | 2765223.0 | 2566205.0 | 39437.0 | 47612.0 | 1672.0 | 45678.0 | 2658039.0 | 107184.0 |
364 | 2018 | WI | White, non-Hispanic | 3391.546667 | 2018 | WI | 2765223.0 | 2566205.0 | 39437.0 | 47612.0 | 1672.0 | 45678.0 | 2658039.0 | 107184.0 |
365 rows × 14 columns
asthma_population_merged['Race'] = asthma_population_merged['Stratification1'].apply(lambda x: x.split(',')[0])
race_mapping = {
'American Indian or Alaska Native': 'American Indian or Alaskan Native',
'Asian or Pacific Islander': 'Hawaiian or Pacific Islander Alone',
'Black, non-Hispanic': 'Black Alone',
'Hispanic': 'Hispanic',
'White, non-Hispanic': 'White Alone',
"Black": "Black Alone",
"White": "White Alone"
}
asthma_population_merged['Clean Race'] = asthma_population_merged['Race'].map(race_mapping)
def get_race_population(row):
race_column = row['Clean Race']
if race_column in row:
return row[race_column]
else:
return None
asthma_population_merged['Clean Race Population'] = asthma_population_merged.apply(get_race_population, axis=1)
asthma_population_merged['Visits per 10,000 based on Race'] = (asthma_population_merged['DataValue'] / asthma_population_merged['Clean Race Population']) * 10000
asthma_population_merged.head()
YearStart | LocationAbbr | Stratification1 | DataValue | Year | State | Total Population | White Alone | Black Alone | American Indian or Alaskan Native | Hawaiian or Pacific Islander Alone | Two or More Races | Not Hispanic | Hispanic | Race | Clean Race | Clean Race Population | Visits per 10,000 based on Race | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | AZ | American Indian or Alaska Native | 557.366667 | 2013 | AZ | 1617914.0 | 1313929.0 | 38650.0 | 202591.0 | 4009.0 | 36760.0 | 1169683.0 | 448231.0 | American Indian or Alaska Native | American Indian or Alaskan Native | 202591.0 | 27.511916 |
1 | 2013 | AZ | Asian or Pacific Islander | 128.233333 | 2013 | AZ | 1617914.0 | 1313929.0 | 38650.0 | 202591.0 | 4009.0 | 36760.0 | 1169683.0 | 448231.0 | Asian or Pacific Islander | Hawaiian or Pacific Islander Alone | 4009.0 | 319.863640 |
2 | 2013 | AZ | Black, non-Hispanic | 1588.833333 | 2013 | AZ | 1617914.0 | 1313929.0 | 38650.0 | 202591.0 | 4009.0 | 36760.0 | 1169683.0 | 448231.0 | Black | Black Alone | 38650.0 | 411.082363 |
3 | 2013 | AZ | Hispanic | 3657.833333 | 2013 | AZ | 1617914.0 | 1313929.0 | 38650.0 | 202591.0 | 4009.0 | 36760.0 | 1169683.0 | 448231.0 | Hispanic | Hispanic | 448231.0 | 81.605987 |
4 | 2013 | AZ | White, non-Hispanic | 4860.633333 | 2013 | AZ | 1617914.0 | 1313929.0 | 38650.0 | 202591.0 | 4009.0 | 36760.0 | 1169683.0 | 448231.0 | White | White Alone | 1313929.0 | 36.993120 |
states = ['FL', 'IA', 'NJ']
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 20))
asthma_population_merged['StateIndex'] = asthma_population_merged['LocationAbbr'].apply(lambda x: states.index(x) if x in states else None)
filtered_data = asthma_population_merged.dropna(subset=['StateIndex'])
sns.lineplot(ax=axes[0], data=filtered_data[filtered_data['LocationAbbr'] == 'FL'], x='YearStart', y='Visits per 10,000 based on Race', hue='Clean Race', legend='full')
axes[0].set_title('ER Visits per 10,000 based on Race - FL')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Visits per 10,000')
sns.lineplot(ax=axes[1], data=filtered_data[filtered_data['LocationAbbr'] == 'IA'], x='YearStart', y='Visits per 10,000 based on Race', hue='Clean Race', legend='full')
axes[1].set_title('ER Visits per 10,000 based on Race - IA')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Visits per 10,000')
sns.lineplot(ax=axes[2], data=filtered_data[filtered_data['LocationAbbr'] == 'NJ'], x='YearStart', y='Visits per 10,000 based on Race', hue='Clean Race', legend='full')
axes[2].set_title('ER Visits per 10,000 based on Race - NJ')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('Visits per 10,000')
plt.tight_layout()
plt.show()
We graph the emergency room visits per 10,000 based on race and see that Haiiwaan or Pacific Islander Alone has the most number of visits per 10,000, which may suggest outliers in the data or in how the statistics are collected in the dataset. People with the ethnicity of Black have the second highest number of visits per 10,000 related to asthma.
# high level or low level of CO2 is the treatment, a binary variable [0, 1]
# Create a DAG
G = nx.DiGraph()
G.add_node("Population", pos=(0.3, 0.7))
G.add_node("CO2", pos=(1, 0))
G.add_node("Asthma", pos=(1.7, 0.7))
G.add_node("Number of power plants", pos=(0.5, 0))
G.add_edges_from([("Population", "CO2"), ("Population", "Asthma"), ("CO2", "Asthma"), ('Number of power plants', 'CO2')])
pos = nx.get_node_attributes(G, 'pos')
plt.figure(figsize=(12, 6))
try:
nx.draw_networkx(G, pos, with_labels=True, node_size=5000, node_color='skyblue', font_size=15, arrows=True)
except:
nx.draw(G, pos, with_labels=True, node_size=5000, node_color='skyblue', font_size=15, arrows=True)
plt.title('Causal DAG: Population as a Confounding Variable')
plt.show()
#Load the asthma dataframe
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)
#Group the asthma dataframe by Year and State
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
#Filter out rows that match our research question related to emergency department visit rates for asthma
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['DataValueType'] == 'Number']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Overall']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr'])['DataValue'].mean().reset_index()
#Group the asthma dataset with the Population Dataset because Population is a confounding variable
# We grouped by population
df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
'AL': 'Alabama',
'AK': 'Alaska',
'AZ': 'Arizona',
'AR': 'Arkansas',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'HI': 'Hawaii',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'IA': 'Iowa',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'ME': 'Maine',
'MD': 'Maryland',
'MA': 'Massachusetts',
'MI': 'Michigan',
'MN': 'Minnesota',
'MS': 'Mississippi',
'MO': 'Missouri',
'MT': 'Montana',
'NE': 'Nebraska',
'NV': 'Nevada',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NY': 'New York',
'NC': 'North Carolina',
'ND': 'North Dakota',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VT': 'Vermont',
'VA': 'Virginia',
'WA': 'Washington',
'WV': 'West Virginia',
'WI': 'Wisconsin',
'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)
emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')
merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
merged_df['Visits per 10,000'] = ((merged_df['DataValue'] / merged_df['Total Population']) * 10000).astype(int)
merged_df['Visits per 10,000'] = merged_df['Visits per 10,000'].astype(int)
merged_df['Total Population'] = merged_df['Total Population'].astype(int)
merged_df['Emissions'] = merged_df['Emissions'].astype(float)
merged_df
YearStart | LocationAbbr | DataValue | Year_x | State | Total Population | State Name | Year_y | Emissions | Visits per 10,000 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | AZ | 40650.0 | 2010 | AZ | 1600442 | Arizona | 2010 | 99.5 | 253 |
1 | 2010 | CA | 217637.0 | 2010 | CA | 8730612 | California | 2010 | 356.6 | 249 |
2 | 2010 | FL | 141612.0 | 2010 | FL | 18845537 | Florida | 2010 | 245.5 | 75 |
3 | 2010 | IA | 14135.0 | 2010 | IA | 2075096 | Iowa | 2010 | 90.3 | 68 |
4 | 2010 | KY | 27334.0 | 2010 | KY | 4348181 | Kentucky | 2010 | 152.9 | 62 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 2018 | NJ | 52173.0 | 2018 | NJ | 569816 | New Jersey | 2018 | 105.1 | 915 |
96 | 2018 | NV | 14810.0 | 2018 | NV | 199220 | Nevada | 2018 | 41.4 | 743 |
97 | 2018 | OR | 12819.0 | 2018 | OR | 4740945 | Oregon | 2018 | 39.6 | 27 |
98 | 2018 | VT | 1799.0 | 2018 | VT | 324553 | Vermont | 2018 | 5.9 | 55 |
99 | 2018 | WI | 21174.0 | 2018 | WI | 2765223 | Wisconsin | 2018 | 101.2 | 76 |
100 rows × 10 columns
# The states in our dataset with Asthma related information for emergency visits to the hospital
asthma_population_merged["State Name"].unique()
array(['Arizona', 'California', 'Florida', 'Iowa', 'Kentucky', 'Massachusetts', 'Maryland', 'North Carolina', 'Nebraska', 'New Jersey', 'New York', 'South Carolina', 'Utah', 'Wisconsin', 'Nevada', 'Vermont', 'Arkansas', 'Georgia', 'Minnesota', 'Oregon', 'Colorado'], dtype=object)
We get the 60th percentile of the "Emissions" column. We will consider everything beyond this threshold to be having high CO2 emissions. Otherwise low CO2 Emissions if below this threshold. This is because we want to have the treatment variable be a binary variable for Causal Inference.
threshold = merged_df['Emissions'].quantile(0.60)
print(threshold)
merged_df['High Emissions'] = merged_df['Emissions'].apply(lambda x: 1 if x > threshold else 0)
merged_df
99.34
YearStart | LocationAbbr | DataValue | Year_x | State | Total Population | State Name | Year_y | Emissions | Visits per 10,000 | High Emissions | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | AZ | 40650.0 | 2010 | AZ | 1600442 | Arizona | 2010 | 99.5 | 253 | 1 |
1 | 2010 | CA | 217637.0 | 2010 | CA | 8730612 | California | 2010 | 356.6 | 249 | 1 |
2 | 2010 | FL | 141612.0 | 2010 | FL | 18845537 | Florida | 2010 | 245.5 | 75 | 1 |
3 | 2010 | IA | 14135.0 | 2010 | IA | 2075096 | Iowa | 2010 | 90.3 | 68 | 0 |
4 | 2010 | KY | 27334.0 | 2010 | KY | 4348181 | Kentucky | 2010 | 152.9 | 62 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 2018 | NJ | 52173.0 | 2018 | NJ | 569816 | New Jersey | 2018 | 105.1 | 915 | 1 |
96 | 2018 | NV | 14810.0 | 2018 | NV | 199220 | Nevada | 2018 | 41.4 | 743 | 0 |
97 | 2018 | OR | 12819.0 | 2018 | OR | 4740945 | Oregon | 2018 | 39.6 | 27 | 0 |
98 | 2018 | VT | 1799.0 | 2018 | VT | 324553 | Vermont | 2018 | 5.9 | 55 | 0 |
99 | 2018 | WI | 21174.0 | 2018 | WI | 2765223 | Wisconsin | 2018 | 101.2 | 76 | 1 |
100 rows × 11 columns
# Power plant dataframe
# The year is 2022 for all entries
# We use the number of power plants as an instrumental variable
PLNT22df = pd.read_csv('PLNT22.csv')
PLNT22df.head()
C:\Users\tranh\AppData\Local\Temp\ipykernel_2596\2105320888.py:4: DtypeWarning: Columns (0,1,4,6,8,16,17,19,20,22,23,27,29,33,34,49,51,52,54,55,57,58,59,60,61,62,63,64,65,67,68,70,71,73,80,94,95,96,99) have mixed types. Specify dtype option on import or set low_memory=False. PLNT22df = pd.read_csv('PLNT22.csv')
Plant file sequence number | Data Year | Plant state abbreviation | Plant name | DOE/EIA ORIS plant or facility code | Plant transmission or distribution system owner name | Plant transmission or distribution system owner ID | Utility name | Utility ID | Plant-level sector | ... | Plant wind generation percent (resource mix) | Plant solar generation percent (resource mix) | Plant geothermal generation percent (resource mix) | Plant other fossil generation percent (resource mix) | Plant other unknown / purchased fuel generation percent (resource mix) | Plant total nonrenewables generation percent (resource mix) | Plant total renewables generation percent (resource mix) | Plant total nonhydro renewables generation percent (resource mix) | Plant total combustion generation percent (resource mix) | Plant total noncombustion generation percent (resource mix) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SEQPLT22 | YEAR | PSTATABB | PNAME | ORISPL | OPRNAME | OPRCODE | UTLSRVNM | UTLSRVID | SECTOR | ... | PLWIPR | PLSOPR | PLGTPR | PLOFPR | PLOPPR | PLTNPR | PLTRPR | PLTHPR | PLCYPR | PLCNPR |
1 | 1 | 2022 | AK | Agrium Kenai Nitrogen Operations | 54452 | Homer Electric Assn Inc | 19558 | Agrium US Inc | 179 | Industrial CHP | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2 | 2022 | AK | Alakanuk | 57053 | Alaska Village Elec Coop, Inc | 221 | Alaska Village Elec Coop, Inc | 221 | Electric Utility | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 3 | 2022 | AK | Allison Creek Hydro | 58982 | Copper Valley Elec Assn, Inc | 4329 | Copper Valley Elec Assn, Inc | 4329 | Electric Utility | ... | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 100.0% | 0.0% | 0.0% | 100.0% |
4 | 4 | 2022 | AK | Ambler | 60243 | Alaska Village Elec Coop, Inc | 221 | Alaska Village Elec Coop, Inc | 221 | Electric Utility | ... | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 100.0% | 0.0% | 0.0% | 100.0% | 0.0% |
5 rows × 141 columns
state_counts = PLNT22df['Plant state abbreviation'].value_counts()
print(state_counts)
data = {
'State': ['CA', 'NC', 'NY', 'TX', 'MN', 'MA', 'NJ', 'IL', 'IA', 'MI', 'OR', 'CO', 'FL', 'PA', 'GA',
'WI', 'SC', 'IN', 'VA', 'OH', 'MD', 'KS', 'CT', 'AK', 'AZ', 'WA', 'ID', 'NM', 'ME', 'MO',
'OK', 'NE', 'VT', 'UT', 'NV', 'RI', 'LA', 'TN', 'AL', 'AR', 'HI', 'WY', 'MT', 'NH', 'ND',
'PR', 'KY', 'SD', 'MS', 'WV', 'DE', 'DC'],
'Count': [1678, 875, 821, 754, 724, 600, 368, 328, 283, 266, 265, 265, 242, 242, 235, 204, 197, 194,
193, 183, 171, 161, 155, 151, 146, 142, 140, 135, 133, 130, 129, 126, 113, 113, 102, 87,
82, 82, 75, 73, 68, 67, 65, 61, 60, 60, 52, 50, 47, 36, 31, 13]
}
df = pd.DataFrame(data)
df.head()
CA 1678 NC 875 NY 821 TX 754 MN 724 MA 600 NJ 368 IL 328 IA 283 MI 266 CO 265 OR 265 PA 242 FL 242 GA 235 WI 204 SC 197 IN 194 VA 193 OH 183 MD 171 KS 161 CT 155 AK 151 AZ 146 WA 142 ID 140 NM 135 ME 133 MO 130 OK 129 NE 126 UT 113 VT 113 NV 102 RI 87 TN 82 LA 82 AL 75 AR 73 HI 68 WY 67 MT 65 NH 61 PR 60 ND 60 KY 52 SD 50 MS 47 WV 36 DE 31 DC 13 PSTATABB 1 Name: Plant state abbreviation, dtype: int64
State | Count | |
---|---|---|
0 | CA | 1678 |
1 | NC | 875 |
2 | NY | 821 |
3 | TX | 754 |
4 | MN | 724 |
result = {k: int(v) for k, v in df.set_index('State')['Count'].to_dict().items()}
merged_df['Total Power Plant Count'] = merged_df['LocationAbbr'].map(result)
merged_df
YearStart | LocationAbbr | DataValue | Year_x | State | Total Population | State Name | Year_y | Emissions | Visits per 10,000 | High Emissions | Total Power Plant Count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | AZ | 40650.0 | 2010 | AZ | 1600442 | Arizona | 2010 | 99.5 | 253 | 1 | 146 |
1 | 2010 | CA | 217637.0 | 2010 | CA | 8730612 | California | 2010 | 356.6 | 249 | 1 | 1678 |
2 | 2010 | FL | 141612.0 | 2010 | FL | 18845537 | Florida | 2010 | 245.5 | 75 | 1 | 242 |
3 | 2010 | IA | 14135.0 | 2010 | IA | 2075096 | Iowa | 2010 | 90.3 | 68 | 0 | 283 |
4 | 2010 | KY | 27334.0 | 2010 | KY | 4348181 | Kentucky | 2010 | 152.9 | 62 | 1 | 52 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 2018 | NJ | 52173.0 | 2018 | NJ | 569816 | New Jersey | 2018 | 105.1 | 915 | 1 | 368 |
96 | 2018 | NV | 14810.0 | 2018 | NV | 199220 | Nevada | 2018 | 41.4 | 743 | 0 | 102 |
97 | 2018 | OR | 12819.0 | 2018 | OR | 4740945 | Oregon | 2018 | 39.6 | 27 | 0 | 265 |
98 | 2018 | VT | 1799.0 | 2018 | VT | 324553 | Vermont | 2018 | 5.9 | 55 | 0 | 113 |
99 | 2018 | WI | 21174.0 | 2018 | WI | 2765223 | Wisconsin | 2018 | 101.2 | 76 | 1 | 204 |
100 rows × 12 columns
# From lab 9, we fit an OLS model
def fit_OLS_model(df, target_variable, explanatory_variables, intercept = False):
"""
Fits an OLS model from data.
Inputs:
df: pandas DataFrame
target_variable: string, name of the target variable
explanatory_variables: list of strings, names of the explanatory variables
intercept: bool, if True add intercept term
Outputs:
fitted_model: model containing OLS regression results
"""
target = df[target_variable]
inputs = df[explanatory_variables]
if intercept:
inputs = sm.add_constant(inputs)
fitted_model = sm.OLS(target, inputs).fit()
return(fitted_model)
gammas_model1 = fit_OLS_model(merged_df, 'Visits per 10,000', ['High Emissions'], intercept = True)
gammas_model1.summary()
Dep. Variable: | Visits per 10,000 | R-squared: | 0.010 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | -0.000 |
Method: | Least Squares | F-statistic: | 0.9571 |
Date: | Mon, 06 May 2024 | Prob (F-statistic): | 0.330 |
Time: | 11:42:41 | Log-Likelihood: | -716.34 |
No. Observations: | 100 | AIC: | 1437. |
Df Residuals: | 98 | BIC: | 1442. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 191.9667 | 40.749 | 4.711 | 0.000 | 111.102 | 272.831 |
High Emissions | 63.0333 | 64.430 | 0.978 | 0.330 | -64.825 | 190.892 |
Omnibus: | 40.794 | Durbin-Watson: | 1.768 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 72.917 |
Skew: | 1.815 | Prob(JB): | 1.47e-16 |
Kurtosis: | 5.081 | Cond. No. | 2.45 |
gammas_model2 = fit_OLS_model(merged_df, 'Visits per 10,000', ['High Emissions', 'Total Population'], intercept = True)
gammas_model2.summary()
Dep. Variable: | Visits per 10,000 | R-squared: | 0.242 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.226 |
Method: | Least Squares | F-statistic: | 15.46 |
Date: | Mon, 06 May 2024 | Prob (F-statistic): | 1.48e-06 |
Time: | 11:42:41 | Log-Likelihood: | -702.99 |
No. Observations: | 100 | AIC: | 1412. |
Df Residuals: | 97 | BIC: | 1420. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 265.5250 | 38.298 | 6.933 | 0.000 | 189.514 | 341.536 |
High Emissions | 260.1005 | 67.227 | 3.869 | 0.000 | 126.674 | 393.527 |
Total Population | -3.394e-05 | 6.23e-06 | -5.448 | 0.000 | -4.63e-05 | -2.16e-05 |
Omnibus: | 19.797 | Durbin-Watson: | 1.872 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 24.225 |
Skew: | 1.160 | Prob(JB): | 5.49e-06 |
Kurtosis: | 3.655 | Cond. No. | 1.72e+07 |
gammas_model3 = fit_OLS_model(merged_df, 'Visits per 10,000', ['Total Power Plant Count','High Emissions', 'Total Population'], intercept = True)
gammas_model3.summary()
Dep. Variable: | Visits per 10,000 | R-squared: | 0.242 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.218 |
Method: | Least Squares | F-statistic: | 10.22 |
Date: | Mon, 06 May 2024 | Prob (F-statistic): | 6.69e-06 |
Time: | 11:42:41 | Log-Likelihood: | -702.98 |
No. Observations: | 100 | AIC: | 1414. |
Df Residuals: | 96 | BIC: | 1424. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 261.4268 | 44.390 | 5.889 | 0.000 | 173.313 | 349.541 |
Total Power Plant Count | 0.0197 | 0.106 | 0.185 | 0.853 | -0.191 | 0.231 |
High Emissions | 257.2455 | 69.298 | 3.712 | 0.000 | 119.690 | 394.801 |
Total Population | -3.41e-05 | 6.31e-06 | -5.400 | 0.000 | -4.66e-05 | -2.16e-05 |
Omnibus: | 20.164 | Durbin-Watson: | 1.878 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 24.861 |
Skew: | 1.177 | Prob(JB): | 4.00e-06 |
Kurtosis: | 3.655 | Cond. No. | 1.74e+07 |
-Treatment: Exposure to increased levels of carbon dioxide. This can be operationalized as being in geographic locations or during time periods with carbon dioxide levels above a certain threshold considered "high." (high here means over 99.34 MTCO2E, the 60th percentile in our dataset)
Control: Exposure to lower levels of carbon dioxide, defined as being in locations or during time periods with carbon dioxide levels below the threshold considered "high." (high here means over 99.34 MTCO2E)
Outcome: The number of asthma-related emergency room visits. This can be measured as the total number of visits or the rate of visits per 10,000 people to account for population size differences.
-Units
We identified an instrumental variable, which will be used to adjust for confounders, which is the number of power plants in a state that affect the CO2 directly and independently from Population and Emergency visits for Asthma per 10,000 people.
We used multiple hypothesis tests to answer the question of whether exposure to increased levels of carbon dioxide causes an increase in the number of asthma-related emergency room visits. Note that different populations may respond differently to carbon dioxide exposure due to genetic, environmental, or lifestyle factors. Multiple hypothesis tests allow for the examination of effects across various demographic groups, such as age, gender, race/ethnicity, or geographic location. The impact of carbon dioxide on asthma-related emergency room visits may vary over time due to changes in air quality, climate conditions, or public health interventions. Testing across different time periods can provide insights into these temporal dynamics.
Additionally, carbon dioxide levels and the prevalence of asthma can vary significantly across different geographic areas. By conducting hypothesis tests for different states, we can identify areas where the association between carbon dioxide exposure and asthma emergencies is stronger or weaker. Conducting multiple hypothesis tests increases the robustness of the findings. If similar results are obtained across various groups, time periods, and locations, it strengthens the evidence for a causal relationship between carbon dioxide exposure and asthma-related emergency room visits. Multiple hypothesis testing allows for the examination of the relationship between carbon dioxide exposure and asthma while controlling for potential confounding variables. This approach can help isolate the effect of carbon dioxide from other factors that might influence asthma emergency visits.
Benjamini-Hochberg algorithm controls the False Discovery Rate, the expected proportion of incorrectly rejected hypothesis (false discoveries) among all rejections. This method allows to control the rate of false discoveries effectively while maintaining a higher power to detect real effects compared to FWER methods. Failing to identify a true effect (type II error).
def null_pdf(x): return norm.pdf(x, 0, 1) #change
def alt_pdf(x): return norm.pdf(x, 2, 1) #change
def calculate_likelihood_ratio(x):
"""
Computes the likelihood ratio between the alternative and null hypothesis.
Inputs:
x: value for which to compute the likelihood ratio
Outputs:
lr : the likelihood ratio at point x
"""
L0 = null_pdf(x)
L1 = alt_pdf(x)
LR = L1/L0 # TODO: fill the likelihood ratio
return LR
# TODO: fill in the missing expression for FPR
def calculate_fpr(tau):
"""
Calculates the FPR for the test based on thresholding X>=tau.
It assumes that the null distribution is the standard gaussian N(0,1)
Inputs:
tau: test threshold
Outputs:
fpr: false positive rate
"""
fpr = 1 - scipy.stats.norm.cdf(tau)
return fpr
# TODO: complete this function
def make_decision(X, alpha):
"""
Makes a decision whether to reject the null hypothesis for point X at level alpha
Inputs:
X: point at which to test
alpha: desired FPR rate for the decision rule (also known as significance level)
Outputs:
decision: {0, 1} or {False, True}
1/True: reject the null
0/False: fail to reject the null
"""
threshold = scipy.stats.norm.ppf(1 - alpha) # TODO: compute the threshold for which the FPR of the test is equal to alpha (see Hint)
decision = X > threshold # TODO: compute the decision; 1 stands for rejecting the null
return decision
# TODO: complete this function
def calculate_p_value(X):
"""
Calculates the P-values for the point X
Inputs:
X: data point
Outputs:
p_value: P(X)
"""
p_value = calculate_fpr(X)
return(p_value)
To collect p-values for testing the hypothesis that high carbon dioxide levels increase emergency room visits due to asthma
Merge CO2 Levels Data with Asthma ER Visits Data
Calculate Asthma ER Visit Rates: Normalize the number of asthma ER visits by the population size (if not already done) to get a rate of visits per 10,000 people (or any other suitable normalization factor). This allows for fair comparisons across different states or regions with varying population sizes.
Classify CO2 Levels: Determine a threshold for high CO2 levels based on environmental standards or literature. Classify each observation (e.g., state-year combination) as having high or low CO2 levels based on this threshold.
Statistical Test for Each Group: For each group (high CO2 level vs. low CO2 level), calculate the mean asthma ER visit rate. Then, use a statistical test (e.g., t-test if assumptions are met, or Mann-Whitney U test for non-parametric data) to compare the mean rates between the two groups.
Collect P-values: The statistical test will provide a p-value indicating the likelihood of observing the data assuming the null hypothesis is true. A low p-value (typically <0.05) suggests that the observed difference in asthma ER visit rates between high and low CO2 emission level groups is statistically significant, leading to the rejection of the null hypothesis in favor of the alternative hypothesis.
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Gender']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()
df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
'AL': 'Alabama',
'AK': 'Alaska',
'AZ': 'Arizona',
'AR': 'Arkansas',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'HI': 'Hawaii',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'IA': 'Iowa',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'ME': 'Maine',
'MD': 'Maryland',
'MA': 'Massachusetts',
'MI': 'Michigan',
'MN': 'Minnesota',
'MS': 'Mississippi',
'MO': 'Missouri',
'MT': 'Montana',
'NE': 'Nebraska',
'NV': 'Nevada',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NY': 'New York',
'NC': 'North Carolina',
'ND': 'North Dakota',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VT': 'Vermont',
'VA': 'Virginia',
'WA': 'Washington',
'WV': 'West Virginia',
'WI': 'Wisconsin',
'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)
df_male_population_summary = df_population.groupby(['Year', 'State'])['Male Population'].sum().reset_index()
df_female_population_summary = df_population.groupby(['Year', 'State'])['Female Population'].sum().reset_index()
# print(df_male_population_summary, df_female_population_summary)
asthma_population_merged = pd.merge(asthma_population_merged, df_male_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_male_pop'))
asthma_population_merged = pd.merge(asthma_population_merged, df_female_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_female_pop'))
emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')
merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
merged_df['Visits per 10,000'] = merged_df.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)
merged_df.head()
YearStart | LocationAbbr | Stratification1 | DataValue | Year_x | State | Total Population | State Name | Year_male_pop | State_male_pop | Male Population | Year_female_pop | State_female_pop | Female Population | Year_y | Emissions | Visits per 10,000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | AZ | Female | 7473.116667 | 2010 | AZ | 1600442.0 | Arizona | 2010 | AZ | 809747.0 | 2010 | AZ | 790763.0 | 2010 | 99.5 | 94.505138 |
1 | 2010 | AZ | Male | 6150.000000 | 2010 | AZ | 1600442.0 | Arizona | 2010 | AZ | 809747.0 | 2010 | AZ | 790763.0 | 2010 | 99.5 | 75.949648 |
2 | 2010 | CA | Female | 38898.420000 | 2010 | CA | 8730612.0 | California | 2010 | CA | 4327961.0 | 2010 | CA | 4402785.0 | 2010 | 356.6 | 88.349579 |
3 | 2010 | CA | Male | 32368.006667 | 2010 | CA | 8730612.0 | California | 2010 | CA | 4327961.0 | 2010 | CA | 4402785.0 | 2010 | 356.6 | 74.788120 |
4 | 2010 | FL | Female | 27106.200000 | 2010 | FL | 18845537.0 | Florida | 2010 | FL | 9212261.0 | 2010 | FL | 9633882.0 | 2010 | 245.5 | 28.136321 |
q1_with_emissions = merged_df.groupby(['YearStart', 'LocationAbbr'])[['Visits per 10,000', 'Emissions', 'Total Population']].aggregate({'Visits per 10,000': 'sum', 'Emissions': 'first', 'Total Population': 'first'}).reset_index()
q1_with_emissions
YearStart | LocationAbbr | Visits per 10,000 | Emissions | Total Population | |
---|---|---|---|---|---|
0 | 2010 | AZ | 170.454787 | 99.5 | 1600442.0 |
1 | 2010 | CA | 163.137699 | 356.6 | 8730612.0 |
2 | 2010 | FL | 50.056662 | 245.5 | 18845537.0 |
3 | 2010 | IA | 45.970045 | 90.3 | 2075096.0 |
4 | 2010 | KY | 42.195833 | 152.9 | 4348181.0 |
... | ... | ... | ... | ... | ... |
95 | 2018 | NJ | 612.695721 | 105.1 | 569816.0 |
96 | 2018 | NV | 502.801522 | 41.4 | 199220.0 |
97 | 2018 | OR | 18.184902 | 39.6 | 4740945.0 |
98 | 2018 | VT | 39.409573 | 5.9 | 324553.0 |
99 | 2018 | WI | 51.435813 | 101.2 | 2765223.0 |
100 rows × 5 columns
q1_with_emissions['Visits_per_10000'] = q1_with_emissions['Visits per 10,000'].astype(float)
q1_with_emissions['HighCO2'] = q1_with_emissions['Emissions'].apply(lambda x: 1 if float(x) > 100 else 0)
q1_with_emissions
YearStart | LocationAbbr | Visits per 10,000 | Emissions | Total Population | Visits_per_10000 | HighCO2 | |
---|---|---|---|---|---|---|---|
0 | 2010 | AZ | 170.454787 | 99.5 | 1600442.0 | 170.454787 | 0 |
1 | 2010 | CA | 163.137699 | 356.6 | 8730612.0 | 163.137699 | 1 |
2 | 2010 | FL | 50.056662 | 245.5 | 18845537.0 | 50.056662 | 1 |
3 | 2010 | IA | 45.970045 | 90.3 | 2075096.0 | 45.970045 | 0 |
4 | 2010 | KY | 42.195833 | 152.9 | 4348181.0 | 42.195833 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
95 | 2018 | NJ | 612.695721 | 105.1 | 569816.0 | 612.695721 | 1 |
96 | 2018 | NV | 502.801522 | 41.4 | 199220.0 | 502.801522 | 0 |
97 | 2018 | OR | 18.184902 | 39.6 | 4740945.0 | 18.184902 | 0 |
98 | 2018 | VT | 39.409573 | 5.9 | 324553.0 | 39.409573 | 0 |
99 | 2018 | WI | 51.435813 | 101.2 | 2765223.0 | 51.435813 | 1 |
100 rows × 7 columns
q1_with_emissions.groupby('HighCO2')['Visits per 10,000'].mean()
HighCO2 0 137.124766 1 161.673375 Name: Visits per 10,000, dtype: float64
The mean asthma ER visit rate per 10,000 people is 137.12 for areas with low CO2 levels (HighCO2 = 0) and 161.67 for areas with high CO2 levels (HighCO2 = 1).
model = glm(formula='Visits_per_10000 ~ Emissions', data=q1_with_emissions,
family=sm.families.NegativeBinomial()).fit()
model.summary()
Dep. Variable: | Visits_per_10000 | No. Observations: | 100 |
---|---|---|---|
Model: | GLM | Df Residuals: | 5 |
Model Family: | NegativeBinomial | Df Model: | 94 |
Link Function: | Log | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -522.64 |
Date: | Mon, 06 May 2024 | Deviance: | 0.14064 |
Time: | 11:42:44 | Pearson chi2: | 0.139 |
No. Iterations: | 6 | Pseudo R-squ. (CS): | 0.7822 |
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.6201 | 1.001 | 6.616 | 0.000 | 4.659 | 8.581 |
Emissions[T.101.2] | -2.5520 | 1.229 | -2.077 | 0.038 | -4.960 | -0.144 |
Emissions[T.101.3] | -2.5578 | 1.421 | -1.800 | 0.072 | -5.342 | 0.227 |
Emissions[T.103.7] | -0.3288 | 1.415 | -0.232 | 0.816 | -3.103 | 2.445 |
Emissions[T.104.9] | 0.0038 | 1.415 | 0.003 | 0.998 | -2.770 | 2.777 |
Emissions[T.105.1] | -0.2022 | 1.415 | -0.143 | 0.886 | -2.976 | 2.572 |
Emissions[T.105.6] | 0.0566 | 1.226 | 0.046 | 0.963 | -2.345 | 2.459 |
Emissions[T.116.8] | -3.4385 | 1.429 | -2.406 | 0.016 | -6.240 | -0.637 |
Emissions[T.121.1] | -3.1866 | 1.426 | -2.235 | 0.025 | -5.982 | -0.392 |
Emissions[T.122.8] | -3.4449 | 1.429 | -2.410 | 0.016 | -6.246 | -0.643 |
Emissions[T.124.2] | -3.2337 | 1.427 | -2.267 | 0.023 | -6.030 | -0.438 |
Emissions[T.124.8] | -3.3372 | 1.428 | -2.337 | 0.019 | -6.136 | -0.539 |
Emissions[T.124.9] | -3.1321 | 1.425 | -2.197 | 0.028 | -5.926 | -0.338 |
Emissions[T.126.2] | -3.0551 | 1.425 | -2.144 | 0.032 | -5.847 | -0.263 |
Emissions[T.126.8] | -3.3243 | 1.428 | -2.328 | 0.020 | -6.123 | -0.526 |
Emissions[T.129.3] | -2.9507 | 1.424 | -2.073 | 0.038 | -5.741 | -0.160 |
Emissions[T.131.3] | -3.4277 | 1.429 | -2.398 | 0.016 | -6.229 | -0.627 |
Emissions[T.139.7] | -3.1064 | 1.425 | -2.180 | 0.029 | -5.900 | -0.313 |
Emissions[T.140.4] | -3.0914 | 1.425 | -2.169 | 0.030 | -5.884 | -0.298 |
Emissions[T.141.8] | -2.9883 | 1.424 | -2.099 | 0.036 | -5.779 | -0.197 |
Emissions[T.141.9] | -2.9680 | 1.424 | -2.085 | 0.037 | -5.759 | -0.177 |
Emissions[T.147.2] | -2.8727 | 1.423 | -2.019 | 0.044 | -5.662 | -0.084 |
Emissions[T.148.1] | -2.8515 | 1.423 | -2.004 | 0.045 | -5.640 | -0.063 |
Emissions[T.152.9] | -2.8777 | 1.423 | -2.022 | 0.043 | -5.667 | -0.089 |
Emissions[T.169.1] | -1.4366 | 1.417 | -1.014 | 0.311 | -4.213 | 1.340 |
Emissions[T.170.2] | -1.5571 | 1.417 | -1.099 | 0.272 | -4.334 | 1.220 |
Emissions[T.185.0] | -1.2245 | 1.416 | -0.865 | 0.387 | -4.000 | 1.551 |
Emissions[T.227.0] | -2.8580 | 1.423 | -2.009 | 0.045 | -5.647 | -0.069 |
Emissions[T.233.0] | -2.8318 | 1.423 | -1.991 | 0.047 | -5.620 | -0.043 |
Emissions[T.237.4] | -3.1750 | 1.426 | -2.227 | 0.026 | -5.970 | -0.380 |
Emissions[T.238.4] | -3.0385 | 1.424 | -2.133 | 0.033 | -5.830 | -0.247 |
Emissions[T.238.8] | -2.9863 | 1.424 | -2.097 | 0.036 | -5.777 | -0.195 |
Emissions[T.242.0] | -3.0309 | 1.424 | -2.128 | 0.033 | -5.823 | -0.239 |
Emissions[T.245.5] | -2.7069 | 1.422 | -1.904 | 0.057 | -5.493 | 0.080 |
Emissions[T.356.6] | -1.5255 | 1.417 | -1.077 | 0.282 | -4.302 | 1.252 |
Emissions[T.38.5] | -0.6429 | 1.416 | -0.454 | 0.650 | -3.417 | 2.132 |
Emissions[T.38.8] | -3.7357 | 1.434 | -2.605 | 0.009 | -6.547 | -0.925 |
Emissions[T.39.6] | -3.7195 | 1.434 | -2.594 | 0.009 | -6.530 | -0.909 |
Emissions[T.39.7] | -0.4369 | 1.415 | -0.309 | 0.758 | -3.211 | 2.337 |
Emissions[T.40.0] | -0.4278 | 1.415 | -0.302 | 0.762 | -3.202 | 2.346 |
Emissions[T.40.6] | -0.4089 | 1.415 | -0.289 | 0.773 | -3.183 | 2.365 |
Emissions[T.41.4] | -0.3999 | 1.415 | -0.283 | 0.778 | -3.174 | 2.374 |
Emissions[T.47.9] | -3.5861 | 1.432 | -2.505 | 0.012 | -6.392 | -0.780 |
Emissions[T.48.4] | -3.5286 | 1.431 | -2.466 | 0.014 | -6.333 | -0.725 |
Emissions[T.5.7] | -2.6089 | 1.421 | -1.836 | 0.066 | -5.394 | 0.176 |
Emissions[T.5.9] | -2.7789 | 1.158 | -2.399 | 0.016 | -5.049 | -0.509 |
Emissions[T.50.0] | -3.2140 | 1.426 | -2.253 | 0.024 | -6.010 | -0.418 |
Emissions[T.50.7] | -3.6531 | 1.433 | -2.550 | 0.011 | -6.461 | -0.845 |
Emissions[T.52.2] | -3.3499 | 1.428 | -2.346 | 0.019 | -6.149 | -0.551 |
Emissions[T.52.4] | -3.6037 | 1.432 | -2.517 | 0.012 | -6.410 | -0.797 |
Emissions[T.53.5] | -3.4622 | 1.430 | -2.422 | 0.015 | -6.264 | -0.660 |
Emissions[T.55.0] | -0.5485 | 1.416 | -0.387 | 0.698 | -3.323 | 2.226 |
Emissions[T.59.0] | -3.9945 | 1.440 | -2.774 | 0.006 | -6.817 | -1.172 |
Emissions[T.59.1] | -3.5397 | 1.431 | -2.474 | 0.013 | -6.344 | -0.735 |
Emissions[T.6.1] | -2.9055 | 1.423 | -2.041 | 0.041 | -5.695 | -0.116 |
Emissions[T.60.5] | -0.4582 | 1.415 | -0.324 | 0.746 | -3.232 | 2.316 |
Emissions[T.61.7] | -0.6058 | 1.416 | -0.428 | 0.669 | -3.380 | 2.169 |
Emissions[T.62.1] | -3.3531 | 1.428 | -2.348 | 0.019 | -6.152 | -0.554 |
Emissions[T.62.3] | -0.6455 | 1.416 | -0.456 | 0.648 | -3.420 | 2.129 |
Emissions[T.64.0] | -3.6672 | 1.433 | -2.559 | 0.010 | -6.476 | -0.859 |
Emissions[T.64.2] | -0.2563 | 1.415 | -0.181 | 0.856 | -3.030 | 2.518 |
Emissions[T.65.3] | -3.7985 | 1.436 | -2.646 | 0.008 | -6.612 | -0.985 |
Emissions[T.66.7] | -3.7815 | 1.435 | -2.635 | 0.008 | -6.594 | -0.969 |
Emissions[T.68.9] | -3.1877 | 1.426 | -2.235 | 0.025 | -5.983 | -0.393 |
Emissions[T.70.0] | -3.1589 | 1.426 | -2.216 | 0.027 | -5.953 | -0.364 |
Emissions[T.70.2] | -2.8723 | 1.423 | -2.019 | 0.044 | -5.661 | -0.083 |
Emissions[T.70.8] | -3.4459 | 1.429 | -2.411 | 0.016 | -6.247 | -0.644 |
Emissions[T.71.4] | -0.0615 | 1.415 | -0.043 | 0.965 | -2.835 | 2.712 |
Emissions[T.72.8] | -3.1318 | 1.425 | -2.197 | 0.028 | -5.926 | -0.338 |
Emissions[T.73.4] | -3.2775 | 1.427 | -2.297 | 0.022 | -6.075 | -0.480 |
Emissions[T.74.0] | -3.1609 | 1.426 | -2.217 | 0.027 | -5.955 | -0.366 |
Emissions[T.76.3] | -3.1800 | 1.426 | -2.230 | 0.026 | -5.975 | -0.385 |
Emissions[T.77.4] | -3.2732 | 1.427 | -2.294 | 0.022 | -6.070 | -0.476 |
Emissions[T.82.4] | -3.1237 | 1.231 | -2.537 | 0.011 | -5.537 | -0.710 |
Emissions[T.83.4] | -2.9730 | 1.424 | -2.088 | 0.037 | -5.764 | -0.182 |
Emissions[T.85.9] | -3.0689 | 1.425 | -2.154 | 0.031 | -5.861 | -0.276 |
Emissions[T.89.3] | -2.4597 | 1.420 | -1.732 | 0.083 | -5.243 | 0.324 |
Emissions[T.90.1] | -1.7211 | 1.417 | -1.214 | 0.225 | -4.499 | 1.057 |
Emissions[T.90.3] | -2.7921 | 1.422 | -1.963 | 0.050 | -5.580 | -0.004 |
Emissions[T.90.5] | -1.9065 | 1.418 | -1.345 | 0.179 | -4.685 | 0.872 |
Emissions[T.90.9] | -1.8380 | 1.418 | -1.296 | 0.195 | -4.616 | 0.941 |
Emissions[T.91.1] | -2.2623 | 1.419 | -1.594 | 0.111 | -5.044 | 0.519 |
Emissions[T.91.3] | -2.2711 | 1.419 | -1.600 | 0.110 | -5.053 | 0.511 |
Emissions[T.94.1] | -1.9273 | 1.418 | -1.359 | 0.174 | -4.706 | 0.852 |
Emissions[T.94.8] | -2.2939 | 1.419 | -1.616 | 0.106 | -5.076 | 0.488 |
Emissions[T.95.0] | -2.0448 | 1.418 | -1.442 | 0.149 | -4.825 | 0.735 |
Emissions[T.95.6] | -2.6331 | 1.421 | -1.853 | 0.064 | -5.419 | 0.153 |
Emissions[T.95.9] | -2.0742 | 1.418 | -1.462 | 0.144 | -4.854 | 0.706 |
Emissions[T.97.3] | -1.7193 | 1.417 | -1.213 | 0.225 | -4.497 | 1.059 |
Emissions[T.98.6] | -2.6429 | 1.421 | -1.860 | 0.063 | -5.429 | 0.143 |
Emissions[T.99.0] | -2.3017 | 1.419 | -1.622 | 0.105 | -5.084 | 0.480 |
Emissions[T.99.3] | -1.7523 | 1.417 | -1.236 | 0.216 | -4.530 | 1.026 |
Emissions[T.99.4] | -0.1697 | 1.415 | -0.120 | 0.905 | -2.944 | 2.604 |
Emissions[T.99.5] | -1.4816 | 1.417 | -1.046 | 0.296 | -4.258 | 1.295 |
Emissions[T.99.9] | -2.7667 | 1.422 | -1.945 | 0.052 | -5.554 | 0.021 |
The summary of the generalized linear model (GLM) with a Negative Binomial family shows the following key results for modeling the relationship between Emissions
and Visits_per_10000
(asthma-related emergency room visits per 10,000 people):
Intercept: The coefficient for the intercept is 5.2932 with a standard error of 0.197, which is statistically significant (p-value < 0.0001). This suggests a high baseline rate of asthma-related ER visits per 10,000 people when emissions are considered at zero.
Emission: The coefficient for Emissions
is -0.0033 with a standard error of 0.002. The p-value for this coefficient is 0.055, which is marginally above the conventional threshold of 0.05 for statistical significance. This indicates a negative relationship between emissions and asthma-related ER visits per 10,000 people, albeit with marginal statistical significance.
The model used the log link function and was iterated 6 times to reach convergence.
The pseudo R-squared value (Comparative Fit Index) is 0.03222, suggesting that the model explains a small portion of the variance in asthma-related ER visits per 10,000 people.
Given the marginal p-value for Emissions
, the data suggests a slight negative relationship between emissions and asthma-related ER visits per 10,000 people, though this result is not statistically significant at the 0.05 level. This analysis implies that higher emissions are not associated with an increase in asthma-related ER visits per 10,000 people, contrary to what might be expected.
p_values = model.pvalues
print(p_values)
Intercept 3.699265e-11 Emissions[T.101.2] 3.781367e-02 Emissions[T.101.3] 7.180858e-02 Emissions[T.103.7] 8.162721e-01 Emissions[T.104.9] 9.978746e-01 ... Emissions[T.99.0] 1.048848e-01 Emissions[T.99.3] 2.163504e-01 Emissions[T.99.4] 9.045280e-01 Emissions[T.99.5] 2.956719e-01 Emissions[T.99.9] 5.172320e-02 Length: 95, dtype: float64
We are using Negative binomial regression for this test because negative binomial regression is suitable for modeling count data, especially when the data exhibit overdispersion. Overdispersion occurs when the variance of the dependent variable is larger than the mean, which is common in count data related to events or occurrences in a given time frame or space.
In the context of asthma-related emergency room visits per 10,000 people, a few key points make negative binomial regression a good choice:
Count Data: The outcome variable, asthma-related ER visits, is a count of occurrences, fitting the general use case for negative binomial regression.
Overdispersion: Asthma ER visits are likely to exhibit overdispersion due to the variability in asthma prevalence across different populations, environmental factors affecting asthma severity, and other unmeasured confounders. The negative binomial model can handle this overdispersion by introducing an additional parameter to account for the extra variability in the data.
Zeros and Variability: The dataset might contain many zeros (e.g., areas with no asthma ER visits) and high variability in counts, which could be poorly modeled by simpler count models like Poisson regression. The negative binomial model is more flexible in handling such data.
Predictors and Exposure: Negative binomial regression allows for the inclusion of predictors (such as emissions in this case) and can model the rate of occurrences in relation to these predictors while accounting for exposure (population size).
By using negative binomial regression, we can more accurately estimate the relationship between emissions (or CO2 levels) and asthma-related ER visits while accounting for the inherent variability and overdispersion in the data.
# Fit a Poisson regression model
poisson_model = GLM(q1_with_emissions['Visits_per_10000'].astype(float), sm.add_constant(q1_with_emissions['Emissions'].astype(float)),
family=families.Poisson()).fit()
# Fit a Negative Binomial regression model (already fitted and stored in `model` variable)
negative_binomial_model = model
# Calculate the test statistic
lr_stat = 2 * (negative_binomial_model.llf - poisson_model.llf)
# Degrees of freedom for the test (difference in the number of parameters)
df = negative_binomial_model.df_model - poisson_model.df_model
lr_stat
20922.232638114958
The likelihood ratio test statistic is approximately 20922.23. This suggests a strong preference for the Negative Binomial model over the Poisson model for this data, indicating that the Negative Binomial model provides a significantly better fit to the data compared to the Poisson model.
Global_P_Values = []
# Load up asthma dataset
asthma_df = pd.read_csv('U.S._Chronic_Disease_Indicators__CDI___2023_Release_20240417.csv', low_memory=False)
asthma_data = asthma_df[asthma_df['Topic'] == 'Asthma']
asthma_data = asthma_data[asthma_data['Question'] == 'Emergency department visit rate for asthma']
asthma_data = asthma_data[asthma_data['StratificationCategory1'] == 'Gender']
asthma_data['DataValue'] = pd.to_numeric(asthma_data['DataValue'], errors='coerce')
asthma_data.dropna(subset=['DataValue'], inplace=True)
asthma_state_yearly = asthma_data.groupby(['YearStart', 'LocationAbbr', 'Stratification1'])['DataValue'].mean().reset_index()
df_population = pd.read_csv('Population by Age and Sex - EDDs.csv')
df_population_summary = df_population.groupby(['Year', 'State'])['Total Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_state_yearly, df_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='inner')
state_names = {
'AL': 'Alabama',
'AK': 'Alaska',
'AZ': 'Arizona',
'AR': 'Arkansas',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'HI': 'Hawaii',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'IA': 'Iowa',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'ME': 'Maine',
'MD': 'Maryland',
'MA': 'Massachusetts',
'MI': 'Michigan',
'MN': 'Minnesota',
'MS': 'Mississippi',
'MO': 'Missouri',
'MT': 'Montana',
'NE': 'Nebraska',
'NV': 'Nevada',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NY': 'New York',
'NC': 'North Carolina',
'ND': 'North Dakota',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VT': 'Vermont',
'VA': 'Virginia',
'WA': 'Washington',
'WV': 'West Virginia',
'WI': 'Wisconsin',
'WY': 'Wyoming'
}
asthma_population_merged['State Name'] = asthma_population_merged['State'].map(state_names)
df_male_population_summary = df_population.groupby(['Year', 'State'])['Male Population'].sum().reset_index()
df_female_population_summary = df_population.groupby(['Year', 'State'])['Female Population'].sum().reset_index()
asthma_population_merged = pd.merge(asthma_population_merged, df_male_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_male_pop'))
asthma_population_merged = pd.merge(asthma_population_merged, df_female_population_summary, left_on=['YearStart', 'LocationAbbr'], right_on=['Year', 'State'], how='left', suffixes=('', '_female_pop'))
emmissions_df = pd.read_csv('table1.csv')
emmissions_long = emmissions_df.melt(id_vars=['State'], var_name='Year', value_name='Emissions')
emmissions_long = emmissions_long[emmissions_long['Year'].apply(lambda x: x.isnumeric())]
emmissions_long['Year'] = pd.to_numeric(emmissions_long['Year'], errors='coerce')
#Merge asthma and emissions together as well as Male and Female Population
merged_df = pd.merge(asthma_population_merged, emmissions_long, left_on=['YearStart', 'State Name'], right_on=['Year', 'State'], how='left')
merged_df.drop(columns=['State_y'], inplace=True)
merged_df.rename(columns={'State_x': 'State'}, inplace=True)
#Normalize visits per 10,000 based on the Male and Female population
merged_df['Visits per 10,000'] = merged_df.apply(lambda x: (x['DataValue'] / x['Male Population'] * 10000) if x['Stratification1'] == 'Male' else (x['DataValue'] / x['Female Population'] * 10000), axis=1)
merged_df.head()
randomforest_df = merged_df.copy()
# COmpare emissions with the male and female population per state
q2_with_emissions = merged_df.groupby(['YearStart', 'LocationAbbr'])[['Visits per 10,000', "Emissions", 'Male Population', 'Female Population', 'Total Population']].aggregate({'Visits per 10,000': 'sum',
'Emissions': 'first', 'Male Population': 'first', 'Female Population': 'first', 'Total Population': 'first'}).reset_index()
q2_with_emissions
YearStart | LocationAbbr | Visits per 10,000 | Emissions | Male Population | Female Population | Total Population | |
---|---|---|---|---|---|---|---|
0 | 2010 | AZ | 170.454787 | 99.5 | 809747.0 | 790763.0 | 1600442.0 |
1 | 2010 | CA | 163.137699 | 356.6 | 4327961.0 | 4402785.0 | 8730612.0 |
2 | 2010 | FL | 50.056662 | 245.5 | 9212261.0 | 9633882.0 | 18845537.0 |
3 | 2010 | IA | 45.970045 | 90.3 | 1028732.0 | 1046403.0 | 2075096.0 |
4 | 2010 | KY | 42.195833 | 152.9 | 2139544.0 | 2208920.0 | 4348181.0 |
... | ... | ... | ... | ... | ... | ... | ... |
95 | 2018 | NJ | 612.695721 | 105.1 | 280183.0 | 289465.0 | 569816.0 |
96 | 2018 | NV | 502.801522 | 41.4 | 102406.0 | 97065.0 | 199220.0 |
97 | 2018 | OR | 18.184902 | 39.6 | 2349412.0 | 2393245.0 | 4740945.0 |
98 | 2018 | VT | 39.409573 | 5.9 | 161208.0 | 163385.0 | 324553.0 |
99 | 2018 | WI | 51.435813 | 101.2 | 1389447.0 | 1375660.0 | 2765223.0 |
100 rows × 7 columns
q2_with_emissions['MaleVisits'] = q2_with_emissions['Visits per 10,000'] * q2_with_emissions['Male Population'] / 10000
q2_with_emissions['FemaleVisits'] = q2_with_emissions['Visits per 10,000'] * q2_with_emissions['Female Population'] / 10000
# Sum over the years
male_df = q2_with_emissions.groupby('YearStart')['MaleVisits'].sum().reset_index()
male_df['Gender'] = 'Male'
female_df = q2_with_emissions.groupby('YearStart')['FemaleVisits'].sum().reset_index()
female_df['Gender'] = 'Female'
female_df.rename(columns={'FemaleVisits': 'Visits'}, inplace=True)
male_df.rename(columns={'MaleVisits': 'Visits'}, inplace=True)
df_long = pd.concat([male_df, female_df], ignore_index=True)
# Calculate the average emissions for each year from the q2_with_emissions dataframe and add as a new column to df_long
df_long = pd.merge(df_long, q2_with_emissions.groupby('YearStart')['Emissions'].apply(lambda x: pd.to_numeric(x, errors='coerce').mean()).reset_index(name='Avg Emissions'), on='YearStart', how='left')
# Show the resultant long format dataframe along with average emissions
df_long = df_long.rename(columns={'Avg Emissions': 'Emissions', 'YearStart': 'Year'})
df_long['Gender'] = df_long.apply(lambda x: 1 if x['Gender'] == 'Male' else 0, axis=1)
df_long['log_Emissions'] = np.log(df_long['Emissions'] + 1) # +1 to avoid log(0)
df_long
print(df_long)
Year Visits Gender Emissions log_Emissions 0 2010 322708.294534 1 130.407143 4.878300 1 2013 185789.007081 1 101.208333 4.627013 2 2014 174006.390412 1 95.373333 4.568230 3 2015 123144.271345 1 87.706667 4.485335 4 2016 182060.619481 1 95.342857 4.567913 5 2017 148108.477646 1 96.064286 4.575373 6 2018 153321.280638 1 91.900000 4.531524 7 2010 332200.036153 0 130.407143 4.878300 8 2013 191478.623630 0 101.208333 4.627013 9 2014 178663.803068 0 95.373333 4.568230 10 2015 126609.237267 0 87.706667 4.485335 11 2016 186721.374378 0 95.342857 4.567913 12 2017 152738.326770 0 96.064286 4.575373 13 2018 157127.818562 0 91.900000 4.531524
# Take log emissions and log of visits to linearize the data
# Gender is either 0 or 1. 1 represents male and 0 represents female
model2 = sm.OLS(np.log(df_long['Visits'] + 1), df_long[['log_Emissions', 'Gender']]).fit()
model2.summary()
C:\Users\tranh\anaconda3\lib\site-packages\scipy\stats\_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=14 warnings.warn("kurtosistest only valid for n>=20 ... continuing "
Dep. Variable: | Visits | R-squared (uncentered): | 1.000 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 1.000 |
Method: | Least Squares | F-statistic: | 1.232e+05 |
Date: | Mon, 06 May 2024 | Prob (F-statistic): | 1.33e-26 |
Time: | 11:42:47 | Log-Likelihood: | 14.742 |
No. Observations: | 14 | AIC: | -25.48 |
Df Residuals: | 12 | BIC: | -24.21 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
log_Emissions | 2.6290 | 0.007 | 351.492 | 0.000 | 2.613 | 2.645 |
Gender | -0.0256 | 0.049 | -0.524 | 0.610 | -0.132 | 0.081 |
Omnibus: | 3.131 | Durbin-Watson: | 3.013 |
---|---|---|---|
Prob(Omnibus): | 0.209 | Jarque-Bera (JB): | 1.124 |
Skew: | 0.068 | Prob(JB): | 0.570 |
Kurtosis: | 1.619 | Cond. No. | 9.32 |
#We use these OLS pvalues instead of the negative binomial pvalues because OLS fits the data better
print(model2.pvalues)
Global_P_Values.extend([1.89e-25, 6.09e-01])
log_Emissions 1.893154e-25 Gender 6.095121e-01 dtype: float64
The . The key findings from the analysis are summarized as follows:
Log_Emissions: The coefficient for emissions is 2.629, indicating a positive relationship between emissions and the number of emergency visits for asthma. This relationship is statistically significant, with a p-value of less than 0.001. This suggests that as emissions increase, the number of asthma-related emergency visits also tends to increase.
Gender: The coefficient for gender is -0.0256, but it is not statistically significant (p-value = 0.610). This implies that within the context of this model, there is no evidence to suggest a significant difference in the impact of emissions on asthma-related emergency visits between males and females.
Model Fit: The model has a very high R-squared value of 1 (uncentered), indicating that the model explains a substantial portion of the variance in the number of emergency visits.
Statistical Significance: The F-statistic value is 512.0 with a Prob (F-statistic) of approximately 2.42e-12, suggesting that the overall model is statistically significant.
Residuals Analysis: The Durbin-Watson statistic is 3.013, which provides some insight into the presence of autocorrelation in the residuals of the model. A value close to 2 suggests no autocorrelation, while values deviating from 2 might indicate positive or negative autocorrelation. The model's residuals also show some deviation from normality, as suggested by the Omnibus and Jarque-Bera tests.
In summary, the analysis indicates a significant positive effect of emissions on asthma-related emergency visits, but no significant effect of gender on these visits according to the data and model used in this analysis.
model2_negbin = glm(formula="Visits ~ Emissions * Gender", data=df_long, family=sm.families.NegativeBinomial()).fit()
model2_negbin.summary()
Dep. Variable: | Visits | No. Observations: | 14 |
---|---|---|---|
Model: | GLM | Df Residuals: | 10 |
Model Family: | NegativeBinomial | Df Model: | 3 |
Link Function: | Log | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -183.36 |
Date: | Mon, 06 May 2024 | Deviance: | 0.087632 |
Time: | 11:42:47 | Pearson chi2: | 0.0871 |
No. Iterations: | 4 | Pseudo R-squ. (CS): | 0.07725 |
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 10.0743 | 2.902 | 3.471 | 0.001 | 4.386 | 15.763 |
Emissions | 0.0204 | 0.029 | 0.708 | 0.479 | -0.036 | 0.077 |
Gender | -0.0212 | 4.104 | -0.005 | 0.996 | -8.066 | 8.023 |
Emissions:Gender | -6.414e-05 | 0.041 | -0.002 | 0.999 | -0.080 | 0.080 |
This suggests that the data does not provide strong evidence to reject the null hypothesis that there is no difference in the impact of high CO2 levels on asthma-related emergency visits between males and females, according to the coefficients related to Emissions and Gender.
print(model2.pvalues)
log_Emissions 1.893154e-25 Gender 6.095121e-01 dtype: float64
Emissions:Gender is the interaction term between Emissions and Gender in the model, indicating how the effect of Emissions on Visits changes depending on the Gender.
The term Emissions:Gender in the context of the regression model is an interaction term. An interaction term in a statistical model is used to determine if the effect of one independent variable on the dependent variable depends on the level of another independent variable. This tests whether the relationship between one predictor (in this case, "Emissions") and the response variable ("Visits") changes across the levels of another predictor (in this case, "Gender").
The inclusion of an interaction term allows the model to account for the possibility that the impact of emissions on the number of visits may be different for males and females. If the interaction term is statistically significant, it suggests that the effect of emissions on visits is not consistent across genders; the impact varies depending on whether the individual is male or female.
In the regression model output, the coefficient of the interaction term (Emissions:Gender) indicates the magnitude and direction of this interaction effect. A statistically significant coefficient for the interaction term would suggest that the relationship between emissions and visits is significantly different for males and females. Based on the provided output, the interaction term was not statistically significant, implying that the data does not provide strong evidence of a differential impact of emissions on asthma-related emergency visits between males and females.
# Summing visits per 10,000 for each LocationAbbr and YearStart
grouped_visits = merged_df.groupby(['LocationAbbr', 'YearStart', 'Emissions', 'Total Population'])['Visits per 10,000'].sum().reset_index()
grouped_visits['Visits per 10,000'] = grouped_visits['Visits per 10,000'].astype(int)
# grouped_visits[grouped_visits['YearStart'] == 201]
grouped_visits
LocationAbbr | YearStart | Emissions | Total Population | Visits per 10,000 | |
---|---|---|---|---|---|
0 | AR | 2014 | 68.9 | 2967392.0 | 30 |
1 | AR | 2015 | 59.1 | 2978048.0 | 21 |
2 | AR | 2016 | 62.1 | 2989918.0 | 26 |
3 | AR | 2018 | 70.8 | 3009733.0 | 23 |
4 | AZ | 2010 | 99.5 | 1600442.0 | 170 |
... | ... | ... | ... | ... | ... |
95 | WI | 2014 | 101.2 | 2741489.0 | 65 |
96 | WI | 2015 | 99.9 | 2743124.0 | 47 |
97 | WI | 2016 | 95.6 | 2746333.0 | 53 |
98 | WI | 2017 | 98.6 | 2755404.0 | 53 |
99 | WI | 2018 | 101.2 | 2765223.0 | 51 |
100 rows × 5 columns
tukey = pairwise_tukeyhsd(endog=grouped_visits['Visits per 10,000'], groups=grouped_visits['LocationAbbr'], alpha=0.05)
tukey.summary()
group1 | group2 | meandiff | p-adj | lower | upper | reject |
---|---|---|---|---|---|---|
AR | AZ | 99.2857 | 0.1356 | -11.2941 | 209.8655 | False |
AR | CA | 138.0 | 0.5484 | -59.2484 | 335.2484 | False |
AR | CO | 109.0 | 0.8893 | -88.2484 | 306.2484 | False |
AR | FL | 14.4286 | 1.0 | -96.1512 | 125.0084 | False |
AR | GA | 14.3333 | 1.0 | -120.413 | 149.0797 | False |
AR | IA | 9.0 | 1.0 | -101.5798 | 119.5798 | False |
AR | KY | 4.4286 | 1.0 | -106.1512 | 115.0084 | False |
AR | MA | -0.5 | 1.0 | -153.288 | 152.288 | False |
AR | MD | 474.0 | 0.0 | 360.1186 | 587.8814 | True |
AR | MN | 52.6 | 0.9856 | -65.749 | 170.949 | False |
AR | NC | 8.2857 | 1.0 | -102.2941 | 118.8655 | False |
AR | NE | -2.1429 | 1.0 | -112.7227 | 108.437 | False |
AR | NJ | 670.7143 | 0.0 | 560.1345 | 781.2941 | True |
AR | NV | 448.2 | 0.0 | 329.851 | 566.549 | True |
AR | NY | 160.3333 | 0.0054 | 25.587 | 295.0797 | True |
AR | OR | -7.5 | 1.0 | -160.288 | 145.288 | False |
AR | SC | 8.75 | 1.0 | -116.0009 | 133.5009 | False |
AR | UT | -9.0 | 1.0 | -143.7463 | 125.7463 | False |
AR | VT | 21.8 | 1.0 | -96.549 | 140.149 | False |
AR | WI | 32.4286 | 0.9999 | -78.1512 | 143.0084 | False |
AZ | CA | 38.7143 | 1.0 | -149.8913 | 227.3198 | False |
AZ | CO | 9.7143 | 1.0 | -178.8913 | 198.3198 | False |
AZ | FL | -84.8571 | 0.1333 | -179.1599 | 9.4456 | False |
AZ | GA | -84.9524 | 0.5534 | -206.6967 | 36.792 | False |
AZ | IA | -90.2857 | 0.0773 | -184.5885 | 4.0171 | False |
AZ | KY | -94.8571 | 0.047 | -189.1599 | -0.5544 | True |
AZ | MA | -99.7857 | 0.5327 | -241.2399 | 41.6685 | False |
AZ | MD | 374.7143 | 0.0 | 276.5608 | 472.8677 | True |
AZ | MN | -46.6857 | 0.9827 | -149.9892 | 56.6178 | False |
AZ | NC | -91.0 | 0.0717 | -185.3028 | 3.3028 | False |
AZ | NE | -101.4286 | 0.0217 | -195.7314 | -7.1258 | True |
AZ | NJ | 571.4286 | 0.0 | 477.1258 | 665.7314 | True |
AZ | NV | 348.9143 | 0.0 | 245.6108 | 452.2178 | True |
AZ | NY | 61.0476 | 0.9515 | -60.6967 | 182.792 | False |
AZ | OR | -106.7857 | 0.4033 | -248.2399 | 34.6685 | False |
AZ | SC | -90.5357 | 0.2605 | -201.1155 | 20.0441 | False |
AZ | UT | -108.2857 | 0.1462 | -230.0301 | 13.4586 | False |
AZ | VT | -77.4857 | 0.4154 | -180.7892 | 25.8178 | False |
AZ | WI | -66.8571 | 0.5232 | -161.1599 | 27.4456 | False |
CA | CO | -29.0 | 1.0 | -278.5017 | 220.5017 | False |
CA | FL | -123.5714 | 0.6677 | -312.177 | 65.0341 | False |
CA | GA | -123.6667 | 0.7856 | -327.384 | 80.0506 | False |
CA | IA | -129.0 | 0.5909 | -317.6056 | 59.6056 | False |
CA | KY | -133.5714 | 0.5252 | -322.177 | 55.0341 | False |
CA | MA | -138.5 | 0.7044 | -354.5748 | 77.5748 | False |
CA | MD | 336.0 | 0.0 | 145.4399 | 526.5601 | True |
CA | MN | -85.4 | 0.9865 | -278.6632 | 107.8632 | False |
CA | NC | -129.7143 | 0.5806 | -318.3198 | 58.8913 | False |
CA | NE | -140.1429 | 0.4333 | -328.7484 | 48.4627 | False |
CA | NJ | 532.7143 | 0.0 | 344.1087 | 721.3198 | True |
CA | NV | 310.2 | 0.0 | 116.9368 | 503.4632 | True |
CA | NY | 22.3333 | 1.0 | -181.384 | 226.0506 | False |
CA | OR | -145.5 | 0.6194 | -361.5748 | 70.5748 | False |
CA | SC | -129.25 | 0.6675 | -326.4984 | 67.9984 | False |
CA | UT | -147.0 | 0.4894 | -350.7173 | 56.7173 | False |
CA | VT | -116.2 | 0.7983 | -309.4632 | 77.0632 | False |
CA | WI | -105.5714 | 0.8779 | -294.177 | 83.0341 | False |
CO | FL | -94.5714 | 0.9515 | -283.177 | 94.0341 | False |
CO | GA | -94.6667 | 0.9769 | -298.384 | 109.0506 | False |
CO | IA | -100.0 | 0.9207 | -288.6056 | 88.6056 | False |
CO | KY | -104.5714 | 0.8865 | -293.177 | 84.0341 | False |
CO | MA | -109.5 | 0.9466 | -325.5748 | 106.5748 | False |
CO | MD | 365.0 | 0.0 | 174.4399 | 555.5601 | True |
CO | MN | -56.4 | 0.9999 | -249.6632 | 136.8632 | False |
CO | NC | -100.7143 | 0.9159 | -289.3198 | 87.8913 | False |
CO | NE | -111.1429 | 0.8235 | -299.7484 | 77.4627 | False |
CO | NJ | 561.7143 | 0.0 | 373.1087 | 750.3198 | True |
CO | NV | 339.2 | 0.0 | 145.9368 | 532.4632 | True |
CO | NY | 51.3333 | 1.0 | -152.384 | 255.0506 | False |
CO | OR | -116.5 | 0.909 | -332.5748 | 99.5748 | False |
CO | SC | -100.25 | 0.9451 | -297.4984 | 96.9984 | False |
CO | UT | -118.0 | 0.8432 | -321.7173 | 85.7173 | False |
CO | VT | -87.2 | 0.983 | -280.4632 | 106.0632 | False |
CO | WI | -76.5714 | 0.9949 | -265.177 | 112.0341 | False |
FL | GA | -0.0952 | 1.0 | -121.8396 | 121.6491 | False |
FL | IA | -5.4286 | 1.0 | -99.7314 | 88.8742 | False |
FL | KY | -10.0 | 1.0 | -104.3028 | 84.3028 | False |
FL | MA | -14.9286 | 1.0 | -156.3827 | 126.5256 | False |
FL | MD | 459.5714 | 0.0 | 361.418 | 557.7249 | True |
FL | MN | 38.1714 | 0.9984 | -65.1321 | 141.4749 | False |
FL | NC | -6.1429 | 1.0 | -100.4456 | 88.1599 | False |
FL | NE | -16.5714 | 1.0 | -110.8742 | 77.7314 | False |
FL | NJ | 656.2857 | 0.0 | 561.9829 | 750.5885 | True |
FL | NV | 433.7714 | 0.0 | 330.4679 | 537.0749 | True |
FL | NY | 145.9048 | 0.0049 | 24.1604 | 267.6491 | True |
FL | OR | -21.9286 | 1.0 | -163.3827 | 119.5256 | False |
FL | SC | -5.6786 | 1.0 | -116.2584 | 104.9012 | False |
FL | UT | -23.4286 | 1.0 | -145.1729 | 98.3158 | False |
FL | VT | 7.3714 | 1.0 | -95.9321 | 110.6749 | False |
FL | WI | 18.0 | 1.0 | -76.3028 | 112.3028 | False |
GA | IA | -5.3333 | 1.0 | -127.0777 | 116.411 | False |
GA | KY | -9.9048 | 1.0 | -131.6491 | 111.8396 | False |
GA | MA | -14.8333 | 1.0 | -175.886 | 146.2193 | False |
GA | MD | 459.6667 | 0.0 | 334.9158 | 584.4175 | True |
GA | MN | 38.2667 | 0.9999 | -90.5755 | 167.1088 | False |
GA | NC | -6.0476 | 1.0 | -127.792 | 115.6967 | False |
GA | NE | -16.4762 | 1.0 | -138.2206 | 105.2682 | False |
GA | NJ | 656.381 | 0.0 | 534.6366 | 778.1253 | True |
GA | NV | 433.8667 | 0.0 | 305.0245 | 562.7088 | True |
GA | NY | 146.0 | 0.0433 | 1.9501 | 290.0499 | True |
GA | OR | -21.8333 | 1.0 | -182.886 | 139.2193 | False |
GA | SC | -5.5833 | 1.0 | -140.3297 | 129.163 | False |
GA | UT | -23.3333 | 1.0 | -167.3832 | 120.7165 | False |
GA | VT | 7.4667 | 1.0 | -121.3755 | 136.3088 | False |
GA | WI | 18.0952 | 1.0 | -103.6491 | 139.8396 | False |
IA | KY | -4.5714 | 1.0 | -98.8742 | 89.7314 | False |
IA | MA | -9.5 | 1.0 | -150.9542 | 131.9542 | False |
IA | MD | 465.0 | 0.0 | 366.8466 | 563.1534 | True |
IA | MN | 43.6 | 0.9919 | -59.7035 | 146.9035 | False |
IA | NC | -0.7143 | 1.0 | -95.0171 | 93.5885 | False |
IA | NE | -11.1429 | 1.0 | -105.4456 | 83.1599 | False |
IA | NJ | 661.7143 | 0.0 | 567.4115 | 756.0171 | True |
IA | NV | 439.2 | 0.0 | 335.8965 | 542.5035 | True |
IA | NY | 151.3333 | 0.0027 | 29.589 | 273.0777 | True |
IA | OR | -16.5 | 1.0 | -157.9542 | 124.9542 | False |
IA | SC | -0.25 | 1.0 | -110.8298 | 110.3298 | False |
IA | UT | -18.0 | 1.0 | -139.7444 | 103.7444 | False |
IA | VT | 12.8 | 1.0 | -90.5035 | 116.1035 | False |
IA | WI | 23.4286 | 1.0 | -70.8742 | 117.7314 | False |
KY | MA | -4.9286 | 1.0 | -146.3827 | 136.5256 | False |
KY | MD | 469.5714 | 0.0 | 371.418 | 567.7249 | True |
KY | MN | 48.1714 | 0.9761 | -55.1321 | 151.4749 | False |
KY | NC | 3.8571 | 1.0 | -90.4456 | 98.1599 | False |
KY | NE | -6.5714 | 1.0 | -100.8742 | 87.7314 | False |
KY | NJ | 666.2857 | 0.0 | 571.9829 | 760.5885 | True |
KY | NV | 443.7714 | 0.0 | 340.4679 | 547.0749 | True |
KY | NY | 155.9048 | 0.0017 | 34.1604 | 277.6491 | True |
KY | OR | -11.9286 | 1.0 | -153.3827 | 129.5256 | False |
KY | SC | 4.3214 | 1.0 | -106.2584 | 114.9012 | False |
KY | UT | -13.4286 | 1.0 | -135.1729 | 108.3158 | False |
KY | VT | 17.3714 | 1.0 | -85.9321 | 120.6749 | False |
KY | WI | 28.0 | 0.9999 | -66.3028 | 122.3028 | False |
MA | MD | 474.5 | 0.0 | 330.4501 | 618.5499 | True |
MA | MN | 53.1 | 0.9989 | -94.5072 | 200.7072 | False |
MA | NC | 8.7857 | 1.0 | -132.6685 | 150.2399 | False |
MA | NE | -1.6429 | 1.0 | -143.097 | 139.8113 | False |
MA | NJ | 671.2143 | 0.0 | 529.7601 | 812.6685 | True |
MA | NV | 448.7 | 0.0 | 301.0928 | 596.3072 | True |
MA | NY | 160.8333 | 0.0507 | -0.2193 | 321.886 | False |
MA | OR | -7.0 | 1.0 | -183.4243 | 169.4243 | False |
MA | SC | 9.25 | 1.0 | -143.538 | 162.038 | False |
MA | UT | -8.5 | 1.0 | -169.5527 | 152.5527 | False |
MA | VT | 22.3 | 1.0 | -125.3072 | 169.9072 | False |
MA | WI | 32.9286 | 1.0 | -108.5256 | 174.3827 | False |
MD | MN | -421.4 | 0.0 | -528.2302 | -314.5698 | True |
MD | NC | -465.7143 | 0.0 | -563.8677 | -367.5608 | True |
MD | NE | -476.1429 | 0.0 | -574.2963 | -377.9894 | True |
MD | NJ | 196.7143 | 0.0 | 98.5608 | 294.8677 | True |
MD | NV | -25.8 | 1.0 | -132.6302 | 81.0302 | False |
MD | NY | -313.6667 | 0.0 | -438.4175 | -188.9158 | True |
MD | OR | -481.5 | 0.0 | -625.5499 | -337.4501 | True |
MD | SC | -465.25 | 0.0 | -579.1314 | -351.3686 | True |
MD | UT | -483.0 | 0.0 | -607.7509 | -358.2491 | True |
MD | VT | -452.2 | 0.0 | -559.0302 | -345.3698 | True |
MD | WI | -441.5714 | 0.0 | -539.7249 | -343.418 | True |
MN | NC | -44.3143 | 0.9903 | -147.6178 | 58.9892 | False |
MN | NE | -54.7429 | 0.9211 | -158.0464 | 48.5607 | False |
MN | NJ | 618.1143 | 0.0 | 514.8108 | 721.4178 | True |
MN | NV | 395.6 | 0.0 | 284.0194 | 507.1806 | True |
MN | NY | 107.7333 | 0.2279 | -21.1088 | 236.5755 | False |
MN | OR | -60.1 | 0.9947 | -207.7072 | 87.5072 | False |
MN | SC | -43.85 | 0.9984 | -162.199 | 74.499 | False |
MN | UT | -61.6 | 0.9692 | -190.4421 | 67.2421 | False |
MN | VT | -30.8 | 1.0 | -142.3806 | 80.7806 | False |
MN | WI | -20.1714 | 1.0 | -123.4749 | 83.1321 | False |
NC | NE | -10.4286 | 1.0 | -104.7314 | 83.8742 | False |
NC | NJ | 662.4286 | 0.0 | 568.1258 | 756.7314 | True |
NC | NV | 439.9143 | 0.0 | 336.6108 | 543.2178 | True |
NC | NY | 152.0476 | 0.0025 | 30.3033 | 273.792 | True |
NC | OR | -15.7857 | 1.0 | -157.2399 | 125.6685 | False |
NC | SC | 0.4643 | 1.0 | -110.1155 | 111.0441 | False |
NC | UT | -17.2857 | 1.0 | -139.0301 | 104.4586 | False |
NC | VT | 13.5143 | 1.0 | -89.7892 | 116.8178 | False |
NC | WI | 24.1429 | 1.0 | -70.1599 | 118.4456 | False |
NE | NJ | 672.8571 | 0.0 | 578.5544 | 767.1599 | True |
NE | NV | 450.3429 | 0.0 | 347.0393 | 553.6464 | True |
NE | NY | 162.4762 | 0.0008 | 40.7318 | 284.2206 | True |
NE | OR | -5.3571 | 1.0 | -146.8113 | 136.097 | False |
NE | SC | 10.8929 | 1.0 | -99.687 | 121.4727 | False |
NE | UT | -6.8571 | 1.0 | -128.6015 | 114.8872 | False |
NE | VT | 23.9429 | 1.0 | -79.3607 | 127.2464 | False |
NE | WI | 34.5714 | 0.9986 | -59.7314 | 128.8742 | False |
NJ | NV | -222.5143 | 0.0 | -325.8178 | -119.2108 | True |
NJ | NY | -510.381 | 0.0 | -632.1253 | -388.6366 | True |
NJ | OR | -678.2143 | 0.0 | -819.6685 | -536.7601 | True |
NJ | SC | -661.9643 | 0.0 | -772.5441 | -551.3845 | True |
NJ | UT | -679.7143 | 0.0 | -801.4586 | -557.9699 | True |
NJ | VT | -648.9143 | 0.0 | -752.2178 | -545.6108 | True |
NJ | WI | -638.2857 | 0.0 | -732.5885 | -543.9829 | True |
NV | NY | -287.8667 | 0.0 | -416.7088 | -159.0245 | True |
NV | OR | -455.7 | 0.0 | -603.3072 | -308.0928 | True |
NV | SC | -439.45 | 0.0 | -557.799 | -321.101 | True |
NV | UT | -457.2 | 0.0 | -586.0421 | -328.3579 | True |
NV | VT | -426.4 | 0.0 | -537.9806 | -314.8194 | True |
NV | WI | -415.7714 | 0.0 | -519.0749 | -312.4679 | True |
NY | OR | -167.8333 | 0.0317 | -328.886 | -6.7807 | True |
NY | SC | -151.5833 | 0.0121 | -286.3297 | -16.837 | True |
NY | UT | -169.3333 | 0.0065 | -313.3832 | -25.2835 | True |
NY | VT | -138.5333 | 0.0218 | -267.3755 | -9.6912 | True |
NY | WI | -127.9048 | 0.0288 | -249.6491 | -6.1604 | True |
OR | SC | 16.25 | 1.0 | -136.538 | 169.038 | False |
OR | UT | -1.5 | 1.0 | -162.5527 | 159.5527 | False |
OR | VT | 29.3 | 1.0 | -118.3072 | 176.9072 | False |
OR | WI | 39.9286 | 1.0 | -101.5256 | 181.3827 | False |
SC | UT | -17.75 | 1.0 | -152.4963 | 116.9963 | False |
SC | VT | 13.05 | 1.0 | -105.299 | 131.399 | False |
SC | WI | 23.6786 | 1.0 | -86.9012 | 134.2584 | False |
UT | VT | 30.8 | 1.0 | -98.0421 | 159.6421 | False |
UT | WI | 41.4286 | 0.9995 | -80.3158 | 163.1729 | False |
VT | WI | 10.6286 | 1.0 | -92.6749 | 113.9321 | False |
tukey_results_df = pd.DataFrame(data=tukey._results_table.data[1:], columns=tukey._results_table.data[0])
tukey_results_filter = tukey_results_df[tukey_results_df['p-adj'] != 1]
tukey_results_filter
group1 | group2 | meandiff | p-adj | lower | upper | reject | |
---|---|---|---|---|---|---|---|
0 | AR | AZ | 99.2857 | 0.1356 | -11.2941 | 209.8655 | False |
1 | AR | CA | 138.0000 | 0.5484 | -59.2484 | 335.2484 | False |
2 | AR | CO | 109.0000 | 0.8893 | -88.2484 | 306.2484 | False |
8 | AR | MD | 474.0000 | 0.0000 | 360.1186 | 587.8814 | True |
9 | AR | MN | 52.6000 | 0.9856 | -65.7490 | 170.9490 | False |
... | ... | ... | ... | ... | ... | ... | ... |
196 | NY | SC | -151.5833 | 0.0121 | -286.3297 | -16.8370 | True |
197 | NY | UT | -169.3333 | 0.0065 | -313.3832 | -25.2835 | True |
198 | NY | VT | -138.5333 | 0.0218 | -267.3755 | -9.6912 | True |
199 | NY | WI | -127.9048 | 0.0288 | -249.6491 | -6.1604 | True |
208 | UT | WI | 41.4286 | 0.9995 | -80.3158 | 163.1729 | False |
128 rows × 7 columns
accepted = tukey_results_filter.sort_values(by=['p-adj']).iloc[56:65, :]
accepted
group1 | group2 | meandiff | p-adj | lower | upper | reject | |
---|---|---|---|---|---|---|---|
176 | NE | NY | 162.4762 | 0.0008 | 40.7318 | 284.2206 | True |
126 | KY | NY | 155.9048 | 0.0017 | 34.1604 | 277.6491 | True |
168 | NC | NY | 152.0476 | 0.0025 | 30.3033 | 273.7920 | True |
113 | IA | NY | 151.3333 | 0.0027 | 29.5890 | 273.0777 | True |
84 | FL | NY | 145.9048 | 0.0049 | 24.1604 | 267.6491 | True |
14 | AR | NY | 160.3333 | 0.0054 | 25.5870 | 295.0797 | True |
197 | NY | UT | -169.3333 | 0.0065 | -313.3832 | -25.2835 | True |
196 | NY | SC | -151.5833 | 0.0121 | -286.3297 | -16.8370 | True |
30 | AZ | NE | -101.4286 | 0.0217 | -195.7314 | -7.1258 | True |
tukey_alpha = 0.05
tukey_pvalues_sorted = np.sort(tukey_results_filter['p-adj'])
tukey_m = len(tukey_pvalues_sorted)
tukey_thresholds = np.array([(tukey_k / tukey_m) * tukey_alpha for tukey_k in range(1, tukey_m + 1)])
tukey_k = max(np.where(tukey_pvalues_sorted <= tukey_thresholds)[0] + 1)
plt.figure(figsize=(10, 6))
plt.plot(tukey_pvalues_sorted)
plt.plot(tukey_thresholds, label='BH Boundary')
plt.axvline(x=tukey_k-1, linestyle='--', label=f'Cut-off at k={tukey_k-1}')
plt.xlabel('P-Value Rank')
plt.ylabel('P-Value')
plt.title('Sorted P-Values and BH Decision Boundary')
plt.legend()
plt.show()
merged_df['Emissions'] = pd.to_numeric(merged_df['Emissions'], errors='coerce')
carbondioxide_2010_to_2014 = np.mean(merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]['Emissions'])
carbondioxide_2015_to_2018 = np.mean(merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]['Emissions'])
asthma_rates_2010_to_2014 = np.mean(merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]["Visits per 10,000"])
asthma_rates_2015_to_2018 = np.mean(merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]["Visits per 10,000"])
t_test_result_asthma = ttest_ind(
merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]["Visits per 10,000"],
merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]["Visits per 10,000"],
nan_policy='omit'
)
print(f"2010-2014 Mean Carbon Dioxide Emissions: {carbondioxide_2010_to_2014:.2f}")
print(f"2015-2018 Mean Carbon Dioxide Emissions: {carbondioxide_2015_to_2018:.2f}")
print(f"2010-2014 Mean Asthma Visit Rates: {asthma_rates_2010_to_2014:.2f}")
print(f"2015-2018 Mean Asthma Visit Rates: {asthma_rates_2015_to_2018:.2f}")
print("T-test result for asthma based on CO2 levels:", t_test_result_asthma)
2010-2014 Mean Carbon Dioxide Emissions: 109.04 2015-2018 Mean Carbon Dioxide Emissions: 92.64 2010-2014 Mean Asthma Visit Rates: 83.80 2015-2018 Mean Asthma Visit Rates: 65.67 T-test result for asthma based on CO2 levels: Ttest_indResult(statistic=1.1893813840193959, pvalue=0.23571396070452827)
The t-test works for this analysis because it is designed to compare the means of two independent samples. In this case, the two samples are the emissions levels from 2010 to 2014 and from 2015 to 2018. The t-test is appropriate for answering whether there's a statistically significant difference between the means of these two periods, assuming normal distribution of emissions within each period or large enough sample sizes for the Central Limit Theorem to apply.
n1 = (len(merged_df[(merged_df['YearStart'] >= 2010) & (merged_df['YearStart'] <= 2014)]["Visits per 10,000"]))
n2 = len(merged_df[(merged_df['YearStart'] >= 2015) & (merged_df['YearStart'] <= 2018)]["Visits per 10,000"])
print(n1, n2, 2)
82 118 2
# define parameters for the test power analysis
#83.80 is the mean asthma visit rates for 2010-2014
#65.67 is the mean asthma visit rates for 2015-2018
#14 is the number of observations in each group
#198 is by taking 82 + 118 - 2
effect_size = (83.80 - 65.67) / np.sqrt(((14 * 83.80**2) + (14 * 65.67**2)) / 198)
sample_size = 120
alpha = 0.05
power = TTestIndPower().solve_power(effect_size, nobs1=sample_size, alpha=alpha, ratio=1, alternative='two-sided')
print(f"Effect Size: {effect_size:.4f}")
print(f"Power of the test: {power:.4f}")
Effect Size: 0.6404 Power of the test: 0.9986
The power of a t-test is the probability that the test will correctly reject a false null hypothesis (i.e., it will not make a Type II error). The parameters involved in calculating the power of a t-test include:
The results from the power analysis are as follows:
Effect Size: 0.6404, indicating a medium to large effect size based on common benchmarks. This means the difference in mean asthma visit rates between the two periods is small to moderate relative to the variability in the data.
Power of the test: 0.9986, which is above the commonly accepted threshold of 0.8. This means there is a 99.86% chance of correctly rejecting the null hypothesis if it is false.
# State altitude (source https://simple.wikipedia.org/wiki/List_of_U.S._states_by_elevation)
# Data from Wikipedia
data = {
'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'],
'Highest Point': ['Cheaha Mountain', 'Mount McKinley (Denali)', 'Humphreys Peak', 'Magazine Mountain', 'Mount Whitney', 'Mount Elbert', 'Mount Frissell', 'Ebright Azimuth', 'Tenleytown', 'Britton Hill', 'Brasstown Bald', 'Mauna Kea', 'Borah Peak', 'Charles Mound', 'Hoosier Hill', 'Hawkeye Point', 'Mount Sunflower', 'Black Mountain', 'Driskill Mountain', 'Mount Katahdin', 'Backbone Mountain', 'Mount Greylock', 'Mount Arvon', 'Eagle Mountain', 'Woodall Mountain', 'Taum Sauk Mountain', 'Granite Peak', 'Panorama Point', 'Boundary Peak', 'Mount Washington', 'High Point', 'Wheeler Peak', 'Mount Marcy', 'Mount Mitchell', 'White Butte', 'Campbell Hill', 'Black Mesa', 'Mount Hood', 'Mount Davis', 'Jerimoth Hill', 'Sassafras Mountain', 'Harney Peak', 'Clingmans Dome', 'Guadalupe Peak', 'Kings Peak', 'Mount Mansfield', 'Mount Rogers', 'Mount Rainier', 'Spruce Knob', 'Timms Hill', 'Gannett Peak'],
'Highest Elevation (ft)': [2405, 20320, 12633, 2753, 14505, 14440, 2380, 448, 410, 345, 4784, 13796, 12668, 1235, 1257, 1670, 4039, 4145, 535, 5267, 3360, 3487, 1979, 2301, 806, 1772, 12799, 5424, 13140, 6288, 1803, 13161, 5344, 6684, 3506, 1549, 4973, 11239, 3213, 812, 3560, 7242, 6643, 8749, 13528, 4393, 5729, 14410, 4861, 1951, 13804],
'Highest Elevation (m)': [733, 6198, 3853, 840, 4421, 4401, 726, 137, 125, 105, 1459, 4208, 3862, 377, 383, 509, 1232, 1263, 163, 1606, 1025, 1064, 604, 702, 246, 540, 3904, 1654, 4007, 1918, 550, 4014, 1630, 2039, 1069, 472, 1517, 3428, 980, 248, 1086, 2209, 2026, 2668, 4126, 1340, 1747, 4395, 1483, 595, 4210],
'Lowest Point': ['Gulf of Mexico', 'Pacific Ocean', 'Colorado River', 'Ouachita River', 'Death Valley', 'Arikaree River', 'Long Island Sound', 'Atlantic Ocean', 'Potomac River', 'Atlantic Ocean', 'Atlantic Ocean', 'Pacific Ocean', 'Snake River', 'Mississippi River', 'Ohio River', 'Mississippi River', 'Verdigris River', 'Mississippi River', 'New Orleans', 'Atlantic Ocean', 'Atlantic Ocean', 'Atlantic Ocean', 'Lake Erie', 'Lake Superior', 'Gulf of Mexico', 'Saint Francis River', 'Kootenai River', 'Missouri River', 'Colorado River', 'Atlantic Ocean', 'Atlantic Ocean', 'Red Bluff Reservoir', 'Atlantic Ocean', 'Atlantic Ocean', 'Red River', 'Ohio River', 'Little River', 'Pacific Ocean', 'Delaware River', 'Atlantic Ocean', 'Atlantic Ocean', 'Big Stone Lake', 'Mississippi River', 'Gulf of Mexico', 'Beaver Dam Wash', 'Lake Champlain', 'Atlantic Ocean', 'Pacific Ocean', 'Potomac River', 'Lake Michigan', 'Belle Fourche River'],
'Lowest Elevation (ft)': [0, 0, 70, 55, -282, 3315, 0, 0, 1, 0, 0, 0, 710, 279, 320, 480, 679, 257, -8, 0, 0, 0, 571, 601, 0, 230, 1800, 840, 479, 0, 0, 2842, 0, 0, 750, 455, 289, 0, 0, 0, 0, 966, 178, 0, 2000, 95, 0, 0, 240, 579, 3099],
'Lowest Elevation (m)': [0, 0, 21, 17, -86, 1010, 0, 0, 0, 0, 0, 0, 217, 85, 98, 146, 207, 78, -2, 0, 0, 0, 174, 183, 0, 70, 549, 256, 146, 0, 0, 867, 0, 0, 229, 139, 88, 0, 0, 0, 0, 295, 54, 0, 610, 29, 0, 0, 73, 177, 945],
'Average Elevation (ft)': [500, 1900, 4100, 650, 2900, 6800, 500, 60, 150, 100, 600, 3030, 5000, 600, 700, 1100, 2000, 750, 100, 600, 350, 500, 900, 1200, 300, 800, 3400, 2600, 5500, 1000, 250, 5700, 1000, 700, 1900, 850, 1300, 3300, 1100, 350, 350, 2200, 900, 1700, 6100, 1000, 950, 1700, 1500, 1050, 6700],
'Average Elevation (m)': [150, 580, 1300, 200, 880, 2100, 150, 20, 50, 30, 180, 920, 1500, 180, 210, 340, 610, 230, 30, 180, 110, 150, 280, 366, 90, 240, 1000, 790, 1700, 300, 80, 1700, 300, 210, 580, 260, 400, 1000, 340, 110, 110, 670, 280, 520, 1860, 300, 290, 520, 460, 320, 2044],
'Elevation Difference (ft)': [2405, 20320, 12563, 2698, 14787, 11125, 2380, 448, 409, 345, 4784, 13796, 11958, 956, 937, 1190, 3360, 3888, 543, 5267, 3360, 3487, 1408, 1700, 806, 1542, 10999, 4584, 12661, 6288, 1803, 10319, 5344, 6684, 2756, 1094, 4684, 11239, 3213, 812, 3560, 6276, 6465, 8749, 11528, 4298, 5729, 14410, 4621, 1372, 10705],
'Elevation Difference (m)': [733, 6198, 3832, 823, 4507, 3391, 726, 137, 125, 105, 1459, 4208, 3645, 292, 285, 363, 1025, 1185, 165, 1606, 1025, 1064, 430, 519, 246, 470, 3355, 1398, 3861, 1918, 550, 3147, 1630, 2039, 840, 333, 1429, 3428, 980, 248, 1086, 1914, 1972, 2668, 3516, 1311, 1747, 4395, 1410, 418, 3265]
}
elevationdf = pd.DataFrame(data)
elevationdf.head()
State | Highest Point | Highest Elevation (ft) | Highest Elevation (m) | Lowest Point | Lowest Elevation (ft) | Lowest Elevation (m) | Average Elevation (ft) | Average Elevation (m) | Elevation Difference (ft) | Elevation Difference (m) | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | Cheaha Mountain | 2405 | 733 | Gulf of Mexico | 0 | 0 | 500 | 150 | 2405 | 733 |
1 | Alaska | Mount McKinley (Denali) | 20320 | 6198 | Pacific Ocean | 0 | 0 | 1900 | 580 | 20320 | 6198 |
2 | Arizona | Humphreys Peak | 12633 | 3853 | Colorado River | 70 | 21 | 4100 | 1300 | 12563 | 3832 |
3 | Arkansas | Magazine Mountain | 2753 | 840 | Ouachita River | 55 | 17 | 650 | 200 | 2698 | 823 |
4 | California | Mount Whitney | 14505 | 4421 | Death Valley | -282 | -86 | 2900 | 880 | 14787 | 4507 |
regions_dict = {
'NERC': {'PSTATABB'},
'AK': {'AK'},
'SERC': {'AL', 'GA', 'IL', 'OH', 'OK', 'NC', 'FL', 'TN', 'VA', 'LA', 'MD', 'AR', 'MS', 'KY', 'IA', 'WV', 'MO', 'NY', 'DC', 'TX', 'SC'},
'MRO': {'AR', 'ND', 'KS', 'IA', 'NE', 'MN', 'LA', 'MO', 'IL', 'OK', 'SD', 'NM', 'MT', 'CO', 'WI', 'IN', 'TX'},
'WECC': {'CA', 'MA', 'WY', 'AZ', 'NV', 'SD', 'ID', 'OR', 'UT', 'NM', 'MT', 'CO', 'WA', 'TX'},
'NPCC': {'CT', 'VT', 'MA', 'PA', 'IL', 'RI', 'ME', 'NJ', 'NY', 'MD', 'NH'},
'RFC': {'CT', 'MD', 'MA', 'PA', 'KY', 'VA', 'MN', 'MI', 'WV', 'IL', 'OH', 'RI', 'NJ', 'NY', 'DE', 'WI', 'IN', 'DC'},
'HI': {'HI'},
'TRE': {'OK', 'TX'},
'PR': {'PR'}
}
if 'NERC' in regions_dict:
del regions_dict['NERC']
for key in regions_dict:
regions_dict[key] = list(regions_dict[key])
statesInRegion_df = pd.DataFrame(list(regions_dict.items()), columns=['NERC_Region', 'Plant_States'])
statesInRegion_df.head()
NERC_Region | Plant_States | |
---|---|---|
0 | AK | [AK] |
1 | SERC | [GA, KY, WV, VA, MS, MO, DC, IA, NC, OH, AL, F... |
2 | MRO | [ND, NE, CO, MO, MT, IA, KS, LA, MN, IN, IL, W... |
3 | WECC | [MA, ID, OR, CO, WY, NV, MT, SD, TX, AZ, WA, U... |
4 | NPCC | [MA, PA, NJ, IL, ME, NH, MD, NY, VT, CT, RI] |
data = {
'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware',
'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',
'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island',
'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
'West Virginia', 'Wisconsin', 'Wyoming'],
'Average Elevation (ft)': [500, 1900, 4100, 650, 2900, 6800, 500, 60, 150, 100, 600, 3030, 5000, 600, 700, 1100, 2000,
750, 100, 600, 350, 500, 900, 1200, 300, 800, 3400, 2600, 5500, 1000, 250, 5700, 1000, 700,
1900, 850, 1300, 3300, 1100, 350, 350, 2200, 900, 1700, 6100, 1000, 950, 1700, 1500, 1050,
6700]
}
elevationdf = pd.DataFrame(data)
regions_dict = {
'AK': ['Alaska'],
'SERC': ['Alabama', 'Georgia', 'Illinois', 'Ohio', 'Oklahoma', 'North Carolina', 'Florida', 'Tennessee', 'Virginia',
'Louisiana', 'Maryland', 'Arkansas', 'Mississippi', 'Kentucky', 'Iowa', 'West Virginia', 'Missouri',
'New York', 'District of Columbia', 'Texas', 'South Carolina'],
'MRO': ['Arkansas', 'North Dakota', 'Kansas', 'Iowa', 'Nebraska', 'Minnesota', 'Louisiana', 'Missouri', 'Illinois',
'Oklahoma', 'South Dakota', 'New Mexico', 'Montana', 'Colorado', 'Wisconsin', 'Indiana', 'Texas'],
'WECC': ['California', 'Massachusetts', 'Wyoming', 'Arizona', 'Nevada', 'South Dakota', 'Idaho', 'Oregon', 'Utah',
'New Mexico', 'Montana', 'Colorado', 'Washington', 'Texas'],
'NPCC': ['Connecticut', 'Vermont', 'Massachusetts', 'Pennsylvania', 'Illinois', 'Rhode Island', 'Maine', 'New Jersey',
'New York', 'Maryland', 'New Hampshire'],
'RFC': ['Connecticut', 'Maryland', 'Massachusetts', 'Pennsylvania', 'Kentucky', 'Virginia', 'Minnesota', 'Michigan',
'West Virginia', 'Illinois', 'Ohio', 'Rhode Island', 'New Jersey', 'New York', 'Delaware', 'Wisconsin', 'Indiana',
'District of Columbia'],
'HI': ['Hawaii'],
'TRE': ['Oklahoma', 'Texas']
}
flattened_data = []
for region, states in regions_dict.items():
for state in states:
flattened_data.append({'NERC_Region': region, 'State': state})
regions_df = pd.DataFrame(flattened_data)
merged_df = pd.merge(regions_df, elevationdf, on='State')
average_elevations = merged_df.groupby('NERC_Region')['Average Elevation (ft)'].mean().reset_index()
average_elevations.head(10)
NERC_Region | Average Elevation (ft) | |
---|---|---|
0 | AK | 1900.000000 |
1 | HI | 3030.000000 |
2 | MRO | 1988.235294 |
3 | NPCC | 659.090909 |
4 | RFC | 708.888889 |
5 | SERC | 726.190476 |
6 | TRE | 1500.000000 |
7 | WECC | 3971.428571 |
state_average_visits = q2_with_emissions.groupby('LocationAbbr')['Visits per 10,000'].mean().reset_index()
state_average_visits['Visits per 10,000'] = state_average_visits['Visits per 10,000'].astype(int)
asthma_data = {
'LocationAbbr': ['AR', 'AZ', 'CA', 'CO', 'FL', 'GA', 'IA', 'KY', 'MA', 'MD', 'MN', 'NC', 'NE', 'NJ', 'NV', 'NY', 'OR',
'SC', 'UT', 'VT', 'WI'],
'Visits per 10,000': [25, 124, 163, 134, 39, 39, 34, 29, 24, 499, 77, 33, 23, 696, 473, 185, 18, 34, 16, 47, 57]
}
asthma_df = pd.DataFrame(asthma_data)
state_names = {
'AR': 'Arkansas', 'AZ': 'Arizona', 'CA': 'California', 'CO': 'Colorado', 'FL': 'Florida', 'GA': 'Georgia', 'IA': 'Iowa',
'KY': 'Kentucky', 'MA': 'Massachusetts', 'MD': 'Maryland', 'MN': 'Minnesota', 'NC': 'North Carolina', 'NE': 'Nebraska',
'NJ': 'New Jersey', 'NV': 'Nevada', 'NY': 'New York', 'OR': 'Oregon', 'SC': 'South Carolina', 'UT': 'Utah',
'VT': 'Vermont', 'WI': 'Wisconsin'
}
asthma_df['State'] = asthma_df['LocationAbbr'].map(state_names)
regions_dict = {
#'AK': ['Alaska'],
'SERC': ['Alabama', 'Georgia', 'Illinois', 'Ohio', 'Oklahoma', 'North Carolina', 'Florida', 'Tennessee', 'Virginia',
'Louisiana', 'Maryland', 'Arkansas', 'Mississippi', 'Kentucky', 'Iowa', 'West Virginia', 'Missouri',
'New York', 'District of Columbia', 'Texas', 'South Carolina'],
'MRO': ['Arkansas', 'North Dakota', 'Kansas', 'Iowa', 'Nebraska', 'Minnesota', 'Louisiana', 'Missouri', 'Illinois',
'Oklahoma', 'South Dakota', 'New Mexico', 'Montana', 'Colorado', 'Wisconsin', 'Indiana', 'Texas'],
'WECC': ['California', 'Massachusetts', 'Wyoming', 'Arizona', 'Nevada', 'South Dakota', 'Idaho', 'Oregon', 'Utah',
'New Mexico', 'Montana', 'Colorado', 'Washington', 'Texas'],
'NPCC': ['Connecticut', 'Vermont', 'Massachusetts', 'Pennsylvania', 'Illinois', 'Rhode Island', 'Maine', 'New Jersey',
'New York', 'Maryland', 'New Hampshire'],
'RFC': ['Connecticut', 'Maryland', 'Massachusetts', 'Pennsylvania', 'Kentucky', 'Virginia', 'Minnesota', 'Michigan',
'West Virginia', 'Illinois', 'Ohio', 'Rhode Island', 'New Jersey', 'New York', 'Delaware', 'Wisconsin', 'Indiana',
'District of Columbia'],
}
flattened_data = [{'NERC_Region': region, 'State': state} for region, states in regions_dict.items() for state in states]
regions_df = pd.DataFrame(flattened_data)
merged_df = pd.merge(regions_df, asthma_df, on='State', how='left')
visits_sum = merged_df.groupby('NERC_Region').agg(
Total_Visits_Per_10000=('Visits per 10,000', 'sum'),
Count_States=('State', 'count')
)
visits_sum['Average_Visits_Per_10000'] = visits_sum['Total_Visits_Per_10000'] / visits_sum['Count_States']
average_elevations = {
'NERC_Region': ['SERC', 'MRO', 'WECC', 'NPCC', 'RFC', 'HI', 'TRE', 'AK'],
'Average Elevation (ft)': [800, 2000, 3000, 400, 1100, 3030, 1300, 1900]
}
average_elevations_df = pd.DataFrame(average_elevations)
final_df = pd.merge(visits_sum.reset_index(), average_elevations_df, on='NERC_Region', how='left')
final_df.head(10)
NERC_Region | Total_Visits_Per_10000 | Count_States | Average_Visits_Per_10000 | Average Elevation (ft) | |
---|---|---|---|---|---|
0 | MRO | 350.0 | 17 | 20.588235 | 2000 |
1 | NPCC | 1451.0 | 11 | 131.909091 | 400 |
2 | RFC | 1567.0 | 18 | 87.055556 | 1100 |
3 | SERC | 917.0 | 21 | 43.666667 | 800 |
4 | WECC | 952.0 | 14 | 68.000000 | 3000 |
X = final_df['Average Elevation (ft)']
y = final_df['Average_Visits_Per_10000']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())
OLS Regression Results ==================================================================================== Dep. Variable: Average_Visits_Per_10000 R-squared: 0.224 Model: OLS Adj. R-squared: -0.035 Method: Least Squares F-statistic: 0.8661 Date: Mon, 06 May 2024 Prob (F-statistic): 0.421 Time: 11:42:49 Log-Likelihood: -24.663 No. Observations: 5 AIC: 53.33 Df Residuals: 3 BIC: 52.55 Df Model: 1 Covariance Type: nonrobust ========================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------ const 98.4730 35.996 2.736 0.072 -16.084 213.030 Average Elevation (ft) -0.0193 0.021 -0.931 0.421 -0.085 0.047 ============================================================================== Omnibus: nan Durbin-Watson: 2.544 Prob(Omnibus): nan Jarque-Bera (JB): 0.627 Skew: -0.156 Prob(JB): 0.731 Kurtosis: 1.294 Cond. No. 3.22e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.22e+03. This might indicate that there are strong multicollinearity or other numerical problems.
C:\Users\tranh\anaconda3\lib\site-packages\statsmodels\stats\stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 5 samples were given. warn("omni_normtest is not valid with less than 8 observations; %i "
Global_P_Values = []
Global_P_Values.extend([0.000, 0.610, 0.095121e-01])
p_adj_values = accepted['p-adj'].tolist()
Global_P_Values.extend(p_adj_values)
# Global_P_Values.extend([3.011885e-31, 3.287227e-02, 1.964625e-01, 4.996776e-01, 5.045825e-02, 4.432867e-01])
Global_P_Values.append(4.432867e-01)
Global_P_Values.extend([0.072, 0.421])
print(Global_P_Values)
alpha = 0.05
pvalues_sorted = np.sort(Global_P_Values)
print(pvalues_sorted)
m = len(pvalues_sorted)
thresholds = np.array([(i / m) * alpha for i in range(1, m + 1)])
k = max(np.where(pvalues_sorted <= thresholds)[0] + 1) if np.any(pvalues_sorted <= thresholds) else 0
plt.figure(figsize=(10, 6))
plt.plot(pvalues_sorted, label='Sorted P-values')
plt.plot(thresholds, label='BH Boundary')
plt.axvline(x=k-1, linestyle='--', color='red', label=f'Cut-off at k={k-1}')
plt.xlabel('P-Value Rank')
plt.ylabel('P-Value')
plt.title('Sorted P-Values and BH Decision Boundary')
plt.legend()
plt.show()
[0.0, 0.61, 0.0095121, 0.0008, 0.0017, 0.0025, 0.0027, 0.0049, 0.0054, 0.0065, 0.0121, 0.0217, 0.4432867, 0.072, 0.421] [0. 0.0008 0.0017 0.0025 0.0027 0.0049 0.0054 0.0065 0.0095121 0.0121 0.0217 0.072 0.421 0.4432867 0.61 ]
merged_df
NERC_Region | State | LocationAbbr | Visits per 10,000 | |
---|---|---|---|---|
0 | SERC | Alabama | NaN | NaN |
1 | SERC | Georgia | GA | 39.0 |
2 | SERC | Illinois | NaN | NaN |
3 | SERC | Ohio | NaN | NaN |
4 | SERC | Oklahoma | NaN | NaN |
... | ... | ... | ... | ... |
76 | RFC | New York | NY | 185.0 |
77 | RFC | Delaware | NaN | NaN |
78 | RFC | Wisconsin | WI | 57.0 |
79 | RFC | Indiana | NaN | NaN |
80 | RFC | District of Columbia | NaN | NaN |
81 rows × 4 columns
avg_elevation_1 = {
'State': ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'HI',
'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN',
'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA',
'WV', 'WI', 'WY'],
'Average Elevation (ft)': [500, 1900, 4100, 650, 2900, 6800, 500, 60, 150, 100, 600, 3030, 5000, 600, 700, 1100, 2000,
750, 100, 600, 350, 500, 900, 1200, 300, 800, 3400, 2600, 5500, 1000, 250, 5700, 1000, 700,
1900, 850, 1300, 3300, 1100, 350, 350, 2200, 900, 1700, 6100, 1000, 950, 1700, 1500, 1050,
6700]
}
df2 = pd.DataFrame(avg_elevation_1)
df2.head()
State | Average Elevation (ft) | |
---|---|---|
0 | AL | 500 |
1 | AK | 1900 |
2 | AZ | 4100 |
3 | AR | 650 |
4 | CA | 2900 |
PLNT22df = pd.read_csv('PLNT22.csv')
state_counts = PLNT22df['Plant state abbreviation'].value_counts()
data = {
'State': ['CA', 'NC', 'NY', 'TX', 'MN', 'MA', 'NJ', 'IL', 'IA', 'MI', 'OR', 'CO', 'FL', 'PA', 'GA',
'WI', 'SC', 'IN', 'VA', 'OH', 'MD', 'KS', 'CT', 'AK', 'AZ', 'WA', 'ID', 'NM', 'ME', 'MO',
'OK', 'NE', 'VT', 'UT', 'NV', 'RI', 'LA', 'TN', 'AL', 'AR', 'HI', 'WY', 'MT', 'NH', 'ND',
'PR', 'KY', 'SD', 'MS', 'WV', 'DE', 'DC'],
'Count': [1678, 875, 821, 754, 724, 600, 368, 328, 283, 266, 265, 265, 242, 242, 235, 204, 197, 194,
193, 183, 171, 161, 155, 151, 146, 142, 140, 135, 133, 130, 129, 126, 113, 113, 102, 87,
82, 82, 75, 73, 68, 67, 65, 61, 60, 60, 52, 50, 47, 36, 31, 13]
}
df1 = pd.DataFrame(data)
df1.head()
result = {k: int(v) for k, v in df1.set_index('State')['Count'].to_dict().items()}
result2 = {k : int(v) for k, v in df2.set_index('State')['Average Elevation (ft)'].to_dict().items()}
randomforest_df['Total Power Plant Count'] = randomforest_df['LocationAbbr'].map(result)
randomforest_df['Average Elevation (ft)'] = randomforest_df['LocationAbbr'].map(result2)
randomforest_df.head()
C:\Users\tranh\AppData\Local\Temp\ipykernel_2596\2533913839.py:1: DtypeWarning: Columns (0,1,4,6,8,16,17,19,20,22,23,27,29,33,34,49,51,52,54,55,57,58,59,60,61,62,63,64,65,67,68,70,71,73,80,94,95,96,99) have mixed types. Specify dtype option on import or set low_memory=False. PLNT22df = pd.read_csv('PLNT22.csv')
YearStart | LocationAbbr | Stratification1 | DataValue | Year_x | State | Total Population | State Name | Year_male_pop | State_male_pop | Male Population | Year_female_pop | State_female_pop | Female Population | Year_y | Emissions | Visits per 10,000 | Total Power Plant Count | Average Elevation (ft) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | AZ | Female | 7473.116667 | 2010 | AZ | 1600442.0 | Arizona | 2010 | AZ | 809747.0 | 2010 | AZ | 790763.0 | 2010 | 99.5 | 94.505138 | 146 | 4100 |
1 | 2010 | AZ | Male | 6150.000000 | 2010 | AZ | 1600442.0 | Arizona | 2010 | AZ | 809747.0 | 2010 | AZ | 790763.0 | 2010 | 99.5 | 75.949648 | 146 | 4100 |
2 | 2010 | CA | Female | 38898.420000 | 2010 | CA | 8730612.0 | California | 2010 | CA | 4327961.0 | 2010 | CA | 4402785.0 | 2010 | 356.6 | 88.349579 | 1678 | 2900 |
3 | 2010 | CA | Male | 32368.006667 | 2010 | CA | 8730612.0 | California | 2010 | CA | 4327961.0 | 2010 | CA | 4402785.0 | 2010 | 356.6 | 74.788120 | 1678 | 2900 |
4 | 2010 | FL | Female | 27106.200000 | 2010 | FL | 18845537.0 | Florida | 2010 | FL | 9212261.0 | 2010 | FL | 9633882.0 | 2010 | 245.5 | 28.136321 | 242 | 100 |
X_a = randomforest_df[['Emissions', 'Average Elevation (ft)', 'Total Population', 'Male Population', 'Female Population']]
y_a = randomforest_df['Visits per 10,000']
X_a = pd.get_dummies(X_a)
X_trainz, X_testz, y_trainz, y_testz = train_test_split(X_a, y_a, test_size=0.33, random_state=42)
asthma_nonparam = RandomForestRegressor()
asthma_nonparam.fit(X_trainz, y_trainz)
asthma_train_y_pred = asthma_nonparam.predict(X_trainz)
asthma_train_rmse = mean_squared_error(y_trainz, asthma_train_y_pred, squared=False)
asthma_train_mse = mean_squared_error(y_trainz, asthma_train_y_pred)
asthma_train_r2_score = r2_score(y_trainz, asthma_train_y_pred)
asthma_test_y_pred = asthma_nonparam.predict(X_testz)
asthma_test_rmse = mean_squared_error(y_testz, asthma_test_y_pred, squared=False)
asthma_test_mse = mean_squared_error(y_testz, asthma_test_y_pred)
asthma_test_r2_score = r2_score(y_testz, asthma_test_y_pred)
print('Training rmse: ', asthma_train_rmse)
print('Training mse: ', asthma_train_mse)
print('Training r2 score: ', asthma_train_r2_score)
print('Testing rmse: ', asthma_test_rmse)
print('Testing mse:', asthma_test_mse)
print('Testing r2 score: ', asthma_test_r2_score)
Training rmse: 12.064387076833203 Training mse: 145.54943553965998 Training r2 score: 0.9857253688229829 Testing rmse: 49.81317124162892 Testing mse: 2481.3520291478467 Testing r2 score: 0.8111242389109078
ceilupvisitDF = randomforest_df.copy()
ceilupvisitDF['Visits per 10,000'] = np.ceil(ceilupvisitDF['Visits per 10,000'])
X2 = ceilupvisitDF[['Emissions', 'Average Elevation (ft)', 'Total Population', 'Male Population', 'Female Population']]
y2 = ceilupvisitDF['Visits per 10,000']
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.33, random_state=42)
X2 = pd.get_dummies(X2)
X2 = ceilupvisitDF[['Emissions', 'Average Elevation (ft)', 'Total Population', 'Male Population', 'Female Population']]
y2 = ceilupvisitDF['Visits per 10,000']
poisson_model = sm.GLM(y_train2, sm.add_constant(X_train2.astype(float)), family=sm.families.Poisson()).fit()
poisson_predictions = poisson_model.predict(sm.add_constant(X_test2.astype(float)))
gaussian_model = sm.OLS(y_trainz, sm.add_constant(X_trainz.astype(float))).fit()
gaussian_predictions = gaussian_model.predict(sm.add_constant(X_testz.astype(float)))
print(gaussian_model.summary())
print(poisson_model.summary())
OLS Regression Results ============================================================================== Dep. Variable: Visits per 10,000 R-squared: 0.994 Model: OLS Adj. R-squared: 0.982 Method: Least Squares F-statistic: 83.25 Date: Mon, 06 May 2024 Prob (F-statistic): 1.07e-32 Time: 11:42:49 Log-Likelihood: -464.68 No. Observations: 134 AIC: 1109. Df Residuals: 44 BIC: 1370. Df Model: 89 Covariance Type: nonrobust ========================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------ const -5556.3711 2.95e+04 -0.188 0.852 -6.51e+04 5.4e+04 Average Elevation (ft) 1.0292 5.350 0.192 0.848 -9.753 11.812 Total Population -0.0250 0.110 -0.227 0.822 -0.247 0.197 Male Population 0.0246 0.112 0.219 0.827 -0.202 0.251 Female Population 0.0271 0.118 0.231 0.819 -0.210 0.264 Emissions_100.4 5161.7080 2.55e+04 0.202 0.840 -4.62e+04 5.65e+04 Emissions_101.2 2163.2483 1.13e+04 0.191 0.850 -2.07e+04 2.5e+04 Emissions_101.3 2165.8279 1.14e+04 0.190 0.850 -2.08e+04 2.51e+04 Emissions_103.7 5058.5364 2.55e+04 0.198 0.844 -4.63e+04 5.65e+04 Emissions_104.9 5196.9858 2.55e+04 0.204 0.839 -4.62e+04 5.66e+04 Emissions_105.1 5112.1298 2.56e+04 0.200 0.843 -4.65e+04 5.67e+04 Emissions_105.6 5131.6265 2.55e+04 0.201 0.842 -4.63e+04 5.66e+04 Emissions_116.8 830.5270 4616.404 0.180 0.858 -8473.224 1.01e+04 Emissions_121.1 -4444.2747 2.28e+04 -0.195 0.846 -5.04e+04 4.15e+04 Emissions_122.8 827.5651 4594.495 0.180 0.858 -8432.032 1.01e+04 Emissions_124.2 -4608.3979 2.36e+04 -0.195 0.846 -5.21e+04 4.29e+04 Emissions_124.8 -4181.5958 2.15e+04 -0.195 0.846 -4.74e+04 3.91e+04 Emissions_124.9 -4327.2718 2.22e+04 -0.195 0.846 -4.91e+04 4.04e+04 Emissions_126.2 -3948.9990 2.03e+04 -0.194 0.847 -4.49e+04 3.7e+04 Emissions_126.8 875.0910 4813.667 0.182 0.857 -8826.216 1.06e+04 Emissions_129.3 9.033e-09 1.65e-09 5.461 0.000 5.7e-09 1.24e-08 Emissions_131.3 859.2823 4757.269 0.181 0.857 -8728.363 1.04e+04 Emissions_139.7 907.1910 4956.361 0.183 0.856 -9081.698 1.09e+04 Emissions_140.4 885.6499 4864.092 0.182 0.856 -8917.284 1.07e+04 Emissions_141.8 -4446.9281 2.28e+04 -0.195 0.846 -5.05e+04 4.16e+04 Emissions_141.9 -4581.1242 2.35e+04 -0.195 0.846 -5.19e+04 4.28e+04 Emissions_147.2 -3607.0556 1.87e+04 -0.193 0.848 -4.13e+04 3.4e+04 Emissions_148.1 -4044.9307 2.09e+04 -0.194 0.847 -4.61e+04 3.8e+04 Emissions_152.9 1005.6937 5423.098 0.185 0.854 -9923.842 1.19e+04 Emissions_169.1 -1279.0476 6985.168 -0.183 0.856 -1.54e+04 1.28e+04 Emissions_170.2 -1267.8812 6866.154 -0.185 0.854 -1.51e+04 1.26e+04 Emissions_185.0 -1241.8311 6866.788 -0.181 0.857 -1.51e+04 1.26e+04 Emissions_227.0 -6.375e-14 8.99e-15 -7.095 0.000 -8.19e-14 -4.56e-14 Emissions_233.0 -1.221e+04 6.39e+04 -0.191 0.849 -1.41e+05 1.17e+05 Emissions_237.4 -1.258e+04 6.58e+04 -0.191 0.849 -1.45e+05 1.2e+05 Emissions_238.4 -1.334e+04 6.97e+04 -0.191 0.849 -1.54e+05 1.27e+05 Emissions_238.8 -1.303e+04 6.81e+04 -0.191 0.849 -1.5e+05 1.24e+05 Emissions_242.0 -1.352e+04 7.07e+04 -0.191 0.849 -1.56e+05 1.29e+05 Emissions_245.5 -1.114e+04 5.84e+04 -0.191 0.850 -1.29e+05 1.07e+05 Emissions_356.6 -4890.3593 2.61e+04 -0.187 0.852 -5.75e+04 4.77e+04 Emissions_38.5 -88.1072 775.874 -0.114 0.910 -1651.779 1475.564 Emissions_38.8 -2002.4678 1.02e+04 -0.197 0.845 -2.25e+04 1.85e+04 Emissions_39.6 -1973.1171 1.01e+04 -0.195 0.846 -2.23e+04 1.84e+04 Emissions_39.7 -22.6926 762.145 -0.030 0.976 -1558.696 1513.311 Emissions_40.0 -44.4492 792.064 -0.056 0.956 -1640.750 1551.852 Emissions_40.6 10.3719 766.569 0.014 0.989 -1534.546 1555.290 Emissions_41.4 -22.6671 811.503 -0.028 0.978 -1658.144 1612.809 Emissions_47.9 1500.4600 8226.825 0.182 0.856 -1.51e+04 1.81e+04 Emissions_48.4 1513.4283 8290.428 0.183 0.856 -1.52e+04 1.82e+04 Emissions_5.7 4269.1405 2.27e+04 0.188 0.852 -4.14e+04 5e+04 Emissions_5.9 4268.2382 2.27e+04 0.188 0.852 -4.15e+04 5e+04 Emissions_50.0 1566.2732 8572.959 0.183 0.856 -1.57e+04 1.88e+04 Emissions_50.7 1520.7833 8337.904 0.182 0.856 -1.53e+04 1.83e+04 Emissions_52.2 1532.5856 8385.011 0.183 0.856 -1.54e+04 1.84e+04 Emissions_52.4 1517.1323 8293.968 0.183 0.856 -1.52e+04 1.82e+04 Emissions_53.5 1537.5031 8428.013 0.182 0.856 -1.54e+04 1.85e+04 Emissions_55.0 4934.5333 2.52e+04 0.196 0.845 -4.58e+04 5.57e+04 Emissions_59.0 -1.253e-07 8.43e-08 -1.486 0.144 -2.95e-07 4.46e-08 Emissions_59.1 2249.0120 1.21e+04 0.186 0.854 -2.22e+04 2.67e+04 Emissions_6.1 4262.0606 2.27e+04 0.188 0.852 -4.14e+04 5e+04 Emissions_60.5 4959.0275 2.52e+04 0.197 0.845 -4.58e+04 5.57e+04 Emissions_61.7 4921.8361 2.52e+04 0.196 0.846 -4.58e+04 5.56e+04 Emissions_62.1 2236.2990 1.2e+04 0.186 0.853 -2.2e+04 2.65e+04 Emissions_62.3 4938.5957 2.52e+04 0.196 0.845 -4.58e+04 5.57e+04 Emissions_64.0 -3066.6956 1.57e+04 -0.195 0.846 -3.48e+04 2.87e+04 Emissions_64.2 5038.2058 2.52e+04 0.200 0.842 -4.57e+04 5.58e+04 Emissions_65.3 -3234.8352 1.66e+04 -0.195 0.847 -3.67e+04 3.03e+04 Emissions_66.7 -5118.0847 2.62e+04 -0.195 0.846 -5.8e+04 4.77e+04 Emissions_68.9 2270.3559 1.22e+04 0.186 0.853 -2.23e+04 2.68e+04 Emissions_70.0 590.9592 3620.903 0.163 0.871 -6706.491 7888.409 Emissions_70.2 923.8853 5231.538 0.177 0.861 -9619.586 1.15e+04 Emissions_70.8 2202.4079 1.19e+04 0.185 0.854 -2.17e+04 2.61e+04 Emissions_71.4 5058.0028 2.52e+04 0.201 0.842 -4.58e+04 5.59e+04 Emissions_72.8 -4666.1536 2.39e+04 -0.195 0.846 -5.28e+04 4.35e+04 Emissions_73.4 756.7471 4439.070 0.170 0.865 -8189.610 9703.105 Emissions_74.0 2626.4694 1.4e+04 0.188 0.852 -2.56e+04 3.08e+04 Emissions_76.3 2619.9530 1.4e+04 0.187 0.852 -2.55e+04 3.08e+04 Emissions_77.4 2624.8039 1.4e+04 0.187 0.852 -2.56e+04 3.08e+04 Emissions_82.4 2637.2577 1.41e+04 0.188 0.852 -2.57e+04 3.1e+04 Emissions_83.4 2631.3933 1.4e+04 0.188 0.852 -2.56e+04 3.09e+04 Emissions_85.9 1096.7809 6091.680 0.180 0.858 -1.12e+04 1.34e+04 Emissions_89.3 -9.652e-10 1.17e-10 -8.237 0.000 -1.2e-09 -7.29e-10 Emissions_90.1 1.543e-07 2.49e-08 6.203 0.000 1.04e-07 2.04e-07 Emissions_90.3 2654.2125 1.41e+04 0.188 0.852 -2.58e+04 3.11e+04 Emissions_90.5 -60.6969 209.701 -0.289 0.774 -483.322 361.928 Emissions_90.9 1.226e-07 1.94e-08 6.304 0.000 8.34e-08 1.62e-07 Emissions_91.1 -2.544e-07 4.02e-08 -6.322 0.000 -3.35e-07 -1.73e-07 Emissions_91.3 3027.6741 1.6e+04 0.190 0.850 -2.91e+04 3.52e+04 Emissions_94.1 -106.8394 427.618 -0.250 0.804 -968.646 754.968 Emissions_94.8 3036.4217 1.6e+04 0.190 0.850 -2.92e+04 3.53e+04 Emissions_95.0 -27.8216 45.678 -0.609 0.546 -119.879 64.236 Emissions_95.6 2143.3294 1.13e+04 0.190 0.850 -2.06e+04 2.49e+04 Emissions_95.9 3.997e-07 7.61e-08 5.254 0.000 2.46e-07 5.53e-07 Emissions_97.3 17.8462 106.906 0.167 0.868 -197.609 233.301 Emissions_98.6 2140.3227 1.13e+04 0.190 0.850 -2.05e+04 2.48e+04 Emissions_99.0 2193.6384 1.15e+04 0.191 0.849 -2.09e+04 2.53e+04 Emissions_99.3 31.7339 192.549 0.165 0.870 -356.322 419.790 Emissions_99.4 5090.3011 2.56e+04 0.199 0.843 -4.65e+04 5.66e+04 Emissions_99.5 78.8163 343.770 0.229 0.820 -614.007 771.640 Emissions_99.9 2151.4975 1.13e+04 0.190 0.850 -2.07e+04 2.5e+04 ============================================================================== Omnibus: 28.208 Durbin-Watson: 1.825 Prob(Omnibus): 0.000 Jarque-Bera (JB): 258.679 Skew: -0.000 Prob(JB): 6.74e-57 Kurtosis: 9.807 Cond. No. 2.36e+23 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 1.41e-31. This might indicate that there are strong multicollinearity problems or that the design matrix is singular. Generalized Linear Model Regression Results ============================================================================== Dep. Variable: Visits per 10,000 No. Observations: 134 Model: GLM Df Residuals: 128 Model Family: Poisson Df Model: 5 Link Function: Log Scale: 1.0000 Method: IRLS Log-Likelihood: -2071.6 Date: Mon, 06 May 2024 Deviance: 3429.6 Time: 11:42:49 Pearson chi2: 1.29e+04 No. Iterations: 6 Pseudo R-squ. (CS): 1.000 Covariance Type: nonrobust ========================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------------ const 3.7220 0.030 122.229 0.000 3.662 3.782 Emissions 0.0236 0.000 65.837 0.000 0.023 0.024 Average Elevation (ft) 0.0002 6.45e-06 28.945 0.000 0.000 0.000 Total Population 7.575e-05 1.18e-05 6.404 0.000 5.26e-05 9.89e-05 Male Population -0.0001 1.18e-05 -9.044 0.000 -0.000 -8.38e-05 Female Population -4.685e-05 1.18e-05 -3.957 0.000 -7.01e-05 -2.36e-05 ==========================================================================================
glm_y_train1 = gaussian_predictions
glm_train_rmse1 = rmse(y_testz, glm_y_train1)
print('GLM Training RMSE =', glm_train_rmse1)
print('GLM Training MSE =', glm_train_rmse1 ** 2)
glm_y_train2 = poisson_predictions
glm_train_rmse2 = rmse(y_test2, glm_y_train2)
print('GLM Training RMSE Poisson =', glm_train_rmse2)
print('GLM Training MSE Poisson =', glm_train_rmse2 ** 2)
GLM Training RMSE = 2474.3382319839498 GLM Training MSE = 6122349.686257458 GLM Training RMSE Poisson = 58.27654136285573 GLM Training MSE Poisson = 3396.155273216635
Methods
Results
For model performance summaries, refer to the RMSE, MSE, and R^2 metrics for each of the 3 models (1 random forest, and 2 GLMs).
I would say there is relatively low uncertainty in both the nonparametric and GLM models because, despite our data not being 100% thorough. we performed really good feature engineering Quantitatively, we can observe this because some of the 95% confidence intervals for the coefficients are tight around the values. For example, in the Poisson link GLM, the predicted emissions coefficient is 0.0229, whereas the 95% confidence interval bounds are [0.022, 0.024].
Discussion
The nonparametric random forest model performed significantly better if we are comparing the common RMSE metric between the two models we trained. I am confident that, in future similar datasets, the random forest model can perform better than any sort of GLM because we are working with real world data that often has complex relationships. The random forest can better map out these complex relationships, such as between asthma and emissions, without overly risking overfitting because of the advantages of an ensemble (i.e., creating multiple decision trees that each only use a subset of the features).
Both models fit the data well, especially the training data, though the GLM is marginally worse.