ML and statistical classification models

1. Case Description

This project aims to develop a classification model using supervised learning techniques to identify energy-efficient buildings in New York City. The dataset used for this project is provided by NYC Open Data and consists of reports on more than 29,000 buildings and 254 features for the calendar year 2019. The identification of energy-efficient buildings is crucial for achieving NYC’s ambitious goal of 100% carbon-free electricity and reducing emissions by 2040. This model will aid in benchmarking and achieving decarbonization goals.

library(tidyverse)
library(ggplot2)
library(DataExplorer)
library(caret)
library(forcats)
library(factoextra)
library(cluster)
library(mclust)
library(GGally)
library(plotly)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
library(ggthemes)
library(pROC)
library(sjPlot)

data = read.csv("Energy_and_Water_Data_Disclosure_for_Local_Law_84_2020__Data_for_Calendar_Year_2019_.csv")

2. Cleaning and Feature Engineering

We started cleaning the dataset and making some feature selection, in order to choose the relevant variables for our analysis. During the process, we changed the data type of all variables to integer, and the NAs were dropped.

In addition, we do some feature engineering. In order to build comparable variables that are computed as total volumes, we divided them by the size of the buildings. As a result, we achieve a subset of 13 variables and more than 2500 observations. We didn’t include more features because many of the original variables were disaggregated measures of those that we chose.

Our final features are listed below:

Energy-Star Score
Occupancy per m^2: People/ft^2
Weather Normalized Source Energy Use Intensity (EUI): kBtu/ft^2
Fuel Oil Use: kBtu/ft^2
Weather Normalized Site Natural Gas Use: therms/ft^2
Weather Normalized Site Grid Purchased Electricity: kWh/ft^2
Electricity Use, generated from Onsite Renewable Systems: kWh/ft^2
Total GHG Emissions: Metric Tons CO2e/ft^2
Water Use Intensity All Water Sources (WUI): gal/ft^2
Gross Floor Area: ft^2
Use type
Longitude
Latitude

building_type = data %>% select(Property.Id, Largest.Property.Use.Type)

clean_ny = data %>% 
  # feature selection
  select(c(Largest.Property.Use.Type...Gross.Floor.Area..ft.., 
          ENERGY.STAR.Score, Occupancy, 
          Weather.Normalized.Source.EUI..kBtu.ft..,
           Weather.Normalized.Source.Energy.Use..kBtu., Fuel.Oil..2.Use..kBtu.,
           Weather.Normalized.Site.Natural.Gas.Use..therms.,
           Weather.Normalized.Site.Electricity..kWh.,
           Electricity.Use...Grid.Purchase.and.Generated.from.Onsite.Renewable.Systems..kWh.,
           Total.GHG.Emissions..Metric.Tons.CO2e.,  Water.Use..All.Water.Sources...kgal.,
           Water.Use.Intensity..All.Water.Sources...gal.ft.., Property.Id,
           Latitude,
           Longitude
           )) %>% 
  mutate(across(where(is.character), as.integer)) %>% 
  # Discarting NAs
  drop_na() %>% 
  #Renaming variables
  rename(area = 1,
         energy_score = 2,
         occupancy = 3,
         eui = 4,
         energy_use = 5,
         fueloil_use = 6,
         gas_use = 7,
         electricity_grid_use = 8,
         renewables_use = 9,
         ghg_emissions = 10,
         water_use = 11,
         wui = 12
         ) %>% 
  # Feature engineering
  mutate(
    occupancy_pa = occupancy/area,
    fueloil_use_pa = fueloil_use/area,
    gas_use_pa = gas_use/area,
    electricity_useGP_pa = electricity_grid_use/area,
    renewables_use_pa = renewables_use/area,
    ghg_emissions_pa = ghg_emissions/area,
  ) %>% 
  select(!c(energy_use, water_use, occupancy, fueloil_use,gas_use,electricity_grid_use,renewables_use, ghg_emissions)) %>% 
  left_join(building_type, by = 'Property.Id' ) %>% 
  rename(type = Largest.Property.Use.Type) %>% 
  mutate(type = as.factor(type)) %>% 
  distinct()

head(clean_ny)

area	energy_score	eui	wui	Property.Id	Latitude	Longitude
308789	38	469	79	28404	40.87239	-73.91267
885474	66	147	4	1143922	40.70455	-74.00664
1852980	92	105	11	1143934	40.72066	-74.01010
1414803	67	160	14	1409171	40.74670	-73.94348
437571	96	80	4	1409173	40.74757	-73.94352
171725	90	79	48	1633012	40.83590	-73.89046
occupancy_pa	fueloil_use_pa	gas_use_pa	electricity_useGP_pa	renewables_use_pa	ghg_emissions_pa	type
3.238457e-04	0.00000000	1.619908740	31.366687	31.750652	0.017212401	Hospital
6.776032e-05	0.00000000	0.000000000	11.655381	11.655385	0.005346289	Office
5.396712e-05	0.03068355	0.190909238	8.750489	8.750488	0.003669225	Office
7.068122e-05	0.00000000	0.002825835	16.994128	17.150692	0.004970303	Office
2.285343e-04	1.16847323	0.071743786	9.215922	9.215924	0.003128635	Office
5.823264e-04	5.18971029	0.369474450	3.739147	3.745523	0.003371670	K-12 School

plot_missing(clean_ny)

missing_values

3. Exploratory Analysis

We review if the Energy Score changes depending on type of the houses and their sizes, Specifically, we found that multifamily houses and midsize buildings tend to have a lower energy score compared to other types and sizes of buildings.

clean_ny <- clean_ny %>%
  mutate(type = if_else(type %in% c("Multifamily Housing"), 
                                    "Multifamily Housing",
                      if_else(type %in% c("K-12 School"), "K-12 School",
                              if_else(type %in% c("Office"), "Office", 
                              "Others"))))   %>% 
  mutate(area = case_when(
    area <= 50000 ~ "midsize",
    area > 50000 & area <= 500000 ~ "large",
    area > 500000 ~ "very large",
  )) %>% mutate(area=as.factor(area))

clean_ny %>% ggplot(aes(type, energy_score)) + 
  geom_boxplot(fill="#336699") +
  labs(title= "Buildings by type", y="Energy Score", x="") +
  theme_classic()

type_buildings

clean_ny %>% ggplot(aes(area, energy_score)) + 
  geom_boxplot(fill="#336699") +
  labs(title= "Buildings by size",  y="Energy Score", x="") +
  theme_classic()

size_buildings

Removing outliers

The boxplots depicted below indicate the presence of numerous outliers in the majority of variables. To address this, we utilize the Interquartile Range Rule and remove the outliers from the dataset.


boxplot(clean_ny$occupancy_pa, xlab="Occupancy per m2", col="lightblue")
boxplot(clean_ny$eui, xlab="Energy Use Intensity", col="lightblue")
boxplot(clean_ny$ghg_emissions_pa, xlab="GHG Emissions per m2", col="lightblue")
boxplot(clean_ny$fueloil_use_pa, xlab="Fuel Oil Use per m2", col="lightblue")
boxplot(clean_ny$gas_use_pa, xlab="Gas Use per m2", col="lightblue")
boxplot(clean_ny$electricity_useGP_pa, xlab="Electricity Use - GP per m2", col="lightblue")
boxplot(clean_ny$renewables_use_pa, xlab="Renewable Energy Use per m2", col="lightblue")
boxplot(clean_ny$wui, xlab="Water Use Intensity per m2", col="lightblue")

boxplots

detect_outliers = function(x){
  q1 = quantile(x, probs=.25)
  q3 = quantile(x, probs=.75)
  iqr = q3-q1
  x > (q3 + (iqr*1.5)) | x < (q1-(iqr*1.5) )
}

rmv_outliers = function(dataframe, columns=names(dataframe)) {
    for (col in columns) {
        dataframe <- dataframe[!detect_outliers(dataframe[[col]]), ]
    }
    dataframe
}

buildings_clean = rmv_outliers(clean_ny, c("occupancy_pa", "eui",
          "fueloil_use_pa","gas_use_pa","electricity_useGP_pa",
          "renewables_use_pa","ghg_emissions_pa","wui" ))

# Select continuous variables
ny_buildings = buildings_clean %>% select(!c(Property.Id,type, Latitude, 
Longitude, area))
plot_scatterplot(ny_buildings, by="eui")

scatterplot

Correlations

The analysis of the correlations reveals that most of our predictors are negatively correlated with the energy score, suggesting as hypothesis that higher resource consumption is associated with lower energy efficiency. We also detected multicollinearity between the variables of renewable energy use and electricity use from the grid. These variables measure similar characteristics in terms of energy consumption and have a perfect correlation, leading us to discard the renewable energy variable.

plot_correlation(ny_buildings %>% select(!eui))

correlations

Target variable

ny_buildings = buildings_clean %>% select(!c(Property.Id))
ny_buildings = ny_buildings %>% mutate(
  efficiency = as.factor(ifelse(energy_score > 75, "Yes", "No"))
) %>% select(!energy_score)

3. Modeling

Train-test split

set.seed(163)
# split data
in_train <- createDataPartition(ny_buildings$efficiency, p = 0.75, list = FALSE)  
training <- ny_buildings[in_train,]
testing <- ny_buildings[-in_train,]

Cross validation (10 folds)

ctrl = trainControl(
  method = "repeatedcv",
  number = 10,
  classProbs = T,
  summaryFunction = twoClassSummary,
  verboseIter = T
  )

Training Logistic Regression

We chose to begin with a logistic regression model because it enables us to model the probability of a binary target variable, and its results are highly interpretable.

logistic = train(
  efficiency ~ 
    area+wui+occupancy_pa+fueloil_use_pa+gas_use_pa+electricity_useGP_pa+
    ghg_emissions_pa+type,
  data = training,
  method= "glm",
  family = "binomial",
  metric = "ROC",
  preProc= c('scale', 'center'),
  tuneLength= 10,
  trControl = ctrl)

#model summary
tab_model(logistic$finalModel)

logistic

Out of the total variables considered, we found 7 variables to be statistically significant in predicting energy efficiency. Among these variables, GHG emissions and gas use were found to have a strong negative association with the probability of a building being energy-efficient.

Evaluation

log_prediction <- predict(logistic,testing,type="prob") 
prediction = as.factor(ifelse(log_prediction[,2]>0.5, "Yes", "No"))

confusionMatrix(prediction, testing$efficiency)

metrics_log

roc(testing$efficiency, log_prediction[,2])$auc

Area under the curve: 0.9229

The model’s performance metrics are satisfactory, but we would like to explore an alternative machine learning model to investigate if other non-linear variables, such as longitude and latitude, could affect the results.

Fit Random Forest

rfFit = train(
  efficiency ~ 
    area+wui+occupancy_pa+fueloil_use_pa+gas_use_pa+electricity_useGP_pa+
    ghg_emissions_pa+type+Longitude+Latitude,
  data = training,
  method= "rf",
  preProc= c('scale', 'center'),
  tuneLength= 10,
  metric = "ROC",
  trControl = ctrl
)

rfProb = predict(rfFit, testing, type = "prob")
prediction = as.factor(ifelse(rfProb[,2]>0.5, "Yes", "No"))

confusionMatrix(prediction, testing$efficiency)

metrics_rf

After comparing the results of the logistic regression and the machine learning model, we found that the predictions are not significantly affected by the inclusion of non-linear variables such as longitude and latitude. However, we still wanted to assess the importance of variables for prediction and evaluate the AUC to gain a better understanding of the model’s performance.

plot(varImp(rfFit, scale = F), scales = list(y = list(cex = .95)))

importance

roc(testing$efficiency, rfProb[,2])$auc

Area under the curve: 0.9426

4. Conclussions

We have confirmed that buildings that consume more resources are less likely to be energy-efficient, and our predictions have been highly accurate. Our models have provided different approaches to the problem at hand. The logistic regression model has shown the clear effect of predictors on our response variable. The random forest model increased the AUC, but didn’t significantly improve other performance metrics. Finally, we have visualized the geographical distribution of energy-efficient buildings in NYC.

world <- ne_countries(scale = "medium", returnclass = "sf")

sites <- data.frame(lon = testing$Longitude, lat= testing$Latitude, efficiency=prediction)

ggplot(data = world) +
    geom_sf() +
    geom_point(data = sites, aes(x = lon, y = lat, color=factor(efficiency)), 
    size = 1) +
    coord_sf(xlim = c(-73.5, -74.5), ylim = c(40.4, 41), expand = FALSE) +
  ggtitle("Energy Efficient New York buildings") +
  labs(color = "Energy Efficiency")

map

GitHub