ML and statistical classification models

1. Case Description

This project aims to develop a classification model using supervised learning techniques to identify energy-efficient buildings in New York City. The dataset used for this project is provided by NYC Open Data and consists of reports on more than 29,000 buildings and 254 features for the calendar year 2019. The identification of energy-efficient buildings is crucial for achieving NYC’s ambitious goal of 100% carbon-free electricity and reducing emissions by 2040. This model will aid in benchmarking and achieving decarbonization goals.

library(tidyverse)
library(ggplot2)
library(DataExplorer)
library(caret)
library(forcats)
library(factoextra)
library(cluster)
library(mclust)
library(GGally)
library(plotly)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
library(ggthemes)
library(pROC)
library(sjPlot)
data = read.csv("Energy_and_Water_Data_Disclosure_for_Local_Law_84_2020__Data_for_Calendar_Year_2019_.csv")

2. Cleaning and Feature Engineering

We started cleaning the dataset and making some feature selection, in order to choose the relevant variables for our analysis. During the process, we changed the data type of all variables to integer, and the NAs were dropped.

In addition, we do some feature engineering. In order to build comparable variables that are computed as total volumes, we divided them by the size of the buildings. As a result, we achieve a subset of 13 variables and more than 2500 observations. We didn’t include more features because many of the original variables were disaggregated measures of those that we chose.

Our final features are listed below:

building_type = data %>% select(Property.Id, Largest.Property.Use.Type)

clean_ny = data %>% 
  # feature selection
  select(c(Largest.Property.Use.Type...Gross.Floor.Area..ft.., 
          ENERGY.STAR.Score, Occupancy, 
          Weather.Normalized.Source.EUI..kBtu.ft..,
           Weather.Normalized.Source.Energy.Use..kBtu., Fuel.Oil..2.Use..kBtu.,
           Weather.Normalized.Site.Natural.Gas.Use..therms.,
           Weather.Normalized.Site.Electricity..kWh.,
           Electricity.Use...Grid.Purchase.and.Generated.from.Onsite.Renewable.Systems..kWh.,
           Total.GHG.Emissions..Metric.Tons.CO2e.,  Water.Use..All.Water.Sources...kgal.,
           Water.Use.Intensity..All.Water.Sources...gal.ft.., Property.Id,
           Latitude,
           Longitude
           )) %>% 
  mutate(across(where(is.character), as.integer)) %>% 
  # Discarting NAs
  drop_na() %>% 
  #Renaming variables
  rename(area = 1,
         energy_score = 2,
         occupancy = 3,
         eui = 4,
         energy_use = 5,
         fueloil_use = 6,
         gas_use = 7,
         electricity_grid_use = 8,
         renewables_use = 9,
         ghg_emissions = 10,
         water_use = 11,
         wui = 12
         ) %>% 
  # Feature engineering
  mutate(
    occupancy_pa = occupancy/area,
    fueloil_use_pa = fueloil_use/area,
    gas_use_pa = gas_use/area,
    electricity_useGP_pa = electricity_grid_use/area,
    renewables_use_pa = renewables_use/area,
    ghg_emissions_pa = ghg_emissions/area,
  ) %>% 
  select(!c(energy_use, water_use, occupancy, fueloil_use,gas_use,electricity_grid_use,renewables_use, ghg_emissions)) %>% 
  left_join(building_type, by = 'Property.Id' ) %>% 
  rename(type = Largest.Property.Use.Type) %>% 
  mutate(type = as.factor(type)) %>% 
  distinct()

head(clean_ny)
area energy_score eui wui Property.Id Latitude Longitude
308789 38 469 79 28404 40.87239 -73.91267
885474 66 147 4 1143922 40.70455 -74.00664
1852980 92 105 11 1143934 40.72066 -74.01010
1414803 67 160 14 1409171 40.74670 -73.94348
437571 96 80 4 1409173 40.74757 -73.94352
171725 90 79 48 1633012 40.83590 -73.89046
occupancy_pa fueloil_use_pa gas_use_pa electricity_useGP_pa renewables_use_pa ghg_emissions_pa type
3.238457e-04 0.00000000 1.619908740 31.366687 31.750652 0.017212401 Hospital
6.776032e-05 0.00000000 0.000000000 11.655381 11.655385 0.005346289 Office
5.396712e-05 0.03068355 0.190909238 8.750489 8.750488 0.003669225 Office
7.068122e-05 0.00000000 0.002825835 16.994128 17.150692 0.004970303 Office
2.285343e-04 1.16847323 0.071743786 9.215922 9.215924 0.003128635 Office
5.823264e-04 5.18971029 0.369474450 3.739147 3.745523 0.003371670 K-12 School
plot_missing(clean_ny)

missing_values

3. Exploratory Analysis

We review if the Energy Score changes depending on type of the houses and their sizes, Specifically, we found that multifamily houses and midsize buildings tend to have a lower energy score compared to other types and sizes of buildings.

clean_ny <- clean_ny %>%
  mutate(type = if_else(type %in% c("Multifamily Housing"), 
                                    "Multifamily Housing",
                      if_else(type %in% c("K-12 School"), "K-12 School",
                              if_else(type %in% c("Office"), "Office", 
                              "Others"))))   %>% 
  mutate(area = case_when(
    area <= 50000 ~ "midsize",
    area > 50000 & area <= 500000 ~ "large",
    area > 500000 ~ "very large",
  )) %>% mutate(area=as.factor(area))

clean_ny %>% ggplot(aes(type, energy_score)) + 
  geom_boxplot(fill="#336699") +
  labs(title= "Buildings by type", y="Energy Score", x="") +
  theme_classic()

type_buildings

clean_ny %>% ggplot(aes(area, energy_score)) + 
  geom_boxplot(fill="#336699") +
  labs(title= "Buildings by size",  y="Energy Score", x="") +
  theme_classic()

size_buildings

Removing outliers

The boxplots depicted below indicate the presence of numerous outliers in the majority of variables. To address this, we utilize the Interquartile Range Rule and remove the outliers from the dataset.


boxplot(clean_ny$occupancy_pa, xlab="Occupancy per m2", col="lightblue")
boxplot(clean_ny$eui, xlab="Energy Use Intensity", col="lightblue")
boxplot(clean_ny$ghg_emissions_pa, xlab="GHG Emissions per m2", col="lightblue")
boxplot(clean_ny$fueloil_use_pa, xlab="Fuel Oil Use per m2", col="lightblue")
boxplot(clean_ny$gas_use_pa, xlab="Gas Use per m2", col="lightblue")
boxplot(clean_ny$electricity_useGP_pa, xlab="Electricity Use - GP per m2", col="lightblue")
boxplot(clean_ny$renewables_use_pa, xlab="Renewable Energy Use per m2", col="lightblue")
boxplot(clean_ny$wui, xlab="Water Use Intensity per m2", col="lightblue")

boxplots

detect_outliers = function(x){
  q1 = quantile(x, probs=.25)
  q3 = quantile(x, probs=.75)
  iqr = q3-q1
  x > (q3 + (iqr*1.5)) | x < (q1-(iqr*1.5) )
}

rmv_outliers = function(dataframe, columns=names(dataframe)) {
    for (col in columns) {
        dataframe <- dataframe[!detect_outliers(dataframe[[col]]), ]
    }
    dataframe
}

buildings_clean = rmv_outliers(clean_ny, c("occupancy_pa", "eui",
          "fueloil_use_pa","gas_use_pa","electricity_useGP_pa",
          "renewables_use_pa","ghg_emissions_pa","wui" ))
# Select continuous variables
ny_buildings = buildings_clean %>% select(!c(Property.Id,type, Latitude, 
Longitude, area))
plot_scatterplot(ny_buildings, by="eui")

scatterplot

Correlations

The analysis of the correlations reveals that most of our predictors are negatively correlated with the energy score, suggesting as hypothesis that higher resource consumption is associated with lower energy efficiency. We also detected multicollinearity between the variables of renewable energy use and electricity use from the grid. These variables measure similar characteristics in terms of energy consumption and have a perfect correlation, leading us to discard the renewable energy variable.

plot_correlation(ny_buildings %>% select(!eui))

correlations

Target variable

ny_buildings = buildings_clean %>% select(!c(Property.Id))
ny_buildings = ny_buildings %>% mutate(
  efficiency = as.factor(ifelse(energy_score > 75, "Yes", "No"))
) %>% select(!energy_score)

3. Modeling

Train-test split

set.seed(163)
# split data
in_train <- createDataPartition(ny_buildings$efficiency, p = 0.75, list = FALSE)  
training <- ny_buildings[in_train,]
testing <- ny_buildings[-in_train,]

Cross validation (10 folds)

ctrl = trainControl(
  method = "repeatedcv",
  number = 10,
  classProbs = T,
  summaryFunction = twoClassSummary,
  verboseIter = T
  )

Training Logistic Regression

We chose to begin with a logistic regression model because it enables us to model the probability of a binary target variable, and its results are highly interpretable.

logistic = train(
  efficiency ~ 
    area+wui+occupancy_pa+fueloil_use_pa+gas_use_pa+electricity_useGP_pa+
    ghg_emissions_pa+type,
  data = training,
  method= "glm",
  family = "binomial",
  metric = "ROC",
  preProc= c('scale', 'center'),
  tuneLength= 10,
  trControl = ctrl)

#model summary
tab_model(logistic$finalModel)

logistic

Out of the total variables considered, we found 7 variables to be statistically significant in predicting energy efficiency. Among these variables, GHG emissions and gas use were found to have a strong negative association with the probability of a building being energy-efficient.

Evaluation

log_prediction <- predict(logistic,testing,type="prob") 
prediction = as.factor(ifelse(log_prediction[,2]>0.5, "Yes", "No"))

confusionMatrix(prediction, testing$efficiency)

metrics_log

roc(testing$efficiency, log_prediction[,2])$auc

Area under the curve: 0.9229

The model’s performance metrics are satisfactory, but we would like to explore an alternative machine learning model to investigate if other non-linear variables, such as longitude and latitude, could affect the results.

Fit Random Forest

rfFit = train(
  efficiency ~ 
    area+wui+occupancy_pa+fueloil_use_pa+gas_use_pa+electricity_useGP_pa+
    ghg_emissions_pa+type+Longitude+Latitude,
  data = training,
  method= "rf",
  preProc= c('scale', 'center'),
  tuneLength= 10,
  metric = "ROC",
  trControl = ctrl
)
rfProb = predict(rfFit, testing, type = "prob")
prediction = as.factor(ifelse(rfProb[,2]>0.5, "Yes", "No"))

confusionMatrix(prediction, testing$efficiency)

metrics_rf

After comparing the results of the logistic regression and the machine learning model, we found that the predictions are not significantly affected by the inclusion of non-linear variables such as longitude and latitude. However, we still wanted to assess the importance of variables for prediction and evaluate the AUC to gain a better understanding of the model’s performance.

plot(varImp(rfFit, scale = F), scales = list(y = list(cex = .95)))

importance

roc(testing$efficiency, rfProb[,2])$auc

Area under the curve: 0.9426

4. Conclussions

We have confirmed that buildings that consume more resources are less likely to be energy-efficient, and our predictions have been highly accurate. Our models have provided different approaches to the problem at hand. The logistic regression model has shown the clear effect of predictors on our response variable. The random forest model increased the AUC, but didn’t significantly improve other performance metrics. Finally, we have visualized the geographical distribution of energy-efficient buildings in NYC.

world <- ne_countries(scale = "medium", returnclass = "sf")

sites <- data.frame(lon = testing$Longitude, lat= testing$Latitude, efficiency=prediction)

ggplot(data = world) +
    geom_sf() +
    geom_point(data = sites, aes(x = lon, y = lat, color=factor(efficiency)), 
    size = 1) +
    coord_sf(xlim = c(-73.5, -74.5), ylim = c(40.4, 41), expand = FALSE) +
  ggtitle("Energy Efficient New York buildings") +
  labs(color = "Energy Efficiency")

map

GitHub