Predicting Household Utility Usage with Generalized Boosted Regression Modeling (GBM)

August 2024 (381 Words, 3 Minutes)

Housing Prices View Project Presentation

Overview

The goal of this project is to predict the household energy consumption for the the following year for the treatment and control groups using Gradient Boosting.

Domain Expertise

This project deals with standard economic time series data; my background being in economics, I am very familiar in dealing with this type of data and analysis.

Data Provenance

The professor provided us an uncleaned dataset that he sourced on the monthly electricity consumption for 23,456 households for the years 2010-2011.

We were provided the following variables for household characteristics:

hh_id: The unique id for each household
year: The year the measurement was taken. Either 2010 and 2011.
month: The month the measurement was taken. In the range of (4,8) April- August.
zipcode: The anonymized zip code where the household is located.
control: Whether the household and month combination is part of control group. 1 if yes, 0 if no.
treatment: Whether the household and month combination is part of treatment group. 1 if yes, 0 if no.
children: Whether the household has children or not. 1 if yes, 0 if no.
hhsize2-5plus: Whether the household is in the size group of number of people in the home. 1 if yes, 0 if no. Household sizes 2- 5+.
income2-9: Whether the household is in income categories <$20k, $20-30k, $30-40k, $40-50k, $50-75k, $75-100k,$100-125k, >$125k, respectively. 1 if yes, 0 if no.
owner: Whether the resident owns the home. 1 if yes, 0 if no.

And the following variables for the observed energy consumption for each household:

lusage: Measured in log(kwh). The log of monthly electricity consumption for the given household.
lusage1-6: Measured in log(kwh). For the months of April - September, 2009 (i.e. pre-sample period).

Technical Stack

This project was completed in R with the package gbm to model with gradient boosting.

Methods

The data was first cleaned and preprocessed. Rows with NA were removed, the dummy variables were converted to factors, and outliers were removed.

The monthly household energy consumption lusage is set as the response and is modeled by all variables excluding hh_id, year, and month. We model each treatment group separately with gradient boosting with a gaussian distribution where the total number of trees to be fit is 5000 and the depth between interactions is 3.

Results

The true values of monthly household energy consumption for the 2011 treatment group of lusage range from [4.516,8.059], with a mean of 6.373. The predicted values using the gbm method have a slightly larger range of [3.927,8.123], with a relatively similar mean of 6.346.

Conclusion

We accurately predicted the distribution and range of values taken for the 2011 treatment group. It was not perfect, as expected, but offers us a fairly accurate prediction of a household’s energy consumption for the following year.

Other confounding variables including government mandated water limits were not accounted for and may be useful in improving future predictions.