Skip to the content.

Shockingly Predictable: Forecasting Outage Duration

by Lauren May (laumay@umich.edu) and Julia Rehring (rehring@umich.edu)


Introduction

In this project, we studied power outages across the United States. We specifically focused on analyzing the duration of power outages and identifying the factors that led to longer outages. Our ultimate goal was to predict outage duration based on a variety of factors including weather, location, cause, and time of year.


Data Cleaning

We preprocessed the outage data to make it more workable. Specifically:

After cleaning, the dataset looks like this:

YEAR MONTH U.S._STATE POSTAL.CODE NERC.REGION CLIMATE.REGION ANOMALY.LEVEL OUTAGE.START OUTAGE.RESTORATION
2011 7 Minnesota MN MRO East North Central -0.3 2011-07-01 17:00:00 2011-07-03 20:00:00
2014 5 Minnesota MN MRO East North Central -0.1 2014-05-11 18:38:00 2014-05-11 18:39:00
2010 10 Minnesota MN MRO East North Central -1.5 2010-10-26 20:00:00 2010-10-28 22:00:00
2012 6 Minnesota MN MRO East North Central -0.1 2012-06-19 04:30:00 2012-06-20 23:00:00
2015 7 Minnesota MN MRO East North Central 1.2 2015-07-18 02:00:00 2015-07-19 07:00:00

Univariate Analysis

To explore which factors might influence outage duration, we first analyzed the distributions of various features.

Outages by Month

There is a clear pattern: more outages occur in summer months, with a smaller peak in early winter. This suggested a possible link between seasonal weather events and outages.

Outage Cause Distribution

As suspected, severe weather is the most common cause—nearly twice as frequent as the second most common cause, intentional attacks.

Climate Anomaly (ONI) Distribution

The Oceanic Niño Index (ONI) helps classify climate conditions as El Niño, La Niña, or Neutral, based on thresholds of ±0.5°C. We explored its distribution to assess its potential relationship with outages.

Urban Population Percentage Distribution

This demonstrates a higher urban population in cities on the coasts and in the West. This could potentially indicate the level of infrastructure that is in place in a certain area and thus how quickly a outage could be restored.

Bivariate Analysis

Next, we explored relationships between outage duration and other features.

Outage Duration by State

The northeastern U.S.—particularly West Virginia—tends to experience longer outages. This is consistent with the areas where urban population density was lower. This supports the hypothesis that geographic and possibly infrastructural factors play a role.

Outage Duration by Cause

Surprisingly, fuel supply emergencies lead to the longest outages, followed by severe weather events.

Average Outage Duration by Month

The longest average outages occur in September. This could be linked to hurricane season or irregular fuel supply events.

Average Outage Duration by urban Population Percentage

Interestingly, areas with a higher urban population percentage tend to have longer outages. It also seems that the northeastern U.S. tends to have higher a higher percentage of urban populations and longer outages.

Interesting Aggregates

In order to further explore the data, we aggregated it in different ways to gain some insights.

Average Outage Duration by Anomaly Level and Climate Region

As demonstrated by the pivot table above, longer outages tend to occur during the normal ranges of the ONI level. It also demonstrates that the Northeast region tends to have longer outages as a whole.

Anomaly Levels by Month

MONTH avg_anomaly avg_duration outage_count
1 -0.332576 2590.48 132
2 -0.300763 2054.53 131
3 -0.120652 1947.72 92
4 -0.0598131 1493.86 107
5 -0.126271 1704.39 118
6 -0.072043 1948.4 186
7 -0.0445714 1680.69 175
8 -0.209272 2428.48 151
9 -0.213043 4294.52 92
10 0.0138889 3600.94 108
11 -0.00724638 1728.16 69
12 0.087037 3293.79 108

The above table gives insight into how the anomaly levels change during the months and how that may coorespond to longer or shorter outages.

Outages Duration by Cause Category and Customers Affected

CAUSE.CATEGORY OUTAGE.DURATION CUSTOMERS.AFFECTED
equipment failure 399.13 2.83979e+06
fuel supply emergency 8395.23 1
intentional attack 429.98 356315
islanding 200.545 209749
public appeal 1468.45 159994
severe weather 3704.41 1.31566e+08
system operability disruption 728.87 1.70555e+07

This table is key in demonstrating that the fact that fuel supply emergency is the cause with the longest outage duration may be an outlier, as it only affects 1 customer. Severe weather on the other hand, affects drastic numbers of customers.

Our Prediction Problem

How long will a power outage last, given various attributes like month, climate region, and population demographics? How well can we develop a regression model to predict this duration?

Baseline Model

To predict power outage duration, we created a baseline model using a Linear Regression pipeline. This model incorporated three key features:

MONTH (categorical, one-hot encoded)

CLIMATE.REGION (categorical, one-hot encoded)

ANOMALY.LEVEL (numerical, with a transformation to absolute value)

We used scikit-learn’s Pipeline and ColumnTransformer to preprocess the data and train the model. The categorical columns were encoded using OneHotEncoder, and the anomaly level was passed through a FunctionTransformer to take its absolute value. We trained the model on a train/test split and evaluated its performance using mean squared error (MSE) and mean absolute error (MAE).

We decided to utilize MSE because it is the standard metric for analyzing the performance of a regression model and would later be used in our cross-validation process in Part 5. However, since MSE is sensitive to outliers and we have some very large outage values, we decided to also use absolute error to represent the model’s performance.

Baseline Model Performance:

TRAINING MSE TESTING MSE TESTING MAE
1.49e+07 1.27e+07 2404.05

This simple model provided a helpful benchmark, but we suspected that additional features and more flexible transformations could yield better results.

Final Model

For our final model, we used a more advanced approach by incorporating:

Additional categorical features: CAUSE.CATEGORY, CLIMATE.CATEGORY

A scaled numeric feature: POPPCT_URBAN

Polynomial expansion of ANOMALY.LEVEL to capture potential non-linear effects

We also introduced regularization by switching to Ridge Regression, which helps control overfitting. To optimize our pipeline, we used GridSearchCV to tune two hyperparameters:

Best parameters found via GridSearch:

This outcome suggests that linear terms provided the best generalization, but the addition of regularization and more features helped improve performance over the baseline.

Final Model Performance:

TRAINING MSE TESTING MSE TESTING MAE
1.11e+07 1.04e+07 2158.37

Overall, this model yielded an 18% reduction in test mean squared error compared to our baseline. The improvement suggests that capturing more information about urban population, cause type, and applying regularization led to better generalization.