Data science project aiming to predict the cause of power outages, using a public dataset of power outage information.
In a majority of contexts when power outages occur, utility control centers and monitoring agencies have a need to understand the root cause of an outage as it occurs in real time. Knowing whether an unexpected loss of power was caused by a weather related incident, an electricity grid malfunction, or even a malicious attack can be critical for handling the situation efficiently and reducing resources spent on diagnostics. Many modern “smart grid” systems collect vast amounts of data on electricity use and allocation, making it possible to utilize the trends in such data and allowing us to predict how disruptions may occur.
This is a project that utilizes a publicly available dataset containing data on major outages witnessed by different states in the United States from January 2000 to July 2016. Using a cleaned version of this dataset initially containing 1,533 observations and 55 variables, this project builds a machine learning classification model which predicts the root cause of a power outage at the time it occurs. Of these 55 variables, some of the most important to this problem are:
The following sections catalogue this project’s development process, including steps taken to explore the data as well as iteratively construct and improve predictive models.
During cleaning, empty columns or columns which were deemed irrelevant to the problem were removed. Irrelevant columns include:
Additionally, basic type casting for numerical features was performed. After performing cleaning (as well as imputations and transformations, as described later), the first few rows of the dataset now look like:
YEAR | MONTH | STATE | NERC.REGION | CLIMATE.REGION | ANOMALY.LEVEL | CLIMATE.CATEGORY | OUTAGE.RESTORATION.DATE | OUTAGE.RESTORATION.TIME | CAUSE.CATEGORY | CAUSE.CATEGORY.DETAIL | OUTAGE.DURATION | CUSTOMERS.AFFECTED | RES.PRICE | COM.PRICE | IND.PRICE | TOTAL.PRICE | RES.SALES | COM.SALES | IND.SALES | TOTAL.SALES | RES.PERCEN | COM.PERCEN | IND.PERCEN | RES.CUSTOMERS | COM.CUSTOMERS | IND.CUSTOMERS | TOTAL.CUSTOMERS | RES.CUST.PCT | COM.CUST.PCT | IND.CUST.PCT | PC.REALGSP.STATE | PC.REALGSP.USA | PC.REALGSP.REL | PC.REALGSP.CHANGE | UTIL.REALGSP | TOTAL.REALGSP | UTIL.CONTRI | PI.UTIL.OFUSA | POPULATION | POPPCT_URBAN | POPPCT_UC | POPDEN_URBAN | POPDEN_UC | POPDEN_RURAL | AREAPCT_URBAN | AREAPCT_UC | PCT_LAND | PCT_WATER_TOT | PCT_WATER_INLAND | CUST.AFF.MISSING | TIME.DATE.DOW | TIME.HOUR-SIN | TIME.HOUR-COS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 | 7 | MN | MRO | East North Central | -0.3 | normal | 2011-07-03 00:00:00 | 20:00:00 | severe weather | nan | 3060 | 70000 | 11.6 | 9.18 | 6.81 | 9.28 | 2332915 | 2114774 | 2113291 | 6562520 | 35.5491 | 32.225 | 32.2024 | 2.30874e+06 | 276286 | 10673 | 2.5957e+06 | 88.9448 | 10.644 | 0.411181 | 51268 | 47586 | 1.07738 | 1.6 | 4802 | 274182 | 1.75139 | 2.2 | 5.34812e+06 | 73.27 | 15.28 | 2279 | 1700.5 | 18.2 | 2.14 | 0.6 | 91.5927 | 8.40733 | 5.47874 | 0 | 4 | 0.0741085 | 0.99725 |
2014 | 5 | MN | MRO | East North Central | -0.1 | normal | 2014-05-11 00:00:00 | 18:39:00 | intentional attack | vandalism | 1 | 63000 | 12.12 | 9.71 | 6.49 | 9.28 | 1586986 | 1807756 | 1887927 | 5284231 | 30.0325 | 34.2104 | 35.7276 | 2.34586e+06 | 284978 | 9898 | 2.64074e+06 | 88.8335 | 10.7916 | 0.37482 | 53499 | 49091 | 1.08979 | 1.9 | 5226 | 291955 | 1.79 | 2.2 | 5.45712e+06 | 73.27 | 15.28 | 2279 | 1700.5 | 18.2 | 2.14 | 0.6 | 91.5927 | 8.40733 | 5.47874 | 1 | 6 | 0.0812137 | 0.996697 |
2010 | 10 | MN | MRO | East North Central | -1.5 | cold | 2010-10-28 00:00:00 | 22:00:00 | severe weather | heavy wind | 3000 | 70000 | 10.87 | 8.19 | 6.07 | 8.15 | 1467293 | 1801683 | 1951295 | 5222116 | 28.0977 | 34.501 | 37.366 | 2.30029e+06 | 276463 | 10150 | 2.5869e+06 | 88.9206 | 10.687 | 0.392361 | 50447 | 47287 | 1.06683 | 2.7 | 4571 | 267895 | 1.70627 | 2.1 | 5.3109e+06 | 73.27 | 15.28 | 2279 | 1700.5 | 18.2 | 2.14 | 0.6 | 91.5927 | 8.40733 | 5.47874 | 0 | 1 | 0.0871557 | 0.996195 |
2012 | 6 | MN | MRO | East North Central | -0.1 | normal | 2012-06-20 00:00:00 | 23:00:00 | severe weather | thunderstorm | 2550 | 68200 | 11.79 | 9.25 | 6.71 | 9.19 | 1851519 | 1941174 | 1993026 | 5787064 | 31.9941 | 33.5433 | 34.4393 | 2.31734e+06 | 278466 | 11010 | 2.60681e+06 | 88.8954 | 10.6822 | 0.422355 | 51598 | 48156 | 1.07148 | 0.6 | 5364 | 277627 | 1.93209 | 2.2 | 5.38044e+06 | 73.27 | 15.28 | 2279 | 1700.5 | 18.2 | 2.14 | 0.6 | 91.5927 | 8.40733 | 5.47874 | 0 | 1 | 0.0196337 | 0.999807 |
2015 | 7 | MN | MRO | East North Central | 1.2 | warm | 2015-07-19 00:00:00 | 07:00:00 | severe weather | nan | 1740 | 250000 | 13.07 | 10.16 | 7.74 | 10.43 | 2028875 | 2161612 | 1777937 | 5970339 | 33.9826 | 36.2059 | 29.7795 | 2.37467e+06 | 289044 | 9812 | 2.67353e+06 | 88.8216 | 10.8113 | 0.367005 | 54431 | 49844 | 1.09203 | 1.7 | 4873 | 292023 | 1.6687 | 2.2 | 5.48959e+06 | 73.27 | 15.28 | 2279 | 1700.5 | 18.2 | 2.14 | 0.6 | 91.5927 | 8.40733 | 5.47874 | 0 | 5 | 0.00872654 | 0.999962 |
To gain a better understanding of this dataset before formalizing a prediction task, a visual analysis of various distributions and their correlations was performed to identify any patterns. At first attempts were made to identify any correlations between critical variables, however in many cases no such pattern became apparent. One such instance occurs when attempting to plot OUTAGE.DURATION against CUSTOMERS.AFFECTED, which was initially done to search for ways to capture the “severity” of an outage:
Although relationships like these do not reveal immediate patterns, they show that the patterns in this dataset are heavily nuanced by the wide range of types of power outages included. In other words, variables such as these are indeed dependent on CAUSE.CATEGORY, as the nature of a power outage naturally determines many of its characteristics. One such example of these patterns, which will later be used for its predictive ability, focuses on a pattern emerging from the fact that many missing values in the CUSTOMERS.AFFECTED column correlate with CAUSE.CATEGORY. The following plots showcase the distribution of CAUSE.CATEGORY, before and after filtering for these missing values:
Notably, intentional attacks consist of nearly 50% of observations missing a value for CUSTOMERS.AFFECTED. While it is unknown how collecting the data used in this dataset may have led to these results, the following classifier models which leverage this information assume similar circumstances exist at the time an outage occurs. If so, this information can be used by including relevant features in a model. Furthermore, the distribution of CAUSE.CATEGORY alone showcases the uneven distribution of outage causes in this dataset. This directly implies later models must account for this fact in order to make accurate predictions.
Investigating further, aggregating by CAUSE.CATEGORY yields some interesting statistics. For example, the variations across categories show how each type of outage affects customers as shown below.
CAUSE.CATEGORY | outages | total_customers_all_time | avg_customers |
---|---|---|---|
severe weather | 744 | 1.37634e+08 | 181336 |
system operability disruption | 123 | 1.97349e+07 | 156626 |
intentional attack | 403 | 1.20169e+07 | 28748.6 |
equipment failure | 55 | 4.59074e+06 | 80539.2 |
public appeal | 69 | 3.32178e+06 | 48141.7 |
fuel supply emergency | 38 | 2.37765e+06 | 47553 |
islanding | 44 | 743319 | 16159.1 |
Imputations were performed on the following columns: CUSTOMERS.AFFECTED - the number of customers affected RES.PERCEN - Percentage of residential electricity consumption compared to the total electricity consumption in the state (in %) IND.PERCEN - Percentage of industrial electricity consumption compared to the total electricity consumption in the state (in %)
CUSTOMERS.AFFECTED was imputed using a conditional median strategy, grouping on NERC.REGION (a location metric) then imputing using the median of CUSTOMERS.AFFECTED. The median was chosen over the mean due to the skewed distribution of CUSTOMERS.AFFECTED, as a number of the most extreme outages in the united states caused by hurricanes or major grid failures affected exponentially more consumers. The effects are visualized below:
RES.PERCEN and IND.PERCEN were both imputed using a strategy that grouped STATE for a given YEAR, then imputing using the mean. For remaining observations with a unique year/state combination, groupings are only performed by state. The effects are visualized below:
After performing this deep dive on the dataset and identifying some of the key patterns lying within it, we are ready to formalize this data into a prediction problem fit for a machine learning classification algorithm. To be explicit, the question the model will answer is:
Given the immediate context of a power outage as it occured, what caused it?
This results in the need for a multiclass classification algorithm which treats CAUSE.CATEGORY as the response variable. The metric used to evaluate models will be overall accuracy, however confusion matrices are used to evaluate recall for specific categories.
Finally, it is worth noting that the only features which are unavailable at the time of prediction are OUTAGE.DURATION and OUTAGE.RESTORATION.DATE–these have been omitted from the project to prevent data leakage. There is some ambiguity of whether CUSTOMERS.AFFECTED should be included in this group based on how this data was actually collected, however as mentioned previously this project assumes a modern monitoring system capable of implementing such a classifier also has the technology to gauge the number of affected customers, or at least produce a valid estimate. Since models will also encode the missingness of such information as its own feature, any extraordinary cases are accounted for.
A standard Multi-class Logistic Regression model is used to create a baseline for this project. This model will utilize the following list of features, organized by which information dimension they belong to. Features that are engineered (and not in the original dataset) are included and specified in the list.
MONTH
: Month (1=Jan … 12=Dec) of outage (ordinal)TIME.DATE.DOW
(Engineered): Day of week (0=Mon … 6=Sun) (ordinal)TIME.HOUR‑SIN
, TIME.HOUR‑COS
(Engineered): Sine/cosine transforms of outage start time (hour + minute), to capture cyclical daily patterns (quantitative)
STATE
: U.S. postal code (nominal)NERC.REGION
: North American Electric Reliability region (nominal)
ANOMALY.LEVEL
: The oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season (quantitative)
CUSTOMERS.AFFECTED
: Number of customers impacted (quantitative)CUST.AFF.MISSING
(Engineered): Indicator (0/1) for originally missing CUSTOMERS.AFFECTED counts (quantitative - binary)POPPCT_URBAN
: Percent urban population in that state (quantitative)
RES.PERCEN
: Residential electricity consumption percentage of total load (quantitative)IND.PERCEN
: Industrial electricity consumption percentage of total load (quantitative)UTIL.CONTRI
: Utility industry’s contribution to the total GSP in the State (as %) (quantitative)TOTAL.REALGSP
: Real GSP contributed by all industries in a given state (measured in 2009 U.S dollars) (quantitative)Of these 14 total features, 10 are quantitative/numeric, 2 are ordinal, and 2 are nominal/categorical.
The following schema was followed to produce the model:
LogisticRegression
(with l1 regularization)C
(log‑spaced 1e‑4 to 1e2)This model yields 76.72% training accuracy and 70.82% test accuracy. Overall, this a promising benchmark to set for future revisions. These results indicate the presence of a strong, real pattern in the data, however there are still several issues and opportunities for improvement to address in future models. While this first baseline is good in that it has promising accuracy, the goal now is to produce a model that is less overfit (as indicated by the difference in train/test scores) and which has better performance.
To improve upon the baseline model several key strategies are employed. These aim to address the following potential issues with the baseline model:
Surveying the results of the baseline model’s performance across each CAUSE.CATEGORY, a problem became apparent relating to the ability to predict ‘public appeal.’ The model suffers from low recall (~16%) and only moderate precision (~50%), showcasing an apparent struggle in predicting public appeal. Upon further research into the dataset itself, the reason becomes apparent. According to this dataset (and the NERC), most public appeal events are instances where officials urge electricity consumers to lower their usage for some reason (like a storm). Such events are planned, and oftentimes do not even represent a loss of power. Beyond intuition, These results prove ‘public appeal’ events are not fit for this problem–the model struggles as it searches for patterns that aren’t there.
As a result, the first major change is to remove ‘public appeal’ observations from the dataset (less than 5% of the overall data)
To examine which features are (un)important to the model, the absolute value of the mean of each coefficient across all classes is used to produce a ranking of feature importance. However, this process only resulted in one feature which can surely be removed: RES.PERCEN, which averaged 0.0 importance across all categories indicating it did not help predict any category. Following these results, RES.PERCEN will be dropped from the final model.
When training the baseline model via cross-validation, a particular error occurred where rare categorical features were encountered in validation sets but not training sets. An exact example output is as follows:
UserWarning: Found unknown categories in columns [...] during transform. These unknown categories will be encoded as all zeros
.
Such a warning indicates that in order to negate this information loss and produce better results, rare categories should be grouped into an “Other” category to prevent errors. This is confirmed by surveying the distribution of the STATE and NERC.REGION variables:
To resolve these, a cutoff for state was set which binned states which represent less than .5% of all data into an “Other” category, and a cutoff after ECAR region was set for NERC.REGION data. Retraining the baseline model with these new binned features, ‘STATE-T’ and ‘NERC.REGION-T’, does not report any warnings. Additionally, binning these categories as new features may reduce overfitting–when surveying feature importance, the values of certain coefficients relating to rare states/regions imply the model may have “memorized” these categories and their CAUSE.CATEGORY in the training set.
Finally, the same logic is applied to CAUSE.CATEGORY, the response variable. CAUSE.CATEGORY-T maps ‘equipment failure’, ‘fuel supply emergency’, and ‘islanding’ all as ‘physical failures’, representing a broad hardware malfunction leading to an outage. This is done as these categories are similar in nature in reality, which is hinted at in the baseline model’s tendency to misclassify each of these for each other. However, binning the response to create less categories overall will almost certainly improve raw accuracy–this is more done as an exercise in curiosity. To examine whether the gains in accuracy surpass the loss of information, all candidate final models are evaluated on both CAUSE.CATEGORY and CAUSE.CATEGORY-T separately.
To optimize performance, several other classification algorithms were tested as candidates for the final model. The other algorithms tested were Random Forest classification and a multi-layer perceptron (MLP), as both algorithms are powerful classifiers which have potential to yield promising results. A Decision Tree model was also included in this set, however this model was only included to visualize a decision tree in the case Random Forest proved best.
With 4 algorithms and 2 variations of a response variable, CAUSE.CATEGORY and CAUSE.CATEGORY-T, 8 total models were created and evaluated. For each model, all hyperparameters were tuned across a vast set of combinations utilizing sklearn GridSearchCV. The following graph outlines the results:
The resulting final model chosen from this set is the multi-class Logistic Regression, which classifies on the non-engineered CAUSE.CATEGORY. This model yields a 79.20% training score, and a 78.08% test score. While the logistic regression using ‘physical failures’ did perform strictly better, overall a ~1.5% increase in test accuracy does not justify losing specificity in predictions. Within the context of this project, a power outage diagnostician is likely more interested in the specifics of a physical failure than marginal improvements in accuracy.
The final model has the following hyperparameters, and was trained by one hot encoding all categorical features and standardizing all numerical features:
Parameter | Value | Description |
---|---|---|
C |
78.80462815669921 |
Inverse of regularization strength (L2) |
penalty |
"l2" |
Regularization norm |
solver |
"liblinear" |
Optimization algorithm (supports L1 & L2) |
max_iter |
10000 |
Maximum iterations for solver convergence |
Features used:
MONTH
: Month (1=Jan … 12=Dec) of outage (ordinal)TIME.DATE.DOW
(Engineered): Day of week (0=Mon … 6=Sun) (ordinal)TIME.HOUR‑SIN
, TIME.HOUR‑COS
(Engineered): Sine/cosine transforms of outage start time (hour + minute), to capture cyclical daily patterns (quantitative)
STATE-T
(Engineered): U.S. postal code (nominal)NERC.REGION-T
(Engineered): North American Electric Reliability region (nominal)
ANOMALY.LEVEL
: The oceanic El Niño/La Niña (ONI) index referring to the cold and warm episodes by season (quantitative)
CUSTOMERS.AFFECTED
: Number of customers impacted (quantitative)CUST.AFF.MISSING
(Engineered): Indicator (0/1) for originally missing CUSTOMERS.AFFECTED counts (quantitative - binary)POPPCT_URBAN
: Percent urban population in that state (quantitative)
IND.PERCEN
: Industrial electricity consumption percentage of total load (quantitative)UTIL.CONTRI
: Utility industry’s contribution to the total GSP in the State (as %) (quantitative)TOTAL.REALGSP
: Real GSP contributed by all industries in a given state (measured in 2009 U.S dollars) (quantitative)Overall, this model has improved significantly over the baseline. While the choice of algorithm remains a multi-class logistic regression, this new model has both strictly better performance while also being far less overfit than the baseline.
In conclusion, the final model’s test accuracy of 78.08% demonstrates that there is real predictive ability in determining the cause of power outages. To improve further, improvements can be made moving forward to further increase performance that were not made throughout this project. For instance, many incorrect predictions are caused by the uneven distribution of CAUSE.CATEGORY–the rare categories lose precision/recall performance due to the model’s frequent biasing towards the more common groups. While creating a logistic regression that balances the weight of each class does not resolve this issue, more advanced methods can be utilized in the future.