GridCARE

Overview

Built machine learning pipeline to predict total energy usage of data centers using standardized datasets from SPGlobal, LBNL Building Performance Database, and Aterio US Data Center Power Demand Dataset. Analyzed 202 unique data centers across the US and internationally, applying multiple imputation techniques (KNN, Random Forest, MICE, Neural Networks) to handle missing data.

Developed Random Forest model achieving 79% accuracy in classifying power usage into 3MW buckets, outperforming Decision Tree and Neural Network alternatives. Delivered comprehensive GitHub repository with model pipeline, data standardization scripts, and usage instructions for GridCARE to estimate energy consumption for incomplete records and inform sustainability planning.

Photo Gallery

meeting our liaison at GridCARE HQ

meeting up with GridCARE at the Monterey GridFWD conference

Impact

79%

Model Accuracy

Random Forest classification performance

202

Data Centers Analyzed

Across US and international locations

3

Major Datasets

SPGlobal, LBNL, Aterio standardized

3MW

Power Prediction Buckets

Granular energy consumption estimates

The Challenge

Data centers are critical infrastructure for the digital economy, but predicting their energy consumption is challenging due to fragmented data sources, missing values across key features, and lack of standardized schemas. GridCARE needed a reliable way to estimate power usage for incomplete records to evaluate efficiency opportunities and inform strategic decisions around sustainability and infrastructure planning.

Existing datasets from SPGlobal (~600 data centers), LBNL (274 data centers), and Aterio (115 data centers) each used different schemas and had significant missing data across critical features like IT power allocation, square footage, and climate zones. The challenge was to standardize these sources and build a robust predictive model despite incomplete information.

Data Science Approach

Data Standardization

Unified three major datasets (SPGlobal, LBNL, Aterio) into single schema, choosing SPGlobal as baseline due to completeness. Excluded fragmented sources like EIA CBECS and individual utility records that lacked specific data center identifiers.

Exploratory Data Analysis

Conducted correlation analysis revealing strong relationships (r=0.80) between TOTALPOWER and features like Total Square Footage, IT Space Occupied, and Number of Racks. ANOVA analysis showed climate significantly impacts energy intensity, with Warm Marine regions exhibiting highest usage.

Multiple Imputation Techniques

Applied K-Nearest Neighbors (KNN), Random Forest, Multiple Imputation by Chained Equations (MICE), and Neural Network-based imputation to handle missing data, with Random Forest imputation achieving lowest RMSE (4035.99).

Feature Engineering

Estimated TOTALPOWER for missing values using formula: (Total Square Footage × Kilowatts per Rack) ÷ 1000. Applied StandardScaler for numerical features and OneHotEncoder for categorical variables (climate, market, state).

Model Development

Evaluated three model architectures—Decision Tree, Random Forest, and Neural Network—using accuracy, precision, recall, and f1-score metrics. Random Forest outperformed alternatives even after adding measures to prevent overfitting (reducing tree depth), achieving 79% overall accuracy with weighted average f1-score of 0.76.

Model performed best on:

0–3MW bucket: 94% f1-score (15 data centers)
3–6MW bucket: 78% f1-score (9 data centers)
15–18MW bucket: 100% f1-score (1 data center)

Random Forest was selected for the final pipeline due to its ability to handle high-dimensionality and better discriminate relevant features. The model predicts power using bucket classification rather than single measurements, allowing for approximate power range predictions in 3MW increments.

Key Insights

Climate Effects on Energy Intensity

ANOVA analysis revealed statistically significant climate impact: Warm Marine (San Francisco) showed highest energy use intensity at ~620 kBtu/ft/yr, while Cool Humid (Chicago) exhibited lowest at ~300 kBtu/ft/yr.

Feature Correlations

Strongest predictors of total power: Total IT Power Allocated (r=0.805), Total IT Space Occupied (r=0.773), and Total IT Power (r=0.425). Watts per square foot remained consistent across facilities due to similar optimization strategies.

Model Recommendations

Future improvements should expand coverage of missing fields (wattage per rack, climate zone), gather time-series data for dynamic modeling, and include granular infrastructure metrics like cooling type and server density.

Deliverables

Delivered complete machine learning pipeline with comprehensive documentation, enabling GridCARE to predict data center power consumption for incomplete records. The pipeline includes:

Model pipeline Jupyter notebook accepting JSON input for new predictions
Trained Random Forest model saved as model.pkl for deployment
Data standardization and imputation scripts
Exploratory data analysis visualizations (correlation heatmaps, climate analysis)
GitHub repository with full codebase and usage instructions
Final report with methodology, results, and recommendations