Overview
Built machine learning pipeline to predict total energy usage of data centers using standardized datasets from SPGlobal, LBNL Building Performance Database, and Aterio US Data Center Power Demand Dataset. Analyzed 202 unique data centers across the US and internationally, applying multiple imputation techniques (KNN, Random Forest, MICE, Neural Networks) to handle missing data.
Developed Random Forest model achieving 79% accuracy in classifying power usage into 3MW buckets, outperforming Decision Tree and Neural Network alternatives. Delivered comprehensive GitHub repository with model pipeline, data standardization scripts, and usage instructions for GridCARE to estimate energy consumption for incomplete records and inform sustainability planning.
Photo Gallery
Impact
Random Forest classification performance
Across US and international locations
SPGlobal, LBNL, Aterio standardized
Granular energy consumption estimates
The Challenge
Data centers are critical infrastructure for the digital economy, but predicting their energy consumption is challenging due to fragmented data sources, missing values across key features, and lack of standardized schemas. GridCARE needed a reliable way to estimate power usage for incomplete records to evaluate efficiency opportunities and inform strategic decisions around sustainability and infrastructure planning.
Existing datasets from SPGlobal (~600 data centers), LBNL (274 data centers), and Aterio (115 data centers) each used different schemas and had significant missing data across critical features like IT power allocation, square footage, and climate zones. The challenge was to standardize these sources and build a robust predictive model despite incomplete information.
Data Science Approach
Data Standardization
Unified three major datasets (SPGlobal, LBNL, Aterio) into single schema, choosing SPGlobal as baseline due to completeness. Excluded fragmented sources like EIA CBECS and individual utility records that lacked specific data center identifiers.
Exploratory Data Analysis
Conducted correlation analysis revealing strong relationships (r=0.80) between TOTALPOWER and features like Total Square Footage, IT Space Occupied, and Number of Racks. ANOVA analysis showed climate significantly impacts energy intensity, with Warm Marine regions exhibiting highest usage.
Multiple Imputation Techniques
Applied K-Nearest Neighbors (KNN), Random Forest, Multiple Imputation by Chained Equations (MICE), and Neural Network-based imputation to handle missing data, with Random Forest imputation achieving lowest RMSE (4035.99).
Feature Engineering
Estimated TOTALPOWER for missing values using formula: (Total Square Footage × Kilowatts per Rack) ÷ 1000. Applied StandardScaler for numerical features and OneHotEncoder for categorical variables (climate, market, state).
Model Development
Evaluated three model architectures—Decision Tree, Random Forest, and Neural Network—using accuracy, precision, recall, and f1-score metrics. Random Forest outperformed alternatives even after adding measures to prevent overfitting (reducing tree depth), achieving 79% overall accuracy with weighted average f1-score of 0.76.
Model performed best on:
- 0–3MW bucket: 94% f1-score (15 data centers)
- 3–6MW bucket: 78% f1-score (9 data centers)
- 15–18MW bucket: 100% f1-score (1 data center)
Random Forest was selected for the final pipeline due to its ability to handle high-dimensionality and better discriminate relevant features. The model predicts power using bucket classification rather than single measurements, allowing for approximate power range predictions in 3MW increments.
Key Insights
Climate Effects on Energy Intensity
ANOVA analysis revealed statistically significant climate impact: Warm Marine (San Francisco) showed highest energy use intensity at ~620 kBtu/ft/yr, while Cool Humid (Chicago) exhibited lowest at ~300 kBtu/ft/yr.
Feature Correlations
Strongest predictors of total power: Total IT Power Allocated (r=0.805), Total IT Space Occupied (r=0.773), and Total IT Power (r=0.425). Watts per square foot remained consistent across facilities due to similar optimization strategies.
Model Recommendations
Future improvements should expand coverage of missing fields (wattage per rack, climate zone), gather time-series data for dynamic modeling, and include granular infrastructure metrics like cooling type and server density.
Deliverables
Delivered complete machine learning pipeline with comprehensive documentation, enabling GridCARE to predict data center power consumption for incomplete records. The pipeline includes:
- Model pipeline Jupyter notebook accepting JSON input for new predictions
- Trained Random Forest model saved as model.pkl for deployment
- Data standardization and imputation scripts
- Exploratory data analysis visualizations (correlation heatmaps, climate analysis)
- GitHub repository with full codebase and usage instructions
- Final report with methodology, results, and recommendations