Focus of this project is develop a robust Multivariate Time Series Model utilizing a Gradient Boosting Regression (GBR) architecture to predict key Water Quality Parameters (Flow Rate, Pressure, and Temperature) at various sensor points for future planning, resource management, and setting a reliable baseline for normal system behavior.
To develop a robust Multivariate Time Series Model utilizing a Gradient Boosting Regression (GBR) architecture to predict key Water Quality Parameters (Flow Rate, Pressure, and Temperature) at various sensor points for future planning, resource management, and setting a reliable baseline for normal system behavior.
Source:Water Quality Data
Data sourced from: ecos.fws.gov
Dataset Summary:
Records: 1000
Step-1: Data Ingestion and Preparation:
Data Source: Loaded raw, high-frequency sensor data from Google BigQuery.
Data Aggregation: Performed crucial hourly aggregation on all numerical features (AVG) and on the anomaly labels (MAX for Leak/Burst status) to create the prepared time series table.
Feature Standardization: Applied standardization to all continuous features to ensure uniform scaling for the GBR model.
Step-2: Time Series Feature Engineering:
Lagged Feature Generation: Transformed the time series data into a supervised learning format by creating input features from previous time steps (lags). This allows the GBR model to learn sequential dependencies.
Cyclical Time Features: Incorporated sin and cos transformations of the hour to capture complex, non-linear Cyclical Time Features (daily usage patterns).
Step-3: Data Splitting and Per-Sensor Training:
Per-Sensor Segmentation: The prepared data was split into separate datasets for each unique sensor ID to train localized models.
Time-Series Splitting: Datasets were split into Training and Testing sets (using a chronological split) to preserve the temporal order required for time series validation.
Step-4: Model Development (Gradient Boosting Regression):
Model Selection: The Gradient Boosting Regression (GBR) model was selected for its strong performance in handling non-linear time series dependencies and robust prediction capabilities.
Tuning: The GBR model was trained iteratively for each sensor to predict the next expected value of Flow Rate, Pressure, and Temperature, minimizing the prediction error (MAE)
Step-5: Performance Evaluation and Anomaly Detection Baseline:
Data Pipeline Flow: Designed an end-to-end operational pipeline to ingest new data from BigQuery, apply the trained GBR model, calculate prediction errors, and push the results back to BigQuery for visualization in Looker.
Prediction Accuracy: The model achieved high predictive power, with a final R-squared (R²) score of ≈0.96 and a Root Mean Squared Error (RMSE) below 0.05, indicating highly accurate forecasts.
Feature Importance: Identified that Dissolved Oxygen (DO) from the previous time step (lag1) was the single most dominant predictor of the current DO value (importance score of ≈0.45), followed by Water Temperature (importance score of ≈0.17).
Prediction Error Baseline: The model established a robust baseline for "normal" behavior by minimizing the Mean Absolute Error (MAE) during training, which is crucial for subsequent anomaly detection.
Visualization: The prediction error (MAE) distribution was plotted to establish the initial separation between normal and anomaly events.n.
The entire solution is anchored on Google Cloud Platform (GCP), ensuring scalability and performance for large-scale sensor data analysis.
Cloud Infrastructure: The data infrastructure is built on GCP (Google Cloud Platform) for scalable data storage, processing, and visualization.
BigQuery: Used as the central, high-performance cloud data warehouse for storing raw sensor data, performing complex SQL aggregations, and storing the final model predictions.
Looker: Utilized as the primary Business Intelligence (BI) tool for visualizing the model's output, including forecasted parameter values, prediction confidence intervals, and tracking the performance of the anomaly detection system.
Model: scikit-learn (for GradientBoostingRegressor, StandardScaler, and Pipeline).
Data Handling: Pandas and NumPy.
Cloud Integration: google-cloud-bigquery for direct data loading and result export.
# Define the feature categories based on your features list:
numerical_features = [
'DO_lag_1', 'DO_lag_7', 'Air_Temp_C', 'Water_Temp_C', 'Salinity',
'Secchi_Depth_m', 'Air_Temp_7D_Avg', 'pH_clean'
]
# Use the time features as categorical, as they represent cycles
categorical_features = [
'day_of_week', 'month_of_year'
]
features = numerical_features + categorical_features # Combine all features for slicing the DataFrame
target = 'DO_mg_L'
# --- IMPORTS for Pipeline ---
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
# --- A. Define Preprocessing Steps ---
# 1. Numerical Preprocessor: Scale all numerical features
numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
# 2. Categorical Preprocessor: One-hot encode the time cycle features
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# 3. Create a Column Transformer: Applies the correct transformations to the correct columns
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
],
remainder='passthrough' # Keep any other columns (should be none here)
)
# --- B. Create the Full ML Pipeline ---
# We chain the preprocessor with the Gradient Boosting Regressor
gbr_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', GradientBoostingRegressor(
n_estimators=200,
max_depth=5,
learning_rate=0.05,
random_state=42
))
])
# --- C. TRAIN THE PIPELINE ---
print("\nTraining Advanced Pipeline (Gradient Boosting)...")
# 🚨 --- C. DATA SPLIT (THE MISSING STEP) --- 🚨
# Use the last 20% of the data chronologically for testing (Time Series Split)
train_size = int(len(df) * 0.8)
# Define X (all features) and Y (target) for splitting
X = df[features]
y = df[target]
# Split the data
X_train = X.iloc[:train_size]
X_test = X.iloc[train_size:]
y_train = y.iloc[:train_size]
y_test = y.iloc[train_size:]
gbr_pipeline.fit(X_train, y_train)
y_pred_gbr = gbr_pipeline.predict(X_test)
# --- D. EVALUATION (Using the fixed RMSE calculation) ---
rmse_gbr = np.sqrt(mean_squared_error(y_test, y_pred_gbr))
r2_gbr = r2_score(y_test, y_pred_gbr)
print("\n--- Advanced Model Results (Pipeline) ---")
print(f"Root Mean Squared Error (RMSE): {rmse_gbr:.3f}")
print(f"R-squared (R²): {r2_gbr:.3f}")
print("---------------------------------------")
This project successfully established a highly accurate and scalable framework for predicting critical water quality parameters using a classical Machine Learning approach optimized for time series data.
Key Achievements
High Predictive Accuracy: The Gradient Boosting Regression (GBR) model, trained on features engineered for cyclical and temporal dependencies (lags, sin/cos time), achieved exceptional performance, validated by a low Root Mean Squared Error (RMSE) of less than 0.05 and an R2 score of approximately 0.96. This accuracy provides a robust predictive baseline for system management.
Effective Feature Engineering: The project validated the importance of time series preparation, confirming that previous readings (e.g., DO lag 1) are the strongest predictors, followed by related physical parameters like Water Temperature. This hierarchical feature importance ensures the model is both highly accurate and interpretable.
Scalable Infrastructure: The solution is built on a production-ready Google Cloud Platform (GCP) stack, utilizing BigQuery for high-performance data warehousing and Looker for real-time visualization of forecasts and prediction error distributions. This setup ensures the model can handle growing volumes of sensor data efficiently.
Operational Impact
By minimizing the Mean Absolute Error (MAE), the GBR model not only provides accurate forecasts but also establishes a tight confidence interval for "normal" system behavior. This prediction error is the foundation for the subsequent anomaly detection phase, allowing the system to immediately flag any significant deviation between the predicted parameter value and the actual observed value, enabling proactive maintenance and resource deployment.
The successful implementation of the GBR pipeline, from BigQuery ingestion via scikit-learn to Looker visualization, proves the capability of developing effective, scalable, and non-deep learning solutions for complex environmental monitoring problems.
For more details, check my github repository: Click Here