Skip to content

Conversation

@hansent
Copy link
Collaborator

@hansent hansent commented Jan 20, 2026

Summary

Adds a structured logging system for GCP serverless deployments that captures model lifecycle events for observability and analysis. This enables understanding cache hit rates, model thrashing patterns, and memory usage in production.

Key features:

  • JSON structured logs compatible with GCP Cloud Logging
  • 7 event types covering the full model lifecycle
  • Tiered sampling to control log volume/costs (high-volume events sampled, lifecycle events always logged)
  • Tiered memory measurement (basic vs detailed mode)
  • Request context propagation (correlates events across workflow steps)
  • Support for both legacy model loading and inference-models paths

Event types:

  • request_received - Direct inference request (sampled)
  • workflow_request_received - Workflow execution started (sampled)
  • model_cache_status - Cache hit/miss per request (sampled)
  • model_loaded_to_disk - Model artifacts downloaded (always logged)
  • model_loaded_to_memory - Model loaded to GPU/CPU (always logged)
  • model_evicted - Model removed from cache (always logged)
  • inference_completed - Inference finished with timing (sampled)

Configuration (env vars):

  • GCP_LOGGING_ENABLED - Master switch (default: True when GCP_SERVERLESS)
  • GCP_LOGGING_SAMPLE_RATE - Sample rate for high-volume events (default: 1.0)
  • GCP_LOGGING_DETAILED_MEMORY - Enable full memory snapshots (default: False)

Test plan

  • Run server locally with GCP_SERVERLESS=true GCP_LOGGING_ENABLED=true and verify JSON logs appear in stdout
  • Make direct inference request → verify request_received, model_cache_status, model_loaded_to_memory, inference_completed events
  • Make second request to same model → verify cache_hit: true in events
  • Execute workflow → verify workflow_request_received and events include workflow_instance_id, step_name
  • Test with USE_INFERENCE_EXP_MODELS=true → verify model_loaded_to_disk event logged via callback
  • Test with GCP_LOGGING_DETAILED_MEMORY=true → verify memory snapshot fields populated
  • Test with GCP_LOGGING_SAMPLE_RATE=0.1 → verify ~10% of high-volume events logged
  • Verify logging disabled when GCP_SERVERLESS=false

Note

Adds a self-contained GCP structured logging system and wires it through key inference paths for observability in serverless deployments.

  • New inference/core/gcp_logging/ module: JSON logger, event dataclasses, request context, and memory utilities; configurable via GCP_LOGGING_ENABLED, GCP_LOGGING_SAMPLE_RATE, GCP_LOGGING_DETAILED_MEMORY
  • env.py: adds GCP logging env flags (enabled by default when GCP_SERVERLESS)
  • HTTP: GCPServerlessMiddleware sets/clears per-request GCPRequestContext; logs request_received and workflow_request_received (sampled)
  • Model manager: logs model_cache_status (sampled), model_loaded_to_memory (always), and inference_completed (sampled); measures GPU/system memory when detailed mode enabled
  • Fixed-size cache: logs model_evicted with reason, lifetime, inference count; tracks model load times
  • Model downloads: logs model_loaded_to_disk from both legacy roboflow.py path and inference_models via GCPLoggingModelAccessManager
  • Workflows: updates context per step (step_name, workflow_instance_id) for correlated events

Written by Cursor Bugbot for commit 974b343. This will update automatically on new commits. Configure here.

hansent and others added 3 commits January 20, 2026 13:39
Introduces a separate structured logging system for GCP serverless deployments
that captures model lifecycle events to understand cache hit rates, model
thrashing, and memory usage patterns.

New module: inference/core/gcp_logging/
- events.py: 7 event dataclasses (request_received, workflow_request_received,
  model_cache_status, model_loaded_to_disk, model_loaded_to_memory,
  model_evicted, inference_completed)
- context.py: Request context tracking via ContextVar
- memory.py: Tiered memory measurement (basic vs detailed mode)
- logger.py: GCPServerlessLogger singleton with sampling support

Environment variables:
- GCP_LOGGING_ENABLED: Master switch (default: True when GCP_SERVERLESS)
- GCP_LOGGING_SAMPLE_RATE: Sample rate for high-volume events (default: 1.0)
- GCP_LOGGING_DETAILED_MEMORY: Enable detailed memory introspection (default: False)

Instrumentation points:
- HTTP middleware: Sets up request context with correlation ID
- Workflow executor: Adds step context for workflow invocations
- ModelManager: Logs cache status, model loads, and inference completion
- Fixed size cache: Logs eviction events with lifetime and inference count
- Roboflow model: Logs disk download events

Features:
- Tiered sampling: High-volume events sampled, lifecycle events always logged
- Tiered memory: Basic mode (cheap) vs detailed mode (GPU sync + syscalls)
- Context propagation: request_id, workflow_instance_id, step_name flow through
- Cache hit tracking: inference_completed includes whether model was pre-loaded

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When USE_INFERENCE_EXP_MODELS is enabled, the model is loaded through
the InferenceExpObjectDetectionModelAdapter which wraps the inference_models
AutoModel. This change detects that adapter pattern and:

- Sets backend to "inference-models" or "inference-models/{backend}"
- Extracts device info from the underlying exp_model

This ensures ModelLoadedToMemoryEvent correctly identifies models loaded
through the inference-models path vs the legacy ONNX path.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When USE_INFERENCE_EXP_MODELS is enabled, models are loaded via the
inference-models library's AutoModel.from_pretrained(). This path
bypasses the legacy download code, so ModelLoadedToDiskEvent wasn't
being logged.

This adds GCPLoggingModelAccessManager - a custom ModelAccessManager
that hooks into inference-models download callbacks:
- on_model_package_access_granted: Start tracking download
- on_file_created: Accumulate file sizes and count
- on_model_loaded: Log ModelLoadedToDiskEvent with totals

The adapter only creates the access manager when GCP logging is enabled.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
"""
if not api_key:
return None
return hashlib.sha256(api_key.encode()).hexdigest()[:16]

Check failure

Code scanning / CodeQL

Use of a broken or weak cryptographic hashing algorithm on sensitive data High

Sensitive data (password)
is used in a hashing algorithm (SHA256) that is insecure for password hashing, since it is not a computationally expensive hash function.
Sensitive data (password)
is used in a hashing algorithm (SHA256) that is insecure for password hashing, since it is not a computationally expensive hash function.
Sensitive data (password)
is used in a hashing algorithm (SHA256) that is insecure for password hashing, since it is not a computationally expensive hash function.
Sensitive data (password)
is used in a hashing algorithm (SHA256) that is insecure for password hashing, since it is not a computationally expensive hash function.
Sensitive data (password)
is used in a hashing algorithm (SHA256) that is insecure for password hashing, since it is not a computationally expensive hash function.
Sensitive data (password)
is used in a hashing algorithm (SHA256) that is insecure for password hashing, since it is not a computationally expensive hash function.
Sensitive data (password)
is used in a hashing algorithm (SHA256) that is insecure for password hashing, since it is not a computationally expensive hash function.

Copilot Autofix

AI 3 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

@grzegorz-roboflow
Copy link
Collaborator

bugbot review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

logger.debug(f"Marking new model {queue_id} as most recently used.")
self._key_queue.append(queue_id)
# Track load time for the new model
self._model_load_times[queue_id] = time.time()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory leak in model load time tracking when logging disabled

Medium Severity

The _model_load_times dictionary is populated unconditionally at line 173 (self._model_load_times[queue_id] = time.time()) for every new model added, but the cleanup logic at line 163 (self._model_load_times.pop(to_remove_model_id, None)) only executes when gcp_logger.enabled is True. When GCP logging is disabled (the default when GCP_SERVERLESS=False), entries are added but never removed during eviction, causing the dictionary to grow unboundedly over the lifetime of the process.

Additional Locations (1)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants