Skip to main content

Triton Inference Server

Overview

The Triton Inference Server panel provides a comprehensive interface for managing NVIDIA Triton Inference Server deployments within the Shakudo Platform. This feature enables users to monitor server health, manage AI model loading/unloading, and control serving endpoints for production machine learning inference workloads. Triton supports models from any framework (TensorFlow, PyTorch, ONNX, TensorRT, or custom) and can be deployed on GPU or CPU infrastructure.

Access & Location

  • Route: ?panel=triton-inference-server
  • Navigation: Main Navigation → Triton Inference Server
  • Access Requirements: None specified (standard user access)
  • Feature Flags: None

Key Capabilities

Server Health Monitoring

The panel continuously monitors the Triton Inference Server health status by checking the /v2/health/ready endpoint. A visual indicator (green/red circle) displays whether the server is healthy and ready to serve requests.

Model Management

Users can view all models in the Triton model repository and control their loading state. Models can be individually loaded or unloaded from the server's memory, or bulk operations can load/unload all models simultaneously. This allows for efficient resource management when multiple models are available.

Endpoint Management

The panel tracks active Triton serving endpoints (pipeline jobs) that are currently running. Users can view endpoint details, check their health status, and cancel endpoints when needed. Each endpoint represents a running service that exposes model inference capabilities via HTTP/HTTPS.

Real-time Logs

Both server-level logs and endpoint-specific logs are available in dedicated panels, providing visibility into model operations, inference requests, and system events.

User Interface

Main View

The interface features two primary tabs accessed via chip buttons:

  • Models Tab: Displays the models table with server logs panel on the side
  • Endpoints Tab: Shows the endpoints table with endpoint-specific logs panel on the side

A server health indicator is prominently displayed in the header, showing real-time status of the Triton server.

Dialogs & Modals

  1. Cancel Endpoint Dialog
    • Purpose: Confirm cancellation of a running Triton endpoint
    • Fields: Confirmation message with endpoint ID
    • Actions: Close (abort) or Cancel (confirm deletion)

Tables & Data Grids

  1. Models Table

    • Columns:
      • Model: Model name (clickable to copy)
      • Version: Model version number
      • Bucket Path: Full path to model in cloud storage (clickable to copy)
      • State: Toggle switch showing Loaded/Unloaded status
    • Actions:
      • Load All: Load all available models into server memory
      • Unload All: Unload all models from server memory
      • Refresh: Reload the models list
      • Individual toggle: Load/unload specific models
    • Filtering: None
    • Pagination: 10 items per page
  2. Endpoints Table

    • Columns:
      • Name: Endpoint name with cancel button (clickable to copy)
      • Endpoint: Full URL to the serving endpoint (clickable to copy)
      • Health: Real-time health check indicator
    • Actions:
      • Cancel endpoint (X button per row)
      • Refresh: Reload the endpoints list
      • Row click: Select endpoint to view logs
    • Filtering: Automatically filters to only show active Triton endpoints (excludes cancelled, failed, or completed jobs)
    • Pagination: Server-side pagination with 10 items per page

Technical Details

GraphQL Operations

Queries:

  • tritonServices - Fetches active Triton pipeline jobs (endpoints) with filtering for jobType='triton' and excluding cancelled/failed/completed jobs. Returns id, jobName, jobType, status, dashboardPrefix, and daskDashboardUrl.

Mutations:

  • cancelEndpoint - Updates a pipeline job status to 'cancelled' by ID, effectively terminating the endpoint.

Subscriptions: None

REST API Endpoints

Model Operations:

  • POST /api/triton-dashboard/get-models - Fetches model repository index from Triton server (/v2/repository/index)
  • POST /api/triton-dashboard/load-models - Loads or unloads a specific model (/v2/repository/models/{name}/{action})

Server Monitoring:

  • POST /api/triton-dashboard/check-url - Health check endpoint validator
  • POST /api/triton-dashboard/logs - Retrieves Triton server logs
  • POST /api/triton-dashboard/server-metrics - Fetches server performance metrics

Endpoint Operations:

  • POST /api/triton-dashboard/check-endpoint-status - Validates endpoint health
  • POST /api/triton-dashboard/endpoint-logs - Retrieves logs for specific endpoints

Component Structure

  • Main Component: shakudo-apps/triton-dashboard/components/Panels/TritonPanel.tsx
  • Tables:
    • shakudo-apps/triton-dashboard/components/Tables/TritonModels.tsx
    • shakudo-apps/triton-dashboard/components/Tables/TritonEndpoints.tsx
  • Dialogs: shakudo-apps/triton-dashboard/components/Dialogs/CancelEndpoint.tsx
  • Toggles: shakudo-apps/triton-dashboard/components/Toggle/LoadUnloadModelToggle.tsx
  • Log Containers:
    • shakudo-apps/triton-dashboard/components/Containers/TritonLogs.tsx
    • shakudo-apps/triton-dashboard/components/Containers/EndpointLogs.tsx

Context & Configuration

  • TritonAppContext: Provides server URL and model repository path configuration
  • Environment Variables:
    • TRITON_SERVER: Base URL for the Triton Inference Server

Common Workflows

Deploying a New Model

  1. Upload model checkpoint to the Triton model repository (cloud bucket path: {bucket}/triton-server/model-repository/)
  2. Structure the model following Triton model repository format
  3. Wait for automatic detection or manually refresh the Models tab
  4. Toggle the model state from "Unloaded" to "Loaded"
  5. Verify the model appears as "Loaded" in the state column

Creating a Model Serving Endpoint

  1. Ensure your model is loaded in the Models tab
  2. Write a client application using Triton client libraries
  3. Wrap the client with FastAPI or Flask
  4. Deploy the client as a pipeline job with jobType='triton'
  5. Monitor the endpoint in the Endpoints tab
  6. Use the provided URL to make inference requests

Managing Server Resources

  1. Navigate to the Models tab
  2. Review which models are currently loaded
  3. Unload unused models to free memory using individual toggles
  4. Use "Load All" before batch inference operations
  5. Use "Unload All" to clear server memory completely

Troubleshooting Failed Endpoints

  1. Switch to the Endpoints tab
  2. Identify the problematic endpoint
  3. Click on the endpoint row to view its logs in the side panel
  4. Review logs for error messages
  5. If necessary, cancel the endpoint using the X button
  6. Fix the underlying issue and redeploy

Notes & Tips

Model Repository Best Practices

  • Follow the Triton model repository structure strictly to ensure automatic detection
  • For TensorFlow models, config.pbtxt can be auto-generated by Triton
  • Model files are stored at: {cloud_bucket}/triton-server/model-repository/
  • Each model should have its own subdirectory with version subdirectories

Performance Optimization

  • Only keep frequently-used models loaded to optimize memory usage
  • Unload models during low-traffic periods to free resources
  • Use bulk load operations when preparing for batch inference workloads
  • Monitor server health indicator before deploying new endpoints

Endpoint Configuration

  • Custom URL endpoints can be specified during client deployment
  • Inference endpoints typically follow pattern: https://{domain}/hyperplane.dev/{endpoint_name}/infer/
  • Endpoints require the daskDashboardUrl field to appear in the endpoints table

Multi-Model and Ensemble Serving

  • Multiple models can be loaded simultaneously and served from a single endpoint
  • Parameterize client inference functions with model_name for multi-model endpoints
  • Ensemble models use the ensemble platform in config.pbtxt with ensemble_scheduling configuration
  • Ensemble models can execute multiple models concurrently using Python backend with asyncio

Log Monitoring

  • Server logs (left panel in Models tab) show server-level events and model loading operations
  • Endpoint logs (right panel in Endpoints tab) show request-specific logs for selected endpoints
  • Click on any endpoint row to switch the log view to that specific endpoint

Health Checks

  • Server health is checked via /v2/health/ready endpoint
  • Individual endpoint health indicators appear in the Health column
  • Green indicator = healthy and ready, Red indicator = unhealthy or not ready

Troubleshooting

  • If models don't appear after upload, check the model repository structure and refresh
  • If load/unload operations fail, verify server health and check server logs
  • Endpoint cancellation changes job status but doesn't immediately terminate running processes
  • Failed endpoints remain visible until explicitly filtered or cancelled