Triton Inference Server

Overview

The Triton Inference Server panel provides a comprehensive interface for managing NVIDIA Triton Inference Server deployments within the Shakudo Platform. This feature enables users to monitor server health, manage AI model loading/unloading, and control serving endpoints for production machine learning inference workloads. Triton supports models from any framework (TensorFlow, PyTorch, ONNX, TensorRT, or custom) and can be deployed on GPU or CPU infrastructure.

Access & Location

Route: ?panel=triton-inference-server
Navigation: Main Navigation → Triton Inference Server
Access Requirements: None specified (standard user access)
Feature Flags: None

Key Capabilities

Server Health Monitoring

The panel continuously monitors the Triton Inference Server health status by checking the /v2/health/ready endpoint. A visual indicator (green/red circle) displays whether the server is healthy and ready to serve requests.

Model Management

Users can view all models in the Triton model repository and control their loading state. Models can be individually loaded or unloaded from the server's memory, or bulk operations can load/unload all models simultaneously. This allows for efficient resource management when multiple models are available.

Endpoint Management

The panel tracks active Triton serving endpoints (pipeline jobs) that are currently running. Users can view endpoint details, check their health status, and cancel endpoints when needed. Each endpoint represents a running service that exposes model inference capabilities via HTTP/HTTPS.

Real-time Logs

Both server-level logs and endpoint-specific logs are available in dedicated panels, providing visibility into model operations, inference requests, and system events.

User Interface

Main View

The interface features two primary tabs accessed via chip buttons:

Models Tab: Displays the models table with server logs panel on the side
Endpoints Tab: Shows the endpoints table with endpoint-specific logs panel on the side

A server health indicator is prominently displayed in the header, showing real-time status of the Triton server.

Dialogs & Modals

Cancel Endpoint Dialog
- Purpose: Confirm cancellation of a running Triton endpoint
- Fields: Confirmation message with endpoint ID
- Actions: Close (abort) or Cancel (confirm deletion)

Tables & Data Grids

Models Table
- Columns:
  - Model: Model name (clickable to copy)
  - Version: Model version number
  - Bucket Path: Full path to model in cloud storage (clickable to copy)
  - State: Toggle switch showing Loaded/Unloaded status
- Actions:
  - Load All: Load all available models into server memory
  - Unload All: Unload all models from server memory
  - Refresh: Reload the models list
  - Individual toggle: Load/unload specific models
- Filtering: None
- Pagination: 10 items per page
Endpoints Table
- Columns:
  - Name: Endpoint name with cancel button (clickable to copy)
  - Endpoint: Full URL to the serving endpoint (clickable to copy)
  - Health: Real-time health check indicator
- Actions:
  - Cancel endpoint (X button per row)
  - Refresh: Reload the endpoints list
  - Row click: Select endpoint to view logs
- Filtering: Automatically filters to only show active Triton endpoints (excludes cancelled, failed, or completed jobs)
- Pagination: Server-side pagination with 10 items per page

Technical Details

GraphQL Operations

Queries:

tritonServices - Fetches active Triton pipeline jobs (endpoints) with filtering for jobType='triton' and excluding cancelled/failed/completed jobs. Returns id, jobName, jobType, status, dashboardPrefix, and daskDashboardUrl.

Mutations:

cancelEndpoint - Updates a pipeline job status to 'cancelled' by ID, effectively terminating the endpoint.

Subscriptions: None

REST API Endpoints

Model Operations:

POST /api/triton-dashboard/get-models - Fetches model repository index from Triton server (/v2/repository/index)
POST /api/triton-dashboard/load-models - Loads or unloads a specific model (/v2/repository/models/{name}/{action})

Server Monitoring:

POST /api/triton-dashboard/check-url - Health check endpoint validator
POST /api/triton-dashboard/logs - Retrieves Triton server logs
POST /api/triton-dashboard/server-metrics - Fetches server performance metrics

Endpoint Operations:

POST /api/triton-dashboard/check-endpoint-status - Validates endpoint health
POST /api/triton-dashboard/endpoint-logs - Retrieves logs for specific endpoints

Component Structure

Main Component: shakudo-apps/triton-dashboard/components/Panels/TritonPanel.tsx
Tables:
- shakudo-apps/triton-dashboard/components/Tables/TritonModels.tsx
- shakudo-apps/triton-dashboard/components/Tables/TritonEndpoints.tsx
Dialogs: shakudo-apps/triton-dashboard/components/Dialogs/CancelEndpoint.tsx
Toggles: shakudo-apps/triton-dashboard/components/Toggle/LoadUnloadModelToggle.tsx
Log Containers:
- shakudo-apps/triton-dashboard/components/Containers/TritonLogs.tsx
- shakudo-apps/triton-dashboard/components/Containers/EndpointLogs.tsx

Context & Configuration

TritonAppContext: Provides server URL and model repository path configuration
Environment Variables:
- TRITON_SERVER: Base URL for the Triton Inference Server

Common Workflows

Deploying a New Model

Upload model checkpoint to the Triton model repository (cloud bucket path: {bucket}/triton-server/model-repository/)
Structure the model following Triton model repository format
Wait for automatic detection or manually refresh the Models tab
Toggle the model state from "Unloaded" to "Loaded"
Verify the model appears as "Loaded" in the state column

Creating a Model Serving Endpoint

Ensure your model is loaded in the Models tab
Write a client application using Triton client libraries
Wrap the client with FastAPI or Flask
Deploy the client as a pipeline job with jobType='triton'
Monitor the endpoint in the Endpoints tab
Use the provided URL to make inference requests

Managing Server Resources

Navigate to the Models tab
Review which models are currently loaded
Unload unused models to free memory using individual toggles
Use "Load All" before batch inference operations
Use "Unload All" to clear server memory completely

Troubleshooting Failed Endpoints

Switch to the Endpoints tab
Identify the problematic endpoint
Click on the endpoint row to view its logs in the side panel
Review logs for error messages
If necessary, cancel the endpoint using the X button
Fix the underlying issue and redeploy

Immediate Jobs - Triton endpoints are managed as pipeline jobs
Services - Similar service management for other types of deployments
Environment Configs - Configure compute resources for Triton deployments

Notes & Tips

Model Repository Best Practices

Follow the Triton model repository structure strictly to ensure automatic detection
For TensorFlow models, config.pbtxt can be auto-generated by Triton
Model files are stored at: {cloud_bucket}/triton-server/model-repository/
Each model should have its own subdirectory with version subdirectories

Performance Optimization

Only keep frequently-used models loaded to optimize memory usage
Unload models during low-traffic periods to free resources
Use bulk load operations when preparing for batch inference workloads
Monitor server health indicator before deploying new endpoints

Endpoint Configuration

Custom URL endpoints can be specified during client deployment
Inference endpoints typically follow pattern: https://{domain}/hyperplane.dev/{endpoint_name}/infer/
Endpoints require the daskDashboardUrl field to appear in the endpoints table

Multi-Model and Ensemble Serving

Multiple models can be loaded simultaneously and served from a single endpoint
Parameterize client inference functions with model_name for multi-model endpoints
Ensemble models use the ensemble platform in config.pbtxt with ensemble_scheduling configuration
Ensemble models can execute multiple models concurrently using Python backend with asyncio

Log Monitoring

Server logs (left panel in Models tab) show server-level events and model loading operations
Endpoint logs (right panel in Endpoints tab) show request-specific logs for selected endpoints
Click on any endpoint row to switch the log view to that specific endpoint

Health Checks

Server health is checked via /v2/health/ready endpoint
Individual endpoint health indicators appear in the Health column
Green indicator = healthy and ready, Red indicator = unhealthy or not ready

Troubleshooting

If models don't appear after upload, check the model repository structure and refresh
If load/unload operations fail, verify server health and check server logs
Endpoint cancellation changes job status but doesn't immediately terminate running processes
Failed endpoints remain visible until explicitly filtered or cancelled

Triton Inference Server

Overview​

Access & Location​

Key Capabilities​

Server Health Monitoring​

Model Management​

Endpoint Management​

Real-time Logs​

User Interface​

Main View​

Dialogs & Modals​

Tables & Data Grids​

Technical Details​

GraphQL Operations​

REST API Endpoints​

Component Structure​

Context & Configuration​

Common Workflows​

Deploying a New Model​

Creating a Model Serving Endpoint​

Managing Server Resources​

Troubleshooting Failed Endpoints​

Related Features​

Notes & Tips​

Model Repository Best Practices​

Performance Optimization​

Endpoint Configuration​

Multi-Model and Ensemble Serving​

Log Monitoring​

Health Checks​

Troubleshooting​