Triton Inference Server
Overview
The Triton Inference Server panel provides a comprehensive interface for managing NVIDIA Triton Inference Server deployments within the Shakudo Platform. This feature enables users to monitor server health, manage AI model loading/unloading, and control serving endpoints for production machine learning inference workloads. Triton supports models from any framework (TensorFlow, PyTorch, ONNX, TensorRT, or custom) and can be deployed on GPU or CPU infrastructure.
Access & Location
- Route:
?panel=triton-inference-server
- Navigation: Main Navigation → Triton Inference Server
- Access Requirements: None specified (standard user access)
- Feature Flags: None
Key Capabilities
Server Health Monitoring
The panel continuously monitors the Triton Inference Server health status by checking the /v2/health/ready
endpoint. A visual indicator (green/red circle) displays whether the server is healthy and ready to serve requests.
Model Management
Users can view all models in the Triton model repository and control their loading state. Models can be individually loaded or unloaded from the server's memory, or bulk operations can load/unload all models simultaneously. This allows for efficient resource management when multiple models are available.
Endpoint Management
The panel tracks active Triton serving endpoints (pipeline jobs) that are currently running. Users can view endpoint details, check their health status, and cancel endpoints when needed. Each endpoint represents a running service that exposes model inference capabilities via HTTP/HTTPS.
Real-time Logs
Both server-level logs and endpoint-specific logs are available in dedicated panels, providing visibility into model operations, inference requests, and system events.
User Interface
Main View
The interface features two primary tabs accessed via chip buttons:
- Models Tab: Displays the models table with server logs panel on the side
- Endpoints Tab: Shows the endpoints table with endpoint-specific logs panel on the side
A server health indicator is prominently displayed in the header, showing real-time status of the Triton server.
Dialogs & Modals
- Cancel Endpoint Dialog
- Purpose: Confirm cancellation of a running Triton endpoint
- Fields: Confirmation message with endpoint ID
- Actions: Close (abort) or Cancel (confirm deletion)
Tables & Data Grids
Models Table
- Columns:
- Model: Model name (clickable to copy)
- Version: Model version number
- Bucket Path: Full path to model in cloud storage (clickable to copy)
- State: Toggle switch showing Loaded/Unloaded status
- Actions:
- Load All: Load all available models into server memory
- Unload All: Unload all models from server memory
- Refresh: Reload the models list
- Individual toggle: Load/unload specific models
- Filtering: None
- Pagination: 10 items per page
- Columns:
Endpoints Table
- Columns:
- Name: Endpoint name with cancel button (clickable to copy)
- Endpoint: Full URL to the serving endpoint (clickable to copy)
- Health: Real-time health check indicator
- Actions:
- Cancel endpoint (X button per row)
- Refresh: Reload the endpoints list
- Row click: Select endpoint to view logs
- Filtering: Automatically filters to only show active Triton endpoints (excludes cancelled, failed, or completed jobs)
- Pagination: Server-side pagination with 10 items per page
- Columns:
Technical Details
GraphQL Operations
Queries:
tritonServices
- Fetches active Triton pipeline jobs (endpoints) with filtering for jobType='triton' and excluding cancelled/failed/completed jobs. Returns id, jobName, jobType, status, dashboardPrefix, and daskDashboardUrl.
Mutations:
cancelEndpoint
- Updates a pipeline job status to 'cancelled' by ID, effectively terminating the endpoint.
Subscriptions: None
REST API Endpoints
Model Operations:
POST /api/triton-dashboard/get-models
- Fetches model repository index from Triton server (/v2/repository/index
)POST /api/triton-dashboard/load-models
- Loads or unloads a specific model (/v2/repository/models/{name}/{action}
)
Server Monitoring:
POST /api/triton-dashboard/check-url
- Health check endpoint validatorPOST /api/triton-dashboard/logs
- Retrieves Triton server logsPOST /api/triton-dashboard/server-metrics
- Fetches server performance metrics
Endpoint Operations:
POST /api/triton-dashboard/check-endpoint-status
- Validates endpoint healthPOST /api/triton-dashboard/endpoint-logs
- Retrieves logs for specific endpoints
Component Structure
- Main Component:
shakudo-apps/triton-dashboard/components/Panels/TritonPanel.tsx
- Tables:
shakudo-apps/triton-dashboard/components/Tables/TritonModels.tsx
shakudo-apps/triton-dashboard/components/Tables/TritonEndpoints.tsx
- Dialogs:
shakudo-apps/triton-dashboard/components/Dialogs/CancelEndpoint.tsx
- Toggles:
shakudo-apps/triton-dashboard/components/Toggle/LoadUnloadModelToggle.tsx
- Log Containers:
shakudo-apps/triton-dashboard/components/Containers/TritonLogs.tsx
shakudo-apps/triton-dashboard/components/Containers/EndpointLogs.tsx
Context & Configuration
- TritonAppContext: Provides server URL and model repository path configuration
- Environment Variables:
TRITON_SERVER
: Base URL for the Triton Inference Server
Common Workflows
Deploying a New Model
- Upload model checkpoint to the Triton model repository (cloud bucket path:
{bucket}/triton-server/model-repository/
) - Structure the model following Triton model repository format
- Wait for automatic detection or manually refresh the Models tab
- Toggle the model state from "Unloaded" to "Loaded"
- Verify the model appears as "Loaded" in the state column
Creating a Model Serving Endpoint
- Ensure your model is loaded in the Models tab
- Write a client application using Triton client libraries
- Wrap the client with FastAPI or Flask
- Deploy the client as a pipeline job with jobType='triton'
- Monitor the endpoint in the Endpoints tab
- Use the provided URL to make inference requests
Managing Server Resources
- Navigate to the Models tab
- Review which models are currently loaded
- Unload unused models to free memory using individual toggles
- Use "Load All" before batch inference operations
- Use "Unload All" to clear server memory completely
Troubleshooting Failed Endpoints
- Switch to the Endpoints tab
- Identify the problematic endpoint
- Click on the endpoint row to view its logs in the side panel
- Review logs for error messages
- If necessary, cancel the endpoint using the X button
- Fix the underlying issue and redeploy
Related Features
- Immediate Jobs - Triton endpoints are managed as pipeline jobs
- Services - Similar service management for other types of deployments
- Environment Configs - Configure compute resources for Triton deployments
Notes & Tips
Model Repository Best Practices
- Follow the Triton model repository structure strictly to ensure automatic detection
- For TensorFlow models,
config.pbtxt
can be auto-generated by Triton - Model files are stored at:
{cloud_bucket}/triton-server/model-repository/
- Each model should have its own subdirectory with version subdirectories
Performance Optimization
- Only keep frequently-used models loaded to optimize memory usage
- Unload models during low-traffic periods to free resources
- Use bulk load operations when preparing for batch inference workloads
- Monitor server health indicator before deploying new endpoints
Endpoint Configuration
- Custom URL endpoints can be specified during client deployment
- Inference endpoints typically follow pattern:
https://{domain}/hyperplane.dev/{endpoint_name}/infer/
- Endpoints require the
daskDashboardUrl
field to appear in the endpoints table
Multi-Model and Ensemble Serving
- Multiple models can be loaded simultaneously and served from a single endpoint
- Parameterize client inference functions with
model_name
for multi-model endpoints - Ensemble models use the
ensemble
platform inconfig.pbtxt
withensemble_scheduling
configuration - Ensemble models can execute multiple models concurrently using Python backend with
asyncio
Log Monitoring
- Server logs (left panel in Models tab) show server-level events and model loading operations
- Endpoint logs (right panel in Endpoints tab) show request-specific logs for selected endpoints
- Click on any endpoint row to switch the log view to that specific endpoint
Health Checks
- Server health is checked via
/v2/health/ready
endpoint - Individual endpoint health indicators appear in the Health column
- Green indicator = healthy and ready, Red indicator = unhealthy or not ready
Troubleshooting
- If models don't appear after upload, check the model repository structure and refresh
- If load/unload operations fail, verify server health and check server logs
- Endpoint cancellation changes job status but doesn't immediately terminate running processes
- Failed endpoints remain visible until explicitly filtered or cancelled