Distributed Workloads Dashboard

Overview

The Distributed Workloads Dashboard provides a centralized view for monitoring and managing distributed compute clusters created within the Shakudo platform. This feature displays active Dask and Ray clusters that have been deployed as part of user sessions or pipeline jobs, allowing administrators and users to view cluster details, access cluster dashboards, and manage cluster lifecycle.

Access & Location

Route: ?panel=distributed-workloads-dashboard
Navigation: Monitoring → Distributed Workloads
Access Requirements: None specified (accessible to authenticated users)
Feature Flags: None

Key Capabilities

View Active Dask Clusters

Monitor all active Dask clusters across the platform, including those created from JupyterHub sessions and pipeline jobs. View cluster specifications such as worker cores, RAM allocation, and number of worker nodes.

View Active Ray Clusters

Monitor all active Ray clusters across the platform. Similar to Dask clusters, Ray clusters can originate from sessions or pipeline jobs and display their resource configuration.

Access Cluster Dashboards

Quickly navigate to the native monitoring dashboards for both Dask and Ray clusters through convenient links in the interface. Dask clusters link to their status page, while Ray clusters link to the Ray dashboard.

Stop Ray Clusters

Directly terminate Ray clusters from the dashboard interface. This action removes the cluster's pods, services, and associated Kubernetes resources.

Refresh Cluster Data

Manually refresh the list of active clusters to get the most up-to-date information about cluster status and resource allocation.

User Interface

Main View

The panel displays two separate tables showing active compute clusters:

Compute Nodes Header: Displays "Compute Nodes" as the main title with a Refresh button
Dask Clusters Section: Shows all active Dask clusters with a subtitle explaining these are "Dask clusters created as part of sessions or pipeline jobs"
Ray Clusters Section: Shows all active Ray clusters with a subtitle explaining these are "Ray clusters created as part of sessions or pipeline jobs"

Both tables use server-side pagination with a default page size of 8 items per table.

Dialogs & Modals

Stop Ray Cluster Dialog
- Purpose: Confirm before terminating a Ray cluster
- Trigger: Click the cancel icon button in the Ray Clusters table
- Fields:
  - Cluster type identifier
  - Namespace information
- Actions:
  - Cancel: Close the dialog without taking action
  - Confirm: Stop the Ray cluster and remove all associated resources
- Behavior: Shows a loading indicator during the deletion process and refreshes the table after 2 seconds on success

Tables & Data Grids

Dask Clusters Table
- Columns:
  - Dashboard Link (icon): Opens the Dask dashboard in a new tab
  - Source: Origin of the cluster (session or pipeline job)
  - Owner: Username of the cluster owner
  - Cluster Name: Full cluster identifier with copy-to-clipboard functionality
  - Cores per Worker: CPU cores allocated to each worker node
  - RAM per Worker: Memory allocated to each worker node
  - # of Worker Nodes: Total number of worker nodes in the cluster
- Actions:
  - Click dashboard icon to open Dask status page
  - Click cluster name to copy to clipboard
- Filtering: None
- Pagination: Server-side, 8 items per page
Ray Clusters Table
- Columns:
  - Actions (icons): Stop cluster button and dashboard link
  - Source: Origin of the cluster (session or pipeline job)
  - Owner: Username of the cluster owner
  - Name: Ray cluster type identifier
  - Cluster Name: Full cluster identifier with copy-to-clipboard functionality
  - Cores per Worker: CPU cores allocated to each worker node
  - RAM per Worker: Memory allocated to each worker node
  - # of Worker Nodes: Total number of worker nodes in the cluster
- Actions:
  - Stop cluster (cancel icon) - Opens confirmation dialog
  - Open Ray dashboard (external link icon) - Opens dashboard in new tab
  - Click cluster name to copy to clipboard
- Filtering: None
- Pagination: Server-side, 8 items per page

Technical Details

API Endpoints

GET /api/distributed-workloads-dashboard/get-nodes

Retrieves all active Dask and Ray clusters from Kubernetes namespaces
Queries pods in both JupyterHub and Pipelines namespaces
Filters pods based on Kubernetes labels:
- Dask: app=dask and dask.org/component=scheduler
- Ray: Labels with type starting with ray-worker
Returns cluster details including resource allocations and dashboard URLs

POST /api/distributed-workloads-dashboard/ray/stop-cluster

Terminates a Ray cluster by deleting associated Kubernetes resources
Parameters: type (cluster type), namespace (Kubernetes namespace)
Deletes:
- Ray worker pods
- Associated services (-svc suffix)
- Istio virtual services (-vs suffix)
Returns success/error status with a descriptive message

Kubernetes Integration

The dashboard interacts directly with the Kubernetes API to:

Discover Clusters: Lists namespaced pods using the Kubernetes Core V1 API
Extract Metadata: Reads pod labels to identify cluster ownership, source, and configuration
Resource Information: Parses pod specs to determine CPU and memory limits
Cleanup: Uses both Core V1 API and Custom Objects API to delete Ray cluster resources

Key Kubernetes labels used:

dask.org/cluster-name: Identifies Dask cluster membership
dask.org/component: Distinguishes scheduler from worker pods
type: Identifies Ray worker pods (starts with ray-worker)
hyperplane.dev/user: Cluster owner username
hyperplane.dev/source: Origin (session or pipeline)
hyperplane.dev/component-id: Component identifier for URL construction

Component Structure

Main Component: /components/Panels/DistributedWorkloads.tsx
Table Component: /shakudo-apps/distributed-workloads-dashboard/components/Tables/ClusterNodes.tsx
Dialog: /shakudo-apps/distributed-workloads-dashboard/components/Dialogs/Ray/StopCluster.tsx
API Routes:
- /pages/api/distributed-workloads-dashboard/get-nodes.ts
- /pages/api/distributed-workloads-dashboard/ray/stop-cluster.ts

Data Structures

DaskCluster Interface:

{
  id: string | undefined;
  user: string | undefined;
  url: string | undefined;
  nodeCores: string | undefined;
  nodeGbRam: string | undefined;
  numNodes: number | undefined;
  clusterName: string | undefined;
  source: string | undefined;
}

RayCluster Interface:

{
  id: string | undefined;
  user: string | undefined;
  url: string | undefined;
  type: string | undefined;
  nodeCores: string | undefined;
  nodeGbRam: string | undefined;
  numNodes: number | undefined;
  clusterName: string | undefined;
  source: string | undefined;
  namespace: string | undefined;
}

Common Workflows

Viewing Active Clusters

Navigate to Monitoring → Distributed Workloads
View the list of active Dask clusters in the first table
Scroll down to view the list of active Ray clusters in the second table
Click the Refresh button to update the cluster information

Accessing a Cluster Dashboard

Locate the cluster in either the Dask or Ray table
Click the external link icon (OpenInNew) in the first column
The cluster's native dashboard opens in a new browser tab

Stopping a Ray Cluster

Locate the Ray cluster you want to stop in the Ray Clusters table
Click the cancel icon (X) in the Actions column
Review the confirmation dialog showing the cluster type and namespace
Click "Confirm" to stop the cluster or "Cancel" to abort
Wait for the deletion to complete (loading indicator appears)
The table automatically refreshes after 2 seconds to reflect the change

Copying Cluster Names

Locate the cluster in either table
Click on the cluster name in the "Cluster Name" column
The full cluster name is copied to your clipboard
A success notification appears confirming the copy action

Sessions Management - Sessions can create Dask and Ray clusters
Jobs - Pipeline jobs can deploy distributed compute clusters
Stack Components - Pre-configured data tools that may use distributed computing

Notes & Tips

Namespace Awareness: The dashboard monitors both JupyterHub and Pipelines namespaces, ensuring all clusters are visible regardless of where they were created
Resource Monitoring: Use this dashboard to monitor resource allocation across all distributed compute workloads to identify optimization opportunities
Dask vs Ray: Dask clusters cannot be stopped from this interface (no stop button provided), while Ray clusters can be terminated directly
Automatic URL Construction: Dashboard URLs are automatically constructed based on cluster metadata and environment configuration
Label-Based Discovery: The system relies on Kubernetes labels to identify and classify clusters, so proper labeling is critical
Dual Namespace Support: If the JupyterHub and Pipelines namespaces are the same, the system intelligently avoids duplicate queries
Error Handling: If cluster retrieval fails, an error notification is displayed and the tables show empty states
Real-time Updates: The Refresh button queries the Kubernetes API directly, ensuring you always have current information
Copy-to-Clipboard: Long cluster names are truncated with ellipsis in the UI, but clicking copies the full name
Source Tracking: The "Source" column helps identify whether a cluster originated from a session or pipeline job

Distributed Workloads Dashboard

Overview​

Access & Location​

Key Capabilities​

View Active Dask Clusters​

View Active Ray Clusters​

Access Cluster Dashboards​

Stop Ray Clusters​

Refresh Cluster Data​

User Interface​

Main View​

Dialogs & Modals​

Tables & Data Grids​

Technical Details​

API Endpoints​

Kubernetes Integration​

Component Structure​

Data Structures​

Common Workflows​

Viewing Active Clusters​

Accessing a Cluster Dashboard​

Stopping a Ray Cluster​

Copying Cluster Names​

Related Features​

Notes & Tips​