Skip to main content

Distributed Workloads Dashboard

Overview

The Distributed Workloads Dashboard provides a centralized view for monitoring and managing distributed compute clusters created within the Shakudo platform. This feature displays active Dask and Ray clusters that have been deployed as part of user sessions or pipeline jobs, allowing administrators and users to view cluster details, access cluster dashboards, and manage cluster lifecycle.

Access & Location

  • Route: ?panel=distributed-workloads-dashboard
  • Navigation: Monitoring → Distributed Workloads
  • Access Requirements: None specified (accessible to authenticated users)
  • Feature Flags: None

Key Capabilities

View Active Dask Clusters

Monitor all active Dask clusters across the platform, including those created from JupyterHub sessions and pipeline jobs. View cluster specifications such as worker cores, RAM allocation, and number of worker nodes.

View Active Ray Clusters

Monitor all active Ray clusters across the platform. Similar to Dask clusters, Ray clusters can originate from sessions or pipeline jobs and display their resource configuration.

Access Cluster Dashboards

Quickly navigate to the native monitoring dashboards for both Dask and Ray clusters through convenient links in the interface. Dask clusters link to their status page, while Ray clusters link to the Ray dashboard.

Stop Ray Clusters

Directly terminate Ray clusters from the dashboard interface. This action removes the cluster's pods, services, and associated Kubernetes resources.

Refresh Cluster Data

Manually refresh the list of active clusters to get the most up-to-date information about cluster status and resource allocation.

User Interface

Main View

The panel displays two separate tables showing active compute clusters:

  1. Compute Nodes Header: Displays "Compute Nodes" as the main title with a Refresh button
  2. Dask Clusters Section: Shows all active Dask clusters with a subtitle explaining these are "Dask clusters created as part of sessions or pipeline jobs"
  3. Ray Clusters Section: Shows all active Ray clusters with a subtitle explaining these are "Ray clusters created as part of sessions or pipeline jobs"

Both tables use server-side pagination with a default page size of 8 items per table.

Dialogs & Modals

  1. Stop Ray Cluster Dialog
    • Purpose: Confirm before terminating a Ray cluster
    • Trigger: Click the cancel icon button in the Ray Clusters table
    • Fields:
      • Cluster type identifier
      • Namespace information
    • Actions:
      • Cancel: Close the dialog without taking action
      • Confirm: Stop the Ray cluster and remove all associated resources
    • Behavior: Shows a loading indicator during the deletion process and refreshes the table after 2 seconds on success

Tables & Data Grids

  1. Dask Clusters Table

    • Columns:
      • Dashboard Link (icon): Opens the Dask dashboard in a new tab
      • Source: Origin of the cluster (session or pipeline job)
      • Owner: Username of the cluster owner
      • Cluster Name: Full cluster identifier with copy-to-clipboard functionality
      • Cores per Worker: CPU cores allocated to each worker node
      • RAM per Worker: Memory allocated to each worker node
      • # of Worker Nodes: Total number of worker nodes in the cluster
    • Actions:
      • Click dashboard icon to open Dask status page
      • Click cluster name to copy to clipboard
    • Filtering: None
    • Pagination: Server-side, 8 items per page
  2. Ray Clusters Table

    • Columns:
      • Actions (icons): Stop cluster button and dashboard link
      • Source: Origin of the cluster (session or pipeline job)
      • Owner: Username of the cluster owner
      • Name: Ray cluster type identifier
      • Cluster Name: Full cluster identifier with copy-to-clipboard functionality
      • Cores per Worker: CPU cores allocated to each worker node
      • RAM per Worker: Memory allocated to each worker node
      • # of Worker Nodes: Total number of worker nodes in the cluster
    • Actions:
      • Stop cluster (cancel icon) - Opens confirmation dialog
      • Open Ray dashboard (external link icon) - Opens dashboard in new tab
      • Click cluster name to copy to clipboard
    • Filtering: None
    • Pagination: Server-side, 8 items per page

Technical Details

API Endpoints

GET /api/distributed-workloads-dashboard/get-nodes

  • Retrieves all active Dask and Ray clusters from Kubernetes namespaces
  • Queries pods in both JupyterHub and Pipelines namespaces
  • Filters pods based on Kubernetes labels:
    • Dask: app=dask and dask.org/component=scheduler
    • Ray: Labels with type starting with ray-worker
  • Returns cluster details including resource allocations and dashboard URLs

POST /api/distributed-workloads-dashboard/ray/stop-cluster

  • Terminates a Ray cluster by deleting associated Kubernetes resources
  • Parameters: type (cluster type), namespace (Kubernetes namespace)
  • Deletes:
    • Ray worker pods
    • Associated services (-svc suffix)
    • Istio virtual services (-vs suffix)
  • Returns success/error status with a descriptive message

Kubernetes Integration

The dashboard interacts directly with the Kubernetes API to:

  1. Discover Clusters: Lists namespaced pods using the Kubernetes Core V1 API
  2. Extract Metadata: Reads pod labels to identify cluster ownership, source, and configuration
  3. Resource Information: Parses pod specs to determine CPU and memory limits
  4. Cleanup: Uses both Core V1 API and Custom Objects API to delete Ray cluster resources

Key Kubernetes labels used:

  • dask.org/cluster-name: Identifies Dask cluster membership
  • dask.org/component: Distinguishes scheduler from worker pods
  • type: Identifies Ray worker pods (starts with ray-worker)
  • hyperplane.dev/user: Cluster owner username
  • hyperplane.dev/source: Origin (session or pipeline)
  • hyperplane.dev/component-id: Component identifier for URL construction

Component Structure

  • Main Component: /components/Panels/DistributedWorkloads.tsx
  • Table Component: /shakudo-apps/distributed-workloads-dashboard/components/Tables/ClusterNodes.tsx
  • Dialog: /shakudo-apps/distributed-workloads-dashboard/components/Dialogs/Ray/StopCluster.tsx
  • API Routes:
    • /pages/api/distributed-workloads-dashboard/get-nodes.ts
    • /pages/api/distributed-workloads-dashboard/ray/stop-cluster.ts

Data Structures

DaskCluster Interface:

{
id: string | undefined;
user: string | undefined;
url: string | undefined;
nodeCores: string | undefined;
nodeGbRam: string | undefined;
numNodes: number | undefined;
clusterName: string | undefined;
source: string | undefined;
}

RayCluster Interface:

{
id: string | undefined;
user: string | undefined;
url: string | undefined;
type: string | undefined;
nodeCores: string | undefined;
nodeGbRam: string | undefined;
numNodes: number | undefined;
clusterName: string | undefined;
source: string | undefined;
namespace: string | undefined;
}

Common Workflows

Viewing Active Clusters

  1. Navigate to Monitoring → Distributed Workloads
  2. View the list of active Dask clusters in the first table
  3. Scroll down to view the list of active Ray clusters in the second table
  4. Click the Refresh button to update the cluster information

Accessing a Cluster Dashboard

  1. Locate the cluster in either the Dask or Ray table
  2. Click the external link icon (OpenInNew) in the first column
  3. The cluster's native dashboard opens in a new browser tab

Stopping a Ray Cluster

  1. Locate the Ray cluster you want to stop in the Ray Clusters table
  2. Click the cancel icon (X) in the Actions column
  3. Review the confirmation dialog showing the cluster type and namespace
  4. Click "Confirm" to stop the cluster or "Cancel" to abort
  5. Wait for the deletion to complete (loading indicator appears)
  6. The table automatically refreshes after 2 seconds to reflect the change

Copying Cluster Names

  1. Locate the cluster in either table
  2. Click on the cluster name in the "Cluster Name" column
  3. The full cluster name is copied to your clipboard
  4. A success notification appears confirming the copy action
  • Sessions Management - Sessions can create Dask and Ray clusters
  • Jobs - Pipeline jobs can deploy distributed compute clusters
  • Stack Components - Pre-configured data tools that may use distributed computing

Notes & Tips

  • Namespace Awareness: The dashboard monitors both JupyterHub and Pipelines namespaces, ensuring all clusters are visible regardless of where they were created
  • Resource Monitoring: Use this dashboard to monitor resource allocation across all distributed compute workloads to identify optimization opportunities
  • Dask vs Ray: Dask clusters cannot be stopped from this interface (no stop button provided), while Ray clusters can be terminated directly
  • Automatic URL Construction: Dashboard URLs are automatically constructed based on cluster metadata and environment configuration
  • Label-Based Discovery: The system relies on Kubernetes labels to identify and classify clusters, so proper labeling is critical
  • Dual Namespace Support: If the JupyterHub and Pipelines namespaces are the same, the system intelligently avoids duplicate queries
  • Error Handling: If cluster retrieval fails, an error notification is displayed and the tables show empty states
  • Real-time Updates: The Refresh button queries the Kubernetes API directly, ensuring you always have current information
  • Copy-to-Clipboard: Long cluster names are truncated with ellipsis in the UI, but clicking copies the full name
  • Source Tracking: The "Source" column helps identify whether a cluster originated from a session or pipeline job