Skip to main content

Datalake (Shakudo Data Lakehouse)

Overview

The Datalake panel provides a comprehensive interface for managing and exploring the Shakudo Data Lakehouse - an Apache Iceberg-based data lakehouse built on object storage. This feature enables users to browse S3 buckets, explore Iceberg catalogs and tables, query sample data, and interact with data through an integrated PySpark terminal. The lakehouse combines the flexibility of data lakes with the structure and ACID guarantees of data warehouses, using Apache Iceberg as the table format and supporting both Nessie and Postgres as catalog backends.

Access & Location

  • Route: ?panel=datalake
  • Navigation: Admin → Datalake
  • Access Requirements: None (feature-flag gated)
  • Feature Flags: datalakeEnabled

Key Capabilities

Browse Object Storage

Navigate through the S3-compatible object storage bucket that backs the data lakehouse. View folders and files, see metadata like size, object count, and last modified timestamps. The file browser provides a hierarchical view of the storage structure.

Explore Iceberg Catalogs

View and interact with Apache Iceberg catalogs that organize tables into logical namespaces. Each catalog contains multiple tables with metadata about schema, partitions, and data files. Catalogs are grouped by name and display statistics about total tables and storage size.

Query Sample Data

Preview data from Parquet files and Iceberg tables directly in the browser. Sample data can be viewed in both table format and JSON format, with configurable sample sizes (10, 20, 50, 100, 250, or 500 rows). This allows quick data validation without running full queries.

Integrated PySpark Terminal

Access an interactive PySpark shell pre-configured with the lakehouse connection. Execute Spark SQL queries, perform data transformations, and interact with Iceberg tables programmatically. The terminal can be restarted to clear the session state.

Configuration Management

Access setup guides for configuring PySpark sessions and Nessie/Dremio integrations. These guides provide copy-paste ready code snippets with bucket names and connection strings pre-filled.

Bytebase SQL Sandbox

For systems with Bytebase integration, users can open a SQL sandbox environment to run queries against the lakehouse using a web-based SQL editor.

User Interface

Main View

The main panel displays:

  • Header: Panel title "Shakudo Data Lakehouse" with an "Advanced" menu for accessing configuration dialogs
  • Bucket Details Card: Shows bucket name, last modified date, total objects, total size, and Postgres catalog connection status
  • Navigation: Breadcrumb-style path navigation showing current location in the folder hierarchy
  • Content Area: Either displays the catalog list view or the file/folder browser depending on navigation state

Dialogs & Modals

  1. Interactive Shell Dialog

    • Purpose: Provides a fullscreen PySpark terminal for interactive data queries
    • Features: Embedded iframe terminal, session restart button, clipboard access for copy/paste
    • Actions: Execute Spark commands, restart session, close dialog
  2. Spark Session Setup Dialog

    • Purpose: Displays markdown documentation for configuring PySpark sessions
    • Fields: Pre-filled code examples with bucket name and Postgres URL
    • Actions: Copy code snippets, close dialog
  3. Nessie Dremio Setup Dialog

    • Purpose: Displays markdown documentation for Nessie catalog configuration
    • Fields: Setup instructions and configuration examples
    • Actions: Copy code snippets, close dialog
  4. Sample Data Viewer Dialog

    • Purpose: Display sample rows from Parquet files or Iceberg tables
    • Fields: Data table view, JSON view toggle, sample size selector
    • Actions: Switch between table/JSON views, copy JSON to clipboard, close dialog

Tables & Data Grids

  1. Iceberg Catalogs Grid

    • Display: Card-based grid layout grouped by catalog name
    • Columns: Table name, catalog name, namespace, object count, size
    • Actions: Click to explore catalog contents, view sample data, copy path, open in Bytebase
    • Features: Expandable accordions for each catalog group
  2. Files and Folders Table

    • Display: Striped table with alternating row colors
    • Columns: Name (with folder/file icons), Last Modified, Objects (for folders), Size, Actions
    • Actions: Click folders to navigate, view sample data, copy path, copy code examples, download files
    • Features: Pagination (5/10/25/50 rows per page), hover highlighting for folders
    • Filtering: None (shows all contents of current path)

Technical Details

GraphQL Operations

Queries:

  • isDatalakeAvailable - Checks if the datalake feature is configured and available
  • getDatalakeBucketDetails - Retrieves bucket metadata including name, size, file count, and last modified
  • getPostgresFullUrl - Gets the Postgres connection URL for the Iceberg catalog
  • getIcebergCatalogs - Lists all Iceberg catalogs with their tables and metadata locations
  • getDatalakeBucketFilesAndFolders - Lists files and folders at a given S3 prefix
  • getDatalakeObjectDetails - Gets detailed metadata for a specific S3 object or folder
  • isIcebergTable - Checks if a given path is an Iceberg table
  • getSampleData - Retrieves sample rows from a Parquet file or Iceberg table

Mutations: None - this is a read-only interface

Subscriptions: None

Component Structure

  • Main Component: components/Datalake/Panel.tsx
  • Core Component: components/Datalake/Datalake.tsx
  • Bucket Details: components/Datalake/DatalakeBucketDetails.tsx
  • Catalogs View: components/Datalake/Catalogs.tsx
  • File Browser: components/Datalake/DatalakePathFolderAndFiles.tsx
  • Catalog Cards: components/Datalake/CatalogCard2.tsx
  • Dialogs:
    • components/Datalake/DatalakeIntegratedTerminalDialog.tsx
    • components/Datalake/PySparkConfigDialog.tsx
    • components/Datalake/NessieConfigDialog.tsx
    • components/Datalake/IcebergTableSampleData.tsx
  • Actions: components/Datalake/DataLakeTableObjectActions.tsx

State Management

The component uses Jotai atoms for state management:

  • BucketNameAtom - Stores the current bucket name
  • SubpathAtom - Tracks the current navigation path in the bucket
  • DataLakePostgresFullUrlAtom - Stores the Postgres catalog connection URL
  • LoadingFilesAndFolderAtom - Tracks loading state for file browser
  • DataSampleSizeAtom - Controls the number of rows to fetch when sampling data
  • DataLakeCatalogSelected - Stores the currently selected catalog for exploration

Common Workflows

Browse Data Lakehouse Contents

  1. Navigate to Admin → Datalake panel
  2. View the main bucket details card showing overall statistics
  3. Click on an Iceberg catalog in the catalogs list to explore its contents
  4. Navigate through folders by clicking on folder rows
  5. Use the breadcrumb navigation to return to parent folders or catalogs list
  6. Click "Refresh" button to update bucket statistics

Query Sample Data from a Table

  1. Browse to a catalog or folder containing data
  2. Locate the table or Parquet file you want to preview
  3. Click the eye icon in the Actions column
  4. Wait for data to load (respects the sample size setting)
  5. Toggle between "Table" and "JSON" views using the toggle buttons
  6. Optionally copy the JSON data to clipboard
  7. Close the dialog when finished

Run Interactive Spark Queries

  1. Click the "Advanced" button in the panel header
  2. Select "Interactive Shell" from the dropdown menu
  3. Wait for the PySpark terminal to load (shows iframe)
  4. Execute Spark SQL or PySpark commands in the terminal
  5. Use the restart button if you need to reset the session
  6. Close the dialog when finished

Set Up PySpark Connection

  1. Click the "Advanced" button in the panel header
  2. Select "Spark Session Setup" from the dropdown
  3. Review the configuration documentation with pre-filled values
  4. Copy the code snippets to use in your own notebooks or scripts
  5. Close the dialog when finished

Access Bytebase SQL Sandbox

  1. Ensure Bytebase is configured (visible if datalakeBytebase platform parameter is set)
  2. Click the "Try it in Sandbox" button in the catalogs view
  3. Opens Bytebase in a new browser tab
  4. Run SQL queries against the lakehouse catalog
  • Sessions - For running Jupyter notebooks that connect to the datalake
  • Jobs - For scheduling data pipeline jobs that read/write to the lakehouse
  • Stack Components - For deploying complementary tools like Dremio or Trino

Notes & Tips

Performance Considerations

  • Large folders may take time to load - the interface fetches all items at once before displaying
  • Sample data queries are limited to configured sample sizes to prevent slow queries
  • The integrated terminal maintains session state, so restart if you encounter memory issues

Data Organization Best Practices

  • Iceberg tables are organized in a hierarchy: catalog → namespace → table
  • Each table's data files are stored in a data/ subdirectory
  • Metadata files are stored in a metadata/ subdirectory
  • Use consistent naming conventions for catalogs and namespaces

Path Formats

  • S3 paths are displayed in the format: s3://bucket-name/path/to/object
  • Internal paths used for navigation exclude the s3:// prefix and bucket name
  • Copy path buttons provide the full S3 URL for use in external tools

Iceberg Table Detection

  • The system automatically detects whether a path contains an Iceberg table
  • Iceberg tables have special handling for sample data (reads from data/ subdirectory)
  • Non-Iceberg Parquet files are read directly

Connection Information

  • The Postgres connection URL shown in the bucket details card is masked for security
  • Hover over the connection chip to reveal the masked URL
  • Click the chip to copy the full unmasked URL to clipboard

Availability

  • If the datalake is not configured, an informational message appears
  • The feature requires backend services to be running and properly configured
  • Check with your platform administrator if the datalake appears unavailable