Datalake (Shakudo Data Lakehouse)

Overview

The Datalake panel provides a comprehensive interface for managing and exploring the Shakudo Data Lakehouse - an Apache Iceberg-based data lakehouse built on object storage. This feature enables users to browse S3 buckets, explore Iceberg catalogs and tables, query sample data, and interact with data through an integrated PySpark terminal. The lakehouse combines the flexibility of data lakes with the structure and ACID guarantees of data warehouses, using Apache Iceberg as the table format and supporting both Nessie and Postgres as catalog backends.

Access & Location

Route: ?panel=datalake
Navigation: Admin → Datalake
Access Requirements: None (feature-flag gated)
Feature Flags: datalakeEnabled

Key Capabilities

Browse Object Storage

Navigate through the S3-compatible object storage bucket that backs the data lakehouse. View folders and files, see metadata like size, object count, and last modified timestamps. The file browser provides a hierarchical view of the storage structure.

Explore Iceberg Catalogs

View and interact with Apache Iceberg catalogs that organize tables into logical namespaces. Each catalog contains multiple tables with metadata about schema, partitions, and data files. Catalogs are grouped by name and display statistics about total tables and storage size.

Query Sample Data

Preview data from Parquet files and Iceberg tables directly in the browser. Sample data can be viewed in both table format and JSON format, with configurable sample sizes (10, 20, 50, 100, 250, or 500 rows). This allows quick data validation without running full queries.

Integrated PySpark Terminal

Access an interactive PySpark shell pre-configured with the lakehouse connection. Execute Spark SQL queries, perform data transformations, and interact with Iceberg tables programmatically. The terminal can be restarted to clear the session state.

Configuration Management

Access setup guides for configuring PySpark sessions and Nessie/Dremio integrations. These guides provide copy-paste ready code snippets with bucket names and connection strings pre-filled.

Bytebase SQL Sandbox

For systems with Bytebase integration, users can open a SQL sandbox environment to run queries against the lakehouse using a web-based SQL editor.

User Interface

Main View

The main panel displays:

Header: Panel title "Shakudo Data Lakehouse" with an "Advanced" menu for accessing configuration dialogs
Bucket Details Card: Shows bucket name, last modified date, total objects, total size, and Postgres catalog connection status
Navigation: Breadcrumb-style path navigation showing current location in the folder hierarchy
Content Area: Either displays the catalog list view or the file/folder browser depending on navigation state

Dialogs & Modals

Interactive Shell Dialog
- Purpose: Provides a fullscreen PySpark terminal for interactive data queries
- Features: Embedded iframe terminal, session restart button, clipboard access for copy/paste
- Actions: Execute Spark commands, restart session, close dialog
Spark Session Setup Dialog
- Purpose: Displays markdown documentation for configuring PySpark sessions
- Fields: Pre-filled code examples with bucket name and Postgres URL
- Actions: Copy code snippets, close dialog
Nessie Dremio Setup Dialog
- Purpose: Displays markdown documentation for Nessie catalog configuration
- Fields: Setup instructions and configuration examples
- Actions: Copy code snippets, close dialog
Sample Data Viewer Dialog
- Purpose: Display sample rows from Parquet files or Iceberg tables
- Fields: Data table view, JSON view toggle, sample size selector
- Actions: Switch between table/JSON views, copy JSON to clipboard, close dialog

Tables & Data Grids

Iceberg Catalogs Grid
- Display: Card-based grid layout grouped by catalog name
- Columns: Table name, catalog name, namespace, object count, size
- Actions: Click to explore catalog contents, view sample data, copy path, open in Bytebase
- Features: Expandable accordions for each catalog group
Files and Folders Table
- Display: Striped table with alternating row colors
- Columns: Name (with folder/file icons), Last Modified, Objects (for folders), Size, Actions
- Actions: Click folders to navigate, view sample data, copy path, copy code examples, download files
- Features: Pagination (5/10/25/50 rows per page), hover highlighting for folders
- Filtering: None (shows all contents of current path)

Technical Details

GraphQL Operations

Queries:

isDatalakeAvailable - Checks if the datalake feature is configured and available
getDatalakeBucketDetails - Retrieves bucket metadata including name, size, file count, and last modified
getPostgresFullUrl - Gets the Postgres connection URL for the Iceberg catalog
getIcebergCatalogs - Lists all Iceberg catalogs with their tables and metadata locations
getDatalakeBucketFilesAndFolders - Lists files and folders at a given S3 prefix
getDatalakeObjectDetails - Gets detailed metadata for a specific S3 object or folder
isIcebergTable - Checks if a given path is an Iceberg table
getSampleData - Retrieves sample rows from a Parquet file or Iceberg table

Mutations: None - this is a read-only interface

Subscriptions: None

Component Structure

Main Component: components/Datalake/Panel.tsx
Core Component: components/Datalake/Datalake.tsx
Bucket Details: components/Datalake/DatalakeBucketDetails.tsx
Catalogs View: components/Datalake/Catalogs.tsx
File Browser: components/Datalake/DatalakePathFolderAndFiles.tsx
Catalog Cards: components/Datalake/CatalogCard2.tsx
Dialogs:
- components/Datalake/DatalakeIntegratedTerminalDialog.tsx
- components/Datalake/PySparkConfigDialog.tsx
- components/Datalake/NessieConfigDialog.tsx
- components/Datalake/IcebergTableSampleData.tsx
Actions: components/Datalake/DataLakeTableObjectActions.tsx

State Management

The component uses Jotai atoms for state management:

BucketNameAtom - Stores the current bucket name
SubpathAtom - Tracks the current navigation path in the bucket
DataLakePostgresFullUrlAtom - Stores the Postgres catalog connection URL
LoadingFilesAndFolderAtom - Tracks loading state for file browser
DataSampleSizeAtom - Controls the number of rows to fetch when sampling data
DataLakeCatalogSelected - Stores the currently selected catalog for exploration

Common Workflows

Browse Data Lakehouse Contents

Navigate to Admin → Datalake panel
View the main bucket details card showing overall statistics
Click on an Iceberg catalog in the catalogs list to explore its contents
Navigate through folders by clicking on folder rows
Use the breadcrumb navigation to return to parent folders or catalogs list
Click "Refresh" button to update bucket statistics

Query Sample Data from a Table

Browse to a catalog or folder containing data
Locate the table or Parquet file you want to preview
Click the eye icon in the Actions column
Wait for data to load (respects the sample size setting)
Toggle between "Table" and "JSON" views using the toggle buttons
Optionally copy the JSON data to clipboard
Close the dialog when finished

Run Interactive Spark Queries

Click the "Advanced" button in the panel header
Select "Interactive Shell" from the dropdown menu
Wait for the PySpark terminal to load (shows iframe)
Execute Spark SQL or PySpark commands in the terminal
Use the restart button if you need to reset the session
Close the dialog when finished

Set Up PySpark Connection

Click the "Advanced" button in the panel header
Select "Spark Session Setup" from the dropdown
Review the configuration documentation with pre-filled values
Copy the code snippets to use in your own notebooks or scripts
Close the dialog when finished

Access Bytebase SQL Sandbox

Ensure Bytebase is configured (visible if datalakeBytebase platform parameter is set)
Click the "Try it in Sandbox" button in the catalogs view
Opens Bytebase in a new browser tab
Run SQL queries against the lakehouse catalog

Sessions - For running Jupyter notebooks that connect to the datalake
Jobs - For scheduling data pipeline jobs that read/write to the lakehouse
Stack Components - For deploying complementary tools like Dremio or Trino

Notes & Tips

Performance Considerations

Large folders may take time to load - the interface fetches all items at once before displaying
Sample data queries are limited to configured sample sizes to prevent slow queries
The integrated terminal maintains session state, so restart if you encounter memory issues

Data Organization Best Practices

Iceberg tables are organized in a hierarchy: catalog → namespace → table
Each table's data files are stored in a data/ subdirectory
Metadata files are stored in a metadata/ subdirectory
Use consistent naming conventions for catalogs and namespaces

Path Formats

S3 paths are displayed in the format: s3://bucket-name/path/to/object
Internal paths used for navigation exclude the s3:// prefix and bucket name
Copy path buttons provide the full S3 URL for use in external tools

Iceberg Table Detection

The system automatically detects whether a path contains an Iceberg table
Iceberg tables have special handling for sample data (reads from data/ subdirectory)
Non-Iceberg Parquet files are read directly

Connection Information

The Postgres connection URL shown in the bucket details card is masked for security
Hover over the connection chip to reveal the masked URL
Click the chip to copy the full unmasked URL to clipboard

Availability

If the datalake is not configured, an informational message appears
The feature requires backend services to be running and properly configured
Check with your platform administrator if the datalake appears unavailable

Datalake (Shakudo Data Lakehouse)

Overview​

Access & Location​

Key Capabilities​

Browse Object Storage​

Explore Iceberg Catalogs​

Query Sample Data​

Integrated PySpark Terminal​

Configuration Management​

Bytebase SQL Sandbox​

User Interface​

Main View​

Dialogs & Modals​

Tables & Data Grids​

Technical Details​

GraphQL Operations​

Component Structure​

State Management​

Common Workflows​

Browse Data Lakehouse Contents​

Query Sample Data from a Table​

Run Interactive Spark Queries​

Set Up PySpark Connection​

Access Bytebase SQL Sandbox​

Related Features​

Notes & Tips​

Performance Considerations​

Data Organization Best Practices​

Path Formats​

Iceberg Table Detection​

Connection Information​

Availability​