Comet Roadmap#
Comet is an open-source project and contributors are welcome to work on any issues at any time, but we find it helpful to have a roadmap for some of the major items that require coordination between contributors.
Window Expressions#
Native window execution is currently disabled by default due to known correctness issues (#2721, #2841).
In addition, dedicated window functions such as rank, dense_rank, row_number, lag, lead, ntile,
cume_dist, percent_rank, and nth_value are not yet implemented and fall back to Spark (#2705). The
goal is to enable windowed aggregates by default (#4007) and add the missing dedicated window functions.
Lambda Expressions#
Spark supports higher-order functions on arrays and maps that take a lambda, including transform, exists,
forall, aggregate, zip_with, map_filter, and map_zip_with. Comet currently lacks a general mechanism
for serializing lambda expressions and evaluating them in DataFusion. Adding this capability will unlock a
significant family of Spark expressions in one effort.
Dynamic Partition Pruning#
Native Parquet scans (CometNativeScanExec) support Dynamic Partition Pruning (DPP) both with and without
Adaptive Query Execution. Non-AQE DPP landed in #4011 and AQE DPP with broadcast reuse landed in #4112.
Iceberg native scans currently support non-AQE DPP only (#3349, #3511); extending broadcast reuse to AQE
DPP for Iceberg is tracked at #3510.
TPC-H and TPC-DS Performance#
We regularly publish benchmark results derived from TPC-H and TPC-DS to track performance against Spark. Closing the remaining gaps and increasing the speedup on both benchmark suites is an ongoing focus, tracked under #2004 (TPC-H), #858 (TPC-DS), and #3799 (improving the awslabs published TPC-DS results).
Upstream Work in DataFusion#
A growing number of Spark-compatible expressions live in the datafusion-spark crate in the core DataFusion
repository. Comet is migrating its expression implementations to that crate so that they can be shared by other
DataFusion-based projects, tracked in #2084. Improvements to core DataFusion operators (joins, aggregates,
window) made in support of Comet also benefit the wider ecosystem.
Spillable Hash Join#
Comet’s native hash join currently requires the build side to fit entirely in memory. Adding spill-to-disk support will allow Comet to handle larger joins without falling back to Spark, improving both reliability and performance for memory-intensive workloads.
Java/Scala Columnar and Arrow UDF Support#
Spark users frequently define custom UDFs in Java or Scala. Comet currently falls back to Spark when a query contains a JVM UDF. Adding support for calling Java/Scala UDFs that operate on columnar Arrow data directly from native execution will reduce fallbacks and allow more queries to run end-to-end in Comet.
Memory Management Improvements#
Comet coordinates memory between the JVM and native Rust execution through a custom memory pool. Improving memory accounting, reservation strategies, and spill integration will reduce out-of-memory errors and allow Comet to make better use of available resources, especially in multi-query and multi-task environments.
Prepare for 1.0.0 Release#
The project is working toward a 1.0.0 release. This effort includes finalizing configuration options, resolving known correctness issues, and improving documentation. Progress is tracked in #4082.
Native Parquet Writes#
Comet has experimental support for native Parquet writes via InsertIntoHadoopFsRelationCommand, currently
disabled by default. The goal is to reach correctness and performance parity with Spark’s writer so it can be
enabled by default (#1625).