Apache DataFusion Comet#

Apache DataFusion Comet

A high-performance accelerator for Apache Spark

Runs your existing Spark queries on the Apache DataFusion native engine, no code changes required. Also accelerates Parquet scans for Apache Iceberg.

spark-shell — comet enabled
# Download the Comet plugin for your Spark / Scala version
$ export COMET_JAR=comet-spark-spark4.1_2.13-0.16.0.jar

# Launch Spark with Comet enabled — drop-in, no code changes
$ $SPARK_HOME/bin/spark-shell \
--jars $COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=4g

// Your existing Spark queries — now executed natively via DataFusion
scala> spark.sql("SELECT category, COUNT(*) FROM events GROUP BY category").show()
scala> 

Apache 2.0  ·  Apache Software Foundation project  ·  Runs on commodity hardware

Run Spark Queries at DataFusion Speeds

Comet delivers a performance speedup for many queries, enabling faster data processing and shorter time-to-insights.

The chart below shows Comet accelerating TPC-DS @ 1 TB. See the Comet Benchmarking Guide for the full per-query breakdown and reproduction methodology.

Total time to run all TPC-DS queries — Comet versus stock Apache Spark
Total time to run all queries (lower is better).

Spark Compatibility

100% compatibility with supported Spark versions.

Comet aims for 100% compatibility with all supported versions of Apache Spark, allowing you to integrate Comet into your existing Spark deployments and workflows seamlessly. With no code changes required, you can immediately harness the benefits of Comet's acceleration capabilities without disrupting your Spark applications. The Comet extension automatically detects unsupported features and falls back to the Spark engine.

Architecture

Tight integration with Apache DataFusion.

The diagram below shows an overview of Comet's architecture: how the Comet plugin intercepts Spark physical plans, translates supported operators into a protocol-buffer representation, and hands them to the Apache DataFusion native engine for execution.

Comet architecture overview diagram showing the bridge between Apache Spark and Apache DataFusion
Comet Overview

Getting Started

To get started with Apache DataFusion Comet, follow the installation instructions. Join the DataFusion Slack and Discord channels to connect with other users, ask questions, and share your experiences with Comet.

Contributing

We welcome contributions from the community to help improve and enhance Apache DataFusion Comet. Whether it's fixing bugs, adding new features, writing documentation, or optimizing performance, your contributions are invaluable in shaping the future of Comet. Check out our contributor guide to get started.