DataFusion Comet 0.13.0 Changelog#

This release consists of 169 commits from 15 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

  • fix: NativeScan count assert firing for no reason #2850 (EmilyMatt)

  • fix: Correct link to tracing guide in CometConf #2866 (manuzhang)

  • fix: Fall back to Spark for MakeDecimal with unsupported input type #2815 (andygrove)

  • fix: Normalize s3 paths for PME key retriever #2874 (mbutrovich)

  • fix: modify CometNativeScan to generate the file partitions without instantiating RDD #2891 (mbutrovich)

  • fix: Modulus on decimal data type mismatch #2922 (andygrove)

  • fix: [iceberg] Mark nativeIcebergScanMetadata @transient #2930 (mbutrovich)

  • fix: enable cast tests for Spark 4.0 #2919 (manuzhang)

  • fix: Remove fallback for maps containing complex types #2943 (andygrove)

  • fix: CometShuffleManager hang by deferring SparkEnv access #3002 (Shekharrajak)

  • fix: format decimal to string when casting to short #2916 (manuzhang)

  • fix: [iceberg] reduce granularity of metrics updates in IcebergFileStream #3050 (mbutrovich)

  • fix: native shuffle now reports spill metrics correctly #3197 (andygrove)

  • fix: Prevent native write when input is not Arrow format #3227 (andygrove)

  • fix: Add JDK to Docker image for release build #3262 (hsiang-c)

Performance related:

  • perf: [iceberg] Deduplicate serialized metadata for Iceberg native scan #2933 (mbutrovich)

  • perf: Use await instead of block_on in native shuffle writer #2937 (mbutrovich)

  • perf: refactor executePlan to try to avoid constantly entering Tokio runtime #2938 (mbutrovich)

  • perf: Optimize lpad/rpad to remove unnecessary memory allocations per element #2963 (andygrove)

  • perf: Improve performance of normalize_nan #2999 (andygrove)

  • perf: Improve string expression microbenchmarks #3012 (andygrove)

  • perf: Improve date/time microbenchmarks to avoid redundant/duplicate benchmarks #3020 (andygrove)

  • perf: Improve aggregate expression microbenchmarks #3021 (andygrove)

  • perf: Improve conditional expression microbenchmarks #3024 (andygrove)

  • perf: Improve performance of date truncate #2997 (andygrove)

  • perf: Add microbenchmark for comparison expressions #3026 (andygrove)

  • perf: Implement more microbenchmarks for cast expressions #3031 (andygrove)

  • perf: Add microbenchmark for hash expressions #3028 (andygrove)

  • perf: Improve performance of CAST from string to int #3017 (coderfender)

  • perf: Improve criterion benchmarks for cast string to int #3049 (andygrove)

  • perf: Additional optimizations for cast from string to int #3048 (andygrove)

  • perf: set DataFusion session context’s target_partitions to match Spark’s spark.task.cpus #3062 (mbutrovich)

  • perf: don’t busy-poll Tokio stream for plans without CometScan #3063 (mbutrovich)

  • perf: minor optimizations in process_sorted_row_partition #3059 (andygrove)

  • perf: optimize complex-type hash implementations #3140 (mbutrovich)

  • perf: [iceberg] Remove IcebergFileStream, use iceberg-rust’s parallelization, bump iceberg-rust to latest, cache SchemaAdapter #3051 (mbutrovich)

  • perf: [iceberg] reduce nativeIcebergScanMetadata serialization points #3243 (mbutrovich)

  • perf: reduce GC pressure in protobuf serialization #3242 (andygrove)

  • perf: cache serialized query plans to avoid per-partition serialization #3246 (andygrove)

  • perf: [iceberg] Use protobuf instead of JSON to serialize Iceberg partition values #3247 (parthchandra)

Implemented enhancements:

  • feat: Add experimental support for native Parquet writes #2812 (andygrove)

  • feat: Partially implement file commit protocol for native Parquet writes #2828 (andygrove)

  • feat: CometNativeWriteExec support with native scan as a child #2839 (mbutrovich)

  • feat: Add support for explode and explode_outer for array inputs #2836 (andygrove)

  • feat: Support ANSI mode SUM (Decimal types) #2826 (coderfender)

  • feat: Add expression registry to native planner #2851 (andygrove)

  • feat: Implement native operator registry #2875 (andygrove)

  • feat: Improve fallback reporting for native_datafusion scan #2879 (andygrove)

  • feat: Enable bucket pruning with native_datafusion scans #2888 (mbutrovich)

  • feat: support_ansi-mode_aggregated_benchmarking #2901 (coderfender)

  • feat: [iceberg] REST catalog support for CometNativeIcebergScan #2895 (mbutrovich)

  • feat: [iceberg] Support session token in Iceberg Native scan #2913 (hsiang-c)

  • feat: Make shuffle writer buffer size configurable #2899 (andygrove)

  • feat: Add partial support for from_json #2934 (andygrove)

  • feat: Create benchmarks comet cast #2932 (coderfender)

  • feat: Support string decimal cast #2925 (coderfender)

  • feat: Remove unnecessary transition for native writes #2960 (comphead)

  • feat: Initial implementation of size for array inputs #2862 (andygrove)

  • feat: Support ANSI mode sum expr (int inputs) #2600 (coderfender)

  • feat: Support casting string float types #2835 (coderfender)

  • feat: Support ANSI mode avg expr (int inputs) #2817 (coderfender)

  • feat: Add support for remote Parquet HDFS writer with openDAL #2929 (comphead)

  • feat: Expand murmur3 hash support to complex types #3077 (andygrove)

  • feat: Comet Writer should respect object store settings #3042 (comphead)

  • feat: add support for unix_date expression #3141 (andygrove)

  • feat: add partial support for date_format expression #3201 (andygrove)

  • feat: add complex type support to native Parquet writer #3214 (andygrove)

  • feat: implement framework to support multiple pyspark benchmarks #3080 (andygrove)

  • feat: add support for datediff expression #3145 (andygrove)

  • feat: Add support for unix_timestamp function #2936 (andygrove)

  • feat: add support for last_day expression #3143 (andygrove)

  • feat: Support left expression #3206 (Shekharrajak)

  • feat: Add support for round-robin partitioning in native shuffle #3076 (andygrove)

  • feat: Native columnar to row conversion (Phase 1) #3221 (andygrove)

Documentation updates:

  • docs: add documentation for fully-native Iceberg scans #2868 (mbutrovich)

  • docs: Add documentation to contributor guide explaining native + JVM shuffle implementation #3055 (andygrove)

  • docs: add guidance on disabling constant folding for literal tests #3200 (andygrove)

  • docs: Add common pitfalls and improve PR checklist in development guide #3231 (andygrove)

  • docs: various documentation updates in preparation for next release #3254 (andygrove)

  • docs: Stop generating dynamic docs content in build #3212 (andygrove)

  • docs: document datetime rebasing and V2 API limitations for DataFusion-based scans #3259 (andygrove)

  • docs: Mark native_comet scan as deprecated #3274 (andygrove)

Other:

  • chore: Add 0.12.0 changelog #2811 (andygrove)

  • chore: Prepare for 0.13.0 development #2809 (andygrove)

  • minor: Add microbenchmark for integer sum with grouping #2805 (andygrove)

  • test: extract conditional expression tests (if, case_when and coalesce) #2807 (rluvaton)

  • build: Disable caching for macOS PR builds #2816 (andygrove)

  • chore(deps): bump actions/checkout from 5 to 6 #2818 (dependabot[bot])

  • chore(deps): bump object_store_opendal from 0.54.1 to 0.55.0 in /native #2819 (dependabot[bot])

  • chore(deps): bump cc from 1.2.46 to 1.2.47 in /native #2822 (dependabot[bot])

  • chore(deps): bump opendal from 0.54.1 to 0.55.0 in /native #2821 (dependabot[bot])

  • chore: update Iceberg install docs #2824 (comphead)

  • chore(deps): bump cc from 1.2.47 to 1.2.48 in /native #2833 (dependabot[bot])

  • chore(deps): bump the proto group in /native with 2 updates #2832 (dependabot[bot])

  • minor: Clean up shuffle transformation code in CometExecRule #2840 (andygrove)

  • chore: fix broken link to Apache DataFusion Comet Overview in README #2846 (onestn)

  • chore: Refactor some of the scan and sink handling in CometExecRule to reduce duplicate code #2844 (andygrove)

  • deps: bump lz4_flex, downgrade prost from yanked version #2847 (mbutrovich)

  • minor: Move shuffle logic from CometExecRule to CometShuffleExchangeExec serde implementation #2853 (andygrove)

  • chore: remove coverage file auto generator #2854 (comphead)

  • chore(deps): bump cc from 1.2.48 to 1.2.49 in /native #2858 (dependabot[bot])

  • chore: Refactor CometExecRule handling of BroadcastHashJoin and fix fallback reporting #2856 (andygrove)

  • chore: update actions/checkout from v4 to v6 in setup-iceberg and set… #2857 (bjornjorgensen)

  • minor: Small refactor in CometExecRule to remove confusing code and fix more fallback reporting #2860 (andygrove)

  • chore: Add unit tests for CometExecRule #2863 (andygrove)

  • chore: Add unit tests for CometScanRule #2867 (andygrove)

  • minor: Pedantic refactoring to move some methods from CometSparkSessionExtensions to CometScanRule and CometExecRule #2873 (andygrove)

  • deps: [iceberg] upgrade DataFusion to 51, Arrow to 57, Iceberg to latest, MSRV to 1.88 #2729 (mbutrovich)

  • chore: Enable plan stability suite for native_datafusion scans #2877 (andygrove)

  • chore: ScanExec::new no longer fetches data #2881 (andygrove)

  • Chore: refactor bit_not #2896 (kazantsev-maksim)

  • chore(deps): bump actions/cache from 4 to 5 #2909 (dependabot[bot])

  • chore(deps): bump actions/upload-artifact from 5 to 6 #2910 (dependabot[bot])

  • chore: Refactor string benchmarks (~10x reduction in LOC) #2907 (andygrove)

  • chore(deps): bump actions/download-artifact from 6 to 7 #2908 (dependabot[bot])

  • chore: use datafusion impl of hex function #2915 (kazantsev-maksim)

  • chore: Use fixed seed in RNG in tests #2917 (andygrove)

  • chore: Remove row_step from process_sorted_row_partition #2920 (andygrove)

  • chore: Move string function handling to new expression registry #2931 (andygrove)

  • chore: Reduce syscalls in metrics update logic #2940 (andygrove)

  • chore: Add shuffle benchmark for deeply nested schemas #2902 (andygrove)

  • chore: Reduce timer overhead in native shuffle writer #2941 (andygrove)

  • chore: Remove low-level ffi/jvm timers from native ScanExec #2939 (andygrove)

  • build: Skip problematic Spark SQL test for Spark 4.0.x #2947 (andygrove)

  • build: Reinstate macOS CI builds of Comet with Spark 4.0 #2950 (manuzhang)

  • chore(deps): bump reqwest from 0.12.25 to 0.12.26 in /native #2952 (dependabot[bot])

  • chore(deps): bump cc from 1.2.49 to 1.2.50 in /native #2954 (dependabot[bot])

  • chore(deps): bump assertables from 9.8.2 to 9.8.3 in /native #2953 (dependabot[bot])

  • minor: Refactor expression microbenchmarks to remove duplicate code #2956 (andygrove)

  • build: fix missing import in main #2962 (andygrove)

  • build: Skip macOS Spark 4 fuzz test #2966 (andygrove)

  • Avoid duplicated writer nodes when AQE enabled #2982 (comphead)

  • build: Set thread thresholds envs for spark test on macOS #2987 (wForget)

  • chore: Add microbenchmark for casting string to temporal types #2980 (andygrove)

  • chore(deps): bump reqwest from 0.12.26 to 0.12.28 in /native #3009 (dependabot[bot])

  • chore(deps): bump tempfile from 3.23.0 to 3.24.0 in /native #3006 (dependabot[bot])

  • chore(deps): bump serde_json from 1.0.145 to 1.0.148 in /native #3010 (dependabot[bot])

  • chore: Add microbenchmark for casting string to numeric #2979 (andygrove)

  • chore: Skip some CI workflows for benchmark changes #3030 (andygrove)

  • chore: Skip more workflows on benchmark PRs #3034 (andygrove)

  • chore: Improve microbenchmark for string expressions #2964 (andygrove)

  • chore(deps): bump tokio from 1.48.0 to 1.49.0 in /native #3039 (dependabot[bot])

  • chore(deps): bump libc from 0.2.178 to 0.2.179 in /native #3038 (dependabot[bot])

  • chore(deps): bump actions/cache from 4 to 5 #3037 (dependabot[bot])

  • Chore: to_json unit/benchmark tests #3011 (kazantsev-maksim)

  • chore: Add checks to microbenchmarks for plan running natively in Comet #3045 (andygrove)

  • chore: Refactor CometScanRule to improve scan selection and fallback logic #2978 (andygrove)

  • chore: Respect to legacySizeOfNull option for size function #3036 (kazantsev-maksim)

  • chore: Add PySpark-based benchmarks, starting with ETL example #3065 (andygrove)

  • chore(deps): bump the proto group in /native with 2 updates #3071 (dependabot[bot])

  • chore: add MacOS file and event trace log to gitignore #3070 (manuzhang)

  • chore(deps): bump arrow from 57.1.0 to 57.2.0 in /native #3073 (dependabot[bot])

  • chore(deps): bump parquet from 57.1.0 to 57.2.0 in /native #3074 (dependabot[bot])

  • chore(deps): bump cc from 1.2.50 to 1.2.52 in /native #3072 (dependabot[bot])

  • chore: improve cast documentation to add support per eval mode #3056 (coderfender)

  • chore: Refactor JVM shuffle: Move SpillSorter to top level class and add tests #3081 (andygrove)

  • minor: Split CometShuffleExternalSorter into sync/async implementations #3192 (andygrove)

  • chore: Add pending PR shield #3205 (comphead)

  • chore: deprecate native_comet scan in favor of native_iceberg_compat #2949 (Shekharrajak)

  • chore: add script to regenerate golden files for plan stability tests #3204 (andygrove)

  • chore: fix clippy warnings for Rust 1.93 #3239 (andygrove)

  • build: build native library once and share across CI test jobs #3249 (andygrove)

  • Experimental: Native CSV files read #3044 (kazantsev-maksim)

  • build: add missing datafusion-datasource dependency #3252 (andygrove)

  • chore: Auto scan mode no longer falls back to native_comet #3236 (andygrove)

  • build: optimize CI cache usage and add fast lint gate #3251 (andygrove)

  • build: use install instead of compile in TPC CI jobs #3263 (andygrove)

  • build: remove dead code for 0.8/0.9 docs that broke CI #3264 (andygrove)

  • refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238 (andygrove)

Credits#

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    91	Andy Grove
    23	dependabot[bot]
    18	Matt Butrovich
     9	B Vadlamani
     7	Oleks V
     5	Kazantsev Maksim
     5	Manu Zhang
     3	Shekhar Prasad Rajak
     2	hsiang-c
     1	Bjørn Jørgensen
     1	Emily Matheys
     1	Parth Chandra
     1	Raz Luvaton
     1	Wonseok Yang
     1	Zhen Wang

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.