DataFusion Comet 0.13.0 Changelog#
This release consists of 169 commits from 15 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
fix: NativeScan count assert firing for no reason #2850 (EmilyMatt)
fix: Correct link to tracing guide in CometConf #2866 (manuzhang)
fix: Fall back to Spark for MakeDecimal with unsupported input type #2815 (andygrove)
fix: Normalize s3 paths for PME key retriever #2874 (mbutrovich)
fix: modify CometNativeScan to generate the file partitions without instantiating RDD #2891 (mbutrovich)
fix: Modulus on decimal data type mismatch #2922 (andygrove)
fix: [iceberg] Mark nativeIcebergScanMetadata @transient #2930 (mbutrovich)
fix: enable cast tests for Spark 4.0 #2919 (manuzhang)
fix: Remove fallback for maps containing complex types #2943 (andygrove)
fix: CometShuffleManager hang by deferring SparkEnv access #3002 (Shekharrajak)
fix: format decimal to string when casting to short #2916 (manuzhang)
fix: [iceberg] reduce granularity of metrics updates in IcebergFileStream #3050 (mbutrovich)
fix: native shuffle now reports spill metrics correctly #3197 (andygrove)
fix: Prevent native write when input is not Arrow format #3227 (andygrove)
fix: Add JDK to Docker image for release build #3262 (hsiang-c)
Performance related:
perf: [iceberg] Deduplicate serialized metadata for Iceberg native scan #2933 (mbutrovich)
perf: Use await instead of block_on in native shuffle writer #2937 (mbutrovich)
perf: refactor executePlan to try to avoid constantly entering Tokio runtime #2938 (mbutrovich)
perf: Optimize lpad/rpad to remove unnecessary memory allocations per element #2963 (andygrove)
perf: Improve performance of normalize_nan #2999 (andygrove)
perf: Improve string expression microbenchmarks #3012 (andygrove)
perf: Improve date/time microbenchmarks to avoid redundant/duplicate benchmarks #3020 (andygrove)
perf: Improve aggregate expression microbenchmarks #3021 (andygrove)
perf: Improve conditional expression microbenchmarks #3024 (andygrove)
perf: Improve performance of date truncate #2997 (andygrove)
perf: Add microbenchmark for comparison expressions #3026 (andygrove)
perf: Implement more microbenchmarks for cast expressions #3031 (andygrove)
perf: Add microbenchmark for hash expressions #3028 (andygrove)
perf: Improve performance of CAST from string to int #3017 (coderfender)
perf: Improve criterion benchmarks for cast string to int #3049 (andygrove)
perf: Additional optimizations for cast from string to int #3048 (andygrove)
perf: set DataFusion session context’s target_partitions to match Spark’s spark.task.cpus #3062 (mbutrovich)
perf: don’t busy-poll Tokio stream for plans without CometScan #3063 (mbutrovich)
perf: minor optimizations in
process_sorted_row_partition#3059 (andygrove)perf: optimize complex-type hash implementations #3140 (mbutrovich)
perf: [iceberg] Remove IcebergFileStream, use iceberg-rust’s parallelization, bump iceberg-rust to latest, cache SchemaAdapter #3051 (mbutrovich)
perf: [iceberg] reduce nativeIcebergScanMetadata serialization points #3243 (mbutrovich)
perf: reduce GC pressure in protobuf serialization #3242 (andygrove)
perf: cache serialized query plans to avoid per-partition serialization #3246 (andygrove)
perf: [iceberg] Use protobuf instead of JSON to serialize Iceberg partition values #3247 (parthchandra)
Implemented enhancements:
feat: Add experimental support for native Parquet writes #2812 (andygrove)
feat: Partially implement file commit protocol for native Parquet writes #2828 (andygrove)
feat: CometNativeWriteExec support with native scan as a child #2839 (mbutrovich)
feat: Add support for
explodeandexplode_outerfor array inputs #2836 (andygrove)feat: Support ANSI mode SUM (Decimal types) #2826 (coderfender)
feat: Add expression registry to native planner #2851 (andygrove)
feat: Implement native operator registry #2875 (andygrove)
feat: Improve fallback reporting for
native_datafusionscan #2879 (andygrove)feat: Enable bucket pruning with native_datafusion scans #2888 (mbutrovich)
feat: support_ansi-mode_aggregated_benchmarking #2901 (coderfender)
feat: [iceberg] REST catalog support for CometNativeIcebergScan #2895 (mbutrovich)
feat: [iceberg] Support session token in Iceberg Native scan #2913 (hsiang-c)
feat: Make shuffle writer buffer size configurable #2899 (andygrove)
feat: Add partial support for
from_json#2934 (andygrove)feat: Create benchmarks comet cast #2932 (coderfender)
feat: Support string decimal cast #2925 (coderfender)
feat: Remove unnecessary transition for native writes #2960 (comphead)
feat: Initial implementation of size for array inputs #2862 (andygrove)
feat: Support ANSI mode sum expr (int inputs) #2600 (coderfender)
feat: Support casting string float types #2835 (coderfender)
feat: Support ANSI mode avg expr (int inputs) #2817 (coderfender)
feat: Add support for remote Parquet HDFS writer with openDAL #2929 (comphead)
feat: Expand
murmur3hash support to complex types #3077 (andygrove)feat: Comet Writer should respect object store settings #3042 (comphead)
feat: add support for unix_date expression #3141 (andygrove)
feat: add partial support for date_format expression #3201 (andygrove)
feat: add complex type support to native Parquet writer #3214 (andygrove)
feat: implement framework to support multiple pyspark benchmarks #3080 (andygrove)
feat: add support for datediff expression #3145 (andygrove)
feat: Add support for
unix_timestampfunction #2936 (andygrove)feat: add support for last_day expression #3143 (andygrove)
feat: Support left expression #3206 (Shekharrajak)
feat: Add support for round-robin partitioning in native shuffle #3076 (andygrove)
feat: Native columnar to row conversion (Phase 1) #3221 (andygrove)
Documentation updates:
docs: add documentation for fully-native Iceberg scans #2868 (mbutrovich)
docs: Add documentation to contributor guide explaining native + JVM shuffle implementation #3055 (andygrove)
docs: add guidance on disabling constant folding for literal tests #3200 (andygrove)
docs: Add common pitfalls and improve PR checklist in development guide #3231 (andygrove)
docs: various documentation updates in preparation for next release #3254 (andygrove)
docs: Stop generating dynamic docs content in build #3212 (andygrove)
docs: document datetime rebasing and V2 API limitations for DataFusion-based scans #3259 (andygrove)
docs: Mark native_comet scan as deprecated #3274 (andygrove)
Other:
chore: Add 0.12.0 changelog #2811 (andygrove)
chore: Prepare for 0.13.0 development #2809 (andygrove)
minor: Add microbenchmark for integer sum with grouping #2805 (andygrove)
test: extract conditional expression tests (
if,case_whenandcoalesce) #2807 (rluvaton)build: Disable caching for macOS PR builds #2816 (andygrove)
chore(deps): bump actions/checkout from 5 to 6 #2818 (dependabot[bot])
chore(deps): bump object_store_opendal from 0.54.1 to 0.55.0 in /native #2819 (dependabot[bot])
chore(deps): bump cc from 1.2.46 to 1.2.47 in /native #2822 (dependabot[bot])
chore(deps): bump opendal from 0.54.1 to 0.55.0 in /native #2821 (dependabot[bot])
chore: update
Iceberginstall docs #2824 (comphead)chore(deps): bump cc from 1.2.47 to 1.2.48 in /native #2833 (dependabot[bot])
chore(deps): bump the proto group in /native with 2 updates #2832 (dependabot[bot])
minor: Clean up shuffle transformation code in
CometExecRule#2840 (andygrove)chore: fix broken link to Apache DataFusion Comet Overview in README #2846 (onestn)
chore: Refactor some of the scan and sink handling in
CometExecRuleto reduce duplicate code #2844 (andygrove)deps: bump lz4_flex, downgrade prost from yanked version #2847 (mbutrovich)
minor: Move shuffle logic from
CometExecRuletoCometShuffleExchangeExecserde implementation #2853 (andygrove)chore: remove coverage file auto generator #2854 (comphead)
chore(deps): bump cc from 1.2.48 to 1.2.49 in /native #2858 (dependabot[bot])
chore: Refactor
CometExecRulehandling ofBroadcastHashJoinand fix fallback reporting #2856 (andygrove)chore: update actions/checkout from v4 to v6 in setup-iceberg and set… #2857 (bjornjorgensen)
minor: Small refactor in
CometExecRuleto remove confusing code and fix more fallback reporting #2860 (andygrove)chore: Add unit tests for
CometExecRule#2863 (andygrove)chore: Add unit tests for
CometScanRule#2867 (andygrove)minor: Pedantic refactoring to move some methods from
CometSparkSessionExtensionstoCometScanRuleandCometExecRule#2873 (andygrove)deps: [iceberg] upgrade DataFusion to 51, Arrow to 57, Iceberg to latest, MSRV to 1.88 #2729 (mbutrovich)
chore: Enable plan stability suite for
native_datafusionscans #2877 (andygrove)chore:
ScanExec::newno longer fetches data #2881 (andygrove)Chore: refactor bit_not #2896 (kazantsev-maksim)
chore(deps): bump actions/cache from 4 to 5 #2909 (dependabot[bot])
chore(deps): bump actions/upload-artifact from 5 to 6 #2910 (dependabot[bot])
chore: Refactor string benchmarks (~10x reduction in LOC) #2907 (andygrove)
chore(deps): bump actions/download-artifact from 6 to 7 #2908 (dependabot[bot])
chore: use datafusion impl of hex function #2915 (kazantsev-maksim)
chore: Use fixed seed in RNG in tests #2917 (andygrove)
chore: Remove
row_stepfromprocess_sorted_row_partition#2920 (andygrove)chore: Move string function handling to new expression registry #2931 (andygrove)
chore: Reduce syscalls in metrics update logic #2940 (andygrove)
chore: Add shuffle benchmark for deeply nested schemas #2902 (andygrove)
chore: Reduce timer overhead in native shuffle writer #2941 (andygrove)
chore: Remove low-level ffi/jvm timers from native
ScanExec#2939 (andygrove)build: Skip problematic Spark SQL test for Spark 4.0.x #2947 (andygrove)
build: Reinstate macOS CI builds of Comet with Spark 4.0 #2950 (manuzhang)
chore(deps): bump reqwest from 0.12.25 to 0.12.26 in /native #2952 (dependabot[bot])
chore(deps): bump cc from 1.2.49 to 1.2.50 in /native #2954 (dependabot[bot])
chore(deps): bump assertables from 9.8.2 to 9.8.3 in /native #2953 (dependabot[bot])
minor: Refactor expression microbenchmarks to remove duplicate code #2956 (andygrove)
build: fix missing import in
main#2962 (andygrove)build: Skip macOS Spark 4 fuzz test #2966 (andygrove)
Avoid duplicated writer nodes when AQE enabled #2982 (comphead)
build: Set thread thresholds envs for spark test on macOS #2987 (wForget)
chore: Add microbenchmark for casting string to temporal types #2980 (andygrove)
chore(deps): bump reqwest from 0.12.26 to 0.12.28 in /native #3009 (dependabot[bot])
chore(deps): bump tempfile from 3.23.0 to 3.24.0 in /native #3006 (dependabot[bot])
chore(deps): bump serde_json from 1.0.145 to 1.0.148 in /native #3010 (dependabot[bot])
chore: Add microbenchmark for casting string to numeric #2979 (andygrove)
chore: Skip some CI workflows for benchmark changes #3030 (andygrove)
chore: Skip more workflows on benchmark PRs #3034 (andygrove)
chore: Improve microbenchmark for string expressions #2964 (andygrove)
chore(deps): bump tokio from 1.48.0 to 1.49.0 in /native #3039 (dependabot[bot])
chore(deps): bump libc from 0.2.178 to 0.2.179 in /native #3038 (dependabot[bot])
chore(deps): bump actions/cache from 4 to 5 #3037 (dependabot[bot])
Chore: to_json unit/benchmark tests #3011 (kazantsev-maksim)
chore: Add checks to microbenchmarks for plan running natively in Comet #3045 (andygrove)
chore: Refactor
CometScanRuleto improve scan selection and fallback logic #2978 (andygrove)chore: Respect to legacySizeOfNull option for size function #3036 (kazantsev-maksim)
chore: Add PySpark-based benchmarks, starting with ETL example #3065 (andygrove)
chore(deps): bump the proto group in /native with 2 updates #3071 (dependabot[bot])
chore: add MacOS file and event trace log to gitignore #3070 (manuzhang)
chore(deps): bump arrow from 57.1.0 to 57.2.0 in /native #3073 (dependabot[bot])
chore(deps): bump parquet from 57.1.0 to 57.2.0 in /native #3074 (dependabot[bot])
chore(deps): bump cc from 1.2.50 to 1.2.52 in /native #3072 (dependabot[bot])
chore: improve cast documentation to add support per eval mode #3056 (coderfender)
chore: Refactor JVM shuffle: Move
SpillSorterto top level class and add tests #3081 (andygrove)minor: Split CometShuffleExternalSorter into sync/async implementations #3192 (andygrove)
chore: Add pending PR shield #3205 (comphead)
chore: deprecate native_comet scan in favor of native_iceberg_compat #2949 (Shekharrajak)
chore: add script to regenerate golden files for plan stability tests #3204 (andygrove)
chore: fix clippy warnings for Rust 1.93 #3239 (andygrove)
build: build native library once and share across CI test jobs #3249 (andygrove)
Experimental: Native CSV files read #3044 (kazantsev-maksim)
build: add missing datafusion-datasource dependency #3252 (andygrove)
chore: Auto scan mode no longer falls back to
native_comet#3236 (andygrove)build: optimize CI cache usage and add fast lint gate #3251 (andygrove)
build: use
installinstead ofcompilein TPC CI jobs #3263 (andygrove)build: remove dead code for 0.8/0.9 docs that broke CI #3264 (andygrove)
refactor: rename scan.allowIncompatible to scan.unsignedSmallIntSafetyCheck #3238 (andygrove)
Credits#
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
91 Andy Grove
23 dependabot[bot]
18 Matt Butrovich
9 B Vadlamani
7 Oleks V
5 Kazantsev Maksim
5 Manu Zhang
3 Shekhar Prasad Rajak
2 hsiang-c
1 Bjørn Jørgensen
1 Emily Matheys
1 Parth Chandra
1 Raz Luvaton
1 Wonseok Yang
1 Zhen Wang
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.