DataFusion Comet 0.16.0 Changelog#

This release consists of 127 commits from 17 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

  • fix: report task output metrics in Spark UI #3999 (0lai0)

  • fix: cast to and from timestamp_ntz #4008 (parthchandra)

  • fix: support to_json on Spark 4.0 #4036 (andygrove)

  • fix: enable arrays_overlap #3901 (kazuyukitanimura)

  • fix: Iceberg reflection for current() on TableOperations hierarchy #3895 (karuppayya)

  • fix: fall back to Spark for shuffle/sort/aggregate on non-default collated strings [Spark 4] #4035 (andygrove)

  • fix: scalar subquery pushdown and reuse for CometNativeScanExec (SPARK-43402) #4053 (mbutrovich)

  • fix: fall back for shredded Variant scans on Spark 4.0 #4084 (andygrove)

  • fix: enable Spark 4 SQL tests previously ignored for issues #3313 and #3314 #4092 (andygrove)

  • fix: fall back to Spark for hash join and sort-merge join on non-default collated string keys [Spark 4] #4095 (0lai0)

  • fix: reject string/binary read as numeric in native_datafusion scan #4091 (andygrove)

  • fix: reject incompatible decimal precision/scale in native_datafusion scan #4090 (andygrove)

  • fix: throw SchemaColumnConvertNotSupportedException from native_datafusion schema mismatch #4117 (andygrove)

  • fix: substring with negative start index #4017 (kazuyukitanimura)

  • fix: honor strictFloatingPoint in RangePartitioning #4167 (0lai0)

  • fix: [Spark 4.1.1] preserve stored allowDecimalPrecisionLoss in DecimalPrecision rule #4179 (andygrove)

  • fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet #4190 (andygrove)

  • fix: support Spark 4.1 BloomFilter V2 format and bit-scattering #4196 (andygrove)

  • fix: JNI local reference cleanup in JVMClasses::with_env #4225 (0lai0)

  • fix: broadcast exchange bypasses AQE partition coalescing #4163 (andygrove)

  • fix: resolve Scala compiler warnings for auto-tupling and bare try #4227 (andygrove)

  • fix: [Spark 4.1] preserve union output partitioning in CometUnionExec #4207 (andygrove)

  • fix: re-enable tests skipped for Spark 4.1 (issue #4098) #4253 (andygrove)

  • fix: cargo clean before release build to avoid stale native libs #4257 (andygrove)

Performance related:

  • perf: avoid redundant columnar shuffle when both parent and child are non-Comet #4010 (andygrove)

  • perf: reduce per-node allocations in to_native_metric_node #4075 (andygrove)

Implemented enhancements:

  • feat: enable native Iceberg reader by default #3819 (andygrove)

  • feat: support collect_set #3954 (comphead)

  • feat: non-AQE DPP for native Parquet scans, broadcast exchange reuse for DPP subqueries #4011 (mbutrovich)

  • feat: add support for array_position expression #3172 (andygrove)

  • feat: Cast string to timestamp_ntz #4034 (parthchandra)

  • feat: Add TimestampNTZType support for unix_timestamp #4039 (parthchandra)

  • feat: fix array_compact for Spark 4.0 and correct return type metadata #3796 (andygrove)

  • feat: task-level input metrics (bytesRead) for Iceberg native scan #4128 (mbutrovich)

  • feat: add MapSort expression support for Spark 4.0 #4076 (andygrove)

  • feat: Support Spark expression str_to_map #3654 (unknowntpo)

  • feat: add support for timestamp_seconds expression #3146 (andygrove)

  • feat: add config to gate converting Spark shuffle to Comet shuffle when child is non-Comet plan #4166 (andygrove)

  • feat: AQE DPP for native Parquet scans with broadcast reuse #4112 (mbutrovich)

  • feat: support regular BuildRight+LeftAnti hash join #4073 (viirya)

  • feat: add bug-triage Claude skill #4109 (andygrove)

  • feat: support PartialMerge aggregation mode #4003 (comphead)

  • feat: add encode time tracking for shuffle operations #4068 (0lai0)

  • feat: Add support for Spark ToDegrees and ToRadians math expressions #3786 (rafafrdz)

  • feat: Add support for Spark Acosh, Asinh, Atanh math expressions #3787 (rafafrdz)

  • feat: Add support for Spark Cbrt math expression #3788 (rafafrdz)

  • feat: Add support for Spark Pi math expression #3789 (rafafrdz)

  • feat: support Parquet field ID matching in native_datafusion scan #4216 (mbutrovich)

  • feat: support AQE DPP broadcast reuse for Iceberg native scans #4215 (mbutrovich)

  • feat: add support for url_encode, url_decode, and try_url_decode #4231 (parthchandra)

  • feat: support TimestampType join keys in SortMergeJoin #3986 (andygrove)

Documentation updates:

  • docs: Add changelog for 0.15.0 #4000 (andygrove)

  • docs: Update README and benchmark results for 0.15.0 release #3995 (andygrove)

  • docs: fix errors in benchmark pages #4001 (andygrove)

  • docs: split compatibility guide into multiple pages #4055 (andygrove)

  • docs: Generate expression compatibility docs from code #4057 (andygrove)

  • doc: update documentation for cast and datetime functions #4058 (parthchandra)

  • docs: add compatibility documentation to all expressions #4067 (andygrove)

  • docs: rename SQL File Tests to Comet SQL Tests #4108 (andygrove)

  • docs: add Understanding Comet Plans user guide page #4086 (andygrove)

  • docs: support conditional content for snapshot vs release builds #4030 (andygrove)

  • docs: update Spark version support and add version compatibility page #4138 (andygrove)

  • docs: improve review skill and contributor guide for serde patterns #4132 (andygrove)

  • docs: Fix errors in list of supported Spark versions #4141 (andygrove)

  • docs: Update roadmap in contributor guide #4144 (andygrove)

  • docs: add implement-comet-expression Claude skill #4158 (andygrove)

  • docs: add roadmap items for spillable hash join, UDF support, memory management, and 1.0.0 #4171 (andygrove)

  • docs: start Spark 4.1 known-limitations section, seeded with #4199 #4202 (andygrove)

  • docs: document Spark 4 IntelliJ setup #4198 (yuboxx)

  • docs: refresh Gluten comparison with ANSI, Spark 4, and Iceberg coverage #4169 (andygrove)

  • docs: check off 53 implemented expressions in support doc #4147 (andygrove)

  • docs: replace project logos with updated branding #4230 (pranamya123)

  • docs: Documentation updates in preparation for 0.16 release #4244 (andygrove)

Other:

  • chor: enable array_distinct #3987 (kazuyukitanimura)

  • chore: Improve shuffle fallback logic #3989 (andygrove)

  • chore: Start 0.16 development #4016 (andygrove)

  • chore: update compatibility guide for primitive to string casts #4012 (parthchandra)

  • test: add sql-file test confirming fallback on parquet variant reads #4021 (andygrove)

  • chore: skip Iceberg and Spark SQL test workflows on test-only changes #4023 (andygrove)

  • ci: force GNU ld on Linux to fix -ljvm linker error #4024 (andygrove)

  • chore: update pending PR filter #4025 (comphead)

  • test: run more Spark 4 tests #4013 (kazuyukitanimura)

  • chore: update documentation links for 0.15.0 release #4029 (andygrove)

  • test: run more JVM tests #4026 (kazuyukitanimura)

  • feature: Support Spark expression: arrays_zip #3643 (hsiang-c)

  • test: fix SparkToColumnar plan-shape assertions on Spark 4 #4032 (andygrove)

  • test: unignore DynamicPartitionPruning static scan metrics test #4038 (andygrove)

  • test: add non-AQE DPP edge case coverage for native Parquet scans #4037 (mbutrovich)

  • test: unignore passing Spark 4.0 #3321 tests, retag remaining failures #4041 (andygrove)

  • test: unignore passing DPP test, retag remaining failures #4046 (andygrove)

  • chore: enable array_union #4043 (kazuyukitanimura)

  • test: add date_trunc DST regression test in non-UTC session timezone #4040 (andygrove)

  • ci: skip heavy test workflows for GenerateDocs and changelog changes #4056 (andygrove)

  • chore(deps): bump the all-other-cargo-deps group in /native with 3 updates #4061 (dependabot[bot])

  • ci: add concurrency blocks to more workflows to cancel on new commit #4064 (mbutrovich)

  • chore: gitignore generated docs directories #4066 (andygrove)

  • chore: promote array_insert to compatible and consolidate expression docs #4065 (andygrove)

  • test: re-enable sql_hive-1 for Spark 4.0 and fix two small failures #4047 (andygrove)

  • ci: stopping running PR builds with 3.5.5/3.5.6 #4103 (andygrove)

  • chore: Bump Spark 4.0.1 to 4.0.2 #4114 (andygrove)

  • build: add spark-4.1 Maven profile and shim sources #4097 (andygrove)

  • refactor: consolidate identical spark-4.0 and spark-4.1 shims into spark-4.x #4118 (andygrove)

  • ci: enable Spark 4.1 PR test matrix #4106 (andygrove)

  • build: add spark-4.2 Maven profile targeting 4.2.0-preview4 #4119 (andygrove)

  • ci: reduce macOS PR matrix to single Spark 4.0 profile #4104 (andygrove)

  • chore: audit array_intersect and expand SQL test coverage #4071 (andygrove)

  • chore: support Spark 4.1/4.2 for golden files generation #4134 (comphead)

  • ci: add Spark 4.0 / JDK 21 profile #4060 (james-willis)

  • ci: Enable Comet PR test matrix and TPCDS plan-stability for Spark 4.2 #4126 (andygrove)

  • test: support fallback chain in CometPlanStabilitySuite, dedupe existing goldens #4129 (andygrove)

  • chore: Update spark_expressions_support.md doc #4165 (comphead)

  • build: Enable Spark SQL tests for Spark 4.1.1 #4093 (andygrove)

  • test: unignore InjectRuntimeFilterSuite tests gated on issue #242 #4178 (andygrove)

  • chore(deps): bump cc from 1.2.60 to 1.2.61 in /native in the all-other-cargo-deps group #4168 (dependabot[bot])

  • test: [Spark 4.1.1] unignore SPARK-52921 union partitioning tests #4195 (andygrove)

  • test: [Spark 4.1.1] unignore CachedBatchSerializerNoUnwrapSuite #4204 (andygrove)

  • test: relax bytesRead ratio assertion for Spark 4.1+ #4197 (andygrove)

  • deps: Bump OpenDAL to 0.56.0 #4217 (mbutrovich)

  • ci: pin JDK per Spark version in Iceberg workflow matrix #4120 (dmefs)

  • test: skip flaky StateStoreSuite under Comet and disambiguate JDK matrix names #4226 (andygrove)

  • Map ProfileCredentialsProvider to profile credential chain #4213 (karuppayya)

  • test: add explicit negative-count cases for array_repeat #4182 (slavlotski)

  • test: add unhex dictionary coverage #4222 (yuboxx)

  • chore: Enable GitHub project view #4247 (andygrove)

  • test: dedupe redundant Spark 4.2 TPC-DS plan-stability golden files #4243 (andygrove)

  • build: change default Maven profile to Spark 4.1 / Scala 2.13 #4140 (andygrove)

  • chore: remove legacy ENABLE_COMET_SCAN_ONLY and ENABLE_COMET_ANSI_MODE env vars from Spark diffs #4252 (parthchandra)

Credits#

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    78	Andy Grove
     9	Matt Butrovich
     7	Parth Chandra
     6	KAZUYUKI TANIMURA
     5	ChenChen Lai
     5	Oleks V
     4	Rafael Fernández
     2	Karuppayya
     2	Yubo Xu
     2	dependabot[bot]
     1	Eric Chang
     1	James Willis
     1	Kuo-Hao Huang
     1	Liang-Chi Hsieh
     1	Pranamya Vadlamani
     1	Vladislav Zabolotsky
     1	hsiang-c

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.