DataFusion Comet 0.16.0 Changelog#

This release consists of 127 commits from 17 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: report task output metrics in Spark UI #3999 (0lai0)
fix: cast to and from timestamp_ntz #4008 (parthchandra)
fix: support to_json on Spark 4.0 #4036 (andygrove)
fix: enable arrays_overlap #3901 (kazuyukitanimura)
fix: Iceberg reflection for current() on TableOperations hierarchy #3895 (karuppayya)
fix: fall back to Spark for shuffle/sort/aggregate on non-default collated strings [Spark 4] #4035 (andygrove)
fix: scalar subquery pushdown and reuse for CometNativeScanExec (SPARK-43402) #4053 (mbutrovich)
fix: fall back for shredded Variant scans on Spark 4.0 #4084 (andygrove)
fix: enable Spark 4 SQL tests previously ignored for issues #3313 and #3314 #4092 (andygrove)
fix: fall back to Spark for hash join and sort-merge join on non-default collated string keys [Spark 4] #4095 (0lai0)
fix: reject string/binary read as numeric in native_datafusion scan #4091 (andygrove)
fix: reject incompatible decimal precision/scale in native_datafusion scan #4090 (andygrove)
fix: throw SchemaColumnConvertNotSupportedException from native_datafusion schema mismatch #4117 (andygrove)
fix: substring with negative start index #4017 (kazuyukitanimura)
fix: honor strictFloatingPoint in RangePartitioning #4167 (0lai0)
fix: [Spark 4.1.1] preserve stored allowDecimalPrecisionLoss in DecimalPrecision rule #4179 (andygrove)
fix: [Spark 4.1.1] preserve parent struct nullness when all requested fields missing in Parquet #4190 (andygrove)
fix: support Spark 4.1 BloomFilter V2 format and bit-scattering #4196 (andygrove)
fix: JNI local reference cleanup in JVMClasses::with_env #4225 (0lai0)
fix: broadcast exchange bypasses AQE partition coalescing #4163 (andygrove)
fix: resolve Scala compiler warnings for auto-tupling and bare try #4227 (andygrove)
fix: [Spark 4.1] preserve union output partitioning in CometUnionExec #4207 (andygrove)
fix: re-enable tests skipped for Spark 4.1 (issue #4098) #4253 (andygrove)
fix: cargo clean before release build to avoid stale native libs #4257 (andygrove)

Performance related:

perf: avoid redundant columnar shuffle when both parent and child are non-Comet #4010 (andygrove)
perf: reduce per-node allocations in to_native_metric_node #4075 (andygrove)

Implemented enhancements:

feat: enable native Iceberg reader by default #3819 (andygrove)
feat: support collect_set #3954 (comphead)
feat: non-AQE DPP for native Parquet scans, broadcast exchange reuse for DPP subqueries #4011 (mbutrovich)
feat: add support for array_position expression #3172 (andygrove)
feat: Cast string to timestamp_ntz #4034 (parthchandra)
feat: Add TimestampNTZType support for unix_timestamp #4039 (parthchandra)
feat: fix array_compact for Spark 4.0 and correct return type metadata #3796 (andygrove)
feat: task-level input metrics (bytesRead) for Iceberg native scan #4128 (mbutrovich)
feat: add MapSort expression support for Spark 4.0 #4076 (andygrove)
feat: Support Spark expression str_to_map #3654 (unknowntpo)
feat: add support for timestamp_seconds expression #3146 (andygrove)
feat: add config to gate converting Spark shuffle to Comet shuffle when child is non-Comet plan #4166 (andygrove)
feat: AQE DPP for native Parquet scans with broadcast reuse #4112 (mbutrovich)
feat: support regular BuildRight+LeftAnti hash join #4073 (viirya)
feat: add bug-triage Claude skill #4109 (andygrove)
feat: support PartialMerge aggregation mode #4003 (comphead)
feat: add encode time tracking for shuffle operations #4068 (0lai0)
feat: Add support for Spark ToDegrees and ToRadians math expressions #3786 (rafafrdz)
feat: Add support for Spark Acosh, Asinh, Atanh math expressions #3787 (rafafrdz)
feat: Add support for Spark Cbrt math expression #3788 (rafafrdz)
feat: Add support for Spark Pi math expression #3789 (rafafrdz)
feat: support Parquet field ID matching in native_datafusion scan #4216 (mbutrovich)
feat: support AQE DPP broadcast reuse for Iceberg native scans #4215 (mbutrovich)
feat: add support for url_encode, url_decode, and try_url_decode #4231 (parthchandra)
feat: support TimestampType join keys in SortMergeJoin #3986 (andygrove)

Documentation updates:

docs: Add changelog for 0.15.0 #4000 (andygrove)
docs: Update README and benchmark results for 0.15.0 release #3995 (andygrove)
docs: fix errors in benchmark pages #4001 (andygrove)
docs: split compatibility guide into multiple pages #4055 (andygrove)
docs: Generate expression compatibility docs from code #4057 (andygrove)
doc: update documentation for cast and datetime functions #4058 (parthchandra)
docs: add compatibility documentation to all expressions #4067 (andygrove)
docs: rename SQL File Tests to Comet SQL Tests #4108 (andygrove)
docs: add Understanding Comet Plans user guide page #4086 (andygrove)
docs: support conditional content for snapshot vs release builds #4030 (andygrove)
docs: update Spark version support and add version compatibility page #4138 (andygrove)
docs: improve review skill and contributor guide for serde patterns #4132 (andygrove)
docs: Fix errors in list of supported Spark versions #4141 (andygrove)
docs: Update roadmap in contributor guide #4144 (andygrove)
docs: add implement-comet-expression Claude skill #4158 (andygrove)
docs: add roadmap items for spillable hash join, UDF support, memory management, and 1.0.0 #4171 (andygrove)
docs: start Spark 4.1 known-limitations section, seeded with #4199 #4202 (andygrove)
docs: document Spark 4 IntelliJ setup #4198 (yuboxx)
docs: refresh Gluten comparison with ANSI, Spark 4, and Iceberg coverage #4169 (andygrove)
docs: check off 53 implemented expressions in support doc #4147 (andygrove)
docs: replace project logos with updated branding #4230 (pranamya123)
docs: Documentation updates in preparation for 0.16 release #4244 (andygrove)

Other:

chor: enable array_distinct #3987 (kazuyukitanimura)
chore: Improve shuffle fallback logic #3989 (andygrove)
chore: Start 0.16 development #4016 (andygrove)
chore: update compatibility guide for primitive to string casts #4012 (parthchandra)
test: add sql-file test confirming fallback on parquet variant reads #4021 (andygrove)
chore: skip Iceberg and Spark SQL test workflows on test-only changes #4023 (andygrove)
ci: force GNU ld on Linux to fix -ljvm linker error #4024 (andygrove)
chore: update pending PR filter #4025 (comphead)
test: run more Spark 4 tests #4013 (kazuyukitanimura)
chore: update documentation links for 0.15.0 release #4029 (andygrove)
test: run more JVM tests #4026 (kazuyukitanimura)
feature: Support Spark expression: arrays_zip #3643 (hsiang-c)
test: fix SparkToColumnar plan-shape assertions on Spark 4 #4032 (andygrove)
test: unignore DynamicPartitionPruning static scan metrics test #4038 (andygrove)
test: add non-AQE DPP edge case coverage for native Parquet scans #4037 (mbutrovich)
test: unignore passing Spark 4.0 #3321 tests, retag remaining failures #4041 (andygrove)
test: unignore passing DPP test, retag remaining failures #4046 (andygrove)
chore: enable array_union #4043 (kazuyukitanimura)
test: add date_trunc DST regression test in non-UTC session timezone #4040 (andygrove)
ci: skip heavy test workflows for GenerateDocs and changelog changes #4056 (andygrove)
chore(deps): bump the all-other-cargo-deps group in /native with 3 updates #4061 (dependabot[bot])
ci: add concurrency blocks to more workflows to cancel on new commit #4064 (mbutrovich)
chore: gitignore generated docs directories #4066 (andygrove)
chore: promote array_insert to compatible and consolidate expression docs #4065 (andygrove)
test: re-enable sql_hive-1 for Spark 4.0 and fix two small failures #4047 (andygrove)
ci: stopping running PR builds with 3.5.5/3.5.6 #4103 (andygrove)
chore: Bump Spark 4.0.1 to 4.0.2 #4114 (andygrove)
build: add spark-4.1 Maven profile and shim sources #4097 (andygrove)
refactor: consolidate identical spark-4.0 and spark-4.1 shims into spark-4.x #4118 (andygrove)
ci: enable Spark 4.1 PR test matrix #4106 (andygrove)
build: add spark-4.2 Maven profile targeting 4.2.0-preview4 #4119 (andygrove)
ci: reduce macOS PR matrix to single Spark 4.0 profile #4104 (andygrove)
chore: audit array_intersect and expand SQL test coverage #4071 (andygrove)
chore: support Spark 4.1/4.2 for golden files generation #4134 (comphead)
ci: add Spark 4.0 / JDK 21 profile #4060 (james-willis)
ci: Enable Comet PR test matrix and TPCDS plan-stability for Spark 4.2 #4126 (andygrove)
test: support fallback chain in CometPlanStabilitySuite, dedupe existing goldens #4129 (andygrove)
chore: Update spark_expressions_support.md doc #4165 (comphead)
build: Enable Spark SQL tests for Spark 4.1.1 #4093 (andygrove)
test: unignore InjectRuntimeFilterSuite tests gated on issue #242 #4178 (andygrove)
chore(deps): bump cc from 1.2.60 to 1.2.61 in /native in the all-other-cargo-deps group #4168 (dependabot[bot])
test: [Spark 4.1.1] unignore SPARK-52921 union partitioning tests #4195 (andygrove)
test: [Spark 4.1.1] unignore CachedBatchSerializerNoUnwrapSuite #4204 (andygrove)
test: relax bytesRead ratio assertion for Spark 4.1+ #4197 (andygrove)
deps: Bump OpenDAL to 0.56.0 #4217 (mbutrovich)
ci: pin JDK per Spark version in Iceberg workflow matrix #4120 (dmefs)
test: skip flaky StateStoreSuite under Comet and disambiguate JDK matrix names #4226 (andygrove)
Map ProfileCredentialsProvider to profile credential chain #4213 (karuppayya)
test: add explicit negative-count cases for array_repeat #4182 (slavlotski)
test: add unhex dictionary coverage #4222 (yuboxx)
chore: Enable GitHub project view #4247 (andygrove)
test: dedupe redundant Spark 4.2 TPC-DS plan-stability golden files #4243 (andygrove)
build: change default Maven profile to Spark 4.1 / Scala 2.13 #4140 (andygrove)
chore: remove legacy ENABLE_COMET_SCAN_ONLY and ENABLE_COMET_ANSI_MODE env vars from Spark diffs #4252 (parthchandra)

Credits#

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

Andy Grove
Matt Butrovich
Parth Chandra
KAZUYUKI TANIMURA
ChenChen Lai
Oleks V
Rafael Fernández
Karuppayya
Yubo Xu
dependabot[bot]
Eric Chang
James Willis
Kuo-Hao Huang
Liang-Chi Hsieh
Pranamya Vadlamani
Vladislav Zabolotsky
hsiang-c

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.