DataFusion Comet 0.8.0 Changelog#

This release consists of 81 commits from 11 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: remove code duplication in native_datafusion and native_iceberg_compat implementations #1443 (parthchandra)
fix: Refactor CometScanRule and fix bugs #1483 (andygrove)
fix: check if handle has been initialized before closing #1554 (wForget)
fix: Taking slicing into account when writing BooleanBuffers as fast-encoding format #1522 (Kontinuation)
fix: isCometEnabled name conflict #1569 (kazuyukitanimura)
fix: make register_object_store use same session_env as file scan #1555 (wForget)
fix: adjust CometNativeScan’s doCanonicalize and hashCode for AQE, use DataSourceScanExec trait #1578 (mbutrovich)
fix: corrected the logic of eliminating CometSparkToColumnarExec #1597 (wForget)
fix: avoid panic caused by close null handle of parquet reader #1604 (wForget)
fix: Make AQE capable of converting Comet shuffled joins to Comet broadcast hash joins #1605 (Kontinuation)
fix: Making shuffle files generated in native shuffle mode reclaimable #1568 (Kontinuation)
fix: Support per-task shuffle write rows and shuffle write time metrics #1617 (Kontinuation)
fix: Modify Spark SQL core 2 tests for native_datafusion reader, change 3.5.5 diff hash length to 11 #1641 (mbutrovich)
fix: fix spark/sql test failures in native_iceberg_compat #1593 (parthchandra)
fix: handle missing field correctly in native_iceberg_compat #1656 (parthchandra)
fix: better int96 support for experimental native scans #1652 (mbutrovich)
fix: respect ignoreNulls flag in first_value and last_value #1626 (andygrove)
fix: update row groups count in internal metrics accumulator #1658 (parthchandra)
fix: Shuffle should maintain insertion order #1660 (EmilyMatt)

Performance related:

perf: Use a global tokio runtime #1614 (andygrove)
perf: Respect Spark’s PARQUET_FILTER_PUSHDOWN_ENABLED config #1619 (andygrove)
perf: Experimental fix to avoid join strategy regression #1674 (andygrove)

Implemented enhancements:

feat: add read array support #1456 (comphead)
feat: introduce hadoop mini cluster to test native scan on hdfs #1556 (wForget)
feat: make parquet native scan schema case insensitive #1575 (wForget)
feat: enable iceberg compat tests, more tests for complex types #1550 (comphead)
feat: pushdown filter for native_iceberg_compat #1566 (wForget)
feat: Fix struct of arrays schema issue #1592 (comphead)
feat: adding more struct/arrays tests #1594 (comphead)
feat: respect batchSize/workerThreads/blockingThreads configurations for native_iceberg_compat scan #1587 (wForget)
feat: add MAP type support for first level #1603 (comphead)
feat: Add more tests for nested types combinations for native_datafusion #1632 (comphead)
feat: Override MapBuilder values field with expected schema #1643 (comphead)
feat: track unified memory pool #1651 (wForget)
feat: Add support for complex types in native shuffle #1655 (andygrove)

Documentation updates:

docs: Update configuration guide to show optional configs #1524 (andygrove)
docs: Add changelog for 0.7.0 release #1527 (andygrove)
docs: Use a shallow clone for Spark SQL test instructions #1547 (mbutrovich)
docs: Update benchmark results for 0.7.0 release #1548 (andygrove)
doc: Renew kubernetes.md #1549 (comphead)
docs: various improvements to tuning guide #1525 (andygrove)
docs: Update supported Spark versions #1580 (andygrove)
docs: change OSX/OS X to macOS #1584 (mbutrovich)
docs: docs for benchmarking in aws ec2 #1601 (andygrove)
docs: Update compatibility docs for new native scans #1657 (andygrove)
doc: Document local HDFS setup #1673 (comphead)

Other:

chore: fix issue in release process #1528 (andygrove)
chore: Remove all subdependencies #1514 (EmilyMatt)
chore: Drop support for Spark 3.3 (EOL) #1529 (andygrove)
chore: Prepare for 0.8.0 development #1530 (andygrove)
chore: Re-enable GitHub discussions #1535 (andygrove)
chore: [FOLLOWUP] Drop support for Spark 3.3 (EOL) #1534 (kazuyukitanimura)
build: Use unique name for surefire artifacts #1544 (andygrove)
chore: Update links for released version #1540 (andygrove)
chore: Enable Comet explicitly in CometTPCDSQueryTestSuite #1559 (andygrove)
chore: Fix some inconsistencies in memory pool configuration #1561 (andygrove)
upgraded spark 3.5.4 to 3.5.5 #1565 (YanivKunda)
minor: fix typo #1570 (wForget)
Chore: simplify array related functions impl #1490 (kazantsev-maksim)
added fallback using reflection for backward-compatibility #1573 (YanivKunda)
chore: Override node name for CometSparkToColumnar #1577 (l0kr)
chore: Reimplement ShuffleWriterExec using interleave_record_batch #1511 (Kontinuation)
chore: Run Comet tests for more Spark versions #1582 (andygrove)
Feat: support array_except function #1343 (kazantsev-maksim)
minor: Fix clippy warnings #1606 (Kontinuation)
chore: Remove some unwraps in hashing code #1600 (andygrove)
chore: Remove redundant shims for getFailOnError #1608 (andygrove)
chore: Making comet native operators write spill files to spark local dir #1581 (Kontinuation)
chore: Refactor QueryPlanSerde to use idiomatic Scala and reduce verbosity #1609 (andygrove)
chore: Create simple fuzz test as part of test suite #1610 (andygrove)
chore: Document testSingleLineQuery test method #1628 (comphead)
chore: Parquet fuzz testing #1623 (andygrove)
chore: Change default Spark version to 3.5 #1620 (andygrove)
chore: Add manually-triggered CI jobs for testing Spark SQL with native scans #1624 (andygrove)
chore: refactor v2 scan conversion #1621 (andygrove)
chore: clean up planner.rs #1650 (comphead)
chore: correct name of pipelines for native_datafusion ci workflow #1653 (parthchandra)
chore: Upgrade to datafusion 47.0.0-rc1 and arrow-rs 55.0.0 #1563 (andygrove)
chore: Upgrade to datafusion 47.0.0 #1663 (YanivKunda)
chore: Enable CometFuzzTestSuite int96 test for experimental native scans (without complex types) #1664 (mbutrovich)
chore: Refactor Memory Pools #1662 (EmilyMatt)

Credits#

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

Andy Grove
Oleks V
Zhen Wang
Kristin Cowalcijk
Matt Butrovich
Parth Chandra
Emily Matheys
Yaniv Kunda
KAZUYUKI TANIMURA
Kazantsev Maksim
Łukasz

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.