DataFusion Comet 0.8.0 Changelog#

This release consists of 81 commits from 11 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

  • fix: remove code duplication in native_datafusion and native_iceberg_compat implementations #1443 (parthchandra)

  • fix: Refactor CometScanRule and fix bugs #1483 (andygrove)

  • fix: check if handle has been initialized before closing #1554 (wForget)

  • fix: Taking slicing into account when writing BooleanBuffers as fast-encoding format #1522 (Kontinuation)

  • fix: isCometEnabled name conflict #1569 (kazuyukitanimura)

  • fix: make register_object_store use same session_env as file scan #1555 (wForget)

  • fix: adjust CometNativeScan’s doCanonicalize and hashCode for AQE, use DataSourceScanExec trait #1578 (mbutrovich)

  • fix: corrected the logic of eliminating CometSparkToColumnarExec #1597 (wForget)

  • fix: avoid panic caused by close null handle of parquet reader #1604 (wForget)

  • fix: Make AQE capable of converting Comet shuffled joins to Comet broadcast hash joins #1605 (Kontinuation)

  • fix: Making shuffle files generated in native shuffle mode reclaimable #1568 (Kontinuation)

  • fix: Support per-task shuffle write rows and shuffle write time metrics #1617 (Kontinuation)

  • fix: Modify Spark SQL core 2 tests for native_datafusion reader, change 3.5.5 diff hash length to 11 #1641 (mbutrovich)

  • fix: fix spark/sql test failures in native_iceberg_compat #1593 (parthchandra)

  • fix: handle missing field correctly in native_iceberg_compat #1656 (parthchandra)

  • fix: better int96 support for experimental native scans #1652 (mbutrovich)

  • fix: respect ignoreNulls flag in first_value and last_value #1626 (andygrove)

  • fix: update row groups count in internal metrics accumulator #1658 (parthchandra)

  • fix: Shuffle should maintain insertion order #1660 (EmilyMatt)

Performance related:

  • perf: Use a global tokio runtime #1614 (andygrove)

  • perf: Respect Spark’s PARQUET_FILTER_PUSHDOWN_ENABLED config #1619 (andygrove)

  • perf: Experimental fix to avoid join strategy regression #1674 (andygrove)

Implemented enhancements:

  • feat: add read array support #1456 (comphead)

  • feat: introduce hadoop mini cluster to test native scan on hdfs #1556 (wForget)

  • feat: make parquet native scan schema case insensitive #1575 (wForget)

  • feat: enable iceberg compat tests, more tests for complex types #1550 (comphead)

  • feat: pushdown filter for native_iceberg_compat #1566 (wForget)

  • feat: Fix struct of arrays schema issue #1592 (comphead)

  • feat: adding more struct/arrays tests #1594 (comphead)

  • feat: respect batchSize/workerThreads/blockingThreads configurations for native_iceberg_compat scan #1587 (wForget)

  • feat: add MAP type support for first level #1603 (comphead)

  • feat: Add more tests for nested types combinations for native_datafusion #1632 (comphead)

  • feat: Override MapBuilder values field with expected schema #1643 (comphead)

  • feat: track unified memory pool #1651 (wForget)

  • feat: Add support for complex types in native shuffle #1655 (andygrove)

Documentation updates:

  • docs: Update configuration guide to show optional configs #1524 (andygrove)

  • docs: Add changelog for 0.7.0 release #1527 (andygrove)

  • docs: Use a shallow clone for Spark SQL test instructions #1547 (mbutrovich)

  • docs: Update benchmark results for 0.7.0 release #1548 (andygrove)

  • doc: Renew kubernetes.md #1549 (comphead)

  • docs: various improvements to tuning guide #1525 (andygrove)

  • docs: Update supported Spark versions #1580 (andygrove)

  • docs: change OSX/OS X to macOS #1584 (mbutrovich)

  • docs: docs for benchmarking in aws ec2 #1601 (andygrove)

  • docs: Update compatibility docs for new native scans #1657 (andygrove)

  • doc: Document local HDFS setup #1673 (comphead)

Other:

  • chore: fix issue in release process #1528 (andygrove)

  • chore: Remove all subdependencies #1514 (EmilyMatt)

  • chore: Drop support for Spark 3.3 (EOL) #1529 (andygrove)

  • chore: Prepare for 0.8.0 development #1530 (andygrove)

  • chore: Re-enable GitHub discussions #1535 (andygrove)

  • chore: [FOLLOWUP] Drop support for Spark 3.3 (EOL) #1534 (kazuyukitanimura)

  • build: Use unique name for surefire artifacts #1544 (andygrove)

  • chore: Update links for released version #1540 (andygrove)

  • chore: Enable Comet explicitly in CometTPCDSQueryTestSuite #1559 (andygrove)

  • chore: Fix some inconsistencies in memory pool configuration #1561 (andygrove)

  • upgraded spark 3.5.4 to 3.5.5 #1565 (YanivKunda)

  • minor: fix typo #1570 (wForget)

  • Chore: simplify array related functions impl #1490 (kazantsev-maksim)

  • added fallback using reflection for backward-compatibility #1573 (YanivKunda)

  • chore: Override node name for CometSparkToColumnar #1577 (l0kr)

  • chore: Reimplement ShuffleWriterExec using interleave_record_batch #1511 (Kontinuation)

  • chore: Run Comet tests for more Spark versions #1582 (andygrove)

  • Feat: support array_except function #1343 (kazantsev-maksim)

  • minor: Fix clippy warnings #1606 (Kontinuation)

  • chore: Remove some unwraps in hashing code #1600 (andygrove)

  • chore: Remove redundant shims for getFailOnError #1608 (andygrove)

  • chore: Making comet native operators write spill files to spark local dir #1581 (Kontinuation)

  • chore: Refactor QueryPlanSerde to use idiomatic Scala and reduce verbosity #1609 (andygrove)

  • chore: Create simple fuzz test as part of test suite #1610 (andygrove)

  • chore: Document testSingleLineQuery test method #1628 (comphead)

  • chore: Parquet fuzz testing #1623 (andygrove)

  • chore: Change default Spark version to 3.5 #1620 (andygrove)

  • chore: Add manually-triggered CI jobs for testing Spark SQL with native scans #1624 (andygrove)

  • chore: refactor v2 scan conversion #1621 (andygrove)

  • chore: clean up planner.rs #1650 (comphead)

  • chore: correct name of pipelines for native_datafusion ci workflow #1653 (parthchandra)

  • chore: Upgrade to datafusion 47.0.0-rc1 and arrow-rs 55.0.0 #1563 (andygrove)

  • chore: Upgrade to datafusion 47.0.0 #1663 (YanivKunda)

  • chore: Enable CometFuzzTestSuite int96 test for experimental native scans (without complex types) #1664 (mbutrovich)

  • chore: Refactor Memory Pools #1662 (EmilyMatt)

Credits#

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    31	Andy Grove
    11	Oleks V
    10	Zhen Wang
     7	Kristin Cowalcijk
     6	Matt Butrovich
     5	Parth Chandra
     3	Emily Matheys
     3	Yaniv Kunda
     2	KAZUYUKI TANIMURA
     2	Kazantsev Maksim
     1	Łukasz

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.