DataFusion Comet 0.9.0 Changelog#

This release consists of 139 commits from 24 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

  • fix: typo for instr in fuzz testing #1686 (mbutrovich)

  • fix: Bucketed scan fallback for native_datafusion Parquet scan #1720 (mbutrovich)

  • fix: Skip row index Spark SQL tests for native_datafusion Parquet scan #1724 (mbutrovich)

  • fix: Check acquired memory when CometMemoryPool grows #1732 (wForget)

  • fix: Fix data race in memory profiling #1727 (andygrove)

  • fix: Enable some DPP Spark SQL tests #1734 (andygrove)

  • fix: support literal null list and map #1742 (kazuyukitanimura)

  • fix: get_struct field is incorrect when struct in array #1687 (comphead)

  • fix: cast map types correctly in schema adapter #1771 (parthchandra)

  • fix: correct schema type checking in native_iceberg_compat #1755 (parthchandra)

  • fix: default values for native_datafusion scan #1756 (mbutrovich)

  • fix: [native_scans] Support CASE_SENSITIVE when reading Parquet #1782 (andygrove)

  • fix: cargo install tpchgen-cli in benchmark doc #1797 (zhuqi-lucas)

  • fix: support map_keys #1788 (comphead)

  • fix: fall back on nested types for default values #1799 (mbutrovich)

  • fix: Re-enable Spark 4 tests on Linux #1806 (andygrove)

  • fix: fallback to Spark scan if encryption is enabled (native_datafusion/native_iceberg_compat) #1785 (parthchandra)

  • fix: native_iceberg_compat: move checking parquet types above fetching batch #1809 (mbutrovich)

  • fix: translate missing or corrupt file exceptions, fall back if asked to ignore #1765 (mbutrovich)

  • fix: Fix Spark SQL AQE exchange reuse test failures #1811 (coderfender)

  • fix: Enable more Spark SQL tests #1834 (andygrove)

  • fix: support map_values #1835 (comphead)

  • fix: Handle case where num_cols == 0 in native execution #1840 (andygrove)

  • fix: Fix shuffle writing rows containing null struct fields #1845 (Kontinuation)

  • fix: Fall back to Spark for RANGE BETWEEN window expressions #1848 (andygrove)

  • fix: Remove COMET_SHUFFLE_FALLBACK_TO_COLUMNAR hack #1865 (andygrove)

  • fix: support read Struct by user schema #1860 (comphead)

  • fix: map parquet field_id correctly (native_iceberg_compat) #1815 (parthchandra)

  • fix: cast_struct_to_struct aligns to Spark behavior #1879 (mbutrovich)

  • fix: correctly handle schemas with nested array of struct (native_iceberg_compat) #1883 (parthchandra)

  • fix: set RangePartitioning for native shuffle default to false #1907 (mbutrovich)

  • fix: conflict between #1905 and #1892. #1919 (mbutrovich)

  • fix: Add overflow check to evaluate of sum decimal accumulator #1922 (leung-ming)

  • fix: Fix overflow handling when casting float to decimal #1914 (leung-ming)

  • fix: Ignore a test case fails on Miri #1951 (leung-ming)

Performance related:

  • perf: Add memory profiling #1702 (andygrove)

  • perf: Add performance tracing capability #1706 (andygrove)

  • perf: Add COMET_RESPECT_PARQUET_FILTER_PUSHDOWN config #1936 (andygrove)

Implemented enhancements:

  • feat: add jemalloc as optional custom allocator #1679 (mbutrovich)

  • feat: support array_repeat #1680 (comphead)

  • feat: More warning info for users #1667 (hsiang-c)

  • feat: decode() expression when using ‘utf-8’ encoding #1697 (mbutrovich)

  • feat: regexp_replace() expression with no starting offset #1700 (mbutrovich)

  • feat: Improve performance tracing feature #1730 (andygrove)

  • feat: Set/cancel with job tag and make max broadcast table size configurable #1693 (wForget)

  • feat: Add support for expm1 expression from datafusion-spark crate #1711 (andygrove)

  • feat: Add config option for showing all Comet plan transformations #1780 (andygrove)

  • feat: Support Type widening: byte → short/int/long, short → int/long #1770 (huaxingao)

  • feat: Translate Hadoop S3A configurations to object_store configurations #1817 (Kontinuation)

  • feat: Upgrade to official DataFusion 48.0.0 release #1877 (andygrove)

  • feat: Add experimental auto mode for COMET_PARQUET_SCAN_IMPL #1747 (andygrove)

  • feat: support RangePartitioning with native shuffle #1862 (mbutrovich)

  • feat: Add support for signum expression #1889 (andygrove)

  • feat: Add support to lookup map by key #1898 (comphead)

  • feat: support array_max #1892 (drexler-sky)

  • feat: pass ignore_nulls flag to first and last #1866 (rluvaton)

  • feat: Implement ToPrettyString #1921 (andygrove)

  • feat: Support hadoop s3a config in native_iceberg_compat #1925 (parthchandra)

  • feat: rand expression support #1199 (akupchinskiy)

  • feat: supports array_distinct #1923 (drexler-sky)

  • feat: auto scan mode should check for supported file location #1930 (andygrove)

  • feat: Encapsulate Parquet objects #1920 (huaxingao)

  • feat: Change default value of COMET_NATIVE_SCAN_IMPL to auto #1933 (andygrove)

  • feat: Supports array_union #1945 (drexler-sky)

Documentation updates:

  • docs: Add changelog for 0.8.0 #1675 (andygrove)

  • docs: Add instructions on running TPC-H on macOS #1647 (andygrove)

  • docs: Add documentation for accelerating Iceberg Parquet scans with Comet #1683 (andygrove)

  • docs: Add note on setting core.abbrev when generating diffs #1735 (andygrove)

  • docs: Remove outdated param in macos bench guide #1748 (ding-young)

  • docs: Add instructions for running individual Spark SQL tests from sbt #1752 (coderfender)

  • docs: Add documentation for native_datafusion Parquet scanner’s S3 support #1832 (Kontinuation)

  • docs: Add docs stating that Comet does not support reading decimals encoded in Parquet BINARY format #1895 (andygrove)

Other:

  • chore: Start 0.9.0 development #1676 (andygrove)

  • chore: Update viable crates #1677 (EmilyMatt)

  • chore: match Maven plugin versions with Spark 3.5 #1668 (hsiang-c)

  • chore: Remove fallback reason “because the children were not native” #1672 (andygrove)

  • chore: Rename scalarExprToProto to scalarFunctionExprToProto #1688 (comphead)

  • chore: fix build errors #1690 (comphead)

  • chore: Make Aggregate transformation more compact #1670 (EmilyMatt)

  • chore: update dev/release/rat_exclude_files.txt #1689 (hsiang-c)

  • chore: Move Comet rules into their own files #1695 (andygrove)

  • chore: Remove fast encoding option #1703 (andygrove)

  • chore: fix CI job name #1712 (hsiang-c)

  • minor: Warn if memory pool is dropped with bytes still reserved #1721 (andygrove)

  • chore: Correct memory acquired size in unified memory pool #1738 (zuston)

  • chore: allow large errors for Clippy #1743 (comphead)

  • chore: Refactor DataTypeSupport #1741 (andygrove)

  • chore: More refactoring of type checking logic #1744 (andygrove)

  • chore: Enable more complex type tests #1753 (andygrove)

  • chore: Add scanImpl attribute to CometScanExec #1746 (andygrove)

  • chore: Prepare for DataFusion 48.0.0 #1710 (andygrove)

  • Docs: Setup Comet on IntelliJ #1760 (coderfender)

  • chore: Reenable nested types for CometFuzzTestSuite with int96 #1761 (mbutrovich)

  • chore: Enable partial Spark SQL tests for native_iceberg_compat scan #1762 (andygrove)

  • chore: [native_iceberg_compat / native_datafusion] Ignore Spark SQL Parquet encryption tests #1763 (andygrove)

  • build: Ignore array_repeat test to fix CI issues #1774 (andygrove)

  • chore: Upload crash logs if Java tests fail #1779 (andygrove)

  • chore: Drop support for Java 8 #1777 (andygrove)

  • chore: Bump arrow to 18.3.0 #1773 (Kontinuation)

  • build: Stop running Comet’s Spark 4 tests on Linux for PR builds #1802 (andygrove)

  • Chore: Moved strings expressions to separate file #1792 (kazantsev-maksim)

  • chore: Speed up “PR Builds” CI workflows #1807 (andygrove)

  • chore: [native scans] Ignore Spark SQL test for string predicate pushdown #1768 (andygrove)

  • chore: Bump DataFusion to git rev 2c2f225 #1814 (andygrove)

  • Feat: support bit_count function #1602 (kazantsev-maksim)

  • Chore: implement bit_not as ScalarUDFImpl #1825 (kazantsev-maksim)

  • build: Specify -Dsbt.log.noformat=true in sbt CI runs #1822 (andygrove)

  • chore: Use unique artifact names in Java test run #1818 (andygrove)

  • minor: Refactor PhysicalPlanner::default() to avoid duplicate code #1821 (andygrove)

  • Chore: implement bit_count as ScalarUDFImpl #1826 (kazantsev-maksim)

  • chore: IgnoreCometNativeScan on a few more Spark SQL tests #1837 (mbutrovich)

  • chore: Enable tests in RemoveRedundantProjectsSuite.scala related to issue #242 #1838 (rishvin)

  • minor: Replace many instances of checkSparkAnswer with checkSparkAnswerAndOperator #1851 (andygrove)

  • chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate #1847 (andygrove)

  • chore: Ignore Spark SQL WholeStageCodegenSuite tests #1859 (andygrove)

  • chore: Upgrade to DataFusion 48.0.0-rc3 #1863 (andygrove)

  • upgraded spark 3.5.5 to 3.5.6 #1861 (YanivKunda)

  • build: Disable some rounding tests when miri is enabled #1873 (andygrove)

  • chore: Enable Spark SQL tests for native_iceberg_compat #1876 (andygrove)

  • chore: Enable more Spark SQL tests #1869 (andygrove)

  • chore: refactor planner read schema tests #1886 (comphead)

  • chore: Implement date_trunc as ScalarUDFImpl #1880 (leung-ming)

  • Chore: implement datetime funcs as ScalarUDFImpl #1874 (trompa)

  • minor: Improve testing of math scalar functions #1896 (andygrove)

  • minor: Avoid rewriting join to unsupported join #1888 (andygrove)

  • chore: Enable native_iceberg_compat Spark SQL tests (for real, this time) #1910 (andygrove)

  • chore: rename makeParquetFileAllTypes to makeParquetFileAllPrimitiveTypes #1905 (parthchandra)

  • chore: add a test case to read from an arbitrarily complex type schema #1911 (parthchandra)

  • test: Trigger Spark 3.4.3 SQL tests for iceberg-compat #1912 (kazuyukitanimura)

  • build: Fix conflict between #1910 and #1912 #1924 (andygrove)

  • minor: fix kube/Dockerfile build failed #1918 (zhangxffff)

  • chore: Improve reporting of fallback reasons for CollectLimit #1694 (andygrove)

  • chore: move udf registration to better place #1899 (rluvaton)

  • chore: Comet + Iceberg (1.8.1) CI #1715 (hsiang-c)

  • chore: Introduce exprHandlers map in QueryPlanSerde #1903 (andygrove)

  • chore: Enable Spark SQL tests for auto scan mode #1885 (andygrove)

  • Feat: support bit_get function #1713 (kazantsev-maksim)

  • chore: Clippy fixes for Rust 1.88 #1939 (andygrove)

  • Minor: Add unit tests for ceil/floor functions #1728 (tlm365)

Credits#

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    62	Andy Grove
    16	Matt Butrovich
    10	Oleks V
     8	Parth Chandra
     5	Kazantsev Maksim
     5	hsiang-c
     4	Kristin Cowalcijk
     4	Leung Ming
     3	B Vadlamani
     3	drexler-sky
     2	Emily Matheys
     2	Huaxin Gao
     2	KAZUYUKI TANIMURA
     2	Raz Luvaton
     2	Zhen Wang
     1	Artem Kupchinskiy
     1	Junfan Zhang
     1	Qi Zhu
     1	Rishab Joshi
     1	Tai Le Manh
     1	Yaniv Kunda
     1	Zhang Xiaofeng
     1	ding-young
     1	trompa

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.