DataFusion Comet 0.9.0 Changelog#

This release consists of 139 commits from 24 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: typo for instr in fuzz testing #1686 (mbutrovich)
fix: Bucketed scan fallback for native_datafusion Parquet scan #1720 (mbutrovich)
fix: Skip row index Spark SQL tests for native_datafusion Parquet scan #1724 (mbutrovich)
fix: Check acquired memory when CometMemoryPool grows #1732 (wForget)
fix: Fix data race in memory profiling #1727 (andygrove)
fix: Enable some DPP Spark SQL tests #1734 (andygrove)
fix: support literal null list and map #1742 (kazuyukitanimura)
fix: get_struct field is incorrect when struct in array #1687 (comphead)
fix: cast map types correctly in schema adapter #1771 (parthchandra)
fix: correct schema type checking in native_iceberg_compat #1755 (parthchandra)
fix: default values for native_datafusion scan #1756 (mbutrovich)
fix: [native_scans] Support CASE_SENSITIVE when reading Parquet #1782 (andygrove)
fix: cargo install tpchgen-cli in benchmark doc #1797 (zhuqi-lucas)
fix: support map_keys #1788 (comphead)
fix: fall back on nested types for default values #1799 (mbutrovich)
fix: Re-enable Spark 4 tests on Linux #1806 (andygrove)
fix: fallback to Spark scan if encryption is enabled (native_datafusion/native_iceberg_compat) #1785 (parthchandra)
fix: native_iceberg_compat: move checking parquet types above fetching batch #1809 (mbutrovich)
fix: translate missing or corrupt file exceptions, fall back if asked to ignore #1765 (mbutrovich)
fix: Fix Spark SQL AQE exchange reuse test failures #1811 (coderfender)
fix: Enable more Spark SQL tests #1834 (andygrove)
fix: support map_values #1835 (comphead)
fix: Handle case where num_cols == 0 in native execution #1840 (andygrove)
fix: Fix shuffle writing rows containing null struct fields #1845 (Kontinuation)
fix: Fall back to Spark for RANGE BETWEEN window expressions #1848 (andygrove)
fix: Remove COMET_SHUFFLE_FALLBACK_TO_COLUMNAR hack #1865 (andygrove)
fix: support read Struct by user schema #1860 (comphead)
fix: map parquet field_id correctly (native_iceberg_compat) #1815 (parthchandra)
fix: cast_struct_to_struct aligns to Spark behavior #1879 (mbutrovich)
fix: correctly handle schemas with nested array of struct (native_iceberg_compat) #1883 (parthchandra)
fix: set RangePartitioning for native shuffle default to false #1907 (mbutrovich)
fix: conflict between #1905 and #1892. #1919 (mbutrovich)
fix: Add overflow check to evaluate of sum decimal accumulator #1922 (leung-ming)
fix: Fix overflow handling when casting float to decimal #1914 (leung-ming)
fix: Ignore a test case fails on Miri #1951 (leung-ming)

Performance related:

perf: Add memory profiling #1702 (andygrove)
perf: Add performance tracing capability #1706 (andygrove)
perf: Add COMET_RESPECT_PARQUET_FILTER_PUSHDOWN config #1936 (andygrove)

Implemented enhancements:

feat: add jemalloc as optional custom allocator #1679 (mbutrovich)
feat: support array_repeat #1680 (comphead)
feat: More warning info for users #1667 (hsiang-c)
feat: decode() expression when using ‘utf-8’ encoding #1697 (mbutrovich)
feat: regexp_replace() expression with no starting offset #1700 (mbutrovich)
feat: Improve performance tracing feature #1730 (andygrove)
feat: Set/cancel with job tag and make max broadcast table size configurable #1693 (wForget)
feat: Add support for expm1 expression from datafusion-spark crate #1711 (andygrove)
feat: Add config option for showing all Comet plan transformations #1780 (andygrove)
feat: Support Type widening: byte → short/int/long, short → int/long #1770 (huaxingao)
feat: Translate Hadoop S3A configurations to object_store configurations #1817 (Kontinuation)
feat: Upgrade to official DataFusion 48.0.0 release #1877 (andygrove)
feat: Add experimental auto mode for COMET_PARQUET_SCAN_IMPL #1747 (andygrove)
feat: support RangePartitioning with native shuffle #1862 (mbutrovich)
feat: Add support for signum expression #1889 (andygrove)
feat: Add support to lookup map by key #1898 (comphead)
feat: support array_max #1892 (drexler-sky)
feat: pass ignore_nulls flag to first and last #1866 (rluvaton)
feat: Implement ToPrettyString #1921 (andygrove)
feat: Support hadoop s3a config in native_iceberg_compat #1925 (parthchandra)
feat: rand expression support #1199 (akupchinskiy)
feat: supports array_distinct #1923 (drexler-sky)
feat: auto scan mode should check for supported file location #1930 (andygrove)
feat: Encapsulate Parquet objects #1920 (huaxingao)
feat: Change default value of COMET_NATIVE_SCAN_IMPL to auto #1933 (andygrove)
feat: Supports array_union #1945 (drexler-sky)

Documentation updates:

docs: Add changelog for 0.8.0 #1675 (andygrove)
docs: Add instructions on running TPC-H on macOS #1647 (andygrove)
docs: Add documentation for accelerating Iceberg Parquet scans with Comet #1683 (andygrove)
docs: Add note on setting core.abbrev when generating diffs #1735 (andygrove)
docs: Remove outdated param in macos bench guide #1748 (ding-young)
docs: Add instructions for running individual Spark SQL tests from sbt #1752 (coderfender)
docs: Add documentation for native_datafusion Parquet scanner’s S3 support #1832 (Kontinuation)
docs: Add docs stating that Comet does not support reading decimals encoded in Parquet BINARY format #1895 (andygrove)

Other:

chore: Start 0.9.0 development #1676 (andygrove)
chore: Update viable crates #1677 (EmilyMatt)
chore: match Maven plugin versions with Spark 3.5 #1668 (hsiang-c)
chore: Remove fallback reason “because the children were not native” #1672 (andygrove)
chore: Rename scalarExprToProto to scalarFunctionExprToProto #1688 (comphead)
chore: fix build errors #1690 (comphead)
chore: Make Aggregate transformation more compact #1670 (EmilyMatt)
chore: update dev/release/rat_exclude_files.txt #1689 (hsiang-c)
chore: Move Comet rules into their own files #1695 (andygrove)
chore: Remove fast encoding option #1703 (andygrove)
chore: fix CI job name #1712 (hsiang-c)
minor: Warn if memory pool is dropped with bytes still reserved #1721 (andygrove)
chore: Correct memory acquired size in unified memory pool #1738 (zuston)
chore: allow large errors for Clippy #1743 (comphead)
chore: Refactor DataTypeSupport #1741 (andygrove)
chore: More refactoring of type checking logic #1744 (andygrove)
chore: Enable more complex type tests #1753 (andygrove)
chore: Add scanImpl attribute to CometScanExec #1746 (andygrove)
chore: Prepare for DataFusion 48.0.0 #1710 (andygrove)
Docs: Setup Comet on IntelliJ #1760 (coderfender)
chore: Reenable nested types for CometFuzzTestSuite with int96 #1761 (mbutrovich)
chore: Enable partial Spark SQL tests for native_iceberg_compat scan #1762 (andygrove)
chore: [native_iceberg_compat / native_datafusion] Ignore Spark SQL Parquet encryption tests #1763 (andygrove)
build: Ignore array_repeat test to fix CI issues #1774 (andygrove)
chore: Upload crash logs if Java tests fail #1779 (andygrove)
chore: Drop support for Java 8 #1777 (andygrove)
chore: Bump arrow to 18.3.0 #1773 (Kontinuation)
build: Stop running Comet’s Spark 4 tests on Linux for PR builds #1802 (andygrove)
Chore: Moved strings expressions to separate file #1792 (kazantsev-maksim)
chore: Speed up “PR Builds” CI workflows #1807 (andygrove)
chore: [native scans] Ignore Spark SQL test for string predicate pushdown #1768 (andygrove)
chore: Bump DataFusion to git rev 2c2f225 #1814 (andygrove)
Feat: support bit_count function #1602 (kazantsev-maksim)
Chore: implement bit_not as ScalarUDFImpl #1825 (kazantsev-maksim)
build: Specify -Dsbt.log.noformat=true in sbt CI runs #1822 (andygrove)
chore: Use unique artifact names in Java test run #1818 (andygrove)
minor: Refactor PhysicalPlanner::default() to avoid duplicate code #1821 (andygrove)
Chore: implement bit_count as ScalarUDFImpl #1826 (kazantsev-maksim)
chore: IgnoreCometNativeScan on a few more Spark SQL tests #1837 (mbutrovich)
chore: Enable tests in RemoveRedundantProjectsSuite.scala related to issue #242 #1838 (rishvin)
minor: Replace many instances of checkSparkAnswer with checkSparkAnswerAndOperator #1851 (andygrove)
chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate #1847 (andygrove)
chore: Ignore Spark SQL WholeStageCodegenSuite tests #1859 (andygrove)
chore: Upgrade to DataFusion 48.0.0-rc3 #1863 (andygrove)
upgraded spark 3.5.5 to 3.5.6 #1861 (YanivKunda)
build: Disable some rounding tests when miri is enabled #1873 (andygrove)
chore: Enable Spark SQL tests for native_iceberg_compat #1876 (andygrove)
chore: Enable more Spark SQL tests #1869 (andygrove)
chore: refactor planner read schema tests #1886 (comphead)
chore: Implement date_trunc as ScalarUDFImpl #1880 (leung-ming)
Chore: implement datetime funcs as ScalarUDFImpl #1874 (trompa)
minor: Improve testing of math scalar functions #1896 (andygrove)
minor: Avoid rewriting join to unsupported join #1888 (andygrove)
chore: Enable native_iceberg_compat Spark SQL tests (for real, this time) #1910 (andygrove)
chore: rename makeParquetFileAllTypes to makeParquetFileAllPrimitiveTypes #1905 (parthchandra)
chore: add a test case to read from an arbitrarily complex type schema #1911 (parthchandra)
test: Trigger Spark 3.4.3 SQL tests for iceberg-compat #1912 (kazuyukitanimura)
build: Fix conflict between #1910 and #1912 #1924 (andygrove)
minor: fix kube/Dockerfile build failed #1918 (zhangxffff)
chore: Improve reporting of fallback reasons for CollectLimit #1694 (andygrove)
chore: move udf registration to better place #1899 (rluvaton)
chore: Comet + Iceberg (1.8.1) CI #1715 (hsiang-c)
chore: Introduce exprHandlers map in QueryPlanSerde #1903 (andygrove)
chore: Enable Spark SQL tests for auto scan mode #1885 (andygrove)
Feat: support bit_get function #1713 (kazantsev-maksim)
chore: Clippy fixes for Rust 1.88 #1939 (andygrove)
Minor: Add unit tests for ceil/floor functions #1728 (tlm365)

Credits#

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

Andy Grove
Matt Butrovich
Oleks V
Parth Chandra
Kazantsev Maksim
hsiang-c
Kristin Cowalcijk
Leung Ming
B Vadlamani
drexler-sky
Emily Matheys
Huaxin Gao
KAZUYUKI TANIMURA
Raz Luvaton
Zhen Wang
Artem Kupchinskiy
Junfan Zhang
Qi Zhu
Rishab Joshi
Tai Le Manh
Yaniv Kunda
Zhang Xiaofeng
ding-young
trompa

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.