DataFusion Comet 0.5.0 Changelog#

This release consists of 69 commits from 15 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

  • fix: Unsigned type related bugs #1095 (kazuyukitanimura)

  • fix: Use RDD partition index #1112 (viirya)

  • fix: Various metrics bug fixes and improvements #1111 (andygrove)

  • fix: Don’t create CometScanExec for subclasses of ParquetFileFormat #1129 (Kimahriman)

  • fix: Fix metrics regressions #1132 (andygrove)

  • fix: Enable scenarios accidentally commented out in CometExecBenchmark #1151 (mbutrovich)

  • fix: Spark 4.0-preview1 SPARK-47120 #1156 (kazuyukitanimura)

  • fix: Document enabling comet explain plan usage in Spark (4.0) #1176 (parthchandra)

  • fix: stddev_pop should not directly return 0.0 when count is 1.0 #1184 (viirya)

  • fix: fix missing explanation for then branch in case when #1200 (rluvaton)

  • fix: Fall back to Spark for unsupported partition or sort expressions in window aggregates #1253 (andygrove)

  • fix: Fall back to Spark for distinct aggregates #1262 (andygrove)

  • fix: disable initCap by default #1276 (kazuyukitanimura)

Performance related:

  • perf: Stop passing Java config map into native createPlan #1101 (andygrove)

  • feat: Make native shuffle compression configurable and respect spark.shuffle.compress #1185 (andygrove)

  • perf: Improve query planning to more reliably fall back to columnar shuffle when native shuffle is not supported #1209 (andygrove)

  • feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192 (andygrove)

  • feat: Implement custom RecordBatch serde for shuffle for improved performance #1190 (andygrove)

Implemented enhancements:

  • feat: support array_insert #1073 (SemyonSinchenko)

  • feat: enable decimal to decimal cast of different precision and scale #1086 (himadripal)

  • feat: Improve ScanExec native metrics #1133 (andygrove)

  • feat: Add Spark-compatible implementation of SchemaAdapterFactory #1169 (andygrove)

  • feat: Improve shuffle metrics (second attempt) #1175 (andygrove)

  • feat: Add a spark.comet.exec.memoryPool configuration for experimenting with various datafusion memory pool setups. #1021 (Kontinuation)

  • feat: Reenable tests for filtered SMJ anti join #1211 (comphead)

  • feat: add support for array_remove expression #1179 (jatin510)

Documentation updates:

  • docs: Update documentation for 0.4.0 release #1096 (andygrove)

  • docs: Fix readme typo FGPA -> FPGA #1117 (gstvg)

  • docs: Add more technical detail and new diagram to Comet plugin overview #1119 (andygrove)

  • docs: Add some documentation explaining how shuffle works #1148 (andygrove)

  • docs: Update TPC-H benchmark results #1257 (andygrove)

Other:

  • chore: Add changelog for 0.4.0 #1089 (andygrove)

  • chore: Prepare for 0.5.0 development #1090 (andygrove)

  • build: Skip installation of spark-integration and fuzz testing modules #1091 (parthchandra)

  • minor: Add hint for finding the GPG key to use when publishing to maven #1093 (andygrove)

  • chore: Include first ScanExec batch in metrics #1105 (andygrove)

  • chore: Improve CometScan metrics #1100 (andygrove)

  • chore: Add custom metric for native shuffle fetching batches from JVM #1108 (andygrove)

  • chore: Remove unused StringView struct #1143 (andygrove)

  • test: enable more Spark 4.0 tests #1145 (kazuyukitanimura)

  • chore: Refactor cast to use SparkCastOptions param #1146 (andygrove)

  • chore: Move more expressions from core crate to spark-expr crate #1152 (andygrove)

  • chore: Remove dead code #1155 (andygrove)

  • chore: Move string kernels and expressions to spark-expr crate #1164 (andygrove)

  • chore: Move remaining expressions to spark-expr crate + some minor refactoring #1165 (andygrove)

  • chore: Add ignored tests for reading complex types from Parquet #1167 (andygrove)

  • test: enabling Spark tests with offHeap requirement #1177 (kazuyukitanimura)

  • minor: move shuffle classes from common to spark #1193 (andygrove)

  • minor: refactor to move decodeBatches to broadcast exchange code as private function #1195 (andygrove)

  • minor: refactor prepare_output so that it does not require an ExecutionContext #1194 (andygrove)

  • minor: remove unused source files #1202 (andygrove)

  • chore: Upgrade to DataFusion 44.0.0-rc2 #1154 (andygrove)

  • chore: Add safety check to CometBuffer #1050 (viirya)

  • chore: Remove unreachable code #1213 (andygrove)

  • test: Enable Comet by default except some tests in SparkSessionExtensionSuite #1201 (kazuyukitanimura)

  • chore: extract struct expressions to folders based on spark grouping #1216 (rluvaton)

  • chore: extract static invoke expressions to folders based on spark grouping #1217 (rluvaton)

  • chore: Follow-on PR to fully enable onheap memory usage #1210 (andygrove)

  • chore: extract agg_funcs expressions to folders based on spark grouping #1224 (rluvaton)

  • chore: extract datetime_funcs expressions to folders based on spark grouping #1222 (rluvaton)

  • chore: Upgrade to DataFusion 44.0.0 from 44.0.0 RC2 #1232 (rluvaton)

  • chore: extract strings file to strings_func like in spark grouping #1215 (rluvaton)

  • chore: extract predicate_functions expressions to folders based on spark grouping #1218 (rluvaton)

  • build(deps): bump protobuf version to 3.21.12 #1234 (wForget)

  • chore: extract json_funcs expressions to folders based on spark grouping #1220 (rluvaton)

  • test: Enable shuffle by default in Spark tests #1240 (kazuyukitanimura)

  • chore: extract hash_funcs expressions to folders based on spark grouping #1221 (rluvaton)

  • build: Fix test failure caused by merging conflicting PRs #1259 (andygrove)

Credits#

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    37	Andy Grove
    10	Raz Luvaton
     7	KAZUYUKI TANIMURA
     3	Liang-Chi Hsieh
     2	Parth Chandra
     1	Adam Binford
     1	Dharan Aditya
     1	Himadri Pal
     1	Jagdish Parihar
     1	Kristin Cowalcijk
     1	Matt Butrovich
     1	Oleks V
     1	Sem
     1	Zhen Wang
     1	gstvg

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.