Parquet Compatibility#
Comet currently has two distinct implementations of the Parquet scan operator.
Scan Implementation |
Notes |
|---|---|
|
Fully native scan |
|
Hybrid JVM/native scan |
The configuration property spark.comet.scan.impl is used to select an implementation. The default setting is
spark.comet.scan.impl=auto, which attempts to use native_datafusion first, and falls back to Spark if the scan
cannot be converted (e.g., due to unsupported features). Most users should not need to change this setting. However,
it is possible to force Comet to use a particular implementation for all scan operations by setting this
configuration property to one of the following implementations. For example: --conf spark.comet.scan.impl=native_datafusion.
native_datafusion Limitations#
The native_datafusion scan has some additional limitations, mostly related to Parquet metadata. All of these
cause Comet to fall back to Spark (including when using auto mode). Note that the native_datafusion scan
requires spark.comet.exec.enabled=true because the scan node must be wrapped by CometExecRule.
No support for row indexes
No support for reading Parquet field IDs
No support for
input_file_name(),input_file_block_start(), orinput_file_block_length()SQL functions. Thenative_datafusionscan does not use Spark’sFileScanRDD, so these functions cannot populate their values.No support for
ignoreMissingFilesorignoreCorruptFilesbeing set totrueDuplicate field names in case-insensitive mode (e.g., a Parquet file with both
Bandbcolumns) are detected at read time and raise aSparkRuntimeExceptionwith error class_LEGACY_ERROR_TEMP_2093, matching Spark’s behavior.
The following native_datafusion limitations may produce incorrect results on Spark versions prior to 4.0
without falling back to Spark:
Reading
TimestampLTZasTimestampNTZ. On Spark 3.x, Spark raises an error per SPARK-36182 because LTZ encodes UTC-adjusted instants that cannot be safely reinterpreted as timezone-free values. Comet does not raise this error and instead returns the raw UTC instant as aTimestampNTZvalue. This applies to all LTZ physical encodings (INT96, TIMESTAMP_MICROS, TIMESTAMP_MILLIS). On Spark 4.0+, this read is permitted (SPARK-47447) and Comet matches Spark’s behavior. See #4219.Unsupported Parquet type conversions. Spark raises schema incompatibility errors for certain conversions (e.g., reading INT32 as BIGINT, reading BINARY as TIMESTAMP, unsupported decimal precision changes). The
native_datafusionscan may not detect these mismatches and could return unexpected values instead of raising an error. See #3720.
native_iceberg_compat Limitations#
The native_iceberg_compat scan has the following additional limitation that may produce incorrect results
without falling back to Spark:
Some Spark configuration values are hard-coded to their defaults rather than respecting user-specified values. This may produce incorrect results when non-default values are set. The affected configurations are
spark.sql.parquet.binaryAsString,spark.sql.parquet.int96AsTimestamp,spark.sql.caseSensitive,spark.sql.parquet.inferTimestampNTZ.enabled, andspark.sql.legacy.parquet.nanosAsLong. See issue #1816 for more details.