Parquet Data Wrapper Reference

View as Markdown

The following are the supported data types and configuration for use with HeavyConnect and the Parquet data format. This reference outlines prerequisites, restrictions, and supported mappings of HeavyDB column types to Parquet column types.

Parquet Data Wrapper Prerequisites & Restrictions

TypeApplicable toRequirement or restriction
PrerequisiteEvery column in every row group in every Parquet fileThe metadata must exist in a usable form, with statistics that are populated. In particular, the minimum, maximum and null count must all be correctly set.
RestrictionHeavyDB foreign table max fragment size and every row group in every Parquet fileHeavyDB foreign table’s max fragment size must be greater than or equal in row count to the largest row group in any Parquet file.
RestrictionEvery Parquet file schemaEvery parquet file must have a flat schema (no nested types in schema definition.) The only exception to this rule is for mapping Array types to parquet lists (which are nested) described below.
RestrictionEvery Parquet file schemaWhen more than one Parquet file is used, every file’s schema must be identical to each other & compatible with the HeavyDB foreign table schema.
PrerequisiteEvery date/time type column in every Parquet fileAll date/time type columns are expected to be in UTC or adjusted for UTC.
RestrictionEvery decimal column mappingDecimal column mappings are required to have the same precision and scale in both the Parquet schema and HeavyDB foreign table schema.
RestrictionEvery numeric/boolean column mappingParquet numeric/boolean types map only to one HeavyDB type that best represents it. See mappings detailed in table below. Coercion allows for more than one mapping and is labelled as such below. Coercion can result in a loss of information, FSI attempts to detect this possibility and reports an error upon detection.
RestrictionEvery numeric/boolean column mappingCoercion to widen types is disabled due to no apparent use case for it.
RestrictionParquet list columnsParquet list columns must specify a schema with a max definition level of 3. This directly means that there are no nodes in the list schema that are required. Note: in this case the Parquet data wrapper will still detect such a list schema, but will throw an error indicating the max definition level is not as expected.
RestrictionParquet scalar columnsParquet (flat) scalar columns must have the OPTIONAL repetition type. REQUIRED is currently not supported.

HeavyDB to Parquet Data Type Mapping

Numeric and Boolean Types: Table 1

HeavyDB (Down) \ Parquet (Right)INT(64)INT(32)INT(16)INT(8)
BIGINTYesNoNoNo

BIGINT ENCODING FIXED (32) /

INTEGER /

CoercibleYesNoNo

BIGINT ENCODING FIXED (16) /

INTEGER ENCODING FIXED (16) /

SMALLINT

CoercibleCoercibleYesNo

BIGINT ENCODING FIXED (8) /

INTEGER ENCODING FIXED (8) /

SMALLINT ENCODING FIXED (8) /

TINYINT

CoercibleCoercibleCoercibleYes
DECIMAL (Precision, Scale)NoNoNoNo
DOUBLENoNoNoNo
FLOATNoNoNoNo
BOOLEANNoNoNoNo

Numeric and Boolean Types: Table 2

HeavyDB (Down) \ Parquet (Right)Unsigned INT(64)Unsigned INT(32)Unsigned INT(16)Unsigned INT(8)
BIGINTCoercibleYes [1]NoNo

BIGINT ENCODING FIXED (32) /

INTEGER /

CoercibleCoercibleYes [1]No

BIGINT ENCODING FIXED (16) /

INTEGER ENCODING FIXED (16) /

SMALLINT

CoercibleCoercibleCoercibleYes [1]

BIGINT ENCODING FIXED (8) /

INTEGER ENCODING FIXED (8) /

SMALLINT ENCODING FIXED (8) / TINYINT

CoercibleCoercibleCoercibleCoercible
DECIMAL (Precision, Scale)NoNoNoNo
DOUBLENoNoNoNo
FLOATNoNoNoNo
BOOLEANNoNoNoNo

Numeric and Boolean Types: Table 3

HeavyDB (Down) \ Parquet (Right)

DECIMAL

(Precision, Scale)

DOUBLEFLOATBOOLEAN
BIGINTNoNoNoNo

BIGINT ENCODING FIXED (32) /

INTEGER /

NoNoNoNo

BIGINT ENCODING FIXED (16) /

INTEGER ENCODING FIXED (16) /

SMALLINT

NoNoNoNo

BIGINT ENCODING FIXED (8) /

INTEGER ENCODING FIXED (8) /

SMALLINT ENCODING FIXED (8) / TINYINT

NoNoNoNo
DECIMAL (Precision, Scale)

Yes - If Precision

and Scale match
No - otherwise

NoNoNo
DOUBLENoYesNoNo
FLOATNoCoercible [2]YesNo
BOOLEANNoNoNoYes

[1] Unsigned Parquet types must map to HeavyDB signed types of one sizing larger— for example Parquet’s Unsigned INT(32) must map to HeavyDB’s BIGINT— to ensure that no information loss occurs.

[2] Float types use 32 bits, while double types use 64 bits in their representation according to the IEEE standard. There is no check for precision loss when coercion from a double to float is requested. However, there is a check to see if the double fits in the range that float types are capable of representing.

Date and Time Types: Table 1

HeavyDB (Down) \ Parquet (Right)DATETIME MILLIS (UTC)TIME MICROS (UTC)TIME NANOS (UTC)

DATE /

DATE ENCODING DAYS (32)

YesNoNoCoercible
DATE ENCODING DAYS (16) [1]CoercibleNoNoCoercible
TIME [2]NoYesYesYes
TIME ENCODING FIXED (32)NoYesCoercibleCoercible
TIMESTAMP (0) [3]NoNoNoNo
TIMESTAMP (3)NoNoNoNo
TIMESTAMP (6)NoNoNoNo
TIMESTAMP (9)NoNoNoNo
TIMESTAMP ENCODING FIXED (32) [4][5]NoNoNoNo

Date and Time Types: Table 2

HeavyDB (Down) \ Parquet (Right)TIMESTAMP MILLIS (UTC)TIMESTAMP MICROS (UTC)TIMESTAMP NANOS (UTC)INT64INT32

DATE /

DATE ENCODING DAYS (32)

CoercibleCoercibleNoNoNo
DATE ENCODING DAYS (16) [1]CoercibleCoercibleNoNoNo
TIME [2]NoNoNoNoNo
TIME ENCODING FIXED (32)NoNoNoNoNo
TIMESTAMP (0) [3]YesYesYesCoercibleNo
TIMESTAMP (3)YesNoNoCoercibleNo
TIMESTAMP (6)NoYesNoCoercibleNo
TIMESTAMP (9)NoNoYesCoercibleNo
TIMESTAMP ENCODING FIXED (32) [4][5]CoercibleCoercibleCoercibleCoercibleCoercible

[1] DATE ENCODING DAYS (16) has no mapping from any Parquet type where the mapping would not result in a potential loss of information. Some mappings are allowed through coercion.

[2] The HeavyDB TIME type represents the number of seconds elapsed in a 24-hour period, while the Parquet TIME type represents a similar quantity but in milli/micro/nanoseconds. To make use of such Parquet columns, this mapping is allowed even though it breaks the convention that only direct mappings are supported. An intermediate transform takes place, calculating the number of seconds elapsed given the number of milli, micro, or nano seconds.

[3] Similar to the HeavyDB TIME type, TIMESTAMP (0) represents the time/date using seconds, which is not compatible with any Parquet TIMESTAMP types. As such, an exception is made for this case to support mapping from all of milli, micro, or nano second Parquet TIMESTAMPs.

[4] Because Parquet’s TIMESTAMP stores the data in a 64-bit representation, and the ENCODING FIXED (32) uses a 32-bit representation, information loss is possible. In these cases, no mapping is supported; however, a coercion is supported.

[5] Timestamps that use different precision with 32-bit representation (other than second precision) have a very limited range and are are not supported. Second timestamps that use a 32-bit representation have a maximum range of:8:45:53 pm UTC | Friday, December 13, 1901 to 3:14:07 am UTC | Tuesday, January 19, 2038

String Types

HeavyDB (Down) \ Parquet (Right)STRINGENUM (Not implemented)UUID (Not Implemented)BYTE_ARRAY
TEXT ENCODING DICTYesYesYesYes
TEXT ENCODING (16)YesYesYesYes
TEXT ENCODING (8)YesYesYesYes
TEXT ENCODING NONEYesYesYesYes

GeoTypes

Only geometry types in WKT format are currently supported

HeavyDB (Down) \ Parquet (Right)STRINGBYTE_ARRAY
POINTYesYes
MULTIPOINTYesYes
LINESTRINGYesYes
MULTILINESTRINGYesYes
POLYGONYesYes
MULTIPOLYGONYesYes

Array Types

HeavyDB array data type maps to the Parquet list data type.