Skip to content

[Python] Consistent to_pandas_dtype (and to_pandas) behavior for canonical extension types #50165

@aboderinsamuel

Description

@aboderinsamuel

Describe the enhancement requested

Tracking / discussion issue spun out of the review on #50145 (which
implements FixedShapeTensorType.to_pandas_dtype, GH-49907).

Today all canonical extension types (bool8, json, uuid, opaque,
fixed_shape_tensor, …) inherit DataType.to_pandas_dtype, which raises
NotImplementedError. As a result to_pandas / Table.to_pandas fall back to
converting the storage (often an object/numpy column), and
Table.to_pandas(split_blocks=True) raises KeyError for these columns.

#50145 returns pandas.ArrowDtype(self) from
FixedShapeTensorType.to_pandas_dtype — a pandas ExtensionDtype implementing
__from_arrow__ — which fixes the error and yields a faithful, round-trippable
extension column on pandas >= 2.1. This issue tracks extending that approach and
the open questions raised in review.

Open questions

  1. Which canonical extension types should implement to_pandas_dtype, and to
    what?
    pd.ArrowDtype(self) is a sensible generic default, but some types may
    map more naturally to a native pandas dtype (e.g. bool8 → a boolean dtype).
    Decide per-type vs. a shared default on BaseExtensionType — note a
    BaseExtensionType default would also change behavior for user-defined
    extension types, which relates to the ExtensionScalar.as_py() fallback in
    [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134.

  2. Implications for to_pandas / Table.to_pandas. Returning a dtype with
    __from_arrow__ changes conversion from the storage/object fallback to a
    faithful extension-typed column. Pros: round-trips preserve the type,
    split_blocks=True works. Cons: user-facing behavior change (changelog
    needed); gated to pandas >= 2.1 (reliable ArrowDtype extension blocks,
    [Python][CI] Extension type test fails with pandas 2.0.2 #35821). types_mapper continues to take precedence.

  3. Docstring cleanup. BaseExtensionType and its subclasses inherit
    to_pandas_dtype (and related methods) from DataType with no mention of
    extension-specific behavior; document this.

Proposed direction

Keep #50145 scoped to fixed_shape_tensor; handle the rest as small
follow-up PRs, each with its own changelog note:

  • bool8
  • uuid
  • json
  • opaque
  • Docstring pass over BaseExtensionType + subclasses documenting
    to_pandas_dtype / to_pandas behavior

cc @AlenkaF @jorisvandenbossche

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions