You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tracking / discussion issue spun out of the review on #50145 (which
implements FixedShapeTensorType.to_pandas_dtype, GH-49907).
Today all canonical extension types (bool8, json, uuid, opaque, fixed_shape_tensor, …) inherit DataType.to_pandas_dtype, which raises NotImplementedError. As a result to_pandas / Table.to_pandas fall back to
converting the storage (often an object/numpy column), and Table.to_pandas(split_blocks=True) raises KeyError for these columns.
#50145 returns pandas.ArrowDtype(self) from FixedShapeTensorType.to_pandas_dtype — a pandas ExtensionDtype implementing __from_arrow__ — which fixes the error and yields a faithful, round-trippable
extension column on pandas >= 2.1. This issue tracks extending that approach and
the open questions raised in review.
Open questions
Which canonical extension types should implement to_pandas_dtype, and to
what?pd.ArrowDtype(self) is a sensible generic default, but some types may
map more naturally to a native pandas dtype (e.g. bool8 → a boolean dtype).
Decide per-type vs. a shared default on BaseExtensionType — note a BaseExtensionType default would also change behavior for user-defined
extension types, which relates to the ExtensionScalar.as_py() fallback in [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134.
Implications for to_pandas / Table.to_pandas. Returning a dtype with __from_arrow__ changes conversion from the storage/object fallback to a
faithful extension-typed column. Pros: round-trips preserve the type, split_blocks=True works. Cons: user-facing behavior change (changelog
needed); gated to pandas >= 2.1 (reliable ArrowDtype extension blocks, [Python][CI] Extension type test fails with pandas 2.0.2 #35821). types_mapper continues to take precedence.
Docstring cleanup.BaseExtensionType and its subclasses inherit to_pandas_dtype (and related methods) from DataType with no mention of
extension-specific behavior; document this.
Proposed direction
Keep #50145 scoped to fixed_shape_tensor; handle the rest as small
follow-up PRs, each with its own changelog note:
Describe the enhancement requested
Tracking / discussion issue spun out of the review on #50145 (which
implements
FixedShapeTensorType.to_pandas_dtype, GH-49907).Today all canonical extension types (
bool8,json,uuid,opaque,fixed_shape_tensor, …) inheritDataType.to_pandas_dtype, which raisesNotImplementedError. As a resultto_pandas/Table.to_pandasfall back toconverting the storage (often an object/numpy column), and
Table.to_pandas(split_blocks=True)raisesKeyErrorfor these columns.#50145 returns
pandas.ArrowDtype(self)fromFixedShapeTensorType.to_pandas_dtype— a pandasExtensionDtypeimplementing__from_arrow__— which fixes the error and yields a faithful, round-trippableextension column on pandas >= 2.1. This issue tracks extending that approach and
the open questions raised in review.
Open questions
Which canonical extension types should implement
to_pandas_dtype, and towhat?
pd.ArrowDtype(self)is a sensible generic default, but some types maymap more naturally to a native pandas dtype (e.g.
bool8→ a boolean dtype).Decide per-type vs. a shared default on
BaseExtensionType— note aBaseExtensionTypedefault would also change behavior for user-definedextension types, which relates to the
ExtensionScalar.as_py()fallback in[Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134.
Implications for
to_pandas/Table.to_pandas. Returning a dtype with__from_arrow__changes conversion from the storage/object fallback to afaithful extension-typed column. Pros: round-trips preserve the type,
split_blocks=Trueworks. Cons: user-facing behavior change (changelogneeded); gated to pandas >= 2.1 (reliable
ArrowDtypeextension blocks,[Python][CI] Extension type test fails with pandas 2.0.2 #35821).
types_mappercontinues to take precedence.Docstring cleanup.
BaseExtensionTypeand its subclasses inheritto_pandas_dtype(and related methods) fromDataTypewith no mention ofextension-specific behavior; document this.
Proposed direction
Keep #50145 scoped to
fixed_shape_tensor; handle the rest as smallfollow-up PRs, each with its own changelog note:
bool8uuidjsonopaqueBaseExtensionType+ subclasses documentingto_pandas_dtype/to_pandasbehaviorcc @AlenkaF @jorisvandenbossche
Component(s)
Python