FrankenPandas

Clean-room Rust reimplementation of the full pandas API surface.

Drop-in replacement. Zero unsafe. Profile-proven performance.

TL;DR

The Problem: pandas is the lingua franca of data analysis, but it's single-threaded Python with unpredictable memory spikes, GIL contention in production pipelines, and dtype coercion surprises that silently corrupt results.

The Solution: FrankenPandas rebuilds the entire pandas API from first principles in Rust. Same semantics, same method names, same edge-case behavior, but with columnar storage, vectorized kernels, arena-backed execution, and compile-time safety guarantees.

Why FrankenPandas?

Feature	pandas	Polars	FrankenPandas
API compatibility with pandas	-	Partial (different API)	Full parity target
Memory safety	Runtime errors	Safe Rust	Safe Rust (`#![forbid(unsafe_code)]`)
Index alignment semantics	Yes	No (no index)	Yes (AACE)
GroupBy with named aggregation	Yes	Yes (different syntax)	Yes (`agg_named`)
`eval()`/`query()` string expressions	Yes	No	Yes
MultiIndex	Yes	No	Yes (foundation)
Categorical dtype	Yes	Yes	Yes (metadata layer)
7 IO formats (CSV/JSON/JSONL/Parquet/Excel/SQL/Feather)	Yes	Partial	Yes
Conformance testing against pandas oracle	-	-	Yes (20+ packet suites)

Quick Example

use frankenpandas::prelude::*;

// Read CSV (with pandas-style options)
let df = read_csv_str("name,age,city\nAlice,30,NYC\nBob,25,LA\nCarol,35,NYC")?;

// Filter with query expressions (just like df.query() in pandas)
let adults = df.query("age > 28")?;

// GroupBy with named aggregation
let summary = adults.groupby(&["city"])?.agg_named(&[
    ("avg_age", "age", "mean"),
    ("count", "age", "count"),
])?;

// Export to any format
let json = write_json_string(&summary, JsonOrient::Records)?;
let feather = write_feather_bytes(&summary)?;

println!("{}", summary);
// city    avg_age  count
// NYC       32.5      2

Crown-Jewel Innovation

Alignment-Aware Columnar Execution (AACE): Every binary operation between DataFrames or Series goes through an explicit index-alignment planning phase before any data is touched. An EvidenceLedger records each materialization decision with Bayesian confidence scores, creating a fully auditable execution trace.

AACE is a core identity constraint, not a best-effort optimization. pandas' alignment semantics (outer join on index for arithmetic, left join for assignment) are preserved exactly, with formal correctness evidence.

Design Philosophy

Principle	What It Means
Semantic parity over speed	We never sacrifice pandas-observable behavior for performance. Every dtype promotion, NaN propagation, and output ordering contract is preserved.
Prove, then optimize	Each optimization round produces a baseline, an opportunity matrix, an isomorphism proof, and a recommendation contract. No optimization without correctness evidence.
Fail closed	Unknown features, incompatible dtypes, and ambiguous coercions produce errors, not silent corruption. Strict mode rejects; hardened mode logs and recovers.
Zero unsafe	Every crate uses `#![forbid(unsafe_code)]`. Memory safety comes from the type system, not audits.
Test everything differentially	Conformance packets run FrankenPandas operations and compare against the pandas oracle. Machine-readable parity reports catch regressions automatically.

Architecture

                  ┌────────────────────────┐
                  │     frankenpandas       │  ← Unified facade crate
                  │   (prelude re-exports)  │
                  └───────────┬────────────┘
                              │
       ┌──────────────────────┼──────────────────────┐
       ▼                      ▼                      ▼
  ┌──────────┐      ┌────────────────┐      ┌──────────────┐
  │ fp-expr   │      │   fp-frame     │      │    fp-io     │
  │ eval()    │      │ DataFrame      │      │  7 formats   │
  │ query()   │      │ Series         │      │  CSV/JSON/   │
  └─────┬─────┘      │ Categorical    │      │  JSONL/      │
        │            │ MultiIndex     │      │  Parquet/    │
        ▼            └───────┬────────┘      │  Excel/SQL/  │
  ┌──────────┐               │               │  Feather     │
  │fp-runtime│        ┌──────┼──────┐        └──────────────┘
  │ Policy   │        ▼      ▼      ▼
  │ Ledger   │   fp-index fp-groupby fp-join
  └──────────┘   alignment  arena-   merge_asof
                  planning  backed   cross join
                            agg
       ┌────────────────────┘
       ▼
  ┌────────────┐      ┌───────────────┐
  │ fp-columnar │      │   fp-types    │
  │ Column      │      │ Scalar, DType │
  │ ValidMask   │      │ NaN/NaT/Null  │
  └────────────┘      └───────────────┘

12 workspace crates, 84,000+ lines of Rust, 788 public functions.

Workspace Structure

frankenpandas/
├── crates/
│   ├── frankenpandas/    # Unified facade crate with prelude
│   ├── fp-types/         # Scalar, DType, NullKind, coercion rules
│   ├── fp-columnar/      # Column, ValidityMask, vectorized kernels
│   ├── fp-index/         # Index, MultiIndex, alignment planning
│   ├── fp-frame/         # DataFrame, Series, Categorical, 788 methods
│   ├── fp-expr/          # Expression parser, eval()/query()
│   ├── fp-groupby/       # GroupBy with 3 execution paths
│   ├── fp-join/          # Inner/Left/Right/Outer/Cross/Asof joins
│   ├── fp-io/            # 7 IO formats, 95 tests
│   ├── fp-conformance/   # Differential testing against pandas oracle
│   ├── fp-runtime/       # Strict/Hardened policy, EvidenceLedger
│   └── fp-frankentui/    # Terminal UI dashboard (experimental)
├── artifacts/perf/       # Optimization round evidence
└── artifacts/phase2c/    # Conformance packet artifacts

IO Format Support

Format	Read	Write	In-Memory	File	Options
CSV	`read_csv_str`	`write_csv_string`	Yes	Yes	delimiter, headers, na_values, index_col, usecols, nrows, skiprows, dtype
JSON	`read_json_str`	`write_json_string`	Yes	Yes	5 orients: Records, Columns, Index, Split, Values
JSONL	`read_jsonl_str`	`write_jsonl_string`	Yes	Yes	One object per line, blank-line tolerant, union-key detection
Parquet	`read_parquet_bytes`	`write_parquet_bytes`	Yes	Yes	Arrow RecordBatch integration, multi-batch reading
Excel	`read_excel_bytes`	`write_excel_bytes`	Yes	Yes	sheet_name, has_headers, index_col, skip_rows; .xlsx/.xls/.xlsb/.ods
Feather	`read_feather_bytes`	`write_feather_bytes`	Yes	Yes	Arrow IPC file + stream formats
SQL	`read_sql`	`write_sql`	N/A	SQLite	query or table, SqlIfExists (Fail/Replace/Append), transaction-wrapped

All formats accessible through DataFrameIoExt trait methods on DataFrame.

Installation

Add to your Cargo.toml:

[dependencies]
frankenpandas = { path = "crates/frankenpandas" }

Or use individual crates for finer-grained control:

[dependencies]
fp-frame = { path = "crates/fp-frame" }
fp-io = { path = "crates/fp-io" }
fp-types = { path = "crates/fp-types" }

Build from Source

git clone https://github.com/Dicklesworthstone/frankenpandas.git
cd frankenpandas
cargo build --workspace --release
cargo test --workspace

Requires Rust nightly (2024 edition). See rust-toolchain.toml.

Quick Start

use frankenpandas::prelude::*;

// 1. Create a DataFrame
let df = read_csv_str("ticker,price,volume\nAAPL,185.50,1000\nGOOG,140.25,500\nAAPL,186.00,1200")?;

// 2. Filter rows
let expensive = df.query("price > 150")?;

// 3. GroupBy and aggregate
let by_ticker = expensive.groupby(&["ticker"])?.sum()?;

// 4. Convert dates
let dates = Series::from_values("d", vec![0i64.into()], vec![Scalar::Utf8("2024-01-15".into())])?;
let parsed = to_datetime(&dates)?;

// 5. Export to multiple formats
let csv = write_csv_string(&by_ticker)?;
let json = write_json_string(&by_ticker, JsonOrient::Records)?;
let feather = write_feather_bytes(&by_ticker)?;

// 6. SQL round-trip
let conn = rusqlite::Connection::open_in_memory()?;
write_sql(&by_ticker, &conn, "results", SqlIfExists::Fail)?;
let back = read_sql_table(&conn, "results")?;

How It Works: Deep Dive

Data Model

FrankenPandas uses a columnar storage model identical to pandas' internal representation:

DataFrame
├── index: Index (Vec<IndexLabel>)  ← Row labels (Int64 or Utf8)
├── columns: BTreeMap<String, Column>  ← Named columns
└── column_order: Vec<String>  ← Insertion-order preservation

Column
├── dtype: DType  ← {Null, Bool, Int64, Float64, Utf8}
├── values: Vec<Scalar>  ← Typed values
└── validity: ValidityMask  ← Bitpacked null bitmap

Scalar = Null(NullKind) | Bool(bool) | Int64(i64) | Float64(f64) | Utf8(String)
NullKind = Null | NaN | NaT  ← Three-way null semantics matching pandas

Every Scalar knows its own type, and every Column enforces type homogeneity through DType. The type promotion hierarchy follows pandas exactly: Null < Bool < Int64 < Float64 (with Utf8 incompatible with numerics).

ValidityMask: Bitpacked Null Tracking

Nulls are tracked using a packed bitmap, not per-element Option types:

ValidityMask {
    words: Vec<u64>,  // Each u64 holds 64 validity bits
    len: usize,       // Actual number of elements
}

Bit 1 = valid, Bit 0 = null/missing
Element at position i is in word i / 64, bit i % 64
Last word is masked to clear unused high bits: (1u64 << remainder) - 1
Boolean algebra: and_mask, or_mask, not_mask operate word-by-word on u64s, processing 64 null checks per CPU instruction
count_valid() uses word.count_ones() (hardware POPCNT) across all words

This is the same approach used by Apache Arrow and gives O(n/64) performance for null-aware operations instead of O(n).

Index Alignment: The AACE Pipeline

When you write series_a + series_b in pandas, it silently performs an outer join on the indexes before adding. FrankenPandas makes this explicit:

Step 1: Plan
   align_union(left.index, right.index) → AlignmentPlan {
       union_index: Index,          // Merged label set
       left_positions: Vec<Option<usize>>,   // Where each left value goes
       right_positions: Vec<Option<usize>>,  // Where each right value goes
   }

Step 2: Materialize
   Reindex both columns according to the plan.
   None positions → fill with NaN (missing propagation).

Step 3: Execute
   Vectorized element-wise operation on aligned arrays.

Step 4: Log
   EvidenceLedger records the decision: timestamp, mode, action,
   Bayesian posterior, expected losses.

The alignment planner uses adaptive lookup (AG-13): it detects whether the index is sorted (via a lazily-computed OnceCell<SortOrder>) and switches between O(log n) binary search for sorted indexes and O(n) HashMap lookup for unsorted ones.

For multi-way alignment (e.g., DataFrame.from_series([s1, s2, s3])), a leapfrog triejoin variant (AG-05) computes the N-way union in a single O(n log n) pass instead of iterative pairwise merges.

GroupBy: Three Execution Paths

The GroupBy engine automatically selects the fastest execution path based on key cardinality and memory budget:

                       ┌───────────────────────┐
                       │  All keys Int64 AND    │
                       │  key range ≤ 65,536?   │
                       └───────────┬────────────┘
                          Yes      │       No
                    ┌──────────────┘       └──────────┐
                    ▼                                  ▼
          ┌──────────────────┐             ┌──────────────────┐
          │ Dense Int64 Path  │             │ HashMap Generic   │
          │ O(1) array index  │             │ Path (fallback)   │
          │ Pre-alloc by key  │             │ (source_idx, sum) │
          │ range: max-min    │             │ pairs, no clone   │
          └──────────────────┘             └──────────────────┘

Path 1, Dense Int64: When all group keys are integers spanning ≤ 65,536 values, pre-allocates a dense array indexed directly by key - min_key. O(1) per-element grouping with zero hash overhead. Used for common patterns like grouping by year, month, category ID.

Path 2, Arena-backed (Bumpalo): When estimated intermediate memory fits within the arena budget (default 256 MB), allocates all working memory from a Bumpalo bump allocator. Single malloc + pointer bumps; bulk deallocation when the arena drops. Zero fragmentation, cache-friendly.

Path 3, Global allocator HashMap: Fallback for arbitrary key types and unbounded cardinality. Stores (source_index, accumulating_sum) pairs and never clones the group key Scalar itself (AG-08 optimization). The original IndexLabel is reconstructed at output time from the source position.

All three paths produce identical output. This is verified by property-based tests that run the same inputs through arena and non-arena paths and assert bitwise equality.

Vectorized Kernels

Arithmetic on typed arrays avoids per-element enum dispatch:

// Instead of matching each Scalar variant per element:
fn vectorized_binary_f64(
    left: &[f64], right: &[f64],
    left_validity: &ValidityMask, right_validity: &ValidityMask,
    op: fn(f64, f64) -> f64,
) -> (Vec<f64>, ValidityMask)

The compiler auto-vectorizes the inner loop to SIMD instructions. Combined validity is computed via and_mask on the bitmap words, again handling 64 elements per instruction.

When types don't align for the fast path (e.g., Int64 + Float64), a scalar fallback promotes each element through cast_scalar(), respecting the full pandas type hierarchy.

Expression Engine

df.eval("a + b * c > threshold") and df.query("price > 100 and volume < 500") are powered by a recursive-descent parser in fp-expr:

Grammar (simplified):
  expr       → or_expr
  or_expr    → and_expr ( "or" and_expr )*
  and_expr   → comparison ( "and" comparison )*
  comparison → arithmetic ( (">" | "<" | "==" | ...) arithmetic )?
  arithmetic → term ( ("+" | "-") term )*
  term       → atom ( ("*" | "/") atom )*
  atom       → NUMBER | COLUMN_NAME | @LOCAL_VAR | "(" expr ")"

The parser produces an Expr AST that the evaluator walks, resolving column references against the DataFrame's EvalContext. Local variables (prefixed with @) are broadcast to Series of the appropriate length. The entire pipeline (parse, resolve, evaluate, filter) happens in a single call with no temporary DataFrames.

Bayesian Runtime Policy

The runtime distinguishes between strict mode (fail on any ambiguity) and hardened mode (log and attempt repair). The decision between Allow, Reject, and Repair uses Bayesian expected-loss minimization:

For each compatibility issue:
  1. Start with prior P(compatible) based on issue type
     - Unknown feature: 0.25
     - Join admission: 0.60

  2. Update with evidence terms (log-likelihood ratios)
     - allowlist_miss:        compat=-3.5, incomp=-0.2
     - unknown_protocol_field: compat=-2.0, incomp=-0.1
     - overflow_risk:         compat=-0.3, incomp=-1.2

  3. Compute posterior via Bayes' theorem

  4. Calculate expected loss for each action:
     E[loss(Allow)]  = L(allow|compat) * P(compat) + L(allow|incomp) * P(incomp)
     E[loss(Reject)] = L(reject|compat) * P(compat) + L(reject|incomp) * P(incomp)
     E[loss(Repair)] = L(repair|compat) * P(compat) + L(repair|incomp) * P(incomp)

  5. Choose action = argmin expected loss

The asymmetric loss matrix penalizes "allow if incompatible" (100.0) far more than "reject if compatible" (6.0), making the system conservative by default. Every decision is recorded in the EvidenceLedger with full trace: timestamp, mode, prior, posterior, Bayes factor, and the evidence terms that drove it.

Categorical Data

Rather than adding a DType::Categorical variant (which would require changing 52+ match arms across 4 crates), categoricals are implemented as a metadata layer on Series:

Series {
    column: Column<Int64>,  // Stores integer codes (0, 1, 2, ...)
    categorical: Some(CategoricalMetadata {
        categories: Vec<Scalar>,  // ["low", "medium", "high"]
        ordered: bool,            // Whether categories have total ordering
    })
}

The .cat() accessor provides pandas-compatible operations: categories(), codes(), rename_categories(), add_categories(), remove_unused_categories(), set_categories(), as_ordered(), as_unordered(), to_values(). Missing values use code -1.

MultiIndex

Hierarchical indexing is represented as parallel label vectors (one per level):

MultiIndex {
    levels: Vec<Vec<IndexLabel>>,  // levels[0] = ["a","a","b","b"]
    names: Vec<Option<String>>,    //            levels[1] = [1, 2, 1, 2]
}

Constructors mirror pandas: from_tuples, from_arrays, from_product (Cartesian product). Operations: get_level_values(level), droplevel(level) → MultiIndexOrIndex, swaplevel(i,j), reorder_levels(order), to_flat_index(sep).

DataFrame integration via set_index_multi(&["col1", "col2"], drop, sep) creates composite index labels, and to_multi_index(&["col1", "col2"]) extracts a standalone MultiIndex from columns.

String Accessor

The .str() accessor provides 50+ string operations matching pandas Series.str:

Category	Methods
Case	`lower`, `upper`, `capitalize`, `title`, `casefold`, `swapcase`
Whitespace	`strip`, `lstrip`, `rstrip`, `expandtabs`
Search	`contains`, `startswith`, `endswith`, `find`, `rfind`, `index_of`, `rindex_of`
Transform	`replace`, `slice`, `repeat`, `pad`, `zfill`, `center`, `ljust`, `rjust`
Split/Join	`split_get`, `split_count`, `join`, `partition`, `rpartition`
Predicates	`isdigit`, `isalpha`, `isalnum`, `isspace`, `islower`, `isupper`, `isnumeric`, `isdecimal`, `istitle`
Regex	`contains_regex`, `replace_regex`, `replace_regex_all`, `extract`, `count_matches`, `findall`, `fullmatch`, `match_regex`
Prefix/Suffix	`removeprefix`, `removesuffix`
Other	`len`, `get`, `wrap`, `normalize`, `cat`, `get_dummies`

Join Types

Type	Behavior	Output Size
Inner	Only rows with keys in both sides	≤ min(left, right)
Left	All left rows; right fills missing with NaN	= left
Right	All right rows; left fills missing with NaN	= right
Outer	All rows from both sides	≤ left + right
Cross	Cartesian product (no key matching)	= left × right
Asof (backward)	Last right row where `right_key ≤ left_key`	= left
Asof (forward)	First right row where `right_key ≥ left_key`	= left
Asof (nearest)	Closest right row by absolute distance	= left

merge_asof is particularly useful for time-series data, e.g., joining trades with quotes at the nearest preceding timestamp.

Window Operations

Full rolling, expanding, and exponentially-weighted window support on both Series and DataFrame:

// Rolling window (like df["col"].rolling(30).mean())
let ma_30 = series.rolling(30, None).mean()?;
let vol   = series.rolling(252, Some(20)).std()?;

// Expanding window (cumulative)
let cum_max = series.expanding(None).max()?;

// Exponentially weighted moving average
let ewma = series.ewm(Some(20.0), None).mean()?;

// Time-based resampling
let monthly = series.resample("M").sum()?;

Window Type	Series Methods	DataFrame Methods
Rolling	`sum`, `mean`, `min`, `max`, `std`, `var`, `count`, `median`, `quantile`, `apply`	`sum`, `mean`, `min`, `max`, `std`, `var`, `count`, `median`, `quantile`
Expanding	`sum`, `mean`, `min`, `max`, `std`, `var`, `median`, `apply`	`sum`, `mean`, `min`, `max`, `std`, `var`, `median`
EWM	`mean`, `std`, `var`	`mean`, `std`, `var`
Resample	`sum`, `mean`, `count`, `min`, `max`, `first`, `last`	`sum`, `mean`, `count`, `min`, `max`, `first`, `last`

GroupBy also supports rolling() and resample() for within-group window operations.

Reshaping

All major pandas reshaping operations are implemented:

// Long → Wide
let pivoted = df.pivot_table("revenue", "region", "product", "sum")?;

// Wide → Long
let melted = df.melt(&["id"], &["q1", "q2", "q3"], "quarter", "sales")?;

// Stack/Unstack (with composite key round-trip)
let stacked = df.stack()?;
let unstacked = stacked.unstack()?;

// Contingency tables
let ct = df.crosstab("gender", "department")?;
let ct_norm = df.crosstab_normalize("gender", "department", "all")?;

// One-hot encoding
let dummies = df.get_dummies(&["color", "size"])?;

// Cross-section selection
let row = df.xs("2024-01-15")?;

DataFrame Output Formats

17 output methods for different consumption contexts:

Method	pandas Equivalent	Format
`to_csv(sep, include_index)`	`df.to_csv()`	Comma/tab-separated values
`to_json(orient)`	`df.to_json()`	JSON with 5 orients
`to_string_table(include_index)`	`df.to_string()`	Aligned ASCII table
`to_string_truncated(idx, rows, cols)`	`df.to_string(max_rows=)`	Truncated with head/tail + "..."
`to_html(include_index)`	`df.to_html()`	HTML `<table>`
`to_latex(include_index)`	`df.to_latex()`	LaTeX `tabular`
`to_markdown(include_index)`	`df.to_markdown()`	GitHub-flavored markdown table
`to_dict(orient)`	`df.to_dict()`	dict/list/records/index/split/tight
`to_series_dict()`	`df.to_dict('series')`	`BTreeMap<String, Series>`
`to_records()`	`df.to_records()`	Vec of row vectors
`to_numpy_2d()`	`df.to_numpy()`	`Vec<Vec<f64>>`
`Display` trait	`print(df)`	Column-aligned with shape footer

GroupBy: Complete Aggregation Matrix

14 aggregation functions available through string dispatch:

Function	Returns	Notes
`sum`	Float64	Null-skipping
`mean`	Float64	Null-skipping
`count`	Int64	Non-null count
`min`	Same as input	Null-skipping
`max`	Same as input	Null-skipping
`std`	Float64	Sample standard deviation (ddof=1)
`var`	Float64	Sample variance (ddof=1)
`median`	Float64	Middle value
`first`	Same as input	First non-null
`last`	Same as input	Last non-null
`prod`	Float64	Product of values
`sem`	Float64	Standard error of mean
`skew`	Float64	Fisher's skewness
`kurt`/`kurtosis`	Float64	Excess kurtosis

Plus group-level operations: cumsum, cumprod, cummax, cummin, rank, shift, diff, nth, head, tail, pct_change, value_counts, describe, get_group, cumcount, ngroup, pipe, ohlc.

Datetime Parsing and Accessors

to_datetime() auto-detects common formats and normalizes to ISO 8601:

Input Format	Example	Auto-Detected?
ISO 8601 date	`2024-01-15`	Yes
ISO 8601 datetime	`2024-01-15T10:30:00`	Yes
Space-separated	`2024-01-15 10:30:00`	Yes
Slash date	`2024/01/15`	Yes
US date (MM/DD/YYYY)	`01/15/2024`	Yes
Epoch seconds (Int64)	`1705312200`	Yes
Epoch milliseconds	`1705312200000`	Yes (auto-detected from magnitude > 10^11)
Custom format	`15-Jan-2024`	Via `to_datetime_with_format(s, Some("%d-%b-%Y"))`

The .dt accessor provides 20+ component extraction methods:

let years   = series.dt().year()?;     // Extract year
let months  = series.dt().month()?;    // 1-12
let dow     = series.dt().dayofweek()?; // Mon=0..Sun=6
let quarter = series.dt().quarter()?;  // 1-4
let woy     = series.dt().weekofyear()?; // ISO week 1-53
let fmt     = series.dt().strftime("%Y-%m-%d")?; // Custom format

Timezone support: tz_localize(tz), tz_convert(tz) with chrono-tz for IANA timezone names.

to_timedelta() parses duration strings with similar flexibility:

// All of these work:
// "02:30:45"          → HH:MM:SS
// "3 days 04:15:30"   → pandas-style timedelta
// "5 days"            → day-only
// "3 hours"           → natural language
// Int64(3661)         → seconds (→ "01:01:01")

Describe: Statistical Summary

DataFrame.describe() generates the same 8-row statistical summary as pandas:

         price      volume
count     8.0         8.0
mean    183.9       862.5
std       3.2       261.1
min     140.3       500.0
25%     141.4       575.0
50%     185.8       850.0
75%     186.3      1075.0
max     187.3      1200.0

Supports custom percentiles (describe_with_percentiles(&[0.1, 0.5, 0.9])) and dtype-filtered describe (describe_dtypes(&["number"], &[]) for numeric only, or include "object" for string columns which produce count/unique/top/freq).

Correlation and Covariance

Pairwise correlation and covariance matrices:

let pearson  = df.corr()?;               // Pearson (default)
let spearman = df.corr_method("spearman")?; // Rank correlation
let kendall  = df.corr_method("kendall")?;  // Kendall tau
let cov_mat  = df.cov()?;               // Covariance matrix
let corr_w   = df.corrwith(&other_df)?;  // Column-wise correlation

Series-level: series.corr(&other), series.cov_with(&other), series.autocorr(lag).

Apply and Transform

Multiple ways to apply custom logic:

// Element-wise on each column
let transformed = df.applymap(|scalar| { /* transform */ })?;

// Row-wise with full row access
let result = df.apply_row(|row_values| { /* produce scalar */ })?;

// Shape-preserving transform (output same shape as input)
let normed = df.transform("zscore")?;

// Column assignment with closures (pandas df.assign(new_col=lambda df: ...))
let df2 = df.assign_fn(vec![
    ("ratio", Box::new(|df| {
        // Compute new column from existing DataFrame
        Ok(compute_ratio_column(df))
    })),
])?;

// Pipe for method chaining
let result = df.pipe(|d| d.query("x > 0"))?.pipe(|d| d.sort_values("x", true))?;

Module-Level Functions

Functions matching pandas top-level API:

Function	pandas Equivalent	Description
`concat_dataframes`	`pd.concat`	Concatenate along axis 0 or 1 with join modes
`concat_dataframes_with_keys`	`pd.concat(keys=)`	Hierarchical index labeling
`to_datetime`	`pd.to_datetime`	Parse dates from strings/epochs
`to_timedelta`	`pd.to_timedelta`	Parse durations from strings/seconds
`to_numeric`	`pd.to_numeric`	Coerce to numeric with NaN for failures
`cut`	`pd.cut`	Equal-width binning
`qcut`	`pd.qcut`	Quantile-based binning
`timedelta_total_seconds`	`Series.dt.total_seconds()`	Convert timedelta to seconds

Performance

Five optimization rounds with formal evidence:

Round	Optimization	Speedup
Round 2	`align_union` borrowed-key HashMap	Eliminates index clones
Round 3	GroupBy identity-alignment fast path	Skips reindex when indexes match
Round 4	Dense Int64 aggregation path	O(1) array access, no HashMap
Round 5	`has_duplicates` OnceCell memoization	87% faster on groupby benchmark

Measured Baselines (10K rows, debug profile)

Operation	p50	p95	Notes
Join (inner, 50% overlap)	12ms	19ms	HashMap-based equijoin
Join (outer, 50% overlap)	27ms	37ms	Union index construction
Boolean filter (5 cols)	15ms	20ms	Reindex all columns by mask
`head(100)`	22us	28us	O(1) slice, no copy
Scalar add (5 cols)	2.5ms	2.8ms	Per-column vectorized
Scalar comparison (5 cols)	3.0ms	4.1ms	Per-column with Bool output

Performance baselines tracked for join (inner/left/right/outer), filter (boolean mask, head/tail), and DataFrame arithmetic at 10K-100K row scales. Benchmarks run via cargo test -p fp-conformance --test perf_baselines -- --ignored --nocapture.

Optimization Catalog

FrankenPandas applies 14 named optimization techniques (AG-01 through AG-15) drawn from the alien-graveyard systems-pattern library. Each is independently toggled and proven via isomorphism tests:

ID	Technique	Where Applied	What It Does
AG-02	Borrowed-key HashMap	fp-index `align_union`, fp-join build phase	Builds position maps using `&IndexLabel` references instead of cloning labels. Eliminates O(n) allocations in the join build phase.
AG-03	Identity-cast skip	fp-types `cast_scalar_owned`	When source dtype already equals target dtype, returns the value without cloning. Saves one allocation per element in column coercion.
AG-05	N-way leapfrog triejoin	fp-index `multi_way_align`	Computes the union of N indexes in a single O(n log n) sorted-merge pass instead of iterative pairwise O(n^2 log n). Used by `DataFrame::from_series`.
AG-06	Arena-backed execution	fp-groupby, fp-join	Routes intermediate allocations through Bumpalo bump allocator. Single `malloc`, pointer bumps, bulk dealloc. Configurable budget (default 256 MB).
AG-07	Vec-based column accumulation	fp-io CSV parser	Pre-allocates `Vec<Vec<Scalar>>` with capacity hints from input byte count. O(1) amortized per cell vs O(log c) BTreeMap insertion.
AG-08	Source-index referencing	fp-groupby HashMap path	Stores `(source_row_index, accumulator)` instead of `(Scalar_clone, accumulator)`. Reconstructs group key labels at output time, avoiding per-group Scalar clones.
AG-10	Typed-array vectorization	fp-columnar binary ops	Dispatches to `&[f64]` / `&[i64]` typed arrays instead of per-element `match Scalar`. Enables LLVM auto-vectorization to SIMD.
AG-11	Fast-path alignment skip	fp-frame, fp-groupby	When both operands share the same index and have no duplicates, skips the alignment planning phase entirely. O(1) check via `OnceCell` memoization.
AG-13	Adaptive sort-order lookup	fp-index `position()`	Lazily detects whether an index is sorted ascending (Int64 or Utf8). Uses O(log n) binary search for sorted, O(n) scan for unsorted. Sort order cached in `OnceCell`.
AG-14	Alignment plan validation	fp-index	Debug-mode assertion that position vectors have consistent lengths and valid indices. Catches alignment bugs at the source.

DType System and Coercion Rules

The type hierarchy determines how values are promoted when columns with different types interact:

Null (bottom type; promotes to anything)
 │
 ├── Bool
 │     └── Int64
 │           └── Float64
 │
 └── Utf8 (incompatible with numeric branch)

Coercion rules (matching pandas exactly):

Left	Right	Result	Example
Null	Any	That type	`Null + Int64 → Int64`
Bool	Int64	Int64	`True + 3 → 4`
Bool	Float64	Float64	`True + 1.5 → 2.5`
Int64	Float64	Float64	`3 + 1.5 → 4.5`
Utf8	Int64	Error	Incompatible; fails closed
Utf8	Float64	Error	Incompatible; fails closed
Same	Same	Same	Identity; no coercion needed (AG-03 fast path)

The identity-cast optimization (AG-03) detects when source dtype already matches target dtype and skips the clone entirely, which matters when cast_scalar_owned() is called millions of times during column coercion.

infer_dtype(values) folds common_dtype() across all elements to find the narrowest type that fits. This is used during CSV/JSON parsing where cell types are inferred individually and then unified per-column.

Null Propagation Semantics

FrankenPandas distinguishes three kinds of missing values, exactly matching pandas:

Kind	Meaning	Created By	`is_missing()`
`Null`	Generic absence	Missing CSV cells, JSON `null`, outer join mismatches	`true`
`NaN`	Float not-a-number	`0.0 / 0.0`, explicit `f64::NAN`, Float64 column nulls	`true`
`NaT`	Not-a-time	Failed datetime parse, timedelta overflow	`true`

Propagation rules:

Arithmetic with missing: 5 + NaN → NaN, NaN + NaN → NaN
Comparison with missing: NaN == NaN → false (IEEE 754), NaN != NaN → true
But semantic_eq(NaN, NaN) → true for index dedup and testing
Aggregation skips nulls by default: nansum([1, NaN, 3]) → 4
GroupBy with dropna=true excludes null group keys

The ValidityMask makes null checking O(1) per element (single bit test) and O(n/64) for bulk operations (word-level popcount).

NanOps: Null-Aware Aggregation Library

The fp-types crate provides 10 null-skipping aggregation primitives that underpin all statistical operations:

// All skip Null/NaN values automatically:
nansum(&values)       // → Scalar::Float64 (sum of non-missing)
nanmean(&values)      // → Scalar::Float64 (mean of non-missing)
nancount(&values)     // → Scalar::Int64 (count of non-missing)
nanmin(&values)       // → same type as input minimum
nanmax(&values)       // → same type as input maximum
nanmedian(&values)    // → Scalar::Float64 (middle value)
nanvar(&values, ddof) // → Scalar::Float64 (sample variance, ddof=1 default)
nanstd(&values, ddof) // → Scalar::Float64 (sample std dev)
nanprod(&values)      // → Scalar::Float64 (product of non-missing)
nannunique(&values)   // → Scalar::Int64 (count of unique non-missing)

These are the building blocks for Series.sum(), DataFrame.mean(), GroupBy.std(), describe(), and every other statistical method. The "skip nulls by default" behavior matches pandas' skipna=True default.

Empty inputs or all-null inputs return NaN for float aggregations and 0 for count, matching pandas exactly.

Error Architecture

Every crate has its own typed error enum, all implementing std::error::Error + Display:

Error Type	Crate	Key Variants
`TypeError`	fp-types	`IncompatibleDtypes { left, right }`
`ColumnError`	fp-columnar	`LengthMismatch`, `DtypeMismatch`, `EmptyColumn`
`IndexError`	fp-index	`OutOfBounds { position, length }`, `LengthMismatch`, `InvalidAlignmentVectors`
`FrameError`	fp-frame	`LengthMismatch`, `DuplicateIndexUnsupported`, `CompatibilityRejected(String)`
`ExprError`	fp-expr	`ParseError(String)`, `UnknownColumn(String)`, `UnknownLocal(String)`
`JoinError`	fp-join	wraps `FrameError` + join-specific failures
`GroupByError`	fp-groupby	wraps `FrameError` + aggregation failures
`IoError`	fp-io	`MissingHeaders`, `MissingIndexColumn`, `Csv(...)`, `Json(...)`, `Parquet(...)`, `Excel(...)`, `Arrow(...)`, `Sql(...)`

All error types are re-exported through the frankenpandas facade crate, so users can pattern-match without importing internal crates:

use frankenpandas::{IoError, FrameError, ExprError};

match result {
    Err(IoError::MissingHeaders) => eprintln!("CSV has no headers"),
    Err(IoError::Sql(msg)) => eprintln!("SQL error: {msg}"),
    _ => {}
}

DataFrame Constructors

15+ ways to create a DataFrame, matching every pandas construction pattern:

Constructor	pandas Equivalent	Example
`from_dict(&[col_order], data)`	`pd.DataFrame({"a": [1,2], "b": [3,4]})`	Column-oriented dict
`from_dict_with_index(data, labels)`	`pd.DataFrame(data, index=[...])`	With custom index
`from_dict_mixed(data)`	`pd.DataFrame({"a": [1], "b": ["x"]})`	Mixed types per column
`from_series(vec![s1, s2])`	`pd.DataFrame({"a": s1, "b": s2})`	From Series with alignment
`from_records(records, columns)`	`pd.DataFrame.from_records(...)`	Row-oriented records
`from_tuples(rows, columns)`	`pd.DataFrame([(1,"a"), (2,"b")])`	Tuple rows
`from_tuples_with_index(rows, cols, idx)`	Same with custom index	Tuples + index labels
`from_csv(text, sep)`	`pd.read_csv(StringIO(text))`	Inline CSV string
`from_dict_index(data)`	`pd.DataFrame.from_dict(data, orient='index')`	Row-keyed dict
`from_dict_index_columns(data, cols)`	Same with column names	Row-keyed + col names
`DataFrame::new(index, columns)`	Low-level construction	Direct index + BTreeMap

When constructing from multiple Series with different indexes, from_series automatically performs N-way index alignment (AG-05 leapfrog triejoin) to produce a DataFrame with the union of all index labels.

Merge: Advanced Options

Beyond the basic join types, the merge system supports pandas' full merge parameter set:

// Merge with validation (like pandas validate='one_to_one')
let merged = df1.merge_with_options(&df2, &["key"], JoinType::Inner,
    MergeOptions {
        validate_mode: Some(MergeValidateMode::OneToOne),
        ..Default::default()
    })?;

// Merge with indicator column (like pandas indicator=True)
// Adds a column showing "left_only", "right_only", or "both"
let merged = df1.merge_with_options(&df2, &["key"], JoinType::Outer,
    MergeOptions {
        indicator_name: Some("_merge".to_owned()),
        ..Default::default()
    })?;

// Merge with custom suffixes for overlapping columns
let merged = df1.merge_with_options(&df2, &["key"], JoinType::Inner,
    MergeOptions {
        suffixes: Some([Some("_left".to_owned()), Some("_right".to_owned())]),
        ..Default::default()
    })?;

Validation modes catch data quality issues at merge time:

Mode	Constraint	Fails When
`OneToOne`	Each key appears once in both sides	Duplicates in either side
`OneToMany`	Keys unique in left, may repeat in right	Duplicates in left
`ManyToOne`	Keys may repeat in left, unique in right	Duplicates in right
`ManyToMany`	No constraint (default)	Never

Testing

1,500+ tests across the workspace:

Category	Count	What It Covers
fp-frame unit tests	1,016	DataFrame, Series, Categorical, MultiIndex integration
fp-io tests	96	7 IO formats, adversarial inputs, round-trip correctness
fp-index tests	119	Index alignment, MultiIndex, duplicate detection
fp-join tests	58	Inner/Left/Right/Outer/Cross/Asof joins, merge_asof edge cases
fp-expr tests	45	Expression parsing, eval/query, @local variables
fp-conformance tests	117	Differential conformance against pandas oracle
Property-based (proptest)	75	DType coercion, CSV/JSON/SQL/Excel/Feather round-trip, ValidityMask algebra, DataFrame arithmetic invariants
End-to-end pipeline	11	Full workflow: CSV read → query → groupby → merge → sort → export
Performance baselines	13	Join/filter/arithmetic latency measurement

Conformance Gate

./scripts/phase2c_gate_check.sh

Regenerates conformance packet artifacts and fails closed if any parity report or gate is not green. 20+ packet suites covering alignment, join, groupby, concat, filter, CSV, dtype, null semantics, and more.

Missing Data Handling

Complete pandas-compatible toolkit for detecting, filling, and removing missing values:

Detection

let nulls = series.isna()?;        // Bool Series: true where null/NaN
let valid = series.notna()?;       // Bool Series: true where valid
let count = series.count();        // Count of non-missing values
let has   = series.hasnans();      // true if any missing values exist

DataFrame-level: df.isna(), df.notna(), df.isnull(), df.notnull() return DataFrames of Booleans. first_valid_index() and last_valid_index() scan for the first/last non-null row.

Filling

// Fill with constant
let filled = series.fillna(&Scalar::Float64(0.0))?;

// Forward fill (propagate last valid value)
let ffilled = series.ffill(None)?;      // No limit
let ffilled = series.ffill(Some(3))?;   // Fill at most 3 consecutive NaN

// Backward fill (propagate next valid value)
let bfilled = series.bfill(Some(2))?;

// Linear interpolation
let interp = series.interpolate()?;

// Combine two DataFrames, filling nulls from `other`
let combined = df.combine_first(&other)?;  // Like pd.DataFrame.combine_first

// Update in place from another DataFrame
let updated = df.update(&other)?;  // Non-null values in `other` overwrite `df`

DataFrame-level: df.fillna(&scalar), df.ffill(limit), df.bfill(limit), df.interpolate(), df.fillna_method("ffill") / df.fillna_method("bfill").

Dropping

// Drop rows with any null (default)
let clean = df.dropna()?;

// Drop rows where ALL values are null
let clean = df.dropna_with_options(DropNaHow::All, None)?;

// Drop rows with nulls in specific columns only
let clean = df.dropna_with_options(DropNaHow::Any, Some(&["price".into(), "volume".into()]))?;

// Drop rows with fewer than N non-null values (thresh)
let clean = df.dropna_with_thresh(3)?;

// Drop COLUMNS with nulls (axis=1)
let clean = df.dropna_columns()?;

Type Coercion and Conversion

// Cast Series to a specific dtype
let float_col = int_series.astype(DType::Float64)?;

// Cast DataFrame column
let df2 = df.astype_column("price", DType::Float64)?;

// Cast multiple columns at once
let df2 = df.astype_columns(&[("id", DType::Utf8), ("score", DType::Float64)])?;

// Auto-infer best dtypes (Utf8 → Int64/Float64 where possible)
let df2 = df.convert_dtypes()?;
let df2 = df.infer_objects()?;  // Same idea, pandas-compatible name

// Coerce to numeric with NaN for failures
let numeric = to_numeric(&string_series)?;

The common_dtype() function determines the result type when combining two columns. It follows pandas' promotion rules exactly: Bool + Int64 → Int64, Int64 + Float64 → Float64, Utf8 + numeric → Error.

Element-Wise Operations

Arithmetic

// Scalar operations
let doubled = df.mul_scalar(2.0)?;
let offset  = df.add_scalar(100.0)?;
let pct     = df.div_scalar(total)?;
let squared = df.pow_scalar(2.0)?;
let modulo  = df.mod_scalar(10.0)?;
let floored = df.floordiv_scalar(3.0)?;

// DataFrame-to-DataFrame (with automatic index alignment)
let diff  = df1.sub_df(&df2)?;         // Aligned subtraction
let ratio = df1.div_df(&df2)?;         // Aligned division
let product = df1.mul_df(&df2)?;       // Aligned multiplication

// With fill_value for missing alignment positions
let sum = df1.add_df_fill(&df2, 0.0)?;  // Fill missing with 0 before adding

Cumulative and Sequential

let csum  = df.cumsum()?;      // Running sum
let cprod = df.cumprod()?;     // Running product
let cmax  = df.cummax()?;      // Running maximum
let cmin  = df.cummin()?;      // Running minimum
let delta = df.diff(1)?;       // First difference (n periods)
let moved = df.shift(1)?;      // Shift values by n periods
let pct   = df.pct_change()?;  // Percentage change

Clipping and Rounding

let clipped = df.clip(0.0, 100.0)?;    // Clip to [0, 100]
let lower   = df.clip_lower(0.0)?;     // Floor at 0
let upper   = df.clip_upper(100.0)?;   // Cap at 100
let rounded = df.round(2)?;            // Round to 2 decimal places
let absolute = df.abs()?;              // Absolute value

Replacement

// Replace specific values
let cleaned = df.replace(&Scalar::Int64(-999), &Scalar::Null(NullKind::NaN))?;

// Series: regex replace
let fixed = series.str().replace_regex(r"\d{3}-\d{4}", "***-****")?;

// Series: map with replacement dictionary
let mapped = series.map_with_na_action(&mapping, true)?;  // na_action=ignore

// Series: conditional assignment
let graded = scores.case_when(&[
    (scores.ge(&Scalar::Int64(90))?, Series::constant("A", n)?),
    (scores.ge(&Scalar::Int64(80))?, Series::constant("B", n)?),
])?;

Advanced Selection Methods

Beyond basic loc/iloc, FrankenPandas provides the full pandas selection toolkit:

// Top-N and Bottom-N rows by column value
let top5 = df.nlargest(5, "revenue")?;
let bot3 = df.nsmallest(3, "price")?;
// With keep parameter: 'first' (default), 'last', 'all'
let top5 = df.nlargest_keep(5, "revenue", "all")?;

// Find label of min/max value
let worst_day = series.idxmin()?;   // → IndexLabel of minimum
let best_day  = series.idxmax()?;   // → IndexLabel of maximum

// Value counts (like Series.value_counts())
let counts = series.value_counts()?;
// With full options: normalize, sort, ascending, dropna
let pcts = series.value_counts_with_options(true, true, false, true)?;

// Membership testing
let mask = series.isin(&[Scalar::Utf8("A".into()), Scalar::Utf8("B".into())])?;
let in_range = series.between(&Scalar::Int64(10), &Scalar::Int64(20))?;

// Index-based position lookup
let pos = series.searchsorted(&Scalar::Float64(42.0), "left")?;
let (codes, uniques) = series.factorize()?;  // Encode as integers

// Select columns by dtype
let numeric_only = df.select_dtypes(&[DType::Int64, DType::Float64], &[])?;
let non_numeric = df.select_dtypes(&[], &[DType::Int64, DType::Float64])?;

// Flexible label-based filtering
let subset = df.filter_labels(Some(&["price", "volume"]), None, None, 1)?; // axis=1
let regex_match = df.filter_labels(None, None, Some("^rev"), 1)?;  // Regex on col names

// Reindex to new labels (fill missing with NaN)
let reindexed = series.reindex(new_labels)?;
let trimmed = series.truncate(Some(&start_label), Some(&end_label))?;

// Column manipulation
let (popped_series, remaining_df) = df.pop("temp_col")?;       // Remove and return
let with_new = df.insert(2, "computed", new_column)?;            // Positional insert
let renamed = df.add_prefix("raw_")?;                           // Prefix all columns
let renamed = df.add_suffix("_v2")?;                             // Suffix all columns

DataFrame Introspection

let shape = df.shape();              // (nrows, ncols)
let dtypes = df.dtypes();            // Vec<(column_name, DType)>
let info = df.info();                // String summary (like df.info() in pandas)
let mem = df.memory_usage()?;        // Per-column byte estimates as Series
let ndim = df.ndim();                // Always 2 for DataFrame
let axes = df.axes();                // (index, column_names)
let is_empty = df.is_empty();        // True if zero rows

// Deep equality (structural + value comparison)
let same = df1.equals(&df2);         // true if identical structure and values

// Element-wise diff between two DataFrames
let changes = df1.compare(&df2)?;    // Shows only positions that differ

// Squeeze single-column/single-row DataFrames
let series = single_col_df.squeeze(1)?;   // DataFrame → Series
let scalar = single_cell_df.squeeze(0)?;  // DataFrame → Scalar

// Scalar access
let val = series.iat(0)?;           // By position (like .iat[0])
let val = series.at(&label)?;       // By label (like .at[label])

// Lookup specific (row, col) pairs
let values = df.lookup(&row_labels, &col_names)?;

SeriesGroupBy

Series-level groupby is separate from DataFrame groupby and provides a lightweight API for single-column aggregation:

// Group one Series by another
let by_region = revenue_series.groupby(&region_series)?;

// Aggregate
let sums  = by_region.sum()?;
let means = by_region.mean()?;
let stds  = by_region.std()?;
let meds  = by_region.median()?;
let prods = by_region.prod()?;

// Multiple aggregations at once
let multi = by_region.agg(&["sum", "mean", "count"])?;  // Returns DataFrame

SeriesGroupBy supports: sum, mean, count, min, max, std, var, median, first, last, prod, sem, skew, kurtosis, agg (multi-function), and value_counts.

Sorting

// Sort by column values
let sorted = df.sort_values("price", true)?;          // ascending=true
let sorted = df.sort_values("price", false)?;         // descending

// Control NaN position
let sorted = series.sort_values_na(true, "first")?;   // NaN at top
let sorted = series.sort_values_na(true, "last")?;    // NaN at bottom (default)

// Sort by index labels
let sorted = df.sort_index(true)?;                    // ascending
let sorted = df.sort_index(false)?;                   // descending

Concat: Full Options

// Axis 0 (stack rows, default)
let stacked = concat_dataframes(&[&df1, &df2])?;

// Axis 1 (add columns side-by-side, outer join on index)
let wide = concat_dataframes_with_axis(&[&df1, &df2], 1)?;

// Axis 1 with inner join (only shared index labels)
let inner = concat_dataframes_with_axis_join(&[&df1, &df2], 1, ConcatJoin::Inner)?;

// Axis 0 with inner join (only shared columns)
let shared = concat_dataframes_with_axis_join(&[&df1, &df2], 0, ConcatJoin::Inner)?;

// With hierarchical keys
let labeled = concat_dataframes_with_keys(&[&df1, &df2], &["train", "test"])?;

// Ignore original indexes (reindex to 0..n)
let clean = concat_dataframes_with_ignore_index(&[&df1, &df2])?;

Pivot Tables: Full Options

// Basic pivot table
let pt = df.pivot_table("revenue", "region", "product", "sum")?;

// Multiple values columns
let pt = df.pivot_table_multi_values(&["revenue", "quantity"], "region", "product", "sum")?;

// With margins (subtotals row/column)
let pt = df.pivot_table_with_margins("revenue", "region", "product", "sum")?;

// Custom margins label
let pt = df.pivot_table_with_margins_name("revenue", "region", "product", "sum", "Grand Total")?;

// Fill NaN in pivot output
let pt = df.pivot_table_fill("revenue", "region", "product", "sum", 0.0)?;

// Multiple aggregation functions
let pt = df.pivot_table_multi_agg("revenue", "region", "product", &["sum", "mean", "count"])?;

Time-Series Operations

// Select rows at a specific time
let noon = df.at_time("12:00:00")?;

// Select rows within a time range
let morning = df.between_time("09:00:00", "12:00:00")?;

// Datetime component extraction (full list)
let components = series.dt();
components.year()?;            components.month()?;
components.day()?;             components.hour()?;
components.minute()?;          components.second()?;
components.dayofweek()?;       components.dayofyear()?;
components.quarter()?;         components.weekofyear()?;
components.is_month_start()?;  components.is_month_end()?;
components.is_quarter_start()?; components.is_quarter_end()?;
components.strftime("%Y-%m-%d %H:%M")?;

// Timezone operations
let localized = series.dt().tz_localize(Some("America/New_York"))?;
let converted = series.dt().tz_convert(Some("UTC"))?;

Column Manipulation

// Rename columns
let renamed = df.rename_with(|name| format!("col_{name}"))?;
let prefixed = df.add_prefix("input_")?;
let suffixed = df.add_suffix("_raw")?;

// Assign new column (value vector)
let df2 = df.assign_column("computed", computed_values)?;

// Assign with closure (sees current DataFrame state)
let df2 = df.assign_fn(vec![
    ("ratio", Box::new(|df: &DataFrame| {
        // Compute from existing columns
        let a = df.column("revenue").unwrap();
        let b = df.column("cost").unwrap();
        // ... return Column
        Ok(result_column)
    })),
])?;

// Reorder columns
let reordered = df.select_columns(&["id", "name", "value"])?;

Recipes

Financial Data Pipeline

use frankenpandas::prelude::*;

// Load trade data
let trades = read_csv_str(
    "date,ticker,price,volume\n\
     2024-01-15,AAPL,185.50,1000\n\
     2024-01-15,GOOG,140.25,500\n\
     2024-01-16,AAPL,186.00,1200\n\
     2024-01-16,GOOG,141.00,800"
)?;

// Parse dates
let date_series = Series::new("date", trades.index().clone(),
    trades.column("date").unwrap().clone())?;
let parsed_dates = to_datetime(&date_series)?;

// Daily VWAP per ticker
let vwap = trades.groupby(&["ticker"])?.agg_named(&[
    ("total_value", "price", "sum"),   // Simplified; real VWAP needs price*vol
    ("total_vol", "volume", "sum"),
    ("trade_count", "volume", "count"),
])?;

// Export for downstream consumption
write_jsonl(&vwap, Path::new("daily_vwap.jsonl"))?;

Merge-Asof for Time Series Alignment

// Join trades with quotes at the nearest preceding timestamp
let result = merge_asof(
    &trades, &quotes, "timestamp", AsofDirection::Backward
)?;
// Each trade row now has the most recent quote as of that trade time

Categorical Analysis

// Create categorical with explicit ordering
let ratings = Series::from_categorical(
    "satisfaction",
    vec![
        Scalar::Utf8("good".into()),
        Scalar::Utf8("poor".into()),
        Scalar::Utf8("excellent".into()),
        Scalar::Utf8("good".into()),
    ],
    true, // ordered
)?;

// Access category operations
let cat = ratings.cat().unwrap();
println!("Categories: {:?}", cat.categories());  // [good, poor, excellent]
println!("Codes: {:?}", cat.codes()?.values());   // [0, 1, 2, 0]

// Rename categories
let renamed = cat.rename_categories(vec![
    Scalar::Utf8("Good".into()),
    Scalar::Utf8("Poor".into()),
    Scalar::Utf8("Excellent".into()),
])?;

// Materialize back to values
let values = renamed.cat().unwrap().to_values()?;

MultiIndex Operations

// Create from product (Cartesian)
let mi = MultiIndex::from_product(vec![
    vec!["east".into(), "west".into()],
    vec![2023i64.into(), 2024i64.into()],
])?
.set_names(vec![Some("region".into()), Some("year".into())]);

// Extract a level
let regions = mi.get_level_values(0)?;

// Flatten to single index
let flat = mi.to_flat_index("_");  // "east_2023", "east_2024", ...

// From DataFrame columns
let mi = df.to_multi_index(&["region", "year"])?;

Expression-Driven Analysis

use std::collections::BTreeMap;

// Compute new columns with eval
let profit = df.eval("revenue - cost")?;

// Filter with compound conditions
let hot_deals = df.query("price < 50 and rating > 4.0")?;

// Use local variables in expressions
let locals = BTreeMap::from([
    ("threshold".to_owned(), Scalar::Float64(100.0)),
]);
let above = df.query_with_locals("value > @threshold", &locals)?;

Conformance Testing in Depth

The conformance system is a differential testing framework that verifies FrankenPandas output against the actual pandas library:

                  ┌───────────────┐
                  │ Fixture JSON   │  ← Input DataFrame + operation
                  └───────┬───────┘
                          │
              ┌───────────┴───────────┐
              ▼                       ▼
    ┌───────────────────┐   ┌─────────────────────┐
    │ FrankenPandas      │   │ pandas Oracle        │
    │ (Rust execution)   │   │ (Python subprocess)  │
    └─────────┬─────────┘   └──────────┬──────────┘
              │                        │
              ▼                        ▼
    ┌──────────────────────────────────────────┐
    │          Parity Comparison                │
    │  dtype match?  value match?  order?       │
    └────────────────────┬─────────────────────┘
                         │
              ┌──────────┴──────────┐
              ▼                     ▼
    ┌────────────────┐   ┌────────────────┐
    │ parity_report   │   │ parity_gate    │
    │ .json           │   │ _result.json   │
    └────────────────┘   └────────────────┘

Packet families cover: series alignment (FP-P2C-001–003), join semantics (FP-P2C-004), groupby aggregates (FP-P2C-005, 011), concat (FP-P2C-006), null/NaN ops (FP-P2C-007), CSV round-trip (FP-P2C-008), dtype invariants (FP-P2C-009), filter/selection (FP-P2C-010), plus 30+ DataFrame-level packets (FP-P2D-014 through FP-P2D-055) covering merge, concat axis options, head/tail, loc/iloc, sort, constructor variants, and more.

Every parity report gets a RaptorQ repair-symbol sidecar for bit-rot detection. The drift history ledger (artifacts/phase2c/drift_history.jsonl) tracks parity trends over time.

Troubleshooting

Problem	Cause	Fix
`cargo build` fails with "edition 2024" error	Requires Rust nightly	Install via `rustup toolchain install nightly` and check `rust-toolchain.toml`
`Column::from_values` returns `IncompatibleDtypes`	Mixed Utf8 + numeric values in one column	Ensure homogeneous types, or use `to_numeric()` first
`query("col > 5")` returns error	Column name not found in DataFrame	Verify column exists with `df.column_names()`
CSV round-trip changes Float64 to Int64	`1.0` written as `"1"`, parsed back as Int64	Use `dtype` parameter in `CsvReadOptions` to force types
SQL write fails with "table already exists"	Default `SqlIfExists::Fail` policy	Use `SqlIfExists::Replace` or `SqlIfExists::Append`
Excel round-trip loses integer precision for large values	Excel stores all numbers as f64	Values within i64 range with zero fraction are recovered as Int64
JSONL reader drops columns	Columns only in later rows were missed	Fixed: reader now unions all keys across all rows
`to_datetime` returns NaT	Unrecognized date format	Use `to_datetime_with_format(series, Some("%d/%m/%Y"))` with explicit format

Limitations

Limitation	Status	Workaround
No native Datetime dtype	Datetimes stored as ISO 8601 Utf8 strings	Use `to_datetime()` for normalization
MultiIndex not yet integrated as DataFrame row index	Foundation type exists	Use `set_index_multi()` for composite keys
Categorical metadata not propagated through arithmetic	By design (matches pandas)	Use `.cat().to_values()` to materialize
No HDF5, Clipboard, or HTML IO	System-library dependencies	Use Feather (faster) or Parquet instead
Single-threaded execution	No parallel execution yet	Profile-proven fast paths compensate
No plotting	Requires graphics library	Export to pandas/matplotlib for visualization

FAQ

Q: How compatible is this with pandas? A: We target absolute API parity. The same method names, same parameter names, same edge-case behavior. Differential conformance tests verify against the actual pandas oracle. We're in early development so not every method is implemented yet, but the architecture is designed for full coverage.

Q: Why not just use Polars? A: Polars is excellent but has a fundamentally different API (lazy evaluation, no index alignment, different method names). FrankenPandas targets teams that want pandas semantics with Rust performance. No code rewrite required; just change the import.

Q: Is unsafe code used anywhere? A: No. Every crate in the workspace uses #![forbid(unsafe_code)]. Memory safety comes from the Rust type system.

Q: How do I use this from Python? A: PyO3 bindings are a planned future step. Currently this is a pure Rust library.

Q: What's the EvidenceLedger? A: Every alignment decision, dtype coercion, and policy override is logged with Bayesian confidence scores. This creates an auditable trail of exactly how your data was transformed. pandas makes these same decisions silently with no record.

Q: What's the performance like? A: Five formal optimization rounds with measured evidence. The groupby path alone saw an 87% speedup through memoization and dense aggregation paths. All optimizations include isomorphism proofs showing behavior is unchanged.

Q: How is the GroupBy implemented? A: Three automatic execution paths. Dense Int64 keys (range ≤ 65,536) use O(1) array indexing. Medium cardinality uses Bumpalo arena allocation (single malloc, zero fragmentation). Everything else falls back to a HashMap with source-index referencing to avoid per-group clones. All three produce identical results, verified by property-based tests.

Q: What does "clean-room" mean? A: We never read, reference, or copy from the pandas source code. We study pandas' behavior (input → output contracts, edge cases, dtype rules) via the conformance oracle, then implement from first principles in Rust. This avoids any license contamination and often produces better implementations.

Q: How do you handle NaN vs None vs NaT? A: Three distinct null kinds: NullKind::Null (generic missing), NullKind::NaN (float not-a-number), NullKind::NaT (not-a-time). Float64(NaN) also counts as missing. is_missing() returns true for all of these. semantic_eq() treats NaN == NaN as true (unlike IEEE 754), matching pandas behavior.

Q: What's the memory overhead vs pandas? A: Comparable for numeric data. ValidityMask uses 1 bit per element (vs pandas' 8-byte nullable dtype). Column stores typed arrays (Vec<f64>, Vec<i64>) instead of object arrays. String data uses Vec<String> (heap-allocated per element, same as pandas object dtype). Arena-backed GroupBy/Join operations avoid per-group heap fragmentation.

Q: Can I use this for production ETL pipelines? A: The core DataFrame, IO, and GroupBy/Join operations are solid and well-tested (1,500+ tests including adversarial inputs and property-based fuzzing). This is pre-1.0 software, so API stability is not yet guaranteed, but the correctness bar is high.

Selection and Indexing

Full pandas-style selection API:

// Label-based (like df.loc[])
let row = df.loc("row_label")?;           // Single row by label
let subset = df.loc_rows(&["a", "b"])?;   // Multiple rows

// Position-based (like df.iloc[])
let row = df.iloc(0)?;                     // First row
let last = df.iloc(-1)?;                   // Last row (negative indexing)
let slice = df.head(10);                   // First 10 rows
let tail = df.tail(5);                     // Last 5 (supports negative n)

// Column selection
let col = df.column("price")?;             // Single column as &Column
let subset = df.select_columns(&["price", "volume"])?;
let numeric = df.select_dtypes(&["int64", "float64"], &[])?;

// Boolean masking
let mask = df.query("price > 100")?;
let filtered = df.filter_rows(&bool_series)?;

// Conditional replacement
let filled = df.where_mask_df(&cond_df, &other_df)?;
let masked = df.mask_df(&cond_df, &other_df)?;

// Index operations
let reindexed = df.set_index("date", true)?;  // Column → index
let reset = df.reset_index(false)?;            // Index → column
let sorted = df.sort_index(true)?;             // Sort by index
let deduped = df.drop_duplicates()?;           // Remove duplicate rows

Serialization and Interoperability

All core types are fully serializable via serde:

// Every type derives Serialize + Deserialize:
// Scalar, DType, NullKind, IndexLabel, Index, MultiIndex,
// Series, DataFrame, CategoricalMetadata, Column, ValidityMask

// JSON serialization
let json = serde_json::to_string(&scalar)?;      // Tagged enum: {"kind":"int64","value":42}
let back: Scalar = serde_json::from_str(&json)?;  // Round-trips perfectly

// Binary serialization (via bincode, messagepack, etc.)
let bytes = bincode::serialize(&dataframe)?;
let back: DataFrame = bincode::deserialize(&bytes)?;

The Scalar enum uses serde's tagged representation (#[serde(tag = "kind", content = "value")]) for human-readable JSON while remaining efficient for binary formats. ValidityMask serializes as a Vec<bool> for JSON compatibility but uses bitpacked Vec<u64> in memory.

Arrow interop: DataFrame ↔ Arrow RecordBatch conversion is built in (used by Parquet and Feather IO). This means FrankenPandas data can be zero-copy shared with any Arrow-compatible system (DuckDB, DataFusion, Spark via Arrow Flight).

Adversarial and Property-Based Testing

Beyond unit tests, FrankenPandas employs two advanced testing strategies:

Property-Based Tests (proptest)

75 properties that must hold for ALL randomly-generated inputs:

Property Category	Examples	Cases
DType coercion	`common_dtype` is symmetric, reflexive, transitive	500
Scalar cast	Identity cast preserves value, missing stays missing	500
Index alignment	Union contains all labels, position vectors correct length	500
Series arithmetic	Self-add doubles values, no panics in hardened mode	200
Join invariants	Inner ⊆ Left ⊆ Outer, output lengths consistent	200
GroupBy	Groups bounded by input rows, arena ≡ global allocator	200
ValidityMask	De Morgan's law, NOT involution, AND commutativity	300
CSV round-trip	Shape preserved, column names preserved, Int64 exact	100
JSON round-trip	Shape preserved across Records/Columns/Split/Values	50
SQL round-trip	Shape and values preserved through SQLite	30
Excel round-trip	Int64 values recovered from f64 through xlsx	30
Feather round-trip	Shape, names, and values exact through Arrow IPC	50
DataFrame arithmetic	add_scalar preserves shape, add(0) ≡ identity	100
Comparison ops	eq XOR ne = true for non-NaN values	100

Adversarial Input Tests

15 tests targeting parser edge cases:

CSV: 200K-character field, 1000 columns, embedded newlines in quotes, multi-byte UTF-8 (Japanese, Cyrillic, emoji), header-only files, no trailing newline
JSON: Deeply nested objects, i64::MAX/i64::MIN boundary values, near-f64::MAX floats, empty arrays/objects
SQL: 10,000-row batch insert, column names with spaces (quoted identifier handling), SQL injection rejection (Bobby Tables pattern)

Duplicate Handling

Pandas-compatible keep parameter for duplicate detection:

// Mark duplicates (like df.duplicated(keep='first'))
let mask = df.duplicated()?;  // First occurrence = false, subsequent = true

// Drop duplicates
let unique = df.drop_duplicates()?;

// Series-level with keep parameter
let deduped = series.drop_duplicates()?;

// Index-level
let has_dups = index.has_duplicates();  // O(1) after first call (OnceCell)
let unique_idx = index.drop_duplicates(DuplicateKeep::First)?;

DuplicateKeep enum: First (keep first occurrence), Last (keep last), None (mark all duplicates).

Random Sampling

Deterministic sampling with seed control:

// Sample n rows
let sampled = df.sample(100, None, false, Some(42))?;  // n=100, seed=42

// Sample fraction
let sampled = df.sample(0, Some(0.1), false, Some(42))?;  // 10% sample

// Sample with replacement (bootstrap)
let bootstrap = df.sample(1000, None, true, Some(42))?;

// Weighted sampling
let weighted = df.sample_weights(100, &weights_series, false, Some(42))?;

Uses a deterministic LCG (Linear Congruential Generator) with Fisher-Yates shuffle for reproducible results across runs. Matching seed → identical sample, regardless of platform.

Roadmap

Priority	Feature	Status
High	PyO3 Python bindings	Planned; would enable `import frankenpandas as fpd` from Python
High	Native Datetime DType	Design phase; would replace Utf8 ISO 8601 string representation
High	Full MultiIndex ↔ DataFrame integration	Foundation exists; needs integration with groupby output, stack/unstack
Medium	Parallel execution (rayon)	Not started; architecture supports it (columns are independent)
Medium	DataFrame.plot() via plotters crate	Not started; would enable terminal/SVG chart output
Medium	Lazy evaluation / query planning	Not started; would enable optimization across chained operations
Low	HDF5 IO	Needs system library (libhdf5)
Low	Clipboard IO	Needs system clipboard access
Low	DataFrame.style for HTML formatting	Decorative; low priority vs correctness work

Key Documents

Document	Purpose
`AGENTS.md`	Guidelines for AI coding agents
`COMPREHENSIVE_SPEC_FOR_FRANKENPANDAS_V1.md`	Full V1 specification
`FEATURE_PARITY.md`	Detailed parity matrix with status per feature family
`artifacts/perf/`	Optimization round baselines, opportunity matrices, proofs
`artifacts/phase2c/`	Conformance packet artifacts and drift history

About Contributions

Please don't take this the wrong way, but I do not accept outside contributions for any of my projects. I simply don't have the mental bandwidth to review anything, and it's my name on the thing, so I'm responsible for any problems it causes; thus, the risk-reward is highly asymmetric from my perspective. I'd also have to worry about other "stakeholders," which seems unwise for tools I mostly make for myself for free. Feel free to submit issues, and even PRs if you want to illustrate a proposed fix, but know I won't merge them directly. Instead, I'll have Claude or Codex review submissions via gh and independently decide whether and how to address them. Bug reports in particular are welcome. Sorry if this offends, but I want to avoid wasted time and hurt feelings. I understand this isn't in sync with the prevailing open-source ethos that seeks community contributions, but it's the only way I can move at this velocity and keep my sanity.

License

MIT License (with OpenAI/Anthropic Rider). See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 272 Commits
.github/workflows		.github/workflows
artifacts		artifacts
crates		crates
references/frankensqlite		references/frankensqlite
scripts		scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
COMPREHENSIVE_SPEC_FOR_FRANKENPANDAS_V1.md		COMPREHENSIVE_SPEC_FOR_FRANKENPANDAS_V1.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
EXHAUSTIVE_LEGACY_ANALYSIS.md		EXHAUSTIVE_LEGACY_ANALYSIS.md
EXISTING_PANDAS_STRUCTURE.md		EXISTING_PANDAS_STRUCTURE.md
FEATURE_PARITY.md		FEATURE_PARITY.md
LICENSE		LICENSE
PHASE2C_EXTRACTION_PACKET.md		PHASE2C_EXTRACTION_PACKET.md
PLAN_TO_PORT_PANDAS_TO_RUST.md		PLAN_TO_PORT_PANDAS_TO_RUST.md
PROPOSED_ARCHITECTURE.md		PROPOSED_ARCHITECTURE.md
README.md		README.md
TODO_EXECUTION_TRACKER.md		TODO_EXECUTION_TRACKER.md
frankenpandas_illustration.webp		frankenpandas_illustration.webp
rust-toolchain.toml		rust-toolchain.toml

Folders and files

Latest commit

History

Repository files navigation