Pyra - A High-level Linter for Data Science Software

Pyra is a high-level linter static analyzer for data science applications written in Python, that helps developers identify potential issues in their data science code written in Python, as an extension of Lyra.

Pyra is based on the peer-reviewed publications:

Greta Dolcetti, Vincenzo Arceri, Antonella Mensi, Enea Zaffanella, Caterina Urban, Agostino Cortesi (2026). "Introducing Pyra: A High-Level Linter for Data Science Software.. In: Dutra, I., et al. Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track and Demo Track.

Greta Dolcetti, Agostino Cortesi, Caterina Urban, Enea Zaffanella. "Towards a High Level Linter for Data Science". In Proceedings of the 10th ACM SIGPLAN International Workshop on Numerical and Symbolic Abstract Domains (NSAD 2024), co-located with SPLASH 2024.

Abstract datatype analysis

Let us consider the following fragment. The code represents a simple data science pipeline that reads a CSV file, drops duplicates, plots the data, scales it, splits it into training and testing sets, and fits a logistic regression model.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("data.csv")
# Columns: ['Fruit', 'Amount', 'Label']
result = df.drop_duplicates(inplace=True)

plt.plot(df["Fruit"], df["Amount"])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[["Amount"]])

X_train, X_test, y_train, y_test =
    train_test_split(X_scaled, df["Label"])

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

The code fragment contains several issues that could lead to misleading results and challenges in reproducibility:

The drop_duplicates method is called with inplace=True, which modifies the DataFrame in place and returns None. This can lead to confusion, as the variable result will be assigned None.
The plot method is used to create a line plot with a categorical x-axis. This is inappropriate, as line plots are typically used for continuous data. A bar plot would be more suitable in this case.
The train_test_split method is called without setting the random_state parameter, meaning the split will differ each time the code is run. This can result in non-reproducible outcomes.
The data is scaled before the train-test split. This can cause data leakage, as the scaling parameters are computed using the entire dataset, including the test set. The scaling should be performed after the split to avoid this issue.

Pyra detects these issues and raises warnings, and raises the following warnings:

Getting Started

Prerequisites

Install Git
Install Python 3.9.18
Install pyenv

Installation

Create a virtual Python environment:

Linux or Mac OS X

pyenv local 3.9.18
Install Lyra in the virtual environment:

Linux or Mac OS X

./<env>/bin/pip install git+https://github.com/spangea/Pyra.git

Command Line Usage

To analyze a specific Python program run:

Linux or Mac OS X
`./<env>/bin/pyra --analysis type-datasciencetest.py`

After the analysis, Pyra generates a PDF file showing the control flow graph of the program annotated with the result of the abstract data type analysis before and after each statement in the program.

Name		Name	Last commit message	Last commit date
Latest commit History 1,445 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
crawling		crawling
docs		docs
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
logo.jpeg		logo.jpeg
requirements.txt		requirements.txt
setup.py		setup.py
warnings.png		warnings.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pyra - A High-level Linter for Data Science Software

Abstract datatype analysis

Getting Started

Prerequisites

Installation

Command Line Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

spangea/Pyra

Folders and files

Latest commit

History

Repository files navigation

Pyra - A High-level Linter for Data Science Software

Abstract datatype analysis

Getting Started

Prerequisites

Installation

Command Line Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages