Skip to content

Commit c7473a6

Browse files
authored
feat: add support for validation of polars dataframe and lazyframe (snakemake#3262)
<!--Add a description of your PR here--> Implement validation for polars DataFrame and LazyFrame. Re-factoring the setting of default values. ### QC <!-- Make sure that you can tick the boxes below. --> * [x] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [x] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Enhanced sample data validation now supports both Pandas and Polars data frames for improved reliability and performance. - Introduced new methods for reading sample data using Polars, expanding data handling options. - Added support for executing Xonsh scripts within Snakemake workflows. - New rule added for running Python scripts with Conda environments. - New functionality for generating self-contained HTML reports, including default statistics and user-specified results. - New functions added for parsing input files and extracting checksums. - **Bug Fixes** - Improved error handling for validation failures, providing more specific error messages. - **Documentation** - Updated the sample metadata schema with new fields for replicate count and tissue origin, alongside a refined description for sample condition. - Clarified usage of conda environments and apptainer integration within Snakemake workflows. - Expanded guidance on generating, customizing, and sharing reports in Snakemake. - Added documentation for integrating Xonsh scripts into Snakemake rules. - Updated help text for the `--keep-storage-local-copies` argument to enhance clarity and usability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent e6023c8 commit c7473a6

File tree

5 files changed

+162
-27
lines changed

5 files changed

+162
-27
lines changed

docs/snakefiles/configuration.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,8 @@ Instead, for data provenance and reproducibility reasons, you are required to pa
112112
Validation
113113
----------
114114

115-
With Snakemake 5.1, it is possible to validate both types of configuration via `JSON schemas <https://json-schema.org>`_.
116-
The function ``snakemake.utils.validate`` takes a loaded configuration (a config dictionary or a Pandas data frame) and validates it with a given JSON schema.
115+
With Snakemake 5.1, it is possible to validate both types of configuration (standard and tabular) via `JSON schemas <https://json-schema.org>`_.
116+
The function ``snakemake.utils.validate`` takes a loaded configuration (a config dictionary, a Pandas DataFrame, Polars DataFrame or Polars LazyFrame) and validates it with a given JSON schema.
117117
Thereby, the schema can be provided in JSON or YAML format. Also, by using the defaults property it is possible to populate entries with default values. See `jsonschema FAQ on setting default values <https://python-jsonschema.readthedocs.io/en/latest/faq/>`_ for details.
118118
In case of the data frame, the schema should model the record that is expected in each row of the data frame.
119119
In the following example,

snakemake/utils.py

Lines changed: 103 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -108,43 +108,126 @@ def set_defaults(validator, properties, instance, schema):
108108
logger.warning("Note that schema file may not be validated correctly.")
109109
DefaultValidator = extend_with_default(Validator)
110110

111-
if not isinstance(data, dict):
111+
def _validate_record(record):
112+
if set_default:
113+
DefaultValidator(schema, resolver=resolver).validate(record)
114+
return record
115+
else:
116+
jsonschema.validate(record, schema, resolver=resolver)
117+
118+
def _validate_pandas(data):
112119
try:
113120
import pandas as pd
114121

115-
recordlist = []
116122
if isinstance(data, pd.DataFrame):
123+
logger.debug("Validating pandas DataFrame")
124+
125+
recordlist = []
117126
for i, record in enumerate(data.to_dict("records")):
118-
record = {k: v for k, v in record.items() if not pd.isnull(v)}
127+
# Exclude NULL values
128+
record = {k: v for k, v in record.items() if pd.notnull(v)}
119129
try:
120-
if set_default:
121-
DefaultValidator(schema, resolver=resolver).validate(record)
122-
recordlist.append(record)
123-
else:
124-
jsonschema.validate(record, schema, resolver=resolver)
130+
recordlist.append(_validate_record(record))
125131
except jsonschema.exceptions.ValidationError as e:
126132
raise WorkflowError(
127133
f"Error validating row {i} of data frame.", e
128134
)
135+
129136
if set_default:
130137
newdata = pd.DataFrame(recordlist, data.index)
131-
newcol = ~newdata.columns.isin(data.columns)
132-
n = len(data.columns)
133-
for col in newdata.loc[:, newcol].columns:
134-
data.insert(n, col, newdata.loc[:, col])
135-
n = n + 1
136-
return
138+
# Add missing columns
139+
newcol = newdata.columns[~newdata.columns.isin(data.columns)]
140+
data[newcol] = None
141+
# Fill in None values with values from newdata
142+
data.update(newdata)
143+
144+
else:
145+
return False
137146
except ImportError:
138-
pass
139-
raise WorkflowError("Unsupported data type for validation.")
140-
else:
147+
return False
148+
return True
149+
150+
def _validate_polars(data):
141151
try:
142-
if set_default:
143-
DefaultValidator(schema, resolver=resolver).validate(data)
152+
import polars as pl
153+
154+
if isinstance(data, pl.DataFrame):
155+
logger.debug("Validating polars DataFrame")
156+
157+
recordlist = []
158+
for i, record in enumerate(data.iter_rows(named=True)):
159+
# Exclude NULL values
160+
record = {
161+
k: v
162+
for k, v in record.items()
163+
if pl.Series(k, [v]).is_not_null().all()
164+
}
165+
try:
166+
recordlist.append(_validate_record(record))
167+
except jsonschema.exceptions.ValidationError as e:
168+
raise WorkflowError(
169+
f"Error validating row {i} of data frame.", e
170+
)
171+
172+
if set_default:
173+
newdata = pl.DataFrame(recordlist)
174+
# Add missing columns
175+
newcol = [col for col in newdata.columns if col not in data.columns]
176+
[
177+
data.insert_column(
178+
len(data.columns),
179+
pl.lit(None, newdata[col].dtype).alias(col),
180+
)
181+
for col in newcol
182+
]
183+
# Fill in None values with values from newdata
184+
for i in range(data.shape[0]):
185+
for j in range(data.shape[1]):
186+
if data[i, j] is None:
187+
data[i, j] = newdata[i, j]
188+
189+
elif isinstance(data, pl.LazyFrame):
190+
# If a LazyFrame is being used, probably it is a large dataframe (so check only first 1000 records)
191+
logger.debug("Validating first 1000 rows of polars LazyFrame")
192+
193+
recordlist = []
194+
for i, record in enumerate(
195+
data.head(1000).collect().iter_rows(named=True)
196+
):
197+
# Exclude NULL values
198+
record = {
199+
k: v
200+
for k, v in record.items()
201+
if pl.Series(k, [v]).is_not_null().all()
202+
}
203+
try:
204+
recordlist.append(_validate_record(record))
205+
except jsonschema.exceptions.ValidationError as e:
206+
raise WorkflowError(
207+
f"Error validating row {i} of data frame.", e
208+
)
209+
210+
if set_default:
211+
logger.warning("LazyFrame does not support setting default values.")
212+
144213
else:
145-
jsonschema.validate(data, schema, resolver=resolver)
214+
return False
215+
except ImportError:
216+
return False
217+
return True
218+
219+
if isinstance(data, dict):
220+
logger.debug("Validating dictionary")
221+
try:
222+
_validate_record(data)
146223
except jsonschema.exceptions.ValidationError as e:
147224
raise WorkflowError("Error validating config file.", e)
225+
logger.debug("Dictionary validated!")
226+
else:
227+
if _validate_pandas(data):
228+
logger.debug("Pandas dataframe validated!")
229+
elif _validate_polars(data):
230+
logger.debug("Polars dataframe validated!")
148231

149232

150233
def simplify_path(path):

tests/test_validate/Snakefile

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,60 @@
11
shell.executable("bash")
22

33
import pandas as pd
4+
import polars as pl
45
from snakemake.utils import validate
56

67

78
configfile: "config.yaml"
89

910

10-
validate(config, "config.schema.yaml")
11+
# Dict
12+
df = pd.read_table(config["samples"])
13+
samples = df.iloc[0].to_dict()
14+
validate(samples, "samples.schema.yaml")
15+
assert samples["tissue"] == "blood"
16+
assert samples["n"] == 1
17+
samples = {k: v for k, v in df.iloc[1].to_dict().items() if pd.notnull(v)}
18+
validate(samples, "samples.schema.yaml")
19+
assert samples["tissue"] == "blood"
20+
assert samples["n"] == 0
21+
22+
# Pandas DataFrame without index
23+
samples = pd.read_table(config["samples"])
24+
validate(samples, "samples.schema.yaml")
25+
assert samples.iloc[0]["tissue"] == "blood"
26+
assert samples.iloc[0]["n"] == 1
27+
assert samples.iloc[1]["n"] == 0
1128

29+
# Polars DataFrame
30+
samples = pl.read_csv(
31+
config["samples"],
32+
separator="\t",
33+
schema={"sample": pl.String, "condition": pl.String, "n": pl.UInt8},
34+
null_values="NA",
35+
)
36+
validate(samples, "samples.schema.yaml")
37+
assert samples[0, "tissue"] == "blood"
38+
assert samples[0, "n"] == 1
39+
assert samples[1, "n"] == 0
40+
41+
# Polars LazyFrame
42+
samples = pl.scan_csv(
43+
config["samples"],
44+
separator="\t",
45+
schema={"sample": pl.String, "condition": pl.String, "n": pl.UInt8},
46+
null_values="NA",
47+
)
48+
validate(samples, "samples.schema.yaml", set_default=False)
49+
assert samples.collect()[0, "n"] == 1
50+
51+
# Pandas DataFrame with index
52+
validate(config, "config.schema.yaml")
1253
samples = pd.read_table(config["samples"]).set_index("sample", drop=False)
1354
validate(samples, "samples.schema.yaml")
55+
assert samples.iloc[0]["tissue"] == "blood"
56+
assert samples.iloc[0]["n"] == 1
57+
assert samples.iloc[1]["n"] == 0
1458

1559

1660
rule all:

tests/test_validate/samples.schema.yaml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,15 @@ properties:
66
description: sample name/identifier
77
condition:
88
type: string
9-
description: sample condition that will be compared during differential expression analysis (e.g. a treatment, a tissue time, a disease)
9+
description: sample condition
10+
n:
11+
type: integer
12+
default: 0
13+
description: replicate count
14+
tissue:
15+
type: string
16+
default: blood
17+
description: sample tissue of origin
1018

1119
required:
1220
- sample

tests/test_validate/samples.tsv

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
sample condition
2-
A tumor
3-
B blood
1+
sample condition n
2+
A case 1
3+
B control NA

0 commit comments

Comments
 (0)