Testing in Data Science
This is what you need for testing, btw.
In Data Science, two types of tests can be written, in addition to the usual
tests written using the PyTest Library:
1. For Data Analysis
2. For Machine Learning
In Data analysis, you need to test the code for previously unseen data
(basically data validation).
You do that by checking the properties of the outcome rather than the value
of the outcome. There are libraries for that:
I found 4 of them, there are obviously more:
1. En garde
2. Hypothesis
3. Feature Forge
4. Voluptuous
These libraries check for properties of output data, rather than the data itself.
In addition, NumPy and Pandas have builtin data validation libraries that you
can use for this.
For example, Hypothesis (which seems to be the most useful in our case),
create random data given some specifications and runs it through our code to
assert some properties that we want to check for. It also looks for most edge
cases on its own and provides feedback.
This blog basically confirms your doubts
An example of Hypothesis
These talks would help:
1. Testing for Properties
Testing in Data Science 1
2. Data Validation
NumPy builtin data validation
In testing ML models, there are a couple steps involved. You need to PyTest all
the non machine learning code.
Since models cannot be tested directly, there are ways to get around it.
1. Blackbox Testing for Machine Learning
2. QA for ML Models
You can still do the property checks on the output data. Feature Forge is
specifically used in ML.
Then there are the Metrics we talked about in class yesterday that are used to
check the quality of the model.
In our specific problem, we could use the Hypothesis library to get a random
Dataframe to pass through our function and check if any rows still have
correlation more than a certain number. Since the data is random but
parameters can be defined, we can get exactly the kind of test we want.
I'll write a test for this later. I'll share the code once it works.
Hope this helps.
Testing in Data Science 2