During the SynPUF ETL, we occasionally referred to the White Rabbit scans of values for a given source column. The frequency counts are helpful, but as @mav7014 pointed out in #29 they leak PHI. In my mind, they also don’t easily let me know how big of a slice of my data contains a particular value. I’d like to see what percentage a given value represents out of the overall frequency. Perhaps if reported small values as “less than 0.1%” or something, we’d also not leak PHI and @mav7014 could reincorporate frequencies in his ETL spec decision making process.
Please note that I also don’t intend for reporting of percentages to replace counts entirely. In some cases, it would be nice to have both raw counts and percents.