Significance Tests for Patterns in Continuous Data

2001, IEEE International Conference on Data Mining

visibility

…

description

16 pages

link

1 file

In this paper we consider the question of uncertainty of detected patterns in data mining. In particular, we develop statistical tests for patterns found in continuous data, indicating the significance of these patterns in terms of the probability that they have occurred by chance. We examine the performance of these tests on patterns detected in several large data sets, including

Figures (12)

The concentrations of objects at the corners of the plot show that edge effects are particularly evident here. In spatial statistics, there are many approaches to coping with these edge effects. One solution is a toroidal wrap-around of the data such that no object is, in effect, on an edge. This is a practical solution but not so easy to justify in higher dimensions. We also note that a object on the edge of a cluster will display edge effects that will not be removed by a wrap-around of the whole data set. For example, in Figure 2 we have a data set in two dimensions with 10,000 objects uniformly distributed over four unit squares. Figure 3 shows the objects significant at the 1% level after the toroidal wrap around has removed edge effects.

Figure 1. Scatterplot Matrix of significant objects at 0.05% level

Figure 2. Full data set

Figure 4. Illustration of edge effect We note that objects that are precisely on an edge or a corner are not necessarily affected by edge effects. In t his case, the nearest neighbors a 1 lie in a volume arc of the hypersphere centered at the object in question. Within this arc, the assumptions of the distribution of the radius edge will have nearest neigh only on one side (see Figure distance. This is why one ap still hold. However, an object a bors in a small radius sphere, bu 4), thus artificially reducing the short distance from the next nearest neighbors mean nearest neighbor proach to edge correction in spatial statistics is to examine only those objects within a ‘guard area’ such that ed encountered. ge effects are not

Figure 5. Art2 data. These added objects constitute a peak above the background noise of Art1, although it is not clear from the plot whether the peak is significantly different from the background noise. We set k = 20 and L = 50,100 and 200 to represent ‘small’, ‘medium’ and ‘large’ peaks; note that the smaller value of Z was set to equal the number of objects in the peak and the medium value was set at twice the number of objects in the peak. We also employed the toroidal wrap-around to remove edge effects.

Table 1. Results from running PEAKER+ on Artl1.

Table 2. Results from running PEAKER+ on Art2. The number of peaks found and the minimum value of F_,, for the mean and median

Table 3. Results from running PEAKER+ on Quakes data. The Quakes data set consists of the longitude and latitude measurements of all 2,049 earthquakes with Richter intensity above 2.5 in California, USA recorded between 1962 and 1981. Patterns found here may indicate an unusually high density of earthquakes in a local region, as opposed to earthquakes that occur randomly in the region. Again, we set k = 20 but this time we varied L over smaller values than those we used in the artificial data sets in order to find peaks with smaller support. The number of peaks significant at the 5% level found by PEAKER+ for these parameter values on the Quakes data is shown in Table 3.

Figure 6. Quakes data set with significant peaks. k = 20, L = 25. The value of L is an indication of the level of support for a pattern, so our choice of L eflects the pattern size if a peak is flagged as being significant. In Figure 6, many eaks are flagged but only a few are significant. In some cases, peaks that look like hey should be patterns are not significant (for example, the upper left peak). They are 10t significant for this level of L; however, this is not to say that they will not be significant at some other level of LZ. We see in Figure 7, that the upper left peak is now significant, whereas the three significant peaks on the left-hand side in Figure 6 are 1ot significant for L = 50. By eye, we can sometimes find objects that look like they ure not features of random variation, but by applying PEAKER+ we can find peaks und assess their significance for particular values of L, also identifying patterns that he naked eye may miss.

Table 4. Results from running PEAKER+ on Crpret1 data. We set k = 20 and recorded results for L = 25, 50 and 100, to discover small patterns in the data. Results in Table 4 suggest that there are 2 significant peaks at the 5% level for L = 50 and that either one or both of these peaks are also significant for L = 30, depending on which test we use. No peaks are significant at the larger L = 100 level.

Figure 7. Quakes data set with significant peaks. k = 20, L = 50.