Showing posts with label algorithm. Show all posts
Showing posts with label algorithm. Show all posts

Wednesday, July 23, 2014

Flood fill a region of an active device in R


The following is a function to "flood fill" a region on the active plotting device. Once called, the user will be asked to click on the desired target region. The flood fill algorithm then searches neighbors in 4 directions of the target cell (down, left, up, right) and checks for similar colors to the target cell. If neighboring cells are of the same color, their color is changed to a defined replacement color, and the cell number is added to a "queue" for further searches of neighbors. Once a cell has been checked, its position is added to a list of completed cells. This algorithm is referred to as "Four-way flood fill using a queue for storage".

Here's a visualization of the Four-way flood fill from Wikimedia Commons:

http://commons.wikimedia.org/wiki/File:Wfm_floodfill_animation_stack.gif
This is kind of a pointless exercise given that any basic image editing programs (e.g. Microsoft Paint) can do this much more efficiently; Nevertheless, I felt compelled to figure out a way of programming this in R (I was originally interested in filling in land areas on a map that I created in R). You'll see from my example above that I didn't quite get it right - there is still some blank white space within the regions that I filled. Part of this problem is remedied by exporting a higher resolution image (floodfill argument "res"), but this slows things down considerably.

In order to have this function work directly on an open graphics device, I exported a PNG image and then re-imported it and trimmed off the margins. What remains is an image of the plot region itself  which I convert to a matrix and look-up dataframe, where each cell's color and neighboring cells are defined. It is this dataframe that forms the basis of my searching algorithm. I'm guessing I have made some sort of small mistake in how I trimmed the margins of the image, thus creating the slight offset in the filled region. Anyway, feel free to suggest improvements!

Function:

Monday, May 19, 2014

Automated determination of distribution groupings - A StackOverflow collaboration

For those of you not familiar with StackOverflow (SO), it's a coder's help forum on the StackExchange website. It's one of the best resources for R-coding tips that I know of, due entirely to the community of users that routinely give expert advise (assuming you show that you have done your homework and provide a clear question and a reproducible example). It's hard to believe that users spend time to offer this help for nothing more than virtual reputation points. I think a lot of coders are probably puzzle fanatics at heart, and enjoy the challenge of a given problem, but I'm nevertheless amazed by the depth of some of the R-related answers. The following is a short example of the value of this community (via SO), which helped me find a solution to a tricky problem.

I have used figures like the one above (left) in my work at various times. It present various distributions in the form of a boxplot, and uses differing labels (in this case, the lowercase letters) to denote significant differences; i.e. levels sharing a label are not significantly different. This type of presentation is common when showing changes in organism condition indices over time (e.g. Figs 3 & 4, Bullseye puffer fish in Mexico).

In the example above, a Kruskal-Wallis rank sum test is used to test differences across all levels, followed by pairwise Mann-Whitney rank tests. The result is a matrix of p-values showing significant differences in distribution. So far so good, but it's not always clear how the grouping relationships should be labelled. In this relatively simple example, the tricky part is that level 1 should be grouped with 3 and 5, but 3 and 5 should not be grouped; Therefore, two labeling codes should be designated, with level 1 sharing both. I have wondered, for some time, if there might be some way to do this in an automated fashion using an algorithm. After many attempts on my own, I finally decided to post a question to SO.

So, my first question "Algorithm for automating pairwise significance grouping labels in R" led me to the concept of the "clique cover problem", and "graph theory" in general, via SO user "David Eisenstat". While I didn't completely understand his recommendation at first, it got me pointed in the right direction - I ultimately found the R package igraph for analyzing and plotting these types of problems.

The next questions were a bit more technical. I figured out that I could return the "cliques" of my grouping relationships network using the cliques function of the igraph package, but my original attempt was giving me a list all relationships in my matrix. It was obvious to me that I would need to identify groupings where all levels were fully connected (i.e. each node in the clique connects to all others). So, my next question "How to identify fully connected node clusters with igraph [in R]" got me a tip from SO user "majom", who showed me that these fully connected cliques could be identified by first reordering the starting nodes in my list of connections (before use in the graph.data.frame function), and then subjecting the resulting igraph object to the function maximal.cliques. So, the first suggestions from David were right on, even though they didn't include code. The result shows nicely all those groupings in the above example (right plot) with fully connected cliques [i.e. (1, 3), (1, 5), (2), (4, 6), (7)].

The final piece of the puzzle was more cosmetic - "How to order a list of vectors based on the order of values contained within [in R]". A bit vague, I know, but what I was trying to do was to label groups in a progressive way so that earlier levels received their labels first. I think this leads to more legible labeling, especially when levels represent some process of progression. At the time of this posting, I have received a single negative (-1) vote on this question... This may have to do with the clarity of the question - I seem to have confused some of the respondents based on follow up comments for clarification - or, maybe someone thought I hadn't shown enough effort on my own. There's no way to know without an accompanying comment. In any case, I got a robust approach from SO user "MrFlick", and I can safely say that I would never have come up with such an eloquent solution on my own.

In all, this solution seems to work great. I have tried it out on larger problems involving more levels and it appears to give correct results. Here is an example with 20 levels (a problem that would have been an amazing headache to do manually):
Any comments are welcome. There might be other ways of doing this (clustering?), but searching for similar methods seems to be limited by my ability to articulate the problem. Who would have thought this was an example of a "clique cover problem"? Thanks again to all those that provided help on SO!

Code to reproduce the example:

Tuesday, October 30, 2012

DINEOF (Data Interpolating Empirical Orthogonal Functions)


I finally got around to reproducing the DINEOF method (Beckers and Rixon, 2003) for optimizing EOF analysis on gappy data fields - it is especially useful for remote sensing data where cloud cover can result in large gaps in data. Their paper gives a nice overview of some of the various methods that have been used for such data sets. One of these approaches, which I have written about before,  involves deriving EOFs from a covariance matrix as calculated from available data. Unfortunately, as the author's point out, such covariance matrices are no longer positive-definite, which can lead to several problems. The DINEOF method seems to overcome several of these issues.