The Glowing Python: filtering

Showing posts with label filtering. Show all posts

Monday, January 28, 2013

A toy Bloom Filter

A Bloom Filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is based on a probabilistic mechanism where false positive retrieval results are possible, but false negatives are not. In this post we will see a pure python implementation of the Bloom Filter and the end we will see how to tune the parameters in order to minimize the number of false positive results.
Let's begin with a little bit of theory. The idea behind the filter is to allocate a bit vector of length m, initially all set to 0, and then choose k independent hash functions, h₁, h₂, ..., h_k, each with range [1 m]. When an element a is added to the set then the bits at positions h(a)₁, h(a)₂, ..., h(a)_k in the bit vector are set to 1. Given a query element q we can test whether it is in the set using the bits at positions h(q)₁, h(q)₂, ..., h(q)_k in the vector. If any of these bits is 0 we report that q is not in the set otherwise we report that q is. The thing we have to care about is that in the first case there remains some probability that q is not in the set which could lead us to a false positive response.
The following class is a naive implementation of a Bloom Filter (pay attention: this implementation is not supposed to be suitable for production. It is made just to show how a Bloom Filter works and to study its behavior):

class Bloom:
 """ Bloom Filter """
 def __init__(self,m,k,hash_fun):
  """
   m, size of the vector
   k, number of hash fnctions to compute
   hash_fun, hash function to use
  """
  self.m = m
  # initialize the vector 
  # (attention a real implementation 
  #  should use an actual bit-array)
  self.vector = [0]*m
  self.k = k
  self.hash_fun = hash_fun
  self.data = {} # data structure to store the data
  self.false_positive = 0

 def insert(self,key,value):
  """ insert the pair (key,value) in the database """
  self.data[key] = value
  for i in range(self.k):
   self.vector[self.hash_fun(key+str(i)) % self.m] = 1

 def contains(self,key):
  """ check if key is cointained in the database
      using the filter mechanism """
  for i in range(self.k):
   if self.vector[self.hash_fun(key+str(i)) % self.m] == 0:
    return False # the key doesn't exist
  return True # the key can be in the data set

 def get(self,key):
  """ return the value associated with key """
  if self.contains(key):
   try:
    return self.data[key] # actual lookup
   except KeyError:
    self.false_positive += 1

The usage of this filter is pretty easy, we have to initialize the data structure with a hash function, a value for k and the size of the bit vector then we can start adding items as in this example:

import hashlib

def hash_f(x):
 h = hashlib.sha256(x) # we'll use sha256 just for this example
 return int(h.hexdigest(),base=16)

b = Bloom(100,10,hash_f)
b.insert('this is a key','this is a value')
print b.get('this is a key')

Now, the problem is to choose the parameters of the filter in order to minimize the number of false positive results. We have that after inserting n elements into a table of size m, the probability that a particular bit is still 0 is exactly

Hence, afer n insertions, the probability that a certain bit is 1 is

So, for fixed parameters m and n, the optimal value k that minimizes this probability is

With this in mind we can test our filter. The first thing we need is a function which tests the Bloom Filter for fixed values of m, n and k countinig the percentage of false positive:

import random

def rand_data(n, chars):
 """ generate random strings using the characters in chars """
 return ''.join(random.choice(chars) for i in range(n))

def bloomTest(m,n,k):
 """ return the percentage of false positive """
 bloom = Bloom(m,k,hash_f)
 # generating a random data
 rand_keys = [rand_data(10,'abcde') for i in range(n)]
 # pushing the items into the data structure
 for rk in rand_keys:
  bloom.insert(rk,'data')
 # adding other elements to the dataset
 rand_keys = rand_keys + [rand_data(10,'fghil') for i in range(n)]
 # performing a query for each element of the dataset
 for rk in rand_keys:
  bloom.get(rk)
 return float(bloom.false_positive)/n*100.0

If we fix m = 10000 and n = 1000, according to the equations above, we have that the value of k which minimizes the false positive number is around 6.9314. We can confirm that experimentally with the following test:

# testing the filter
m = 10000
n = 1000
k = range(1,64)
perc = [bloomTest(m,n,kk) for kk in k] # k is varying

# plotting the result of the test
from pylab import plot,show,xlabel,ylabel
plot(k,perc,'--ob',alpha=.7)
ylabel('false positive %')
xlabel('k')
show()

The result of the test should be as follows

Looking at the graph we can confirm that for k around 7 we have the lowest false positive percentage.

Sunday, February 5, 2012

Convolution with numpy

A convolution is a way to combine two sequences, x and w, to get a third sequence, y, that is a filtered version of x. The convolution of the sample x_t is computed as follows:

It is the mean of the weighted summation over a window of length k and w_t are the weights. Usually, the sequence w is generated using a window function. Numpy has a number of window functions already implemented: bartlett, blackman, hamming, hanning and kaiser. So, let's plot some Kaiser windows varying the parameter beta:

import numpy
import pylab

beta = [2,4,16,32]

pylab.figure()
for b in beta:
 w = numpy.kaiser(101,b) 
 pylab.plot(range(len(w)),w,label="beta = "+str(b))
pylab.xlabel('n')
pylab.ylabel('W_K')
pylab.legend()
pylab.show()

The graph would appear as follows:

And now, we can use the function convolve(...) to compute the convolution between a vector x and one of the Kaiser window we have seen above:

def smooth(x,beta):
 """ kaiser window smoothing """
 window_len=11
 # extending the data at beginning and at the end
 # to apply the window at the borders
 s = numpy.r_[x[window_len-1:0:-1],x,x[-1:-window_len:-1]]
 w = numpy.kaiser(window_len,beta)
 y = numpy.convolve(w/w.sum(),s,mode='valid')
 return y[5:len(y)-5]

Let's test it on a random sequence:

# random data generation
y = numpy.random.random(100)*100 
for i in range(100):
 y[i]=y[i]+i**((150-i)/80.0) # modifies the trend

# smoothing the data
pylab.figure(1)
pylab.plot(y,'-k',label="original signal",alpha=.3)
for b in beta:
 yy = smooth(y,b) 
 pylab.plot(yy,label="filtered (beta = "+str(b)+")")
pylab.legend()
pylab.show()

The program would have an output similar to the following:

As we can see, the original sequence have been smoothed by the windows.

Friday, October 14, 2011

Beginning with OpenCV in Python

OpenCV (Open Source Computer Vision) is a library of programming functions for real time computer vision [Ref]. In this post we will see how to use some of the basic functions of OpenCV in Python.

The following code opens an image from the disk, prints some image properties on the console and shows a window that contains the image.

# load and show an image in gray scale
image = cv.LoadImage('ariellek.jpg',cv.CV_LOAD_IMAGE_GRAYSCALE)

# print some image properties
print 'Depth:',image.depth,'# Channels:',image.nChannels
print 'Size:',image.width,image.height
print 'Pixel values average',cv.Avg(image)

# create the window
cv.NamedWindow('my window', cv.CV_WINDOW_AUTOSIZE)
cv.ShowImage('my window', image) # show the image
cv.WaitKey() # the window will be closed with a (any)key press

This is the image I used for this example.

And this is what the script showed on the console:

Depth: 8 # Channels: 1
Size: 366 550
Pixel values average (80.46735717834079, 0.0, 0.0, 0.0)

Now we can resize the image loaded above:

# resize the image
dst = cv.CreateImage((150,150), 8, 1)
cv.Resize(image,dst,interpolation=cv.CV_INTER_LINEAR)
cv.ShowImage('my window', dst)
cv.WaitKey()
cv.SaveImage('image2.jpg', dst) # save the image

And this is the result.

A Sobel operator can be applied as follow:

# Sobel operator
dstSobel = cv.CreateMat(image.height, image.width, cv.CV_32FC1)
cv.Sobel(image,dstSobel,1,1,3)
cv.ShowImage('my window', dstSobel)
cv.WaitKey()
cv.SaveImage('imageSobel.jpg', dstSobel)

And this is the result on the picture that I'm using:

The final example below uses two operation, a smoothing filter and a subtraction. It applies a Gaussian Blur to the original image and subtracts the result of the filtering from the original image.

# image smoothing and subtraction
imageBlur = cv.CreateImage(cv.GetSize(image), image.depth, image.nChannels)
# filering the original image
cv.Smooth(image, imageBlur, cv.CV_BLUR, 15, 15)
diff = cv.CreateImage(cv.GetSize(image), image.depth, image.nChannels)
# subtraction (original - filtered)
cv.AbsDiff(image,imageBlur,diff)
cv.ShowImage('my window', diff)
cv.WaitKey()
cv.SaveImage('imageDiff.jpg', diff)

The final output is: