Showing posts with label Linear Algebra. Show all posts
Showing posts with label Linear Algebra. Show all posts

28 May 2012

The Math of Search and Similarity


Part one: Lucene, The Boolean Model, tf*idf, and the Vector Space Model


Lucene is built on classic IR (Information Retrieval) Math concepts that are actually pretty old by current standards in our industry, with most of its roots from the 1970’s and some of them like the Boolean Model being quite a bit older. Still that’s a solid quarter century older than search techniques like Pagerank and yet it can be a central and almost necessary technology used in many systems.  The mathematical domain of searching in Lucene relies on ideas like similarity measures which also relates to other disciplines that are used in Data Mining and Machine learning.  The basic ideas of Lucene and IR provide a nice introduction to these ideas and I want to focus on the following classic IR topics:

Tokenizing Text


Lucene is built on the idea of Documents, which can be thought of as a collection of terms.  Think of a document as a web page or book and think of the terms as the words and in some cases phrases.  Lucene uses an Analyzer to tokenize the text, for example you may want not want to include Stop Words like [the, and, to, a, ...] in your index also indexes can be built using the concept of N-Grams take the expression "Information Retrieval" for example, you might want to treat this as a single term as opposed to breaking it up, there are also Analyzers that do Stemming and even phonetic analysis.  The tokenizing process is involved but that should give you a basic idea, more can be found here.  Tokenization allows the terms to be defined, identified and parsed and it is the properties of terms in relation to the documents which drive the Boolean Model, the Vector Space Model and tf*idf.

The Boolean Model


The Boolean Model treats documents as vectors of boolean indicators, this is a brief summary of ideas discussed in chapter one of Introduction to Information Retrieval, I am borrowing part of the example from that chapter:

  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0

This is an incidence matrix that shows which terms are in which documents.  This representation can be viewed as either a set of columns which form binary vectors indicating which term occurs in each document or it can be viewed as a set of binary term-vectors which shows which shows which documents contain a given term.  This example is also a bit deceiving in that these vectors will be pretty sparse, The Tempest and Othello document vectors (columns) and the Cleopatra and Calpurnia term vectors (rows) are more indicative of this sparseness at this small scale.  Whether we store and search this data as a full matrix or individual vectors it would require a lot of space and the space would be mostly zeros, so a better structure to store and search this is used, the inverted index, the details of which are covered in chapter one of Introduction to Information Retrieval and Apendix B of Lucene in Action.

The queries on this structure take the form of standard Boolean logic including: Logical Conjunction, Logical Disjunction and Logical Negation. An example is (again taken from Introduction to Information Retrieval with some notational modification):

To answer the query we create a vector (q) [Brutus AND Caesar AND NOT Calpurnia], we take the Boolean vectors for Brutus (b), Caesar (cs) and Calpurnia (cl), complement the last, and then do a bitwise AND, where all Boolean vector operation are bitwise operations, this can all be denoted as follows:








The resulting term vector for this query yields: [Antony and Cleopatra, Hamlet] vector positions one and four.

In the above example you can see using the column vectors shows how the inverted index structure works. Here we are indexing into the corpus based on the terms not on the documents. So where a document is usually a collection terms (a column) we have inverted this document/term relationship and are now using the term/document relationship to link to the documents (rows). The inverted index structure in Lucene is built on listing terms and then linking to the documents, in Lucene each document is given an integer document id, so each term has a list of document ids that contain the document. The inverted index structure can include other information such as the position in the document or other metadata.

The inverted index is the same concept as something known as a concordance, which is a list of words and their positions in a work used by scholars. A famous example is one that was published for the Dead Sea scrolls by the team who exclusively controlled them, but the published images of the scrolls at a very slow pace which left many unpublished and this created a researcher monopoly on the scripts which upset many researchers who did not have access. In 1988 Ben Zion Wacholder used a computer to reconstruct 17 documents from the concordance, these were published in 1991.  Coincidently the same month a complete set of facsimiles were discovered at the Huntington Library and published breaking the monopoly on the scrolls, they are now available free on online, as they should be.

tf*idf Weighting (Term Frequency*Inverse Document Frequency)


tf*idf stands for Term Frequency – Inverse Document Frequency which is basically the Term Frequency multiplied by the Inverse Document Frequency.  In 1972 Karen Sparck Jones published a paper proposing what has become known as Inverse Document Frequency (idf) it is defined as:



Where |D| is the size of the set of all of the documents and |dt| is the number of documents that contain the term, this formula is effectively the log of the inverse of the estimated (uniform) probability that a term will occur in a document, which would be given by the formula:



This can be rewritten as:



Using the logarithmic identity, this is needed since a log of a probability, which is less than or equal to one, will always be either a negative value or zero:


The interesting thing about using logs is that it makes the idf terms addable for example if we want to compute the Inverse Document Frequency for term one and(∧) term two, assuming they are mutually exclusive which means the probability P(x ∧ y) = P(x)P(y) is multiplicative:





The additive property is due to another logarithmic identity, which also forms a group isomorphism between multiplication of positive real numbers and addition of real numbers:



Now to get tf*idf we need to multiply the Term Frequency which is the number of times the term occurs in the document with the Inverse Document Frequency, the complete formula is:



When I learned about this I immediately thought of the formula for Information Theory:



It turns out that this has been thought of and debated and is not really correct, a topic that is fairly involved and you can read more about it this paper by Stephen Robertson. The ideas above about the logarithmic properties of addition in idf are taken from this paper.

Basically tf*idf is a term weighting mechanism which maps documents and terms to real numbers and we need this for the next piece of the puzzle.



The Vector Space Model

One problem with the Boolean model is that it either gives too many results or too few.  In the example above a Shakespearean "micro-corpus" was used to demonstrate how it works but in real corpus a query like that could potentially retrieve thousands of documents which would make it impractical to attempt to wade through the result set.  This is where a weighting mechanism like tf*idf comes in, it effectively maps the boolean values to values in the domain of Real Numbers, well technically floating point numbers.  The vector space model uses the columns of the incidence matrix above which are the vectors of terms for each document where each term position in the vector is calculated using the tf*idf weighting scheme.  These document vectors exist in a Vector Space and a query can be constructed as a vector and each document can be compared to the query vector using the angles between the query vector and the document vectors (α, θ) to determine which documents (d1,d2) best answer a query (q), shown visually above.  Also you should note that this is a simple two dimensional example, in reality there will be a dimension for each additional term which becomes a hyperdimensional vector space consisting of hundreds or even thousands of dimensions. 

It is not the angle between vectors that is calculated but the cosine of the angle, which is known as the Cosine Similarity, this is used as it is easier to calculate than the angle itself, the formula is:


The numerator is the Dot Product of the two vectors, a query vector and a document vector, the denominator is the product of the Euclidean Norm or Magnitude of the two vectors.  The above equation can be written as follows with the expanded Dot Product and Euclidean Norm calculations:


tf*idf is designed to increase as the number of occurrences of the term increase in a given document It should also be noted that the Vector Space Model can be built with other term weighting mechanisms other than tf*idf.  More information can be found in "Cosine Similarity and Term Weight Tutorial", in chapter six of Introduction to Information Retrieval and Lucene specific information on scoring.

Extending our example from above we can calculate our weights, first the idf values:

       D      dt      Idf
Antony 6 3 0.69314718
Brutus 6 3 0.69314718
Caesar 6 5 0.18232155
Calpurnia 6 1 1.79175946
Cleopatra 6 1 1.79175946



From a Shakespearean concordance we can get the number of occurrences for each term:

  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 61 0 0 0 1
Brutus 3 112 0 1 0 0
Caesar 159 145 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 56 0 0 0 0 0



Using the idf values and term frequencies we can calculate tf*idf for our documents:

  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 108.82410 42.281978 0 0 0 0.69314718
Brutus 2.0794415 77.632484 0 0.69314718 0 0
Caesar 28.989127 26.436625 0 0.36464311 0.18232155 0.18232155
Calpurnia 0 17.917594 0 0 0 0
Cleopatra 100.33853 0 0 0 0 0


I will leave calculating a query as an exercise for you, if you are feeling so inclined or masochistic.

Lucene uses a combination of the Boolean Model and the Vector Space Model, when the query is run the documents are first "approved", selected, by the Boolean Model and then scored by The Vector Space Model.  I believe this approach offers two advantages.  First it would reduce the number of calculations to search the Vector Space since you aren’t really searching it and limiting your calculations to a subset.  Secondly the Vector Space Model does not handle negation in queries, and I am not sure how negation is handled in Lucene in regards to scoring.  Actually there have been a number of IR tweaks and improvements for flexibility and efficiency that have been incorporated into Lucene and I confess I have only scratched the surface in both my coverage here and current knowledge.

27 November 2011

Eleven Equations True Computer Science Geeks Should (at Least Pretend to) Know




This idea is a complete rip off an article that appeared in Wired a little while ago and it got me thinking what would my list for Computer Science look like?  Plus I thought it might be a fun post and unlike the Wired list this one goes to eleven.  So here they are in no particular order:



Binomial Coefficient



The Binomial Coefficient equation generates Pascal’s Triangle and gives you the coefficients for the Binomial Theorem these ideas are often attributed to Pascal but in fact they have been known in part for over a millennia.


As I mentioned that this list is no particular order and I don’t wish to play favorites but if there is one equation here that you should really consider learning and committing to memory this is it. It is central to Combinitorics which has connections to many areas of math, I guess I should qualify that if you are serious about computer science and math related programming topics then you should be able to recite and apply it from memory, even when you are drunk, sleep deprived or being held at gun point. Ok that might be a bit much but you get the idea.  If you are programmer and you haven’t learned it, or need a refresher, I have a post that relates it to the Java Collections API.



Demorgan’s Laws


Logical form:


Set Form:



Ok so that’s like four "equations" for DeMorgan’s Laws, one can’t help but to struck by the similarity between the two sets and this is not a coincidence these two sets of formulas are essentially the same thing, they are special cases of complemented distributive lattices  which means technically it’s really just two formulas:




In this case the ∨ symbol means lattice join operation and the ∧ symbol is the lattice meet operation and the dash with the right downward mark means lattice complementation, I used this to differentiate from the tilde for Boolean complementation.  Also these two formulas are categorical duals of each other, which means that if one were smart and had a good grasp of Category Theory we could probably be really OCD about it and get it down to one formula.


Understanding Set Theory and Boolean Algebra are very important basic concepts in our profession and the Boolean version can make you better programmer, as you can use it to refactor logical if statements


Eigenvector and Eigenvalue



This equation and these concepts are not only central to Linear Algebra but these Linear Algebraic ideas and others are both used and extend into many other areas of math including Linear Transformations and Graphs in terms of matrices like the Adjacency Matrix and much more.


Linear Algebra is becoming increasingly central to the types of things we are doing in our profession, it is used heavily in Statistics and Machine Learning not to mention the Page Rank Algorithm is based in part on this equation.



Pumping Lemma for Regular Languages



No Comp Sci list would be complete with at least one formula from either Formal Language Theory or Automata Theory.  The problem is that these two are pretty different from other areas of math in terms of "Equations" I tried to think of some basic Ideas and The Pumping Lemma came to mind, and I found the above formula concisely expressed on Wikipedia.  This version of the Pumping Lemma is a more limited version of Another Theory which is used to check whether a Language is Regular. Also this is maybe not the best choice as it is pretty dense, if you want a better explanation I recommend this.


While this "equation" is pretty specialized, it is actually a special case of the more general Iteration Theorem, the real point here is really about more general domain knowledge.  It is central to what you do every day you are programming such as compilers and regular expressions not to mention the almost ubiquitous Kleene Star.


Information Entropy


I confess after all my machinations to try to reduce Demorgan’s Laws down to less equations I am totally cheating here by pulling a twofer in including two formulas, both Shannon’s Information Theory:



And the formula for Chaitin’s Constant, which is associated with Algorithmic Information Theory and Kolmogorov Complexity:



Interestingly this area has a parallel in the physical word, in terms of Thermodynamic Entropy, which also parallels the Boltzman Equation mentioned in the Wired Article.  Seth Loyd draws some interesting connections between computation and the physical world including Entropy and Boltzman’s Equation.


In programming and computer science information is our business and these ideas are central to many things that we deal with especially the storage, transmission, computational transformation (like compression), and computation of information.  We’ve already seen the possible relationship between both of these theories and Reusability and Software Frameworks.



Bayes' Theorem



Of all of equations on the list, Bayes' Theorem may stand out on its own merits as being one of the most recognizable and it might even qualify as a poster child for the statistical revolution sweeping our industry.  It rose from obscurity to being used for spam detection and with such applications as classifiers, Machine Learning, Data Mining and more.


Mathematically it is part of a rift in statistics and probability and I confess I may not yet call myself "an initiate of the bayesian conspiracy" but it hard to deny its utility plus there seems to be a more axiomatic basis that relates to Boolean Logic Set Theory, which makes it all the more alluring.



Fermat’s Little Theorem




These two equations of Fermat’s Little Theorem are really the same equation written in two different forms, where (a ⊥ p) means coprime and (p ∈ P) means prime.


This is a beautiful little theorem in Number Theory which can be generalized and written even more succinctly using Eulers Totient Function which is represented by φ (n):



These ideas are crucial for understanding encryption algorithms like RSA and there’s this cool result.


Natural Join



Natural Join, like all of the Relational Algebra Join Operations is a composition of the more basic, operations: Selection (σ), Projection (π), Cartesian Product (×), Set Difference (-) and Union (∪).  Natural Join is the Cartesian Product of two tables selecting matching rows and eliminating duplicate columns.  (In the formula above the projection (π) is the set (A) of attributes with duplicates removed and then selects (σ) the set (C) of common attributes which match in both sets and are equal in both sets.)


The Relational Algebra is in my opinion a bit of an odd beast when it comes to algebraic structures, but if you use SQL you are already using an implementation of it.  As data persistence becomes more heterogeneous I think understanding these concepts, will become more important.  Additionally Relational Algebra also has strong ties to Abstract Algebra[ and Category Theory[.



The Fixed-Point (Y) Combinator



Just as we need a "equation" from Formal Language Theory or Automata Theory we really should have one that is related to Lambda Calculus, in this case this equation is in the untyped lambda calculus.  Even though the Fixed-point Combinator stands up on its own, it is fairly well known, well at least in name due to the use of it by Paul Graham for his incubator (Y-Combinator).


The Fixed-Point Combinator allows one to implement anonymous recursion which is a powerful programming technique.  It also ties into some deeper theoretical aspects of computer science and logic such as Combinatory Logic.



O(n)


Many CS majors may cringe at this, and another possible reaction is that’s not much of an equation.  There’s actually quite a bit to this for example did you know there are some substantial and quite beautiful equations associated with this seemingly simple formula.  They are:




Also the following is a definition of (lim sup) Limit Superior:



Where inf is infimum and sup is supremum two concepts that seem to show up quite a bit especially when you are concerned with bounds like with Lattices and they also seem to show up in relation to Topology.


And we all know why we should care about his one: Algorithms, Algorithms, Algorithms.



Euler’s Identity



Euler’s Identity is also listed in Wired’s article and I have to agree that it is one the most beautiful and intriguing formulas in math.


Now this isn’t really as relevant to CS as the other formulas but if you a true computer geek you are hopefully some kind of math geek, and this is such a great formula, not to mention that it appears in The Simpsons' Homer3. It’s actually a specific case (where θ = π) of Euler’s Formula:



These formulas will give one a better understanding complex numbers which are needed in other areas of advanced math also these formulas relate to Fourier Analysis which does have applications in Computer Science.


I am curious to see how well this list holds up, how will I feel about it in a year?  Also in writing this I realized that I still have quite a bit more to learn about at least half of this list, if not all of it, which made it hard to write as it was quite a little (unfinished) research project. I have already written individual articles on several of these equations and have at least two in the works for Linear Algebra, at least two on the Relational Algebra and one planned for Pascal’s Triangle, and those are just the ideas I’ve already had, who knows what the future holds.  I hope you enjoyed it as much as I did writing it.


The Set version may not apply to all cases in set theory, this assumes Union and Intersection over the Power Set of a Set Universe, this would be closed under these operations and would include complements of all elements.  The Set Universe is the upper bound and the empty set is the lower bound.

19 June 2011

Free Math Resources You Can Use



In my opinion, this actually may be the best time in human history to learn math, or just about anything for that matter. If you are of sufficient means to have access to the internet, which unfortunately not all people are, you have access to an unprecedented amount of information. The problem is the overwhelming amount of information that is out there and how to find it. So I wanted to share some of my tricks to finding good math information, which I call "Math Mining", of course these can be used for any academic information, so maybe "Academic Mining" may be more apropos.

One of the reasons that the present time is a good time to learn math is due to the diversity of sources for information, Wikipedia is one of those sources, Wiki Surfing, as has been previously discussed, but that is just the tip of the math iceberg and the really cool thing is that there are many resources that can give you entirely new and fresh perspectives on things that may sometimes seem dull and obfuscated by more traditional approaches found in books, not that there aren't a lot of great books too. It really is an exciting time.

Many professors have their publications, notes and course resources freely available on the internet and some of these include full books in pdf or ps or html format. In fact that leads me to my little google trick, let's assume you want to find some information about a math subject, we'll use Linear Algebra as an example. Then if you use the Google search:

"Linear Algebra" inurl:pdf

You will get a lot of hits that are academic pages, these will be a mix of publications and course related material. Once you find a document, you can use that url to find more information. For example let's say that our above search leads us to the following (fictitious) url:

After you click the link and get your reward, you should realize that this is a potential gateway to much more information. Now admittedly this might be seen as a moral gray area, because sometimes I get the feeling that some of these resources are not as openly exposed as they could be so it may be that the instructors do not want to openly share their work and they are practicing security through obscurity, but in my opinion if directory browsing is enabled and/or your documents are indexed by Google, then they're fair game, so if you are someone who this applies to, I suggest that you either share it openly it or lock it down. I encourage anyone who is sharing their work to do it freely and openly regardless of whether people are taking your classes. After all it's for the greater good. "The greater good." And if you openly share it then people like me can read it, learn it, know it and talk about how awesome you and your work are. It's a win-win.

Hack the Site

Hack #1 Url Suffix Removal

By removing "chapter10.pdf" yielding www.math.umars.edu/~hjfarnsworth/math420/fall2010/ will expose more resources if this directory has browsing enabled or if it has a default page. You can progressively remove directories to find one that is useful, and actually sometimes it is worth it to jump directly to www.math.umars.edu/~hjfarnsworth/ which will often be a professor home page which can yield links to publications, course pages with documents, and other potentially interesting information.

Hack #2 File Name Enumeration

So you looked at chapter10.pdf and it's awesome but Hack #1 did not yield it or the related chapters. Due to the naming convention try: www.math.umars.edu/~hjfarnsworth/math420/fall2010/chapter09.pdf or www.math.umars.edu/~hjfarnsworth/math420/fall2010/chapter9.pdf, often this approach will yield other related documents.

Hack #3 Invoke the Power of Google

Let's say the hack #1 didn't work and the resultant url had a random characteristic like:

The following Google search will ferret out those pesky hard to find pdf's:

site:www.math.umars.edu/~hjfarnsworth/ inurl:pdf

Also you can use .ps and .ps.gz in place of .pdf for file type searches. If you feel that this is crossing some kind of moral line then don't do it, but I like to say all is fair in Love and Math.

I would like to give another example of this technique, I recently came across "Mapreduce & Hadoop Algorithms in Academic Papers (4th update - May 2011)" which linked to "Max-cover algorithm in map-reduce" which caught my interest, and of course the ACM is charging for it, but no worries, there is usually no need to pay them, actually I recommend boycotting them. I employed the above tricks but they didn't work, simply Googling one of the authors did (always pick the most unique name(s)):

"Flavio Chierichetti"

Pulled up his web site which had a free copy of the paper, now all I have to do is find the time to read it. Also the above techniques yielded the paper's "cliff notes".

Of course you can just look up someone by name, for example, you can find some of Donald Knuth's publications here.

In regards to academic publications there are two excellent repositories with a wealth of information these are Citeseer out of Penn State, this site can be a little flaky in terms of availability, at least that's been my experience in the past and the other is arXiv run by Cornell University. These mostly contain research oriented work but you can often find relevant information even for neophytes, actually a lot of advanced papers and books for that matter start out with introductory sections that can be worth looking at.

Encyclopedic and other Miscellaneous Resources

Wikipedia, obviously, as previously mentioned. Also the oft controversial Stephen Wolfram provides an excellent resource called Wolfram Mathworld.

Project Euler is a site dedicated to collaboratively solving math oriented problems programmatically more about it can be found here.

Math on the Web by category here provides some interesting links, I believe this is run by the American Mathematical Society but I am not sure.


The National Institute of Standards and Technology site: NIST Digital Library of Mathematical Functions.

Also there is Mathoverflow which is a Stackoverflow type of question and answer community devoted to Math.

Blogs

There are a number of blogs that blog about both math and programming related math. Actually if your primary interest is machine learning, I recommend Bradford Cross's Measuring Measures blog, it is hard to find things on his site and it was recently restyled with a magenta/maroon background which I now find a little bit harder to read. The relevant links here are: Learning About Network Theory, Learning About Statistical Learning, and Learning About Machine Learning, 2nd ed. Additionally Ravi Mohan did a follow-up: Learning about Machine Learning.

Good Math Bad Math by Mark Chu-Carroll has lots of good articles about math including some for beginners in various areas. Catonmat by Peteris Krumins has some nice entries with notes about the online MIT courses that he has worked through which currently covers Algorithms and Linear Algebra also mentioned above. The Unapologetic Mathematician has a lot of nice articles, this is a bit more advanced though. Math-Blog has a lot of articles as well. They tend to focus on more traditional areas of math. Math blog's abound and there are too many to mention, here's a few:

Lastly I will mention a blog by Jeff Moser, actually he only has a few math related posts, but his Computing your Skill on Probability and Statistics is a beautiful work of art well worth looking at.

Online Courses

The well known Khan Academy offers a number of courses including several math courses.

MIT Open Courseware has many online courses most notably for CS majors Introduction to Algorithms by the venerable Charles Leiserson and Erik Demaine videos here and Linear Algebra by Gilbert Strang.

On Stanford Engineering Everywhere the following might be of interest:

Artificial Intelligence | Machine Learning

Artificial Intelligence | Natural Language Processing

Linear Systems and Optimization | The Fourier Transform and its Applications

The Mechanical Universe is primarily dedicated to physics, but several math topics such as Calculus and Vectors are covered explicitly. It's also a nice series of lectures on the topic in spite of being a little dated in productions values.

Other Online Videos

Two math documentaries are covered here are Fermat’s Enigma: The Epic Quest to Solve the World’s Greatest Mathematical Problem and the overly dramatic but still interesting Dangerous Knowledge.

The story of Maths by Marcus du Sautoy.

Keith Devlin talks about Pascal and Fermat's coorespondance while working out probability in this intersting talk: Authors@Google: Keith Devlin.

Bob Franzosa - Introduction to Topology.

The Catsters videos on youtube cover various Category Theory related topics.

N J Wildberger's Algebraic Topology

Dan Spielman has a video discussing Expander Graphs.


Introduction to Game Theory by Benjamin Polak at Yale.


The site videolectures has many lectures in Computer Science and Math including:

If you find these videos too slow this might interest you.

Math Software

There are many math related software packages and libraries three of which are covered in more detail here.

Math library Sage written in Python

GNU Octave

The R project for Statistical Computing

Scilab

Maxima, a Computer Algebra System

Various Books and Academic Stuff


Here are a bunch of interesting courses and books that I have encountered during my searching which you might find interesting as well. These are presented in no particular order:




Algorithims

Algorithms by S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani.

Jeff Erickson has some Algorithms Course Materials



Steven Skiena author of the The Algorithm Design Manual offers some pretty comprehensive course notes for his cse541 LOGIC for COMPUTER SCIENCE not to mention the opportunity to learn how to bet on Jai-alai in the Cayman Islands.


Gregory Chaitin's Algorithmic Information Theory.



Computer Science


Foundations of Computer Science by Jeffrey Ullman and Al Aho.


The Haskell Road to Logic, Math and Programming by Kees Doets and Jan van Eijck



Discrete Math

Discrete Mathematics with Algorithms by M. O. Albertson and J. P. Hutchinson.




Analysis

Analysis WebNotes is a self-contained course in Mathematical Analysis for undergraduates or beginning graduate students.


Introduction to Analysis Lecture Notes by Vitali Liskevich.

Applied Analysis by John Hunter and Bruno Nachtergaele.


REAL ANALYSIS by Gabriel Nagy.



Probability Theory

Introduction to Probability Theory by Ali Ghodsi.

Introduction to Probability by Charles M. Grinstead.

The first three chapters of Probability Theory: The Logic of Science by E. T. Jaynes. Can be found here.

Think Stats: Probability and Statistics for Programmers by Allen B. Downey.


LECTURE NOTES MEASURE THEORY and PROBABILITY by Rodrigo Bañuelos.


Principles of Uncertainty by by Chapman and Hall.



Information Theory, Inference, and Learning Algorithms by David MacKay.





Machine Learning/Date Mining

Machine Learning Module ML(M) by M. A .Girolami.

Alexander J. Smola's and and S.V.N. Vishwanathan's draft of Introduction to Machine Learning.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition) by Trevor Hastie, Robert Tibshirani and Jerome Friedman.

Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.

Mining of Massive Datasets by Jeffrey Ullman.



Fourier Theory


Lecture Notes for EE 261 The Fourier Transform and its Applications pdf By Brad Osgood.



Abstract Algebra

Abstract Algebra by Thomas W. Judson.


James Milne has a number of sets of extensive notes on Algebraic topics like goup theory here.


ABSTRACT ALGEBRA: A STUDY GUIDE FOR BEGINNERS by John A. Beachy.


Elements of Abstract and Linear Algebra Edwin H. Connell.


Abstract Algebra by Elbert A. Walker.


A series of chapters on groups by Christopher Cooper.



Linear Algebra

Linear Algebra by Robert A. Beezer.


A course on Linear Algebra with book chapters.


Really cool interactive tutorial on Singular Value Decomposition by Todd Will.





Model Theory

Fundamentals of Model Theory pdf by William Weiss and Cherie D'Mello.



Set Theory

A book on Set Theory pdf by William Weiss.



Graph Theory

Reinhard Diestel makes his excellent and comprehensive book Graph Theory available, pdf here.



Logic

You can find Introduction to Mathematical Logic by J. Adler, J. Schmid, Model Theory, Universal Algebra and Order by J. Adler, J. Schmid, M. Sprenger and other goodies here.


Introduction to Logic by Michal Walicki.


Logic for Computer Science: Foundations of Automatic Theorem Proving by Jean Gallier.


The Novel Research Institute has a number of free academic books including: Logic and Metalogic:Logic, Metalogic, Fuzzy and Quantum Logics and Algebraic Topology, Category Theory and Higher Dimensional Algebra-Results and Applications to Quantum Physics



Category Theory

Some course notes on Category Theory by Tom Leinster.

Basic Category Theory pdf by Jaap van Oosten.

Abstract and Concrete Categories The Joy of Cats by Jiri Adámek, Horst Herrlich, George E. Strecker.

A gentle introduction to category theory --- the calculational approach pdf by Maarten M. Fokkinga.

Steve Easterbrook's An introduction to Category Theory for Software Engineers.



Algebraic Topology/Topos Theory

Eugenia Cheng of Catsters fame has a course in Algebraic Topology with some substantial notes.

The above links of Eugenia Cheng refer to Algebraic Topology by Allen Hatcher.

An informal introduction to topos theory pdf by Tom Leinster.



Topology

A free, protected, password available by request, e-book on topology: Topology without Tears by Sidney A. Morris.

Chapters for a topology course by Anatole Katok can be found here.



Computational Topology

Jeff Erickson has some nice notes on Computational Topology, pdf's can be found on the schedule page.

Afra Zomorodian has some nice resources on Computational Topology including a nice introductory paper.



Spectral Graph Theory

Fan Chung Graham has a lot interesting stuff, some pretty advanced, relating to graph theory including social graph theory and spectral graph theory.

Dan Spielman has some course notes on Spectral Graph Theory.



Expander Graphs

Avi Wigderson's Expander Graphs and their Applications.



Fractal Geometry

The Algorithmic Beauty of Plants pdf by Przemyslaw Prusinkiewicz and Aristid Lindenmayer is available on the Algorithmic Botany site.



Game Theory

Thomas S. Ferguson's course at UCLA on Game Theory also Game Theory for Statisticians.


A course in game theory by Martin J. Osborne and Ariel Rubinstein, requires registration.




Algebraic/Enumerative Combinatorics

MIT Open Courseware in Algebraic Combinatorics


An uncompleted book and notes on Enumerative Combinatorics by the Late Kenneth P. Bogart also here.


Lionel Levine's notes on Algebraic Combinatorics


Richard P. Stanley new edition of Enumerative Combinatorics Volume one.


A Course in Universal Algebra by Stanley N. Burris and H.P. Sankappanavar


Misc

Pat Hanrahan's CS448B: Visualization.



Sean Luke's "Essentials of Metaheuristics".


It's all a click away

The links in this entry, especially the academic links are susceptible to link rot, people move from institution to institution or leave academia for jobs in the private sector. I will endeavor to revisit this entry and try to keep these up to date and perhaps even add to them, however, if you encounter this page and have any interest in any or all of these resources I recommend downloading them now so that you have them.

Using the resources of this blog you should be able to get your hands on a huge amount of free resources on a wide range of topics. This can be helpful if you are on a budget or just want to try before you buy an expensive book on a topic. I hope you avail yourself of some of these, there's lots of great stuff and if you know of some that I do not please add them in the comments.

01 June 2011

Math for Programmers TNG

Steve Yegge wrote a blog entry a few years back called "Math for Programmers", as I previously mentioned I had a few other entries that parallel it, and this is one of those.  I consider this an extension of his list hence the title.

One of the "themes" of my blog is math for programmers, for example see:   "Math You Can Use", "Monoid for the Masses", and "The Combinatorial and Other Math of the Java Collections API".  And here I am going to give some of my opinions and what I feel is a more comprehensive list than is provided in Steve Yegge’s original post, not that it’s not a good post in fact I recommend reading it.  In it he recommends using a technique I like to call Wiki Surfing, which has a nice Hawaiian/Polynesian ring to it, where you surf Wikipedia to learn about Math, I am hoping that this post can in part serve as a giant "root node" to start from.

In my blog "The Math Debate" I talked about my future what and how blogs, this is the "what",  I often see questions on various sites like Reddit from people who want to improve their math skills and asking how and what to learn.  I wanted to talk about what I feel is important. Remember I am neither a mathematician nor do I play one on TV, although I do hope to play one on TV someday, also remember I am still learning about all of these areas so take my advice with a kosher sized grain of salt, and if you see errors please point them out.

In my opinion you can mostly, not worry, for the moment, about things like Calculus and lot of the continuous math, don’t get me wrong you do eventually want to learn or relearn these things but for CS and programming the Discrete Math is more immediately relevant.  I definitely think that if you really want to successfully use math in programming you need to be something of a math generalist, which is really the "breadth first" approach that Steve Yegge mentions.

So the real question, problem if you will, is what to study, this is of course subjective and dependant on your goals. My motivation is multifaceted with two primary motivations, one is pure curiosity coupled with wanting to have a higher understanding of Life, The Universe, and Everything, the other is wanting to do more interesting computer work and to do it better.  Fortunately I mostly do not find these motivations contradictory.  

Math is a huge field and it is intimately intertwined which becomes more evident the deeper you dig, it seems to me that a broad perspective is helpful, of course I am laying out quite a bit here, my recommendation is to try to glean a feel for the various disciplines, also as I mentioned once you start digging sometimes you end up reading about areas that previously seemed unrelated.

If you wish to only focus on a smaller set of areas I would recommend the following for programming: 

Set Theory, Logic, Graph Theory with some Combinatorics, and some basic Abstract Algebra like Groups and Monoids.  Then go on to Linear Algebra, more Combinatorics, Probability and Statistics and some type of intro to Formal Language Theory and Automata Theory. Pretty much stuff you might have taken for a B.S. in CS  How and what you learn is really up to you, here is a more comprehensive list:

Set Theory is in many respects the "arithmetic" of doing advanced and CS related Math it is essential and should be something that you should plan on committing the basics to memory, do you know the following concepts: union, intersection, symmetric difference, set difference, disjoint, cardinality and the power set? Many of the following disciplines rely on Set Theory, learning these basics can save you a lot of time in the sense that you may need to stop and think about them or look them up as you are learning other areas.

Another discipline which could be thought of as an "arithmetic" of Math and CS is Propositional Logic, in fact it really is the "arithmetic" of your computer at the hardware level.  Also Set Theory and Logic have a number of parallels and actually the two are somewhat intertwined.  Here you should ask yourself do you know the truth tables for and, or, nand, nor, and xor off the top of your head? The general relevance of Propositional Logic is enhanced by the added abstraction of Predicate Logic. Also there are Combinitory Logics like Lambda Calculus which will help you better understand functional programming among other things.

The next topic which probably needs no introduction is Graph Theory.  This is another area you should really plan committing the basics to memory also you might want to consider extensive study here if you motivation is advanced programming and software architecture. Also it’s really cool and there’s tons of cool stuff happening in this field both mathematically and in our industry, like Social Network Theory.  Also I think Graph Theory is one of the best maths to study, I find it fascinating and extremely relevant to IT, also it’s what I call a gateway math, once you get hooked on it, it leads to other maths.

Another area which could be considered foundational in an arithmetic sense is Combinatorics, which is science of counting. I admit that this is a math that I love and fits with being a programmer. A good programmer is always looking at things from the perspective of what are all the possible outcomes and contingencies;   Combinatorics gives you a methodology to do that. I have written a programmer intro here.

While we are talking about possible outcomes the natural extension is Probability and Statistics which build in part on Combinatorics and are now pretty much in the forefront of hot "maths" for IT because of areas like Bayesian Statistics, Machine Learning and Data Science.  Also if you delve deep into these then you need to be looking at both differential and integral calculus.

An increasingly important area is Abstract Algebra which includes our old friend the Monoid and Groups, Rings and Fields.  I confess that there are still quite a few things here that I am trying to wrap my head around, especially the intuitively understanding whole Polynomial thing but I feel you can’t go wrong studying this and that it is central to fundamentally understanding what we do. 

One math that I may have implied that you could initially skip is Trigonometry, that really isn’t true, but I think you can get away with just focusing on getting Sine and Cosine under your belt at first and then you have enough to fall back on if you need more and those two will help you with things like Linear Algebra and if you have the inclination: Fourier Transforms.

Linear Algebra is pretty crucial it’s like the Swiss Army Knife of math and to use another weak analogy it’s a little like a Collections API of math.  This is definitely worth studying and is used in a lot of areas including Statistics, Machine Learning and Search Technologies which can involve things likeVector Spaces.

Relations, Functions, Posets, etc., there are some "basic" concepts that you usually get hit with during an introductory CS math course it might be part of a Discrete Math or some type of Introduction to "Advanced" or "Abstract" Math class and it will usually follow pretty close to Set Theory.  These things are pretty essential, and I don’t think I was ever exposed to Posets in my curriculum which I think was one of several deficiencies.  A couple of concepts that you want to know backwards and forwards, pun intended, are Injection, Surjection, and Bijection.  Also it seems like the Ordering/Posets related concept of Infimum aka Greatest Lower Bound and Supremum aka Least Upper Bound come up a lot, at least in the stuff I look at.

Number Theory is actually pretty cool.  Prime Numbers for example fascinate many people including me, and it has some direct but perhaps more specialized applications to CS like Hash Functions and Encryption.

Proofs – This is one area that is important and in which I am severely lacking actually with the exception of a few algebraic combinatorial proofs I’ve been proof free since `93, but it’s definitely on my todo list.   I mention it separately because it can be treated as it as its own discipline even though it’s ubiquitous in Math.

CS Stuff

The following areas I would consider to be very Computer Science related and may generally be somewhat outside of main stream Math.

Algorithms, this is a no brainer, and includes the obvious Big O aka Landau Notation, it’s not only O you know. Also there are Complexity Classes, and really things that overlap with other areas like Graph Algorithms and Computability in general which involve:

Formal Language Theory and Automata Theory, go hand in hand and are needed to really understand Regular Expressions, Parsers and Compilers. Not to mention that the Von Neumann architecture is based on the Turing Machine. I am, however, a little dismayed by the number of developers including CS majors that I have encountered that have minimal or no knowledge of these, and not to be a rude but if you do not have some basic knowledge about these you don’t know jack about computers and software. I mean this in the most positive encouraging way, so get on it!  Do it now! Click the links and get started! Also for Formal Language Theory it helps to understand the String Monoid.

Relational Algebra is the underlying Algebra to how SQL works, yes I know SQL is not cool right now, but learning this underlying math gives you insights that extend beyond SQL.  Actually Relational Algebra relates to Set Theory and Abstract Algebra and some pretty heavy other maths like Category Theory, Topos Theory, and Fibrations, most of these are beyond my understanding at the moment.

Queuing Theory is pretty relevant to things that we do and is an area which I want to improve on so I don’t have a lot to say here, it does of course relate pretty heavily to Probability Theory and things like Markov Chains.

Information Theory (Shannon) and Algorithmic Information Theory (Kolmogorov, et al)1 both relate to a few areas one of course is Probability Theory. Compression and Encryption are related to these in a number of ways.  Put simply, Shannon’s Information Theory relates representing and transmitting information, where as Algorithmic Information Theory is concerned with what is the smallest algorithm to generate strings. This is definitely interesting and relevant stuff that you want to look at.

Taking it up a notch

With Math it can be infinite and there are always more advanced things to learn, now realize that this is a broad list and while you probably want to have a general view of things, you will undoubtedly have areas that you want to focus on.  Actually to do anything with math, unless you are genius, you will have make choices about what you study. 

If you want to go to the next "level" whatever that means to you, you will probably want to look at the some of the above areas in more detail and here’s a list of some areas to ponder as well:

Calculus – At this point you may be thinking, hey wait I had to take this as a freshman in college or like me you took it in high school, how can this be next level stuff, isn’t it basic?  Well yes and no, actually when was the last time you needed to use a Derivative or Integral when programming?  Calculus, specifically Differential Calculus is often cited in arguments against math in programming.  Now don’t get me wrong it’s important and relevant especially as your level of other areas increases and it may be more relevant more quickly, especially Integral Calculus if you go the Statistics/Machine Learning route.

Graph Theory is pretty huge and if you are interested in math in regards to programming work, I don’t think you can go wrong learning more about it, and after you’ve got a lot the "basic" concepts like Coloring, Planarity, Walks, and Cycles ideas, you can look into areas like Spectral Graph Theory and Expander Graphs. I just found out about these two recently and they look pretty interesting.

Other Logic topics like Multi-Valued Logics, extend the ideas of logic creating possibilities that fall outside of traditional two values of true and false.  Fuzzy Logic and Probabilistic Logic add numeric values and probability to the interpretation of logic values. Modal Logics add additional semantics to traditional Logic to describe additional types of attributes like necessity and temporal characteristics. Additionally there is Metalogic and don't forget Gödel's incompleteness theorems.

Order Theory and Lattice Theory take Partially Ordered Sets to a higher level.  These seem to be pretty intertwined, especially lattices, with a bunch of other areas like Algebra, Logic, and Probability. I have to wonder if Lattices should be introduced more formally earlier. 

Model Theory is in some senses a convergence Abstract Algebra and Logic, so if those two maths get you jazzed you probably want to be looking here at some point.

I think an understanding of Topology is going to be critical in doing many math related tasks with computers, in a sense it already is, after all Graphs are topological concepts, in fact Euler’s solution of the "Seven Bridges of Konigsberg" problem is often cited as both the inception of Graph Theory and Topology and Euler went on to create a formula which describes the "planarity" of graphs on Topological surfaces of varying genus, additionally I have seen a number of interesting books and papers involving Topology and Computing, more on those later.

Measure Theory put simply deals with the idea of measuring sets.  It is built on top a structure called a Sigma Algebra which has some similar properties to a Topological Space actually the two can be combined to create a Borel Set, most of this stuff is beyond my current level.  It also relates to probablity theory in that probablity theory can be described in terms of measures especially the relationship of Event Spaces to Measure Spaces.

Analysis - I still have a lot to learn in this area but it seems to be a kind of nexus of continuous math and is a very important area for understanding advanced concepts also I have heard it described as the study of change. One branch of it is Functional Analysis. Analysis ties a lot of pretty heavy areas together including but not limited to Calculus, Measure Theory, Hilbert Spaces and an important, interesting and pretty cool area of Topology called Metric Spaces. This is another area that has a lot of the groundwork for things like Machine Learning.

Algebraic Topology is the combination of concepts of Abstract Algebra and Topology and surprisingly this may be one of the most central maths for Computer Science, one of my goals it to have some decent working knowledge of this field and I recommend that anyone seriously considering attempting to acquire a deeper understanding of Software and Computing consider studying this. Now you may not want to jump right in and there are some related areas that are more immediately relevant like Category Theory, which I have read might be better described as "Abstract Function Theory", it’s sort of the Math of Math, and it is showing up more and more in CS related areas I’m looking at you Monads! Category Theory seems to have a lot of relevance to functional programming which is more intuitive if you think "Abstract Function Theory". This also gets you back to that whole Relational Algebra area with things like Topos Theory, and Fibrations.

Game Theory is perhaps a "Penny Stock" in CS, in that it is probably not on a lot of people’s radar, but it has some very interesting possible applications, especially in terms of algorithms.  It applies heavily to social sciences and considering how prominent Social Networking and the concepts like the Social Graph are becoming it seems like it might be pretty relevant.  It also has applications in general management, perhaps Software Process will benefit from it as well someday.

Complex Numbers, it has always struck me how alien complex numbers seem. I am confused by the fact that they seem to be a way to represent two dimensions, so should we just always use them for that instead of two dimensional vectors?  And you can take it to higher dimensions with Hypercomplex Numbers like Quarternions and Octonions. I know that complex numbers are a key to understanding Hilbert Spaces and:

Fourier Theory has many applications to natural phenomena related to Signals and Signal Processing, it is also used in Image Processing.  This is an area that I still don’t know a lot about but it seems that Euler’s Formula is pretty important here.  

Fractal Geometry: "All of Nature talks to me, if I could just figure what it was trying to tell me." Fractal Geometry and its sister discipline Chaos Theory describe perhaps all natural phenomena from the clustering of galaxies to the pattern of capillaries in your fingers to the Brownian Motion in your hot beverage, these patterns even apply to social phenomena and computer phenomena such as stock market patterns and network traffic and more. I feel that there are a multitude of areas in which we have only scratched the surface on this, and I think that these will become equally commonplace in computing.  There are even constructs called L-Systems, invented/discovered by Aristid Lindenmayer, a botanist, to model plant growth, these structures also generate "pure" fractals like the Koch Curve.  A very compelling thing about L-Systems is their similarity to the structures in the Chomsky Hierarchy.   Benoit Mandelbrot, who sadly, recently passed away, coined the term Fractal and described them as "a set for which the Hausdorff-Besicovitch Dimension strictly exceeds the Topological Dimension." These are both topological concepts so in a sense Fractals are also steeped in Topology.

Now this list might seem a bit crazy and overly ambitious and this is mostly my crazy list but ironically, it partially jibes with, and was partially influenced by a book called Comprehensive Math for Computer Scientists which is published in two volumes. It is a beautifully crafted book but it uses a very modern approach to math so I find it a bit hard to follow at times but still I recommend it, actually one of my goals is to fully understand it.  Here are some top level topics they cover that I did not:

One last and critical point is that this is a pretty broad and probably intimidating list, I know writing it made my brain hurt, if it’s any consolation my knowledge on the above topics ranges from a decent understanding to a vague idea what the topic is about.  If you develop a passion for any or all of these areas you will always feel like there is not enough time to learn what you want to learn, but remember don’t look at it as being intimidating or frustrating, look at it as an opportunity to never be bored with life.

1This is also attributed to Chaitin and Solomonoff