bayesian

package module
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 7, 2025 License: BSD-3-Clause Imports: 8 Imported by: 70

README

Naive Bayesian Classification

Go Reference Go Version License

Perform naive Bayesian classification into an arbitrary number of classes on sets of strings. bayesian also supports term frequency-inverse document frequency calculations (TF-IDF).

Copyright (c) 2011-2024. Jake Brukhman. ([email protected]). All rights reserved. See the LICENSE file for BSD-style license.


Background

This is meant to be an low-entry barrier Go library for basic Bayesian classification. See code comments for a refresher on naive Bayesian classifiers, and please take some time to understand underflow edge cases as this otherwise may result in innacurate classifications.


Installation

Using the go command:

go get github.com/jbrukh/bayesian
go install !$

Documentation

See the documentation on pkg.go.dev.


Features

  • Conditional probability and "log-likelihood"-like scoring.
  • Underflow detection.
  • Simple persistence of classifiers.
  • Statistics.
  • TF-IDF support.

Example 1 (Simple Classification)

To use the classifier, first you must create some classes and train it:

import "github.com/jbrukh/bayesian"

const (
    Good bayesian.Class = "Good"
    Bad  bayesian.Class = "Bad"
)

classifier := bayesian.NewClassifier(Good, Bad)
goodStuff := []string{"tall", "rich", "handsome"}
badStuff  := []string{"poor", "smelly", "ugly"}
classifier.Learn(goodStuff, Good)
classifier.Learn(badStuff,  Bad)

Then you can ascertain the scores of each class and the most likely class your data belongs to:

scores, likely, _ := classifier.LogScores(
                        []string{"tall", "girl"},
                     )

Magnitude of the score indicates likelihood. Alternatively (but with some risk of float underflow), you can obtain actual probabilities:

probs, likely, _ := classifier.ProbScores(
                        []string{"tall", "girl"},
                     )

Example 2 (TF-IDF Support)

To use the TF-IDF classifier, first you must create some classes and train it and you need to call ConvertTermsFreqToTfIdf() AFTER training and before calling classification methods such as LogScores, SafeProbScores, and ProbScores)

import "github.com/jbrukh/bayesian"

const (
    Good bayesian.Class = "Good"
    Bad bayesian.Class = "Bad"
)

// Create a classifier with TF-IDF support.
classifier := bayesian.NewClassifierTfIdf(Good, Bad)

goodStuff := []string{"tall", "rich", "handsome"}
badStuff  := []string{"poor", "smelly", "ugly"}

classifier.Learn(goodStuff, Good)
classifier.Learn(badStuff,  Bad)

// Required
classifier.ConvertTermsFreqToTfIdf()

Then you can ascertain the scores of each class and the most likely class your data belongs to:

scores, likely, _ := classifier.LogScores(
    []string{"tall", "girl"},
)

Magnitude of the score indicates likelihood. Alternatively (but with some risk of float underflow), you can obtain actual probabilities:

probs, likely, _ := classifier.ProbScores(
    []string{"tall", "girl"},
)

Use wisely.

Documentation

Overview

Package bayesian is a Naive Bayesian Classifier

Jake Brukhman <[email protected]>

BAYESIAN CLASSIFICATION REFRESHER: suppose you have a set
of classes (e.g. categories) C := {C_1, ..., C_n}, and a
document D consisting of words D := {W_1, ..., W_k}.
We wish to ascertain the probability that the document
belongs to some class C_j given some set of training data
associating documents and classes.

By Bayes' Theorem, we have that

  P(C_j|D) = P(D|C_j)*P(C_j)/P(D).

The LHS is the probability that the document belongs to class
C_j given the document itself (by which is meant, in practice,
the word frequencies occurring in this document), and our program
will calculate this probability for each j and spit out the
most likely class for this document.

P(C_j) is referred to as the "prior" probability, or the
probability that a document belongs to C_j in general, without
seeing the document first. P(D|C_j) is the probability of seeing
such a document, given that it belongs to C_j. Here, by assuming
that words appear independently in documents (this being the
"naive" assumption), we can estimate

  P(D|C_j) ~= P(W_1|C_j)*...*P(W_k|C_j)

where P(W_i|C_j) is the probability of seeing the given word
in a document of the given class. Finally, P(D) can be seen as
merely a scaling factor and is not strictly relevant to
classificiation, unless you want to normalize the resulting
scores and actually see probabilities. In this case, note that

  P(D) = SUM_j(P(D|C_j)*P(C_j))

One practical issue with performing these calculations is the
possibility of float64 underflow when calculating P(D|C_j), as
individual word probabilities can be arbitrarily small, and
a document can have an arbitrarily large number of them. A
typical method for dealing with this case is to transform the
probability to the log domain and perform additions instead
of multiplications:

  log P(C_j|D) ~ log(P(C_j)) + SUM_i(log P(W_i|C_j))

where i = 1, ..., k. Note that by doing this, we are discarding
the scaling factor P(D) and our scores are no longer
probabilities; however, the monotonic relationship of the
scores is preserved by the log function.

Index

Constants

This section is empty.

Variables

View Source
var ErrAlreadyConverted = errors.New("cannot add class after TF-IDF conversion")

ErrAlreadyConverted is returned when trying to add a class after TF-IDF conversion.

View Source
var ErrClassExists = errors.New("class already exists")

ErrClassExists is returned when trying to add a class that already exists.

View Source
var ErrUnderflow = errors.New("possible underflow detected")

ErrUnderflow is returned when an underflow is detected.

Functions

This section is empty.

Types

type Class

type Class string

Class defines a class that the classifier will filter: C = {C_1, ..., C_n}. You should define your classes as a set of constants, for example as follows:

const (
    Good Class = "Good"
    Bad Class = "Bad
)

Class values should be unique.

type Classifier

type Classifier struct {
	Classes []Class

	DidConvertTfIdf bool // we can't classify a TF-IDF classifier if we haven't yet
	// contains filtered or unexported fields
}

Classifier implements the Naive Bayesian Classifier.

func NewClassifier

func NewClassifier(classes ...Class) *Classifier

NewClassifier returns a new classifier. The classes provided should be at least 2 in number and unique, or this method will panic.

func NewClassifierFromFile

func NewClassifierFromFile(name string) (c *Classifier, err error)

NewClassifierFromFile loads an existing classifier from file. The classifier was previously saved with a call to c.WriteToFile(string).

func NewClassifierFromReader

func NewClassifierFromReader(r io.Reader) (c *Classifier, err error)

NewClassifierFromReader: This actually does the deserializing of a Gob encoded classifier

func NewClassifierTfIdf

func NewClassifierTfIdf(classes ...Class) *Classifier

NewClassifierTfIdf returns a new TF-IDF classifier. The classes provided should be at least 2 in number and unique, or this method will panic.

func (*Classifier) AddClass added in v1.1.0

func (c *Classifier) AddClass(class Class) error

AddClass adds a new class to the classifier dynamically. Returns ErrClassExists if the class already exists, or ErrAlreadyConverted if the classifier has been converted to TF-IDF. This method is safe for concurrent use.

func (*Classifier) Classify

func (c *Classifier) Classify(document []string) (class Class, scores []float64, strict bool)

Classify returns the most likely class for the given document along with the log scores and whether the classification is strict. This is a convenience wrapper around LogScores that returns the Class directly instead of an index.

func (*Classifier) ClassifyProb

func (c *Classifier) ClassifyProb(document []string) (class Class, scores []float64, strict bool)

ClassifyProb returns the most likely class for the given document along with the probability scores and whether the classification is strict. This is a convenience wrapper around ProbScores that returns the Class directly instead of an index.

func (*Classifier) ClassifySafe

func (c *Classifier) ClassifySafe(document []string) (class Class, scores []float64, strict bool, err error)

ClassifySafe returns the most likely class for the given document along with the probability scores, whether the classification is strict, and an error if underflow is detected. This is a convenience wrapper around SafeProbScores that returns the Class directly instead of an index.

func (*Classifier) ConvertTermsFreqToTfIdf

func (c *Classifier) ConvertTermsFreqToTfIdf()

ConvertTermsFreqToTfIdf uses all the TF samples for the class and converts them to TF-IDF https://en.wikipedia.org/wiki/Tf%E2%80%93idf once we have finished learning all the classes and have the totals. This method is safe for concurrent use.

func (*Classifier) IsTfIdf

func (c *Classifier) IsTfIdf() bool

IsTfIdf returns true if we are a classifier of type TfIdf

func (*Classifier) Learn

func (c *Classifier) Learn(document []string, which Class)

Learn will accept new training documents for supervised learning. This method is safe for concurrent use.

func (*Classifier) Learned

func (c *Classifier) Learned() int

Learned returns the number of documents ever learned in the lifetime of this classifier.

func (*Classifier) LogScores

func (c *Classifier) LogScores(document []string) (scores []float64, inx int, strict bool)

LogScores produces "log-likelihood"-like scores that can be used to classify documents into classes.

The value of the score is proportional to the likelihood, as determined by the classifier, that the given document belongs to the given class. This is true even when scores returned are negative, which they will be (since we are taking logs of probabilities).

The index j of the score corresponds to the class given by c.Classes[j].

Additionally returned are "inx" and "strict" values. The inx corresponds to the maximum score in the array. If more than one of the scores holds the maximum values, then strict is false.

Unlike c.Probabilities(), this function is not prone to floating point underflow and is relatively safe to use. This method is safe for concurrent use.

func (*Classifier) Observe

func (c *Classifier) Observe(word string, count int, which Class)

Observe should be used when word-frequencies have been already been learned externally (e.g., hadoop). This method is safe for concurrent use.

func (*Classifier) ProbScores

func (c *Classifier) ProbScores(doc []string) (scores []float64, inx int, strict bool)

ProbScores works the same as LogScores, but delivers actual probabilities as discussed above. Note that float64 underflow is possible if the word list contains too many words that have probabilities very close to 0.

Notes on underflow: underflow is going to occur when you're trying to assess large numbers of words that you have never seen before. Depending on the application, this may or may not be a concern. Consider using SafeProbScores() instead.

If all scores underflow to zero, returns equal probabilities for all classes (1/n each). This method is safe for concurrent use.

func (*Classifier) ReadClassFromFile

func (c *Classifier) ReadClassFromFile(class Class, location string) (err error)

ReadClassFromFile loads existing class data from a file. This method is safe for concurrent use.

func (*Classifier) SafeProbScores

func (c *Classifier) SafeProbScores(doc []string) (scores []float64, inx int, strict bool, err error)

SafeProbScores works the same as ProbScores, but is able to detect underflow in those cases where underflow results in the reverse classification. If an underflow is detected, this method returns an ErrUnderflow, allowing the user to deal with it as necessary. Note that underflow, under certain rare circumstances, may still result in incorrect probabilities being returned, but this method guarantees that all error-less invocations are properly classified.

Underflow detection is more costly because it also has to make additional log score calculations.

When underflow is detected, the returned scores are computed from log-domain scores using the log-sum-exp trick for numerical stability. This method is safe for concurrent use.

func (*Classifier) Seen

func (c *Classifier) Seen() int

Seen returns the number of documents ever classified in the lifetime of this classifier.

func (*Classifier) WordCount

func (c *Classifier) WordCount() (result []int)

WordCount returns the number of words counted for each class in the lifetime of the classifier. This method is safe for concurrent use.

func (*Classifier) WordFrequencies

func (c *Classifier) WordFrequencies(words []string) (freqMatrix [][]float64)

WordFrequencies returns a matrix of word frequencies that currently exist in the classifier for each class state for the given input words. In other words, if you obtain the frequencies

freqs := c.WordFrequencies(/* [j]string */)

then the expression freq[i][j] represents the frequency of the j-th word within the i-th class. This method is safe for concurrent use.

func (*Classifier) WordsByClass

func (c *Classifier) WordsByClass(class Class) (freqMap map[string]float64)

WordsByClass returns a map of words and their probability of appearing in the given class. This method is safe for concurrent use.

func (*Classifier) WriteClassToFile

func (c *Classifier) WriteClassToFile(name Class, rootPath string) error

WriteClassToFile writes a single class to file. This method is safe for concurrent use.

func (*Classifier) WriteClassesToFile

func (c *Classifier) WriteClassesToFile(rootPath string) error

WriteClassesToFile writes all classes to files. This method is safe for concurrent use.

func (*Classifier) WriteGob

func (c *Classifier) WriteGob(w io.Writer) (err error)

WriteGob serializes this classifier to GOB and writes to Writer. This method is safe for concurrent use.

func (*Classifier) WriteToFile

func (c *Classifier) WriteToFile(name string) error

WriteToFile serializes this classifier to a file. This method is safe for concurrent use.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL