unik

package module
v0.0.0-...-7ba71fb Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 8, 2021 License: MIT Imports: 5 Imported by: 0

README

uniq - a k-mer serialization package

Go Reference

This package provides k-mer serialization methods for the package kmers, TaxIds of k-mers are optionally saved, while there's no frequency information.

This package is used in project unikmer and kmcp.

Details

K-mers (represented in uint64 in RAM ) are serialized in 8-Byte (or less Bytes for shorter k-mers in compact format, or much less Bytes for sorted k-mers) arrays and optionally compressed in gzip format with extension of .unik.

TaxIds are optionally stored next to k-mers with 4 or less bytes.

Compression rate comparison

No TaxIds stored in this test.

cr.jpg

label encoded-kmera gzip-compressedb compact-formatc sortedd comment
plain plain text
gzip gzipped plain text
unik.default gzipped encoded k-mers in fixed-length byte array
unik.compat gzipped encoded k-mers in shorter fixed-length byte array
unik.sorted gzipped sorted encoded k-mers
  • a One k-mer is encoded as uint64 and serialized in 8 Bytes.
  • b K-mers file is compressed in gzip format by default, users can switch on global option -C/--no-compress to output non-compressed file.
  • c One k-mer is encoded as uint64 and serialized in 8 Bytes by default. However few Bytes are needed for short k-mers, e.g., 4 Bytes are enough for 15-mers (30 bits). This makes the file more compact with smaller file size, controled by global option -c/--compact .
  • d One k-mer is encoded as uint64, all k-mers are sorted and compressed using varint-GB algorithm.
  • In all test, flag --canonical is ON when running unikmer count.

License

MIT License

History

This package was originally maintained in unikmer.

The magic number of binary format is still .unikmer for keeping compatibility.

Documentation

Index

Constants

View Source
const (
	// UnikCompact means k-mers are serialized in fix-length (n = int((K + 3) / 4) ) of byte array.
	UnikCompact = 1 << iota
	// UnikCanonical means only canonical k-mers kept.
	UnikCanonical
	// UnikSorted means k-mers are sorted
	UnikSorted // when sorted, the serialization structure is very different
	// UnikIncludeTaxID means a k-mer is followed its LCA taxid
	UnikIncludeTaxID

	// UnikHashed means ntHash value are saved as code.
	UnikHashed
	// UnikScaled means only hashes smaller than or equal to max_hash are saved.
	UnikScaled
)
View Source
const MainVersion uint8 = 5

MainVersion is the main version number.

View Source
const MinorVersion uint8 = 0

MinorVersion is the minor version number.

Variables

View Source
var ErrBrokenFile = errors.New("unik: broken file")

ErrBrokenFile means the file is not complete.

View Source
var ErrCallLate = errors.New("unik: SetMaxTaxid/SetGlobalTaxid should be called before writing KmerCode/code/taxid")

ErrCallLate means SetMaxTaxid/SetGlobalTaxid should be called before writing KmerCode/code/taxid

View Source
var ErrCallOrder = errors.New("unik: WriteTaxid/ReadTaxid should be called after WriteCode/ReadCode")

ErrCallOrder means WriteTaxid/ReadTaxid should be called after WriteCode/ReadCode

View Source
var ErrCallReadWriteTaxid = errors.New("unik: can not call ReadTaxid/WriteTaxid when flag UnikIncludeTaxID is off")

ErrCallReadWriteTaxid means flag UnikIncludeTaxID is off, but you call ReadTaxid/WriteTaxid

View Source
var ErrDescTooLong = errors.New("unik: description too long, 128 bytes at most")

ErrDescTooLong means length of description two long

View Source
var ErrInvalidFileFormat = errors.New("unik: invalid binary format")

ErrInvalidFileFormat means invalid file format.

View Source
var ErrInvalidTaxid = errors.New("unik: invalid taxid, 0 not allowed")

ErrInvalidTaxid means zero given for a taxid.

View Source
var ErrKMismatch = errors.New("unik: K mismatch")

ErrKMismatch means K size mismatch.

View Source
var ErrKOverflow = errors.New("unik: k-mer size (1-32) overflow")

ErrKOverflow means K > 32.

View Source
var ErrVersionMismatch = errors.New("unik: version mismatch")

ErrVersionMismatch means version mismatch between files and program

View Source
var Magic = [8]byte{'.', 'u', 'n', 'i', 'k', 'm', 'e', 'r'}

Magic number of binary file.

Functions

func PutUint64s

func PutUint64s(buf []byte, v1, v2 uint64) (ctrl byte, n int)

PutUint64s endcodes two uint64s into 2-16 bytes, and returns control byte and encoded byte length.

func Uint64s

func Uint64s(ctrl byte, buf []byte) (values [2]uint64, n int)

Uint64s decode from encoded bytes

Types

type Header struct {
	MainVersion  uint8
	MinorVersion uint8
	K            int
	Flag         uint32
	Number       uint64 // Number of Kmers, may not be accurate

	Description []byte // let's limit it to 128 Bytes
	Scale       uint32 // scale of down-sampling
	MaxHash     uint64 // max hash for scaling/down-sampling
	// contains filtered or unexported fields
}

Header contains metadata

func (Header) String

func (h Header) String() string

type Reader

type Reader struct {
	Header
	// contains filtered or unexported fields
}

Reader is for reading kmers.KmerCode.

func NewReader

func NewReader(r io.Reader) (reader *Reader, err error)

NewReader returns a Reader.

func (*Reader) GetGlobalTaxid

func (reader *Reader) GetGlobalTaxid() uint32

GetGlobalTaxid returns the global taxid

func (*Reader) GetMaxHash

func (reader *Reader) GetMaxHash() uint64

GetMaxHash returns the max hash for scaling.

func (*Reader) GetScale

func (reader *Reader) GetScale() uint32

GetScale returns the scale of down-sampling

func (*Reader) GetTaxidBytesLength

func (reader *Reader) GetTaxidBytesLength() int

GetTaxidBytesLength returns number of byte to store a taxid

func (*Reader) HasGlobalTaxid

func (reader *Reader) HasGlobalTaxid() bool

HasGlobalTaxid means the file has a global taxid

func (*Reader) HasTaxidInfo

func (reader *Reader) HasTaxidInfo() bool

HasTaxidInfo means the binary file contains global taxid or taxids for all k-mers

func (*Reader) IsCanonical

func (reader *Reader) IsCanonical() bool

IsCanonical tells if the only canonical k-mers stored

func (*Reader) IsCompact

func (reader *Reader) IsCompact() bool

IsCompact tells if the k-mers are stored in a compact format

func (*Reader) IsHashed

func (reader *Reader) IsHashed() bool

IsHashed tells if ntHash values are saved.

func (*Reader) IsIncludeTaxid

func (reader *Reader) IsIncludeTaxid() bool

IsIncludeTaxid tells if every k-mer is followed by its taxid

func (*Reader) IsScaled

func (reader *Reader) IsScaled() bool

IsScaled tells if hashes is scaled

func (*Reader) IsSorted

func (reader *Reader) IsSorted() bool

IsSorted tells if the k-mers in file sorted

func (*Reader) Read

func (reader *Reader) Read() (kmers.KmerCode, error)

Read reads one kmers.KmerCode.

func (*Reader) ReadCode

func (reader *Reader) ReadCode() (uint64, error)

ReadCode reads one code.

func (*Reader) ReadCodeWithTaxid

func (reader *Reader) ReadCodeWithTaxid() (code uint64, taxid uint32, err error)

ReadCodeWithTaxid reads a code, also return taxid if having.

func (*Reader) ReadTaxid

func (reader *Reader) ReadTaxid() (taxid uint32, err error)

ReadTaxid reads on taxid

func (*Reader) ReadWithTaxid

func (reader *Reader) ReadWithTaxid() (kmers.KmerCode, uint32, error)

ReadWithTaxid reads a kmers.KmerCode, also return taxid if having.

type Writer

type Writer struct {
	Header
	// contains filtered or unexported fields
}

Writer writes kmers.KmerCode.

func NewWriter

func NewWriter(w io.Writer, k int, flag uint32) (*Writer, error)

NewWriter creates a Writer.

func (*Writer) Flush

func (writer *Writer) Flush() (err error)

Flush write the last k-mer

func (*Writer) SetGlobalTaxid

func (writer *Writer) SetGlobalTaxid(taxid uint32) error

SetGlobalTaxid sets the global taxid

func (*Writer) SetMaxHash

func (writer *Writer) SetMaxHash(maxHash uint64) error

SetMaxHash set the max hash

func (*Writer) SetMaxTaxid

func (writer *Writer) SetMaxTaxid(taxid uint32) error

SetMaxTaxid set the maxtaxid

func (*Writer) SetScale

func (writer *Writer) SetScale(scale uint32) error

SetScale set the scale

func (*Writer) Write

func (writer *Writer) Write(kcode kmers.KmerCode) (err error)

Write writes one kmers.KmerCode.

func (*Writer) WriteCode

func (writer *Writer) WriteCode(code uint64) (err error)

WriteCode writes one code

func (*Writer) WriteCodeWithTaxid

func (writer *Writer) WriteCodeWithTaxid(code uint64, taxid uint32) (err error)

WriteCodeWithTaxid writes a code and its taxid. If UnikIncludeTaxID is off, taxid will not be written.

func (*Writer) WriteHeader

func (writer *Writer) WriteHeader() (err error)

WriteHeader writes file header

func (*Writer) WriteKmer

func (writer *Writer) WriteKmer(mer []byte) error

WriteKmer writes one k-mer.

func (*Writer) WriteKmerWithTaxid

func (writer *Writer) WriteKmerWithTaxid(mer []byte, taxid uint32) error

WriteKmerWithTaxid writes one k-mer and its taxid

func (*Writer) WriteTaxid

func (writer *Writer) WriteTaxid(taxid uint32) (err error)

WriteTaxid appends taxid to the code

func (*Writer) WriteWithTaxid

func (writer *Writer) WriteWithTaxid(kcode kmers.KmerCode, taxid uint32) (err error)

WriteWithTaxid writes one kmers.KmerCode and its taxid. If UnikIncludeTaxID is off, taxid will not be written.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL