Here is a paper written by Paul Wilson and me about GLZA titled Grammatical Ziv-Lempel Compression: Achieving PPM-Class Text Compression Ratios With LZ-Class Decompression Speed.
is it possible to disable Tree/GLZA ability to find the most occurred symbols through many lines but to find ones only in the lines range? Example:
aaabbb<CR-LF>
bbbccc
has these symbols: aaa + bbb + ccc but NOT bbb<CR-LF>bbb. Serialize 20 GB into one line is impractical.
Is there also any other input file size limit?
Best regards,
FatBit
Dear FatBit,
Yes, it is possible if you are willing to modify the source code. To accomplish this, the score_node function would need to be changed. Shortly after the "top_score_loop" label, there is a line of code that looks like this:
This would block deduplications that include the carraige return or linefeed symbol.
I don't really understand why you want to do this, so if I am missing something, please let me know. In the example you provided GLZA would only deduplicate "bbb", but perhaps the implication is that there are more "bbb" strings elsewhere in the file.
Best Regards,
Kennon
Last edited by Kennon Conrad; 24th January 2016 at 23:56.
thank you for your reply. The reason is simple and is related to chemical structures coding = SMILES. Because any molecule structure representation is "independent" = one line, therefore is not right to compute common parts through many lines together.
thank you for your reply. The reason is simple and is related to chemical structures coding = SMILES. Because any molecule structure representation is "independent" = one line, therefore is not right to compute common parts through many lines together.
Best regards,
FatBit
Ah, I should have known :). In that case, it may be better to allow the carriage returns and linefeeds to be put on either the start or end of deduplicated strings. It seems like maybe putting them at the start would be better. If so, then then code would not skip strings starting with carriage returns and linefeeds, but would instead need to allow those symbols within a string until it saw one that was not a carriage return or linefeed.
BTW, I forgot to answer a question in your first post. GLZA should work for files up to 1 GB. Beyond that, I haven't done a lot of testing. It will have problems at some point if the file is big enough to cause more than 10 million dictionary entries.
Last edited by Kennon Conrad; 25th January 2016 at 09:07.
thank you for your reply. For example Benson's group contribution metod has ~ 300-400 structure fragments so I am not afraid tens mega records. I have smaller or bigger data files therefore I can make evolution.
substance formula SMILE
ethane CH3-CH3 CC
propane CH3-CH2-CH3 CCC
butane CH3-CH2-CH2-CH3 CCCC
alkane CH3-(CH2)n-CH3 C(C)nC
I meant 1 GB rather than 1 MB and fixed my response. You should be fine with tens mega records. On the other hand, I think for data like this that perhaps LZHAM might provide a better compression ratio. It is open source program and has fast decoding. The data has a different structure than regular text and LZHAM will probably do better with the local repetition.
I also have to clarify my response. Record I mean fragment/record in dictionary. Not number of substances. And the goal is not only compression but also different fragments "catching".
could you be so kind and help me? I am not able to generate (as you) symbol frequencies as a text file (like thread Tree alpha v01 download #424). Give me please step by step batch solution. And second question - do you provide x64 compiles only? Because I am not able to run your programs on Win 7 x32 Profesional Czech edition. Message "This is not Win32 application" occurs and my access to x64 is limited.
I fixed a couple of things in the code that were problems for 32-bit builds and posted the source and executables in the Tree thread. If you have any problems, please let me know. I didn't test much.
To generate the list of symbols in the encoded stream, you can run the batch file to compress and decompress a file using "TestGLZA32 <filename>" (where <filename> is your file's name). You could then rerun GLZAencode32 like this "GLZAencode32 -v <filename>.glzc <filename>.glze >symbol_list.txt" and the symbol list will be put into a file named symbol_list.txt. You can adjust the dictionary size using the -m command line argument (production cost in bits) with GLZAcompress32. The larger the value, the smaller the dictionary. Default is 7.0. For example "GLZAcompress32 -m100 <filename>.glzf <filename>.glzc" will typically cause the dictionary to be much smaller.
Best Regards,
Kennon
Last edited by Kennon Conrad; 29th January 2016 at 20:27.
I used GLZA 03b and tried TestGLZA.bat data.asc and obtained (on W7 X32 Prof Czech) for all programs: *.exe is not compatible with runnig version of Windows. To see if you need x32 and x64 version and contact the software publisher.
Kennon, thanks very much for the paper and for sharing Tree/GLZA in general.
Unfortunately the paper doesn't help me much in understanding the details of Tree/GLZA.
How exactly is entropy coding done? How are literals sent, how are dictionary indexes sent? Is there a limit on number of dictionary entries? Is it per first-char? How are the localized strings handled? What does the "ergodic" bit do? etc. etc.
Maybe I can answer some of your questions. (I'm coauthor on the paper, but Kennon came up with most of it all by himself, and designed and wrote all the code and knows all the details. He's at DCC, and will probably correct some details when he has a chance.)
There's no localization per se, if I understand your question. GLZA just parses UTF8 characters up front, in the prepass that characterizes the input as UTF8 or not, normal-looking space-separated text or not, and strided binary or not, then massages the UTF8 characters (or bytes) a tiny bit in preparation for IRR grammar construction. (That's all in GLZAformat.c, which is surprisingly simple.) None of that matters a whole lot---if you treat a normal space-separated-words UTF8 text file as a binary bytes file, it still compresses well. The biggest wins are using estimates of encoded string length to guide mostly-top-down grammar construction (dictionary string selection), and GLZ coding (using the natural first-occurrence order to implicitly communicate dictionary offsets).
One of the cool things about grammar-based compression is that it doesn't have any particular difficulty dealing with multibyte and variable-length characters. Once you've initially parsed them and turned them into discrete symbols, they're pretty much just atomic symbols for most purposes. The decoder doesn't know much about it, or care---it's mainly just concatenating variable-length byte strings anyhow, and doesn't care whether they represent fixed-size ASCII characters, binary bytes, or variable-length Cyrillic or Chinese characters. You don't need much postprocessing, and what little there is can be integrated into the decoder with very little overhead. (No separate postprocessing of everything to get back to UTF8.) This could easily be generalized to handle non-UTF8 encodings too, like Big5 Chinese & Japanese, or the usual Cyrillic code pages.
The symbol size is 32 bits, so you could have billions of dictionary entries ("productions" in "grammar-based"-speak). Right now there's a more serious implementation limit on total file size, but that is trivial and should be easy to fix, so that you could handle inputs of many gigabytes (like all of Wikipedia) with up to a few billion dictionary entries.
The ergodicity bit distinguishes symbols that tend to repeat pretty soon from those that don't, so that the decoder knows whether to put them in one of the MTF queues when it sees them. (And not to put others in the MTF queues, which would waste code space and slow decoding down---you don't want to waste time looking in MTF lists for things that aren't likely to be there and get a shorter code than they otherwise would, based on frequency in the normal way, because they're there.) In text, for example, you tend to have some very common strings with relatively stable frequencies (" if the", " of a", etc.) that are best coded by frequency, and other strings in between that are better coded by recency. (That's basically the same idea as "stop words" in various text & query processing contexts; you don't want to key off of stuff that's both frequent and ubiquitous, and want to focus on stuff that's specific to the current topic and/or mode of speech---tense, person, etc.)
I'm not sure I understand what you're asking with "is it per first-char?". The order 1 Markov model basically predicts the first character of the next symbol based on the last char or few chars of the current symbol. It's pretty crude at this point, but good enough to handle some very important common cases, like digits tending to be followed by digits or commas or decimal points, and Chinese chars by Chinese chars (vs. Latin-1 by Latin-1, when you have a mix like in HTML Chinese in, say, Chinese Wikipedia).
As to how literals are sent, the first occurrence of a character (for UTF8 ) is sent with a special code, saying to add it to the dictionary. Subsequent occurrences are sent with a normal length 1 dictionary code, saying it's a known symbol in the dictionary and not a new char/string to be put in the dictionary.
Last edited by Paul W.; 2nd April 2016 at 01:52.
Reason: typo
Paul's comments are accurate but I have a few additional comments. Literals are not handled much differently than other dictionary entries. The main difference is that when the first instance of a literal is sent, it is sent with a special length code indicating a literal. The literal is sent with a fixed width field with size corresponding to the largest literal value. For standard ASCII text it is 8 bits (or 7 if only ASCII characters 0x0 - 0x7F are used), and it is up to 19 bits for UTF-8 compliant text. Once the initial literal is sent it is treated the same as any other dictionary symbol.
The dictionary is kept as a set of lists per first (sub)symbol, with a list per dictionary code length. The set of dictionary lists for each leading symbol are mapped into up to 4096 code bins. Dictionary symbols are sent by sending the first symbol (with adaptive leading symbol predictions based on the trailing symbol, then the bin is sent with range coding, then if the bin is for a code length of more than 12 bits, the index within that code length list is also range coded.