Printtok2 Text Compression: What Am I Doing Wrong?

Discussion in 'Software' started by HarryPotter, Dec 10, 2024.

  1. HarryPotter

    HarryPotter MajorGeek

    Hi! I'm working on a text compression program called PrintTok2. Its main strategy is tokenization, but the compression ratio is currently doing horrible. :( I believe the main culprit is the way I'm handling tokenization. I'm using the following to identify tokens in the input:

    For every byte in the input, do:
    Scan for a match in the known tokens. If found, see if it can be larger than the previous match. If not found or is better, add the new token. Otherwise, increment the number of occurrences.
    If a token, skip to the end of this token and try again. If not, advance to the next byte.

    I'm also using the following code to sort the tokens by gains and trim at where the tokens no longer gain compressibility, written in ANSI-compliant C:
    Code:
    void sorttokens1 (void)
    {
        unsigned i, j, k, l;
        struct tok2buf tmpswaptok;
    unsigned char c[64];
        for (i=0; i<tok2bufsize; i++) {
            tok2buf[i].saved=((tok2buf[i].len)*(tok2buf[i].occur))-(tok2buf[i].len+1)+(tok2buf[i].len<6);
            if (tok2buf[i].len<4 || tok2buf[i].occur<3) tok2buf[i].saved=0;
        }
        for (i=0; i<tok2bufsize-1; i++) {
            k=i;
            for (j=i; j<tok2bufsize; j++) if (tok2buf[j].saved>tok2buf[k].saved) k=j;
            if (k!=i) {
                memcpy (&tmpswaptok, &tok2buf[i], sizeof(tmpswaptok));
                memcpy (&tok2buf[i], &tok2buf[k], sizeof(tmpswaptok));
                memcpy (&tok2buf[k], &tmpswaptok, sizeof(tmpswaptok));
            }
        } 
        for (i=0; i<tok2bufsize; i++) {
            //tok2buf[i].saved=((tok2buf[i].len-1)*(tok2buf[i].occur))-(tok2buf[i].len+1);
            memcpy (c, tok2buf[i].token, tok2buf[i].len);
            c[tok2buf[i].len]=0;
            printf (" Token# %d: \"%s\", Occur %d\n", i, c, tok2buf[i].occur);
        }
    getchar();
        for (i=0; i<tok2bufsize; i++) {
            if (tok2buf[i].saved<10) {tok2bufsize=i; break;}
        }
    printf ("# tokens after sort: %d\n", tok2bufsize);
        collecttokens();
    }
    
    I'm wondering if I'm doing it the wrong way.
     

MajorGeeks.Com Menu

Downloads All In One Tweaks \ Android \ Anti-Malware \ Anti-Virus \ Appearance \ Backup \ Browsers \ CD\DVD\Blu-Ray \ Covert Ops \ Drive Utilities \ Drivers \ Graphics \ Internet Tools \ Multimedia \ Networking \ Office Tools \ PC Games \ System Tools \ Mac/Apple/Ipad Downloads

Other News: Top Downloads \ News (Tech) \ Off Base (Other Websites News) \ Way Off Base (Offbeat Stories and Pics)

Social: Facebook \ YouTube \ Twitter \ Tumblr \ Pintrest \ RSS Feeds