blob: 526b3ff2a83616a52a18738729fcd584311e364c [file] [log] [blame]
LZW compression used to encode/decode a GIF file by Bob Montgomery [73357,3140]
ENCODER
Consider the following input data stream in a 4 color (A, B, C, D) system.
We will build a table of codes which represent strings of colors. Each time
we find a new string, we will give it the next code, and break it into a
prefix string and a suffix color. The symbols \ or --- represent the prefix
string, and / represents the suffix color. The first 4 entries in the table
are the 4 colors with codes 0 thru 3, each of which represents a single
color. The next 2 codes (4 and 5) are the clear code and the end of image
code. The first available code to represent a string of colors is 6. Each
time we find a new code, we will send the prefix for that code to the
output data stream.
A B A B A B A B B B A B A B A A C D A C D A D C A B A A A B A B .....
\/\/---/-----/\/---/-------/\/ \/ \ /\/---/---/\ /\/-----/---/---/
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Code
6 8 10 8 14 16 8 13 7 Prefix
The encoder table is built from input data stream. Always start with the
suffix of last code, and keep getting colors until you have a new
combination.
Color Code Prefix Suffix String Output
A 0 -
B 1 -
C 2 -
D 3 -
Clear 4 -
End 5 -
A \ A A First color is a special case.
B / \ 6 A B AB A
A | / 7 B A BA B
B |
A / | 8 6 A ABA 6
B |
A |
B \ / 9 8 B ABAB 8
B / | 10 B B BB B
B |
A | / 11 10 A BBA 10
B |
A |
B |
A / \ 12 9 A ABABA 9
A \ / 13 A A AA A
C / \ 14 A C AC A
D \ / 15 C D CD C
A / | 16 D A DA D
C |
D | / 17 14 D ACD 14
A |
D / \ 18 16 D DAD 16
C \ / 19 D C DC D
A / | 20 C A CA C
B |
A |
A | / 21 8 A ABAA 8
A |
B / | 22 13 B AAB 13
A |
B / 23 7 B BAB 7
The resultant output stream is: A B 6 8 B 10 9 A A C D 14 16 D C 8 ....
The GIF encoder starts with a code length of 2+1=3 bits for 4 colors, so
when the code reaches 8 we will have to increase the code size to 4 bits.
Similarly, when the code gets to 16 we will have to increse the code size
to 5 bits, etc. If the code gets to 13 bits, we send a clear code and start
over. See GIFENCOD.GIF for a flow diagram of the encoding process. This
uses a tree method to search if a new string is already in the table, which
is much simpler, faster, and easier to understand than hashing.
===========================================================================
DECODER
We will now see if we can regenerate the original data stream and duplicate
the table looking only at the output data stream generated by the encoder
on the previous page. The output data stream is:
A B 6 8 B 10 9 A A C D 14 16 D C 8 ....
The docoding process is harder to see, but easier to implement, than the
encoding process. The data is taken in pairs, and a new code is assigned to
each pair. The prefix is the left side of the pair, and the suffix is the
color that the right side of the pair decomposes to from the table. The
decomposition is done by outputing the suffix of the code, and using the
prefix as the new code. The process repeats until the prefix is a single
color, and it is output too. The output of the decomposition is pushed onto
a stack, and then poped off the stack to the display, which restores the
original order that the colors were seen by the encoder. We will go thru
the first few entries in detail, which will hopefully make the process
clearer.
The first pair is (A B), so the prefix of code 6 is A and the suffix
is B, and 6 represents the string AB. The color A is sent to the display.
The 2nd pair is (B 6), so the prefix of code 7 is B and the suffix is
the the last color in the decomposition of code 6. Code 6 decomposes into
BA, so code 7 = BA, and has a suffix A. The color B is sent to the display.
The 3rd pair is (6 8) and the next code is 8. How can we decompose 8.
We know that the prefix of code 8 is 6, but we don't know the suffix. The
answer is that we use the suffix of the prefix code; A in this case since
the suffix of 6 is A. Thus, code 8 = ABA and has a suffix A. We decompose 6
to get BA, which becomes AB when we pop it off the stack to the display.
The 4th pair is (8 B), so code 9 has a prefix of 8 and a suffix of B,
and code 9 = ABAB. We output ABA to the stack, and pop it off to the
display as ABA.
The 5th pair is (B 10) and the next code is 10. The prefix of code 10
is B and the suffix is B (since the prefix is B). Code 10 = BB, and we
output the prefix B to the display.
The 6th pair is (10 9) and the next code is 11. Thus the prefix of
code 11 is 10 and the suffix is the last color in the decomposition of 9,
which is A. Thus code 11 = BBA, And we output BB to the display.
So far, we have output the correct colors stream A B AB ABA B BB to
the display, and have duplicated the codes 6 thru 11 in the encoder table.
This process is repeated for the whole data stream to reconstruct the
original color stream and build a table identical to the one built by the
encoder. We start the table with codes 0-5 representing the 4 colors, the
clear code, and the end code. When we get to code 8, we must increse the
code size to 5 bits, etc. See GIFDECOD.GIF for a flow diagram of the
decoding process.
I Hope this helps take some of the mystery out of LZW compression, which is
really quite easy once you 'see' it. Bob Montgomery