INFO: Structure of .ARC files

From: Doug Wokoun (aa384@cleveland.Freenet.Edu)
Date: 07/03/90-12:00:51 PM Z


From: aa384@cleveland.Freenet.Edu (Doug Wokoun)
Subject: INFO: Structure of .ARC files
Date: Tue Jul  3 12:00:51 1990


(ARCINF.TXT)
 
ARC-FILE.INF, created by Keith 
Petersen, W8SDZ, 21-Sep-86, extracted 
from UNARC.INF by Robert A. Freed.
 
>From: Robert A. Freed Subject: 
Technical Information for ARC files 
Date: June 24, 1986
 
Note: In the following discussion, 
UNARC refers to my CP/M-80 program for 
extracting files from MSDOS ARCs. The 
definitions of the ARC file format are 
based on MSDOS ARC512.EXE.
 
ARCHIVE FILE FORMAT 
-------------------
 
Component files are stored 
sequentially within an archive. Each 
entry is preceded by a 29-byte header, 
which contains the directory 
information. There is no wasted space 
between entries. (This is in contrast 
to the centralized directory used by 
Novosielski libraries. Although random 
access to subfiles within an archive 
can be noticeably slower than with 
libraries, archives do have the 
advantage of not requiring 
pre-allocation of directory space.)
 
Archive entries are normally 
maintained in sorted name order. The 
format of the 29-byte archive header 
is as follows:
 
Byte 1: 1A Hex.
         This marks the start of an 
archive header. If this byte is not 
found
         when expected, UNARC will 
scan forward in the file (up to 64K 
bytes)
         in an attempt to find it 
(followed by a valid compression 
version).
         If a valid header is found in 
this manner, a warning message is
         issued and archive file 
processing continues. Otherwise, the 
file is
         assumed to be an invalid 
archive and processing is aborted. 
(This is
         compatible with MS-DOS ARC 
version 5.12). Note that a special
         exception is made at the 
beginning of an archive file, to 
accomodate
         "self-unpacking" archives 
(see below).
 
Byte 2: Compression version, as 
follows:
 
         0 = end of file marker 
(remaining bytes not present)
         1 = unpacked (obsolete)
         2 = unpacked
         3 = packed
         4 = squeezed (after packing)
         5 = crunched (obsolete)
         6 = crunched (after packing) 
(obsolete)
         7 = crunched (after packing, 
using faster hash algorithm) 
(obsolete)
         8 = crunched (after packing, 
using dynamic LZW variations)
 
Bytes 3-15: ASCII file name, 
nul-terminated.
 
(All of the following numeric values 
are stored low-byte first.)
 
Bytes 16-19: Compressed file size in 
bytes.
 
Bytes 20-21: File date, in 16-bit 
MS-DOS format:
              Bits 15:9 = year - 1980
              Bits 8:5 = month of year
              Bits 4:0 = day of month
              (All zero means no 
date.)
 
Bytes 22-23: File time, in 16-bit 
MS-DOS format:
              Bits 15:11 = hour 
(24-hour clock)
              Bits 10:5 = minute
              Bits 4:0 = second/2 (not 
displayed by UNARC)
 
Bytes 24-25: Cyclic redundancy check 
(CRC) value (see below).
 
Bytes 26-29: Original (uncompressed) 
file length in bytes.
              (This field is not 
present for version 1 entries, byte 2 
= 1.
              I.e., in this case the 
header is only 25 bytes long. Because
              version 1 files are 
uncompressed, the value normally found 
in
              this field may be 
obtained from bytes 16-19.)
 
SELF-UNPACKING ARCHIVES 
-----------------------
 
A "self-unpacking" archive is one 
which can be renamed to a .COM file 
and executed as a program. An example 
of such a file is the MS-DOS program 
ARC512.COM, which is a standard 
archive file preceded by a three-byte 
jump instruction. The first entry in 
this file is a simple "bootstrap" 
program in uncompressed form, which 
loads the subfile ARC.EXE (also 
uncompressed) into memory and passes 
control to it. In anticipation of a 
similar scheme for future distribution 
of UNARC, the program permits up to 
three bytes to precede the first 
header in an archive file (with no 
error message).
 
CRC COMPUTATION ---------------
 
Archive files use a 16-bit cyclic 
redundancy check (CRC) for error 
control. The particular CRC polynomial 
used is x^16 + x^15 + x^2 + 1, which 
is commonly known as "CRC-16" and is 
used in many data transmission 
protocols (e.g. DEC DDCMP and IBM 
BSC), as well as by most floppy disk 
controllers. Note that this differs 
from the CCITT polynomial (x^16 + x^12 
+ x^5 + 1), which is used by the 
XMODEM-CRC protocol and the public 
domain CHEK program (although these do 
not adhere strictly to the CCITT 
standard). The MS-DOS ARC program does 
perform a mathematically sound and 
accurate CRC calculation. (We mention 
this because it contrasts with some 
unfortunately popular public domain 
programs we have witnessed, which from 
time immemorial have based their 
calculation on an obscure magazine 
article which contained a 
typographical error!)
 
Additional note (while we are on the 
subject of CRC's): The validity of 
using a 16-bit CRC for checking an 
entire file is somewhat questionable. 
Many people quote the statistics 
related to these functions (e.g. "all 
two-bit errors, all single burst 
errors of 16 or fewer bits, 99.997% of 
all single 17-bit burst errors, 
etc."), without realizing that these 
claims are valid only if the total 
number of bits checked is less than 
32767 (which is why they are used in 
small-packet data transmission 
protocols). I.e., for file sizes in 
excess of about 4K bytes, a 16-bit CRC 
is not really as good as what is often 
claimed. This is not to say that it is 
bad, but there are more reliable 
methods available (e.g. the 32-bit 
AUTODIN-II polynomial). (End of 
lecture!)
 
                           Bob Freed
                           62 Miller 
Road
                           Newton 
Centre, MA 02159
                           Telephone 
(617) 332-3533 

-- 
-- 


-----------------------------------------
Return to message index