kusano fc6ab3
A Fast Method for Identifying Plain Text Files
kusano fc6ab3
==============================================
kusano fc6ab3
kusano fc6ab3
kusano fc6ab3
Introduction
kusano fc6ab3
------------
kusano fc6ab3
kusano fc6ab3
Given a file coming from an unknown source, it is sometimes desirable
kusano fc6ab3
to find out whether the format of that file is plain text.  Although
kusano fc6ab3
this may appear like a simple task, a fully accurate detection of the
kusano fc6ab3
file type requires heavy-duty semantic analysis on the file contents.
kusano fc6ab3
It is, however, possible to obtain satisfactory results by employing
kusano fc6ab3
various heuristics.
kusano fc6ab3
kusano fc6ab3
Previous versions of PKZip and other zip-compatible compression tools
kusano fc6ab3
were using a crude detection scheme: if more than 80% (4/5) of the bytes
kusano fc6ab3
found in a certain buffer are within the range [7..127], the file is
kusano fc6ab3
labeled as plain text, otherwise it is labeled as binary.  A prominent
kusano fc6ab3
limitation of this scheme is the restriction to Latin-based alphabets.
kusano fc6ab3
Other alphabets, like Greek, Cyrillic or Asian, make extensive use of
kusano fc6ab3
the bytes within the range [128..255], and texts using these alphabets
kusano fc6ab3
are most often misidentified by this scheme; in other words, the rate
kusano fc6ab3
of false negatives is sometimes too high, which means that the recall
kusano fc6ab3
is low.  Another weakness of this scheme is a reduced precision, due to
kusano fc6ab3
the false positives that may occur when binary files containing large
kusano fc6ab3
amounts of textual characters are misidentified as plain text.
kusano fc6ab3
kusano fc6ab3
In this article we propose a new, simple detection scheme that features
kusano fc6ab3
a much increased precision and a near-100% recall.  This scheme is
kusano fc6ab3
designed to work on ASCII, Unicode and other ASCII-derived alphabets,
kusano fc6ab3
and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)
kusano fc6ab3
and variable-sized encodings (ISO-2022, UTF-8, etc.).  Wider encodings
kusano fc6ab3
(UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.
kusano fc6ab3
kusano fc6ab3
kusano fc6ab3
The Algorithm
kusano fc6ab3
-------------
kusano fc6ab3
kusano fc6ab3
The algorithm works by dividing the set of bytecodes [0..255] into three
kusano fc6ab3
categories:
kusano fc6ab3
- The white list of textual bytecodes:
kusano fc6ab3
  9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.
kusano fc6ab3
- The gray list of tolerated bytecodes:
kusano fc6ab3
  7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).
kusano fc6ab3
- The black list of undesired, non-textual bytecodes:
kusano fc6ab3
  0 (NUL) to 6, 14 to 31.
kusano fc6ab3
kusano fc6ab3
If a file contains at least one byte that belongs to the white list and
kusano fc6ab3
no byte that belongs to the black list, then the file is categorized as
kusano fc6ab3
plain text; otherwise, it is categorized as binary.  (The boundary case,
kusano fc6ab3
when the file is empty, automatically falls into the latter category.)
kusano fc6ab3
kusano fc6ab3
kusano fc6ab3
Rationale
kusano fc6ab3
---------
kusano fc6ab3
kusano fc6ab3
The idea behind this algorithm relies on two observations.
kusano fc6ab3
kusano fc6ab3
The first observation is that, although the full range of 7-bit codes
kusano fc6ab3
[0..127] is properly specified by the ASCII standard, most control
kusano fc6ab3
characters in the range [0..31] are not used in practice.  The only
kusano fc6ab3
widely-used, almost universally-portable control codes are 9 (TAB),
kusano fc6ab3
10 (LF) and 13 (CR).  There are a few more control codes that are
kusano fc6ab3
recognized on a reduced range of platforms and text viewers/editors:
kusano fc6ab3
7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these
kusano fc6ab3
codes are rarely (if ever) used alone, without being accompanied by
kusano fc6ab3
some printable text.  Even the newer, portable text formats such as
kusano fc6ab3
XML avoid using control characters outside the list mentioned here.
kusano fc6ab3
kusano fc6ab3
The second observation is that most of the binary files tend to contain
kusano fc6ab3
control characters, especially 0 (NUL).  Even though the older text
kusano fc6ab3
detection schemes observe the presence of non-ASCII codes from the range
kusano fc6ab3
[128..255], the precision rarely has to suffer if this upper range is
kusano fc6ab3
labeled as textual, because the files that are genuinely binary tend to
kusano fc6ab3
contain both control characters and codes from the upper range.  On the
kusano fc6ab3
other hand, the upper range needs to be labeled as textual, because it
kusano fc6ab3
is used by virtually all ASCII extensions.  In particular, this range is
kusano fc6ab3
used for encoding non-Latin scripts.
kusano fc6ab3
kusano fc6ab3
Since there is no counting involved, other than simply observing the
kusano fc6ab3
presence or the absence of some byte values, the algorithm produces
kusano fc6ab3
consistent results, regardless what alphabet encoding is being used.
kusano fc6ab3
(If counting were involved, it could be possible to obtain different
kusano fc6ab3
results on a text encoded, say, using ISO-8859-16 versus UTF-8.)
kusano fc6ab3
kusano fc6ab3
There is an extra category of plain text files that are "polluted" with
kusano fc6ab3
one or more black-listed codes, either by mistake or by peculiar design
kusano fc6ab3
considerations.  In such cases, a scheme that tolerates a small fraction
kusano fc6ab3
of black-listed codes would provide an increased recall (i.e. more true
kusano fc6ab3
positives).  This, however, incurs a reduced precision overall, since
kusano fc6ab3
false positives are more likely to appear in binary files that contain
kusano fc6ab3
large chunks of textual data.  Furthermore, "polluted" plain text should
kusano fc6ab3
be regarded as binary by general-purpose text detection schemes, because
kusano fc6ab3
general-purpose text processing algorithms might not be applicable.
kusano fc6ab3
Under this premise, it is safe to say that our detection method provides
kusano fc6ab3
a near-100% recall.
kusano fc6ab3
kusano fc6ab3
Experiments have been run on many files coming from various platforms
kusano fc6ab3
and applications.  We tried plain text files, system logs, source code,
kusano fc6ab3
formatted office documents, compiled object code, etc.  The results
kusano fc6ab3
confirm the optimistic assumptions about the capabilities of this
kusano fc6ab3
algorithm.
kusano fc6ab3
kusano fc6ab3
kusano fc6ab3
--
kusano fc6ab3
Cosmin Truta
kusano fc6ab3
Last updated: 2006-May-28