|
DjVu: a Compression Technique and Software Platform for Publishing Scanned Documents, Digital Documents and High-Resolution Images on the Web. |
Despite the growing importance of the Internet, much of the knowledge,
culture, and educational material in existence today is still available
only in paper form. Bringing this wealth of information into the
digital realm in a form that is faithful to the original, easily
accessible, and searchable, is an essential step towards making
the Internet the World's Universal Library.
DjVu (pronounced "deja vu") is a compression technique, a file format,
and a delivery platform that is specifically designed to enable the
creation of digital libraries of printed material, either scanned
from paper or digitally produced. For scanned document, DjVu file
sizes are typically 3 to 10 times smaller than TIFF or PDF in
black and white, and 5 to 10 times smaller than JPEG in color.
A typical page from a book, magazine, or ancient document scanned in
color at 300dpi contains on the order of 8 million pixels, and occupies
24MB uncompressed. Traditional compression techniques such as JPEG are
notoriously inefficient on several counts:
- typical JPEG file sizes for a page are between 400KB and 2MB
at best, which is totally impractical for remote access.
- sharp edges (such as character outlines) are the cause of
numerous wasted bits and/or unpleasant ringing artifacts.
- such large images are very slow to render, require a very large
memory buffer for the decompressed image in the client, and are
not easily zoomable or panable with current web browsers.
- the text is not normally separated from the image, and therefore
cannot be OCRed, indexed, or searched.
- no provision is made for multipage documents, unless one encapsulates
the images into a container format such as PDF, thereby adding
additional layers of inefficiency.
The DjVu system aleviates these problems and can handle bitonal
documents, low-color (palettized) images, continuous-tone images
(photos, etc...), scanned documents in color or greyscale, as well
as digitally produced documents (e.g. in PostScript or PDF).
Bitonal documents are encoded with a technique dubbed JB2, which
builds a compressed library of repeating shapes in the document
(such as characters), and codes the locations where they appear on
each page. Low-color images are compressed much the same way, with
the addition of a color palette, and a color index for each shape.
Continuous-tone images are compressed with a progressive
wavelet-based method dubbed IW44 that is on par with JPEG2000 in
terms of signal to noise ratio, but whose decoder/renderer is very
memory efficient and optimized for speed (3 times faster than the
fastest JPEG-2000 mode). The coders back-ends make extensive use
of a new binary adaptive arithmetic coder called the Z-coder.
Scanned color documents are decomposed into a foreground plane and a
background plane. The foreground plane contains the text and the line
drawings compressed as a bitonal or low-color image at maximum
resolution (using JB2), thereby preserving the sharpness and
readability of the text. The background plane contains the pictures
and paper textures compressed at reduced resolution with IW44. Areas
of the background covered by foreground components are smoothly
interpolated so as to minimize their coding cost. The
foreground/background segmenter first detects objects that are sharply
contrasted with their surroundings, and then classifies them into the
foreground or the background planes using several criteria, such as
their color uniformity, their geometry, and an estimate of their
coding cost.
Digitally produced PDF or PostScript documents are turned into a list
of low-level drawing commands using the popular tool GhostScript.
This list is then translated into a list of non-overlapping shapes
which are subsequently classified into the foreground or the
background layer using a number of shape-based heuristics. The
layers are then compressed as with scanned documents.
Bitonal documents in DjVu typically occupy 5 to 30KB per page at
300dpi, which is 3 to 8 times smaller than Group 4 (used in Fax
machines, in TIFF files, and in PDF files). Low-color images such as
icons are typically 2 times smaller than with GIF, but can be up to 10
times smaller if they contain lots of text. Photos are about 2 times
smaller than JPEG, and about the same size as fast modes of JPEG-2000
for the same SNR. An interesting aspect of IW44 wavelet codec is that
it is optimized to allow on-the-fly decompression/rendering of the area
visible in the display window (and not more) as the user zooms and
pans around. This allows to keep the images in compressed form in the
RAM of the client machine, and allows to display very large images
without excessive memory requirements. Scanned color and grayscale
documents in DjVu are typically 30 to 100KB per page at 300dpi, which
is 5 to 10 times smaller than JPEG, and about 2-3 times smaller than
MRC/T.44 or TIFF/FX. Digitally produced documents with mostly text
are typically 1 to 3 times smaller than PDF or gzipped PostScript
originals at 300dpi, but can be considerably smaller if the documents
contain many pictures.
DjVu documents are displayed within web browsers through a very
light-weight plug-in (available for all major platforms). Everything in
the design of DjVu was optimized to reduce the delay between the user's
decision to view a page, and the display of that page on the screen.
A multithreaded software architecture with smart caching allows
individual document components to be loaded and pre-decoded on-demand.
Pages are loaded on demand, allowing random access without prior
download of the entire document, and without the help of a byte
server. Page components (foreground layer, background chunks,...) are
downloaded in sequence and rendered by a separate thread as soon as
they are complete. This allows progressive rendering and refinement of
the images. The page that follows the page currently being displayed
is pre-loaded, pre-decoded and cached automatically thereby reducing
the page-flipping delay. The DjVu viewer has a "modeless" graphical
user interface that allows fast zooming, panning, and page flipping
with a single mouse operation or keystroke.
The foreground layer can be OCRed and the result embedded back into
the DjVu file as a searchable "hidden text" layer. Tools are available
to extract that text and translate it into a variety of formats that
include each word annotated with the coordinates of its bounding box
on the page. The formats also include the document structure (pages,
columns, paragraphs, lines, words). Hyperlinks, annotations, page
thumbnails, and other metadata can also be embedded into DjVu documents.
Server-side full-text search can easily be provided using free indexing
tools and a few Perl scripts. Large collections have been (and are being)
put on the Web in DjVu with full-text search capabilities, including
the NIPS Proceedings (http://nips.djvuzone.org,
13 volumes, 14,000
pages at 400dpi, 191MB), the Century Dictionnary
(http://www.century-dictionary.com, 12 volumes, over 10,000 pages,
500,000 definitions, 22 million searchable words, 850MB), along with
several national library collections and content from commercial
providers around the world. DjVu is currently used by thousands of
users to publish and exchange scanned documents on the Web.
A list of selected web sites that use DjVu
is available here.
DjVu can be seen as a general open platform for document delivery.
The DjVu Reference Library, which includes the full multithreaded
decoder/renderer, the IW44 encoder, the palettized image encoder,
and basic bitonal and color document encoders is available as Free
Software under the GNU GPL and can be used as a platform for
research on new codecs, segmentation schemes, delivery mechanisms,
viewing interfaces, and content analysis systems.
Papers, examples, benchmarks, and pointers are available
at http://www.djvuzone.org.
Source code is available at http://djvu.sourceforge.net
Plug-ins, compressors, SDKs, and commercial software can be
obtained at http://www.djvu.com.
Servers that can convert documents in any format to DjVu are
available at http://openlib.djvuzone.org,
http://bib2web.djvuzone.org,
and http://any2djvu.djvuzone.org.
|
|