This project implements a custom Huffman Compression Algorithm, designed to compress text-based files such as .txt
, .json
, .docx
, and .pdf
. It uses the Huffman coding technique to reduce the size of the text by encoding characters based on their frequency of occurrence in the source data.
- Text Compression: Compresses text-based file formats by analyzing the frequency of characters and encoding them using the Huffman coding algorithm.
- Binary Output: Generates compressed files in a binary format.
- Metadata: Saves metadata related to the compression process, including the Huffman tree structure, to allow for decompression.
- Compression Metrics: Reports on the compression ratio and the original and compressed file sizes.
.txt
.json
.docx
-
.pdf
(with consideration that only the text is compressed; embedded images are not compressed)
Note: The algorithm is designed for text-based formats. When handling PDFs containing images, only the text portion will be compressed.
Before running the project, ensure you have the following dependencies installed:
- Node.js (v16 or higher)
-
npm
oryarn
for managing packages
-
Clone this repository:
git clone https://github.com/HUMBLEF0OL/file-squeeze.git
-
Navigate to the project directory:
cd file-squeeze
-
Install the required dependencies:
npm install
-
Compress a file:
Use the
filesqueeze
command with thecompress
option to compress a file.filesqueeze compress <inputFile> [--output <outputDir>]
-
<inputFile>
: The file to be compressed (e.g.,sample.txt
). -
[--output <outputDir>]
: The directory to store the compressed files (defaults to./output
).
-
-
Decompress a file:
To decompress a previously compressed file, use the
decompress
command.filesqueeze decompress <inputDir> [--output <outputDir>]
-
<inputDir>
: The directory containing the compressed file (encoded.bin
andmetaData.bin
). -
[--output <outputDir>]
: The directory to store the decompressed files (defaults to./output
).
-
The project generates a compression report for each file processed. The report includes:
- Original File Size: Size of the file before compression.
- Compressed File Size: Size of the file after compression.
- Compression Ratio: The ratio of the original file size to the compressed file size.
- Time Taken: Time spent to process and compress the file.
You can view the results in the console after the compression completes.
- The algorithm starts by analyzing the frequency of each character in the input file.
- A priority queue (min-heap) is built using the frequency data. This queue ensures that the least frequent characters are processed first.
- The Huffman tree is built by combining nodes based on their frequencies. The two nodes with the least frequency are merged into a parent node, and this process is repeated until only one node (the root) remains.
- Once the tree is built, binary codes are assigned to each character based on its position in the tree. Characters closer to the root get shorter codes, ensuring optimal compression.
- The Huffman tree is serialized and saved in binary format for use in decompression.
- The input text is encoded using the generated Huffman codes. Both the compressed data and metadata (Huffman tree) are saved into files.
- The decompression process reads the serialized Huffman tree and decodes the compressed data back into its original form.
hello world
- The file will be compressed into a binary file (
encoded.bin
), and metadata will be saved in a separate file (metaData.bin
).
- Original File Size: 90 KB
- Compressed File Size: 48 KB
- Compression Ratio: 1.875 (compressed size / original size)
If you'd like to contribute to this project, feel free to open a pull request. For bug reports or suggestions, please create an issue in the GitHub repository.
This project is licensed under the MIT License.
- The core compression algorithm is based on the Huffman coding technique. You can read more about it here: Huffman coding - Wikipedia.
- Special thanks to libraries like
pdf-lib
andpdf-parse
for PDF text extraction and manipulation.