Implementing Huffman Encoding for Data Compression using Rust

Benjamin Chibuzor-Orie
7 min readMay 28, 2023
Photo by Markus Spiske on Unsplash

In the vast landscape of programming, one of the most exciting challenges developers encounter is working with data compression algorithms. These algorithms serve an invaluable purpose: they enable us to store and transmit data more efficiently, a necessity in our increasingly data-driven world. Among the myriad of data compression algorithms, the Huffman algorithm has proven to be an effective and widely utilized method. This algorithm, based on the concept of using shorter code representations for frequently occurring entries, has been the foundation of many file compression systems.

In this blog post, we’re going to delve into the nitty-gritty of implementing the Huffman Compression Algorithm, but with an intriguing twist — we’ll be using Rust, a programming language that’s been gaining traction for its emphasis on performance, memory safety, and concurrency.

Rust, with its impressive speed and safety guarantees, lends itself perfectly to the rigorous demands of implementing a compression algorithm like Huffman’s. Whether you’re a seasoned Rustacean or a curious programmer looking to expand your horizons, this post promises to be an informative journey. We’ll break down the complexities of the Huffman algorithm, explain the nuances of implementing it in Rust, and offer step-by-step guidance to ensure you can follow along.

So buckle up, as we set out to explore the implementation of the Huffman Compression Algorithm using Rust. By the end of this post, you’ll have gained a deeper understanding of data compression techniques and the robust capabilities of Rust, and hopefully, you’ll be equipped to take on similar programming challenges in your future endeavors.

Huffman Encoding

This algorithm operates by assigning shorter codes to bytes that occur frequently and longer codes to those that appear less often. The notion of encoding isn’t a novel one. To ensure a robust understanding of the benefits derived from Huffman, we will talk about two modes of encoding:

Fixed-width Encoding

This is a type of encoding where the code assigned to bytes of an input stream have a fixed width. Let’s say we need to encode the string abbcdeef. There are 6 characters in it so we would need a code width of at least 3 to encode this string per character since the bit representation of 6 is 110.

we can arbitrarily assign codes to those characters with the width in mind as below:

a — 000, b — 001, c — 010, d — 011, e — 100, f — 101

Using the above code table we can now compress that string and it will look like this instead:

000001001010011100100101

well that’s disappointingly longer than our original string isn’t it? Then why does anyone think this is a “compression” by any metric? Well, firstly this new string is composed of only 1’s and 0’s which means we can represent this string as a group of bits, so treat that output as a binary number and not a string. This would mean we need 24 bits to store the above encoded data. Comparing that to the unencoded string prior to that (abbcdeef) which stores each character using a byte (8 bits), we would have needed 64 bits to store the unencoded data since there was code to character mapping.

Variable-width encoding

This is a type of encoding where the code assigned to bytes of an input stream do not have a fixed width. This encoding is good because it helps reduce the amount of bits that need to be used for storing encoded data in the end. However, it has its drawbacks. It introduces complexity into the encoding process because we have to prevent clashes between codes of different bytes and avoid misleading prefixes. Let’s take an example.

a — 0, b — 1, c — 10, d — 11, e — 100, f — 101

With the following code table above, let’s encode the stream fa. That would return 1010. So let’s try to decode thatback to the original string:

Shouldn’t take you long to notice that the decoded output is ambiguous because we can get several different outputs from the same input:

baba, cba, cc, fa can all be decoded from 1010 using that code table. So the most important problem of this form of encoding becomes finding a code table that avoids clashes due to prefixes hence eliminating any ambiguity. This is the purpose of Huffman encoding. It does this job exceptionally well by doing two things:

  1. Constructing a byte to code mapping such that there are no prefix clashes between any two codes while reducing the amount of bits needed to represent the data.
  2. Assigning shorter codes to bytes that occur more often and longer codes to bytes that occur less often.

Let’s walk through the algorithm for huffman encoding!

The Algorithm

First thing we want to do is create a frequency table for our data to determine how many times each unique byte occurs in the stream. This will help us determine which bytes are more frequent than others.

Then using this frequency table, we create a binary tree in which more frequently occuring bytes are closer to the root and less frequently occurring bytes are farther away from the root. So in a string like aaaabbbccd a would be the closest while d would be the farthest with respect to the root of the binary tree. In order to achieve this arrangement we:

  1. pick the two least frequent entities.
  2. create a new subtree from these nodes where the frequency of the parent node is the sum of the frequency of it’s children
  3. add this new subtree to the list of entities and remove the other two.
  4. Go to step 1 and continue until there is just one entity left.

Okay let’s step through that in terms of code. Feel free to clone the source from Github (recommended) or follow through this blog post and copy the code samples.

The folder structure looks like this:

I defined this struct to represent a node of our binary tree in huffman.rs:

#[derive(Debug)]
pub struct HuffmanNode {
freq: usize,
byte: Option<u8>,
left: Option<Box<HuffmanNode>>,
right: Option<Box<HuffmanNode>>,
}

Defined the following types for handling errors idiomatically:

pub type Result<T> = std::result::Result<T, Error>;

#[derive(Debug)]
pub enum Error {
HuffTreeErr,
Message(String),
}

impl Display for Error {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::HuffTreeErr => write!(f, "Failed to create Huffman tree"),
Self::Message(msg) => msg.fmt(f),
}
}
}

impl From<std::io::Error> for Error {
fn from(value: std::io::Error) -> Self {
Error::Message(value.to_string())
}
}

impl From<bitmanip::Error> for Error {
fn from(value: bitmanip::Error) -> Self {
match value {
bitmanip::Error::NotValid => Error::Message("Invalid bit string".into())
}
}
}

I defined this helper function to create nodes easily:

fn new_node(byte: Option<u8>, freq: usize) -> Box<HuffmanNode> {
Box::new(
HuffmanNode {
byte,
freq,
left: None,
right: None,
}
)
}

Then, a function for counting the unique bytes as discussed in the algorithm. We do this in a buffered way in order to not use too much memory:

fn count_unique_bytes(mut reader: impl Read + Seek) -> Result<HashMap<u8, usize>> {
let mut freq_map: HashMap<u8, usize> = HashMap::new();

let mut buffer: [u8; 65536] = [0; 65536];

loop {
let bytes_read = reader.read(&mut buffer)?;
if bytes_read == 0 {
break;
}
for byt in buffer.iter().take(bytes_read) {
let count = freq_map.entry(*byt).or_insert(0);
*count += 1;
}
}

reader.seek(SeekFrom::Start(0))?;

Ok(freq_map)
}

Then we have to create a binary tree:

fn create_huffman_tree(mut reader: impl Read + Seek) -> Result<Box<HuffmanNode>> {
let freq_map = count_unique_bytes(&mut reader)?;

let mut nodes: Vec<Box<HuffmanNode>> = freq_map
.iter()
.map(|(ch, freq)| new_node(Some(*ch), *freq))
.collect();

while nodes.len() > 1 {
nodes.sort_by(|a, b| b.freq.cmp(&a.freq));
let a = nodes.pop().ok_or(Error::HuffTreeErr)?;
let b = nodes.pop().ok_or(Error::HuffTreeErr)?;

let mut node = new_node(None, a.freq + b.freq);
node.left = Some(a);
node.right = Some(b);
nodes.push(node);
}

let root = nodes.pop().ok_or(Error::HuffTreeErr)?;

Ok(root)
}

This code runs a loop that sorts the items in descending frequency such that the most frequent elements come first in the array and the least frequent elements come last in the array. Then it removes the two least frequent elements and creates a subtree out of them. It then adds the subtree back to the array. This goes on until there is only one element in the array before looping terminates.

Then we need another function for creating the huffman code table from the binary tree:

fn assign_huffman_codes(node: Box<HuffmanNode>, h: &mut HashMap<u8, String>, s: String) {
if let Some(ch) = node.byte {
h.insert(ch, s);
} else {
if let Some(left) = node.left {
assign_huffman_codes(left, h, s.clone() + "0");
}
if let Some(right) = node.right {
assign_huffman_codes(right, h, s + "1");
}
}
}

In this function we are assigning 0 to all left branches and 1 to all right branches. by recursively traversing the tree from the root we will arrive at a bit string representing each characters in which there is no prefix clashing between any two codes.

I wrote a small module for bit manipulation and I simply named it bitmanip.rs. I used this module to convert a bit string (string containing only 0’s and 1’s) to a byte array and vice versa:

use std::cmp::min;

#[derive(Debug)]
pub enum Error {
NotValid
}

pub type Result<T> = std::result::Result<T, Error>;

pub fn bytes_to_bit_str(bytes: Vec<u8>) -> String {
let mut bits = String::new();


for byte in bytes {
let bin_repr = format!("{:b}", byte);
bits.push_str(&bin_repr.replace("0b", ""));
}
bits
}

pub fn bit_str_to_bytes(bits: &str) -> Result<Vec<u8>> {
let mut bytes: Vec<u8> = Vec::new();

let mut i = 0;

while i < bits.len() {
let bin_str = &bits[i..min(i + 8, bits.len())];
let byte = u8::from_str_radix(bin_str, 2).map_err(|_| Error::NotValid)?;
bytes.push(byte);
i += 8;
}

Ok(bytes)
}

Then the final function that pulls all of it together:

pub fn compress_data(mut reader: impl Read + Seek, mut writer: impl Write) -> Result<()> {
let tree_root = create_huffman_tree(&mut reader)?;

let mut huffman_code_table: HashMap<u8, String> = HashMap::new();

assign_huffman_codes(tree_root, &mut huffman_code_table, "".to_string());

let mut buffer: [u8; 65536] = [0; 65536];

loop {
let bytes_read = reader.read(&mut buffer)?;

if bytes_read == 0 {
break;
}

let mut compressed_segment = String::new();

for byte in buffer {
compressed_segment.push_str(&huffman_code_table[&byte]);
}

let compressed_bytes = bitmanip::bit_str_to_bytes(&compressed_segment)?;

writer.write_all(&compressed_bytes)?;
}

Ok(())
}

And finally, a main function to run our code:

use std::fs::{File};
use std::io::{BufReader, BufWriter};
use std::io;
use huffman::huffman;

fn main() {
let mut reader = file_reader("data/S.csv").unwrap();
let mut writer = file_writer("data/S.csv.huff").unwrap();

huffman::compress_data(&mut reader, &mut writer).unwrap();
}

fn file_reader(filename: &str) -> io::Result<BufReader<File>> {
let file = File::open(filename)?;
let reader = BufReader::new(file);
Ok(reader)
}

fn file_writer(filename: &str) -> io::Result<BufWriter<File>> {
let file = File::create(filename)?;
let writer = BufWriter::new(file);
Ok(writer)
}

--

--