I have an application in which I store a lot of data in text files.Recently I’ve needed to compress this data into datasets and send it to a browser. I’ve also decided to remove uncompressed data and leave only zipped files. The mayor advantage is HDD consumption – 90% less space needed to store data! However I’ve encountered a problem. How to retrieve a single file from a zipped collection without unzipping whole collection? Well as always – with Ruby it’s quite easy :)

I’ve created a small wrapper to a Zip Ruby library. It will contain 3 methods:

  1. self.zip – used to compress directory
  2. self.unzip – used to decompress directory
  3. self.open_one – used to retrieve single file content from a compressed directory

First of all, compression…

Zipping directory

require 'rubygems'
require 'zip/zip'
require 'find'
require 'fileutils'

class Zipper

  def self.zip(dir, zip_dir, remove_after = false)
    Zip::ZipFile.open(zip_dir, Zip::ZipFile::CREATE)do |zipfile|
      Find.find(dir) do |path|
        Find.prune if File.basename(path)[0] == ?.
        dest = /#{dir}\/(\w.*)/.match(path)
        # Skip files if they exists
        begin
          zipfile.add(dest[1],path) if dest
        rescue Zip::ZipEntryExistsError
        end
      end
    end
    FileUtils.rm_rf(dir) if remove_after
  end

end

We catch Zip::ZipEntryExistsError exception – so we won’t overwrite files in an archive if the file already exist. After all (no exceptions raised) we can remove the source directory:

Zipper.zip('/home/user/directory', '/home/user/compressed.zip')

Unzipping directory

class Zipper

  def self.unzip(zip, unzip_dir, remove_after = false)
    Zip::ZipFile.open(zip) do |zip_file|
      zip_file.each do |f|
        f_path=File.join(unzip_dir, f.name)
        FileUtils.mkdir_p(File.dirname(f_path))
        zip_file.extract(f, f_path) unless File.exist?(f_path)
      end
    end
    FileUtils.rm(zip) if remove_after
  end

end

Usage is similar to the zip method. We provide zip file, directory to unzip and we decide whether or not to remove source file after unzipping its content.

Zipper.unzip('/home/user/compressed.zip','/home/user/directory', true)

Retrieving single file content

class Zipper

  def self.open_one(zip_source, file_name)
    Zip::ZipFile.open(zip_source) do |zip_file|
      zip_file.each do |f|
        next unless "#{f}" == file_name
        return f.get_input_stream.read
      end
    end
    nil
  end

end

Usage:

Zipper.open_one('/home/user/source.zip', 'subdir_in_zip/file.ext')

If file doesn’t exist nil will be returned. This method does not save this file – it will return decompressed content (but won’t save it). I use it to serve this content via web-server. What about performance? Well it depends on zipped file size, amount of compressed files in archive and our “target” file size. Below a simple chart showing relationship between the number of files and the speed of accessing a single one. The results are satisfactory for my purposes. The single uncompressed file in a dataset has about 15.9KB.

As you can see above access times are quite bearable when you think about 90% savings on your hard drive.

Munin chart with disk usage before and after zipping data (fuck yeah!). Look at /home: