I have an application in which I store a lot of data in text files.Recently I've needed to compress this data into datasets and send it to a browser. I've also decided to remove uncompressed data and leave only zipped files. The mayor advantage is HDD consumption - 90% less space needed to store data! However I've encountered a problem. How to retrieve a single file from a zipped collection without unzipping whole collection? Well as always - with Ruby it's quite easy :)
I've created a small wrapper to a Zip Ruby library. It will contain 3 methods:
- self.zip - used to compress directory
- self.unzip - used to decompress directory
- self.open_one - used to retrieve single file content from a compressed directory
First of all, compression...
Zipping directory
require 'rubygems' require 'zip/zip' require 'find' require 'fileutils' class Zipper def self.zip(dir, zip_dir, remove_after = false) Zip::ZipFile.open(zip_dir, Zip::ZipFile::CREATE)do |zipfile| Find.find(dir) do |path| Find.prune if File.basename(path)[0] == ?. dest = /#{dir}\/(\w.*)/.match(path) # Skip files if they exists begin zipfile.add(dest[1],path) if dest rescue Zip::ZipEntryExistsError end end end FileUtils.rm_rf(dir) if remove_after end end
We catch Zip::ZipEntryExistsError exception - so we won't overwrite files in an archive if the file already exist. After all (no exceptions raised) we can remove the source directory:
Zipper.zip('/home/user/directory', '/home/user/compressed.zip')
Unzipping directory
class Zipper def self.unzip(zip, unzip_dir, remove_after = false) Zip::ZipFile.open(zip) do |zip_file| zip_file.each do |f| f_path=File.join(unzip_dir, f.name) FileUtils.mkdir_p(File.dirname(f_path)) zip_file.extract(f, f_path) unless File.exist?(f_path) end end FileUtils.rm(zip) if remove_after end end
Usage is similar to the zip method. We provide zip file, directory to unzip and we decide whether or not to remove source file after unzipping its content.
Zipper.unzip('/home/user/compressed.zip','/home/user/directory', true)
Retrieving single file content
class Zipper def self.open_one(zip_source, file_name) Zip::ZipFile.open(zip_source) do |zip_file| zip_file.each do |f| next unless "#{f}" == file_name return f.get_input_stream.read end end nil end end
Usage:
Zipper.open_one('/home/user/source.zip', 'subdir_in_zip/file.ext')
If file doesn't exist nil will be returned. This method does not save this file - it will return decompressed content (but won't save it). I use it to serve this content via web-server. What about performance? Well it depends on zipped file size, amount of compressed files in archive and our "target" file size. Below a simple chart showing relationship between the number of files and the speed of accessing a single one. The results are satisfactory for my purposes. The single uncompressed file in a dataset has about 15.9KB.
As you can see above access times are quite bearable when you think about 90% savings on your hard drive.
Munin chart with disk usage before and after zipping data (fuck yeah!). Look at /home: