zip Archives - Closer to Code

Content

Introduction
Approach
Hardware
Memory gain due to compression
The read-only test case
The write-only test case
Read-write test case
Overall summary

Introduction

Year and a half ago, I was working with a software that used Redis as a buffer to store large sets of text data. We had some bottlenecks there. One of them was related to Redis and the large amount of data, that we had there (large comparing to RAM amount). Since then, I've wanted to check if using Zlib library would be a big improvement or would it be just a next bottleneck (CPU). Unfortunately I don't have access to this software any more, that's why I've decided to create a simple test case just to check this matter.

Approach

We mostly did read operations from Redis (the data was inserted and it was used many times through the cache life cycle), however I've wanted to test as many cases as I could, that's why I've tested following cases (each of them is tested with and without compression):

Read only operations
Write only operations
Read/write operations

I've also considered given data sets:

empty file source
small file source (16.6 kB)
medium file source (47.5 kB)
big file source (221.9 kB)
large file source (4.2 MB)

To make it quite accurate I've tested it 10 times for given case and also for different number of iterations (from 1 to 1000). You can check the results on charts below.

Hardware

Hardware used was rather simple:

2 GB RAM
Pentium 4 HT
2TB SATA WD Disk
Unstressed Debian Linux on board

Memory gain due to compression

The memory gain is quite obvious - it depends on the compression level that we can achive using Zlib library. Below you can see the memory usage before and after compression - for all the given example files:

So, as you can see, except the first case, with the empty file, we gain around 75-85% of space, when using the Zlib compression of text data. In my opinion this is a huge improvement.

The read-only test case

First let's see how the read-only performance decreases due to the compression mechanism that was implemented. On the chart we can see both - the non-compressed and compressed script versions, both for given (same) number of iterations. Each chart represents one file size.

Empty file

For empty file, the speed decrease is around 6%.

Small file

For small file, the speed decrease is around 26%. But still - 1000 reads take us less than a half of second.

Medium file

Here the difference is around 47.5% - the bigger the file gets, the bigger the decompression overhead is. Since the read operations from Redis are quite fast, the decompression time seems to affect the whole process in a quite high level.

Big file

Diffence is around 55%. Which is still tolerable. Of course if we would do a lot of readings that should be performed really fast - this would be a problem.

Large file

Difference is around 33.5%, which is quite weird, since it is lower than with the previous smaller file.

Read-only summary

As you can see above, the decompression overhead is around 30-50%. Is it much? It depends what you are doing. If you are ready to exchange such a speed loss for a memory saving, this might be the solution for you (of course for now we are talking about the read-only scenario).

The write-only test case

Empty file

Difference is around 29%. Seems like the compression impact comparing to decompression is way bigger (nearly 5 times). This doesn't look good for the write-only test case, however let's see more examples.

Small file

Difference is around 161%. This is huge difference - even with small data amounts, compression seams to really influence the write process.

Medium file

Difference is around 520%. No comments here - this level is way beyond the heavy duty usage tolerance level. With such a difference, using compression on heavy-write system would be

Big file

Difference is around 1063%.

Large file

Difference is around 675%.

Write-only summary

It seems like the heavy duty write-only usage with Zlib compression won't be a good idea. The time spent compressing the input data is way beyond any acceptable level. Write process increases so much, that it would be way better to use storage that has compression already implemented (or one that is not in-memory).

Read-write test case

This is the last test case. I've performed one write per 1 read.

Empty file

Difference is around 19.8%.

Small file

Difference is around 104%.

Medium file

Difference is around 350%.

Big file

Difference is around 667%.

Large file

Difference is around 402%.

Read-write summary

It looks like the write process slowed the whole program in a big way. This is a second case that shows, that we should not use this approach in a heavy-write environment

Overall summary

First of all, if you have such a problem at the beginning of your software development, probably you should use a different data-storage (if you plan to have more data than you can handle in memory). This article and the compression "tweak" that I've tried to check/use should be implemented only when you don't have any other option and the Redis memory usage is your biggest bottleneck. Also, you should not try to use this idea when you do a lot of writing to Redis and a only few reads per data chunk. This would slow down your software dramatically. Of course if you do less than few hundred operations per second, you probably would not feel this, but still, try to think about future issues as well.

On the other hand, it seems that when we have an application that uses Redis mostly for reading stuff from it, we might think about compressing the input data. However as I mentioned before - try not to overdo it and implement such a mechanism only when you really need it.

To be honest, I'm a bit disappointed. I thought, that using the compression would not be such a big overhead, however it seems like Redis is way faster than I thought comparing to compression process.

I have an application in which I store a lot of data in text files.Recently I've needed to compress this data into datasets and send it to a browser. I've also decided to remove uncompressed data and leave only zipped files. The mayor advantage is HDD consumption - 90% less space needed to store data! However I've encountered a problem. How to retrieve a single file from a zipped collection without unzipping whole collection? Well as always - with Ruby it's quite easy :)

I've created a small wrapper to a Zip Ruby library. It will contain 3 methods:

self.zip - used to compress directory
self.unzip - used to decompress directory
self.open_one - used to retrieve single file content from a compressed directory

First of all, compression...

Zipping directory

require 'rubygems'
require 'zip/zip'
require 'find'
require 'fileutils'

class Zipper

  def self.zip(dir, zip_dir, remove_after = false)
    Zip::ZipFile.open(zip_dir, Zip::ZipFile::CREATE)do |zipfile|
      Find.find(dir) do |path|
        Find.prune if File.basename(path)[0] == ?.
        dest = /#{dir}\/(\w.*)/.match(path)
        # Skip files if they exists
        begin
          zipfile.add(dest[1],path) if dest
        rescue Zip::ZipEntryExistsError
        end
      end
    end
    FileUtils.rm_rf(dir) if remove_after
  end

end

We catch Zip::ZipEntryExistsError exception - so we won't overwrite files in an archive if the file already exist. After all (no exceptions raised) we can remove the source directory:

Zipper.zip('/home/user/directory', '/home/user/compressed.zip')

Unzipping directory

class Zipper

  def self.unzip(zip, unzip_dir, remove_after = false)
    Zip::ZipFile.open(zip) do |zip_file|
      zip_file.each do |f|
        f_path=File.join(unzip_dir, f.name)
        FileUtils.mkdir_p(File.dirname(f_path))
        zip_file.extract(f, f_path) unless File.exist?(f_path)
      end
    end
    FileUtils.rm(zip) if remove_after
  end

end

Usage is similar to the zip method. We provide zip file, directory to unzip and we decide whether or not to remove source file after unzipping its content.

Zipper.unzip('/home/user/compressed.zip','/home/user/directory', true)

Retrieving single file content

class Zipper

  def self.open_one(zip_source, file_name)
    Zip::ZipFile.open(zip_source) do |zip_file|
      zip_file.each do |f|
        next unless "#{f}" == file_name
        return f.get_input_stream.read
      end
    end
    nil
  end

end

Usage:

Zipper.open_one('/home/user/source.zip', 'subdir_in_zip/file.ext')

If file doesn't exist nil will be returned. This method does not save this file - it will return decompressed content (but won't save it). I use it to serve this content via web-server. What about performance? Well it depends on zipped file size, amount of compressed files in archive and our "target" file size. Below a simple chart showing relationship between the number of files and the speed of accessing a single one. The results are satisfactory for my purposes. The single uncompressed file in a dataset has about 15.9KB.

As you can see above access times are quite bearable when you think about 90% savings on your hard drive.

Munin chart with disk usage before and after zipping data (fuck yeah!). Look at /home:

Tag: zip

Compressing large data sets in Redis with Zlib – Ruby test case

Content

Introduction

Approach

Hardware

Memory gain due to compression

The read-only test case

Empty file

Small file

Medium file

Big file

Large file

Read-only summary

The write-only test case

Empty file

Small file

Medium file

Big file

Large file

Write-only summary

Read-write test case

Empty file

Small file

Medium file

Big file

Large file

Read-write summary

Overall summary

Using Ruby and Zip library to compress directories and read single file from compressed collection

Zipping directory

Unzipping directory

Retrieving single file content