Tag: Mongoid

Ruby, Rails + objects serialization (Marshal), Mongoid and performance matters

Introduction

Sometimes, we want to store our objects in files/database directly (not ORmapped or DRmapped). We can obtain this with serialization. This process will convert any Ruby object into format that can be saved as a byte stream. You can read more about serialization here.

Serializing stuff with Ruby

Ruby uses Marshal serialization. It is quite easy to use. If use use ActiveRecord, you can use this simple class to store objects in AR supported database:

class PendingObject < ActiveRecord::Base

  # Iterate through all pending objects
  def self.each
    self.all.each do |el|
      yield el, el.restore
    end
  end

  # Marshal given object and store it on db
  def store(object)
    self.object = Marshal.dump(object)
    self.save!
  end

  # "Unmarshal" it and return
  def restore
    Marshal.load(self.object)
  end

end

Of course this is just a simple example of how to use serialization. Serialized data should be stored in a binary field:

      t.binary :object

Mongo, Mongoid and its issues with serialization

Unfortunately you can't just copy-paste this ActiveRecord solution directly into Mongoid:

class PendingObject

  include Mongoid::Document
  include Mongoid::Timestamps

  field :object, :type => Binary

  # Iterate through all pending objects
  def self.each
    self.all.each do |el|
      yield el, el.restore
    end
  end

  def store(object)
    self.object = Marshal.dump(object)
    self.save!
  end

  def restore
    Marshal.load(self.object)
  end

end

It doesn't matter whether or not you use Binary or String in a field type decleration. Either way you'll get this as a result:

String not valid UTF-8

I can understand why this would happen with a String, but why when I set it as a binary value? It should just store whatever I put there...

Base64 to the rescue

In order to fix this, I've decided to use Base64 to convert serialized data. This has an significant impact on the size of each serialized object (30-35% more) but I can live with that. I was more concerned about the performance, that's why I've decided to test it. There are 2 cases what I've wanted to check:

  • Serialization
  • Serialization and deserialization (reading serialized objects)

Here are steps that I took:

  1. Create simple ruby object
  2. Serialize it 100 000 times with step every 1000 (without Base64)
  3. Serialize it 100 000 times with step every 1000 (with Base64)
  4. Benchmark creating of ruby simple objects (just as a reference point)
  5. Analyze all the data

Just to be sure (and to minimize random CPU spikes) I've performed test cases 10 times and then I took average values.

Benchmark

Benchmark code is really simple:

  1. Code responsible for iteration preparing
  2. DummyObject - object that will be serialized
  3. PendingObject - object that will be used to store data in Mongo
  4. ResultStorer - object used to store time results (time taken)
  5. Benchmark - container for all the things
  6. Loops :)

You can download source code here (benchmark.rb).

Results, charts, fancy data

First the reference point - pure objects initialization (without serialization). We can see, that there's no big decrease in performance, no matter how many objects we will initialize. Initializing 100 000 objects takes around 0.25 second.

Now some more interesting data :) Objects initialization and initialization with serialization (single direction and without base64):

It is pretty straightforward, that serialization isn't the fastest way to go. It might slowdown whole process around 10 times. But it's still like 2.5 seconds for 100 000 objects. Now lets see what will happen when we add a base64 to all of it (for a reference we will leave previous values on the chart as well):

It seems, that Base64 conversions will slow down the whole process about 10-12% max. It is still bearable (since for 100 000 objects its around 2.7s).

Now it is time for the most interesting part: deserialization. By "deserialization" I mean time that we need to convert a stream of bytes into objects (serialization time is not taken into consideration here):

Results are quite predictable. Adding Base64 to the deserialization process, increases overall time required around 12-14%. As previously, it is an overhead that can be accepted - especially when you realize that even then, 100 000 objects can be deserialized in less than 2 seconds.

Lets summarize all that we have (pure initialization, serialization, serialization with Base64, deserialization, deserialization with Base64, serialization-deserialization process and the serialization-deserialization with Base64):

Conclusions

Based on our calculations and benchmarks we can see, that the overall performance drop when serializing and deserializing using Base64 is around 23-26%. If you're not planning to work with huge number of objects at the same time, the whole process will still be extremely fast and you can use it.

Of course if you can use for example MySQL with Binary - there is no need to use Base64 with it. But on the other hand, if you're using MongoDB (with Mongoid) or any other database that has some issues with Binary and you still want to store serialized objects in it - this is a way to go. If you consider also the bigger size of Base64 data, the total performance loss should not exceed 35%.

So: if you don't have time to look for a better solution and you will be aware of disadvantages of this solution - you can use it ;)

Using multiple MongoDB databases instead of one – performance check

I'm starting to develop a new application. Can't say what it is, but it perfectly fits MongoDB Document Oriented Database approach. Everything is great. except of small detail - I don't want to store everything in one database. Of course I could use collections and embedded documents to organize whole nested structures and keep users stuff separated, although it would make source code much more complicated than it should be. Instead I've decided to use one MongoDB database per user. That way I can separate users data and I don't need to worry about scoping it out. There will be a gateway, that will authorize incoming requests to a proper database.

Gateway

Schema is pretty straightforward and the only thing that was bothering me was the multi-db switching performance. I've decided to make a simple benchmark that would test if there's a difference when using one or many databases. Results are promising. It seems, that there's only around 5% (5.3% exactly) performance loss when using many databases instead of one. 5% is a difference level that I can easily accept. To be honest I think, I will gain much more especially when I will have a lot of data. Lets say I have 100 customers with 100 000 000 records. With one database I would have to query all of it. With separate databases, I will have to query only 1% of it.

Below you can see performance difference when querying one vs many MongoDB databases.

Dbs I will definitely go with that approach and I will try to keep you posted.

Note: This is not a full-pro-extremely accurate long-time test - more like a proof of concept. Keep that in mind ;)

Copyright © 2024 Closer to Code

Theme by Anders NorenUp ↑