Tag: Order

Mongoid and Aggregation Framework: Get similar elements based on tags, ordered by total number of matches (similarity level)

So, lets say we want have an Article model with tags array:

class Article
  include Mongoid::Document
  include Mongoid::Timestamps

  field :content, type: String, default: ''
  field :tags, type: Array, default: []
end

Let's try to pick any similar first (without similarity level)

We have an article, that has some tags (%w{ ruby rails mongoid mongodb }) and we would like to get similar articles. Nothing special (yet):

current_article = Article.first
similar = Article.in tags: current_article.tags

Let's also pick elements without our base article (current_article) though we decided to get similar articles, not similar or equal:

Article
  .ne(_id: current_article.id)
  .in(tags: current_article.tags)

We could even refactor it a bit...

class Article
  include Mongoid::Document
  include Mongoid::Timestamps

  scope :exclude, -> article { ne(_id: article.id) }
  scope :similar_to, -> article { exclude(article).in(tags: article.tags ) }

  field :content, type: String, default: ''
  field :tags, type: Array, default: []

  def similar
    @similar ||= self.class.similar_to self
  end
end

# Example usage:
current_article.similar #=> [Article, Article]

Seems pretty decent, but this won't give us most similar articles. It will just return most recent, that have equal at least one tag with our current_article. What should we do then?

Mongo Aggregation Framework to the rescue

To get such information, sorted in a proper way, we need to perform following steps:

  1. Don't include current_article in resultset
  2. Get all articles (except current one), that have at least one tag as current_article (we did this earlier)
  3. Count how many similar tags occurred in each of articles
  4. Sort articles by similarity
  5. Take first 10 articles

Step 1 - Excluding

# Mongoid
Article.where(id: {"$ne" => current_article.id})
# Mongo (this is still in Ruby - not in Mongo shell!)
"$match" => { 
  _id: { "$ne" => current_article.id }
}

Step 2 - All articles with at least one similar tag

# Mongoid
Article.in(tags: current_article.tags )
# Mongo (this is still in Ruby - not in Mongo shell!)
"$match" => { 
  tags: { "$in" => %w{ ruby rails mongoid mongodb } }
}

Step 3 - Unwind by tags

If you're not familiar with unwind look here. That way, we get article copy for every tag for each article.

{ "$unwind" => "$tags" }

Step 4 - Second matching

You may wonder, why we filter results again. Well The initial filtering was not required, but we did this to remove all non-related articles, so the data set is much smaller. Unfortunately unwind created document copy per each of the tags - even those that we don't want to. That's why we have to filter it again.

"$match" => { 
  tags: { "$in" => %w{ ruby rails mongoid mongodb } }
}

Note that we don't need to filter out again by ID, since in incoming dataset we already don't have the current_article document instance.

Step 5 - Grouping

Now we can group by documents ID. Also we will add sum for grouping, so we will know similarity level for each document. One point in sum equals one similar matching tag.

{ "$group" => {
    _id: "$_id", 
    matches: {"$sum" =>1}
  }
}

Step 6 - Sorting

Now we can sort by sum to have elements in descending order (most similar on top):

{ "$sort" => {matches:-1} }

Step 7 - 10 first elements

And the last step - limiting:

{ "$limit" => 10 }

Making it all work together

In order to execute this whole code in Ruby, we need to use Article.collection.aggregate method:

results = Article.collection.aggregate(
  {
    "$match" => { 
      tags: { 
        "$in" => current_article.tags 
      },
      _id: { 
        "$ne" => current_article.id 
      }
    },
  },  
  { 
    "$unwind" => "$tags" 
  },
  { 
    "$match" => { 
      tags: { 
        "$in" => current_article.tags 
      } 
    }
  },
  { 
    "$group" => {
      _id: "$_id", 
      matches: { "$sum" =>1 }
    }
  },
  { 
    "$sort" => { matches: -1 }  
  },
  { 
    "$limit" => 10 
  }
)

We won't get Ruby objects as a result (we'll get an array of hashes). We can process it further if we need similarity level, but if we just need similar articles (for example to display them) we can just:

Article.find results.map(&:first).map(&:last)

Rails 4.0.1: Revert change on ActiveRecord::Relation#order method monkey patch to keep Rails 4.0.0 order behaviour

It is really good habit to review source code of each new Rails release (or at least a changelog file). Today while reviewing this release note, I've noticed, that the Rails team is going to revert the ActiveRecord#order functionality, so it will work like in the 3.2 version.

I must say, that I'm a bit disappointed. I really got used to this functionality and I used it really often. It was quite convenient to create scopes with default sorting, that could be easily "overwritten" by any other. Of course after that change I can still use the reorder method, to get exactly same effect, but it will require a lot of changes in the code. Also IMHO it seems kinda unfair - I put a lot of effort to migrate from Rails 3.2 to Rails 4.0.0 (stable) and it seems, that some of that work just got wasted, because Rails guys seem to be a bit unstable. I can understand behaviour changes between major releases, but this is just a fuck*ng small one!

105012-you-shall-deal-with-it

If you're not willing to spend a lot of time getting back to previous ordering mode (and dealing with it), you can use this monkey patch (put it into config/initializers) to keep the current (4.0.0) ordering behaviour:

module ActiveRecord
  class Relation

    def order!(*args)
      args.flatten!
      validate_order_args args

      references = args.reject { |arg| Arel::Node === arg }
      references.map! { |arg| arg =~ /^([a-zA-Z]\w*)\.(\w+)/ && $1 }.compact!
      references!(references) if references.any?

      # if a symbol is given we prepend the quoted table name
      args = args.map { |arg|
        arg.is_a?(Symbol) ? "#{quoted_table_name}.#{arg} ASC" : arg
      }
      self.order_values = args + self.order_values
      self
    end

  end
end

Rafaels says that 4.0.1 ordering will stay as a default one, although I would not recommend doing any hasty moves now. I think it's worth waiting at least few months to find out, if they are not going to change it again soon.

Meanwhile you can review Github commit with this change. Also be prepared for the shitstorm that is coming on Thursday (4.0.1 release day)...

687474703a2f2f692e716b6d652e6d652f3335747837352e6a7067

Copyright © 2024 Closer to Code

Theme by Anders NorenUp ↑