Running with Ruby

Tag: Nokogiri

Rack/Rails middleware that will add rel=”nofollow” to all your links

Few years ago I wrote a post about adding rel=”nofollow” to all the links in your comments, news, posts, messages in Ruby on Rails. I’ve been using this solution for a long time, but recently it started to be a pain in the ass. More and more models, more and more content – having to always declare some sort of filtering logic in the models don’t seem legit any more. Instead I’ve decided to use a different approach. Why not use a Rack middleware that would add the nofollow rel to all “outgoing” links? That way models would not be “polluted” with stuff that is directly related only to views.

Nokogiri is the answer

To replace all the rel attributes, we can use Nokogiri. It is both convenient and fast:

require 'nokogiri'

doc = Nokogiri::HTML.parse(content)

doc.css('a').each do |a|
  a.set_attribute('rel', 'noindex nofollow')
end

doc.to_s

Small corner cases that we need to cover

Unfortunately there are some cases that we need to cover, so simple replacing all the links is not an option. We should not add nofollow when:

  • There’s already a rel defined on an anchor
  • There are local links that should be “followable”
  • There are local links with a full domain in them
  • We want to narrow anchor seeking to a given css selector (we want to leave links that are in layout, etc)

If we include all of above, our code should look like this:

require 'nokogiri'

doc = Nokogiri::HTML.parse(content)
scope = '#main-content'
host = 'mensfeld.pl'

doc.css(scope + ' a').each do |a|
  # If there's a rel already don't change it
  next unless a.get_attribute('rel').blank?
  # If this is a local link don't change it
  next unless a.get_attribute('href') =~ /\Awww|http/i
  # Don't change it also if it is a local link with host
  next if a.get_attribute('href') =~ /#{host}/

  a.set_attribute('rel', 'noindex nofollow')
end

Hooking it up to Rack middleware

There’s a great Rails on Rack tutorial, so I will skip some details.

Our middleware needs to accept following options:

  • whitelisted host
  • css scope (if we decide to norrow anchor seeking)

So, the initialize method for our middleware should look like this:

# @param app [SenpuuV7::Application]
# @param host [String] host that should be allowed - we should allow our internal
#   links to be without nofollow
# @param scope [String] we can norrow to a given part of HTML (id, class, etc)
def initialize(app, host, scope = 'body')
  @app = app
  @host = host
  @scope = scope
end

Each middleware needs to have a call method:

# @param [Hash] env hash
# @return [Array] full rack response
def call(env)
  response = @app.call(env)
  proxy = response[2]

  # Ignore any non text/html requests
  if proxy.is_a?(Rack::BodyProxy) &&
    proxy.content_type == 'text/html'
    proxy.body = sanitize(proxy.body)
  end

  response
end

and finally, the sanitize method that encapsulates the Nokogiri logic:

# @param [String] content of a response (body)
# @return [String] sanitized content of response (body)
def sanitize(content)
  doc = Nokogiri::HTML.parse(content)
  # Stop if we could't parse with HTML
  return content unless doc

  doc.css(@scope + ' a').each do |a|
    # If there's a rel already don't change it
    next unless a.get_attribute('rel').blank?
    # If this is a local link don't change it
    next unless a.get_attribute('href') =~ /\Awww|http/i
    # Don't change it also if it is a local link with host
    next if a.get_attribute('href') =~ /#{@host}/

    a.set_attribute('rel', 'noindex nofollow')
  end

  doc.to_s
# If anything goes wrong, return original content
rescue
  return content
end

Usage example

To use it, just create an initializer in config/initializers of your app with following code:

require 'nofollow_anchors'

MyApp::Application.config.middleware.use NofollowAnchors, 'mensfeld.pl', 'body #main-content'

also don’t forget to add gem ‘nokogiri’ to your gemfile.

Performance

Nokogiri is quite fast and based on benchmark that I did, it takes about 5-30 miliseconds to parse the whole content. Below you can see time and number of links (up to 488) per page. Keep that in mind when you will use this middleware.

perf

TL;DR – Whole middleware

require 'nokogiri'

# Middleware used to ensure that we don't allow any links outside without a
# nofollow rel
# @example
#   App.middleware.use NofollowAnchors, 'example.com', 'body'
class NofollowAnchors
  # @param app [SenpuuV7::Application]
  # @param host [String] host that should be allowed - we should allow our internal
  #   links to be without nofollow
  # @param scope [String] we can norrow to a given part of HTML (id, class, etc)
  def initialize(app, host, scope = 'body')
    @app = app
    @host = host
    @scope = scope
  end

  # @param [Hash] env hash
  # @return [Array] full rack response
  def call(env)
    response = @app.call(env)
    proxy = response[2]

    if proxy.is_a?(Rack::BodyProxy) &&
      proxy.content_type == 'text/html'
      proxy.body = sanitize(proxy.body)
    end

    response
  end

  private

  # @param [String] content of a response (body)
  # @return [String] sanitized content of response (body)
  def sanitize(content)
    doc = Nokogiri::HTML.parse(content)
    # Stop if we could't parse with HTML
    return content unless doc

    doc.css(@scope + ' a').each do |a|
      # If there's a rel already don't change it
      next unless a.get_attribute('rel').blank?
      # If this is a local link don't change it
      next unless a.get_attribute('href') =~ /\Awww|http/i
      # Don't change it also if it is a local link with host
      next if a.get_attribute('href') =~ /#{@host}/

      a.set_attribute('rel', 'noindex nofollow')
    end

    doc.to_s
  rescue
    return content
  end
end

Adding rel=”nofollow” to all the links in your comments, news, posts, messages in Ruby on Rails

SEO, SEO, SEO

After migrating somewhere like 600 news messages from my old PHP CMS to a Susanoo (at Naruto Shippuuden Senpuu website), I’ve realized that a lot of them have an external links to other websites. Most of them should have rel=”nofollow”,but hey, who cared about SEO 8 years ago :)

Fortunately there is a quite simple and convenient way to fix this. Also after implementing this, we will be able to prevent such things in future.

Nokogiri, HTML and CSS selectors

What we need to do? Well, we must:

  • parse our news content (html part generated by CKEditor),
  • fetch all the link tags,
  • add to all the external links a rel=”nofollow” attribute.

To start doing this, let’s add a gem called Nokogiri to our gemfile:

gem 'nokogiri'

Nokogiri basic usage is quite simple:

noko = Nokogiri::HTML.parse(html_stuff)

After we create an Nokogiri::HTML instance we can use a CSS selector to get all the links:

doc.css('any selector').each do |link|
  # do smthng with those links
end

Each link instance is a Nokogiri::XML::Element so we can easily add an rel attribute:

# Nokogiri::XML::Element link instance
link[:rel] = 'nofollow'

So we could iterate through all the links, add a rel attribute, run to_html method and save the output as our content. Unfortunately we cannot do so because Nokogiri adds some extra stuff, like doctype and header to our html, so when displaying on our website would break the layout.

We could try to gsub links like this:

# Nokogiri::XML::Element link instance
# convert it into a html element
old = link.to_s
# add nofollow and convert to html
link[:rel] = 'nofollow'
new = link.to_s

# try to replace old whole tag with a new one
content.gsub!(old, new)

It might work, but it requires a well formatted and valid xhtml. Won’t work with something like that:

<A href="LINK">msg</A>

Some posts on Naruto Shippuuden Senpuu.net are 8 years old and I had to handle also the tags like the one above. So, what can I do?

There is a third, lil bit less elegant approach (yet it works pretty well). We can replace all the href=”link” parts with href=”link” rel=”nofollow”. This approach seams to work for any valid/invalid type of links.

Before filter to the rescue

We can use a before_filter to handle “nofollowing” all the links in our content. To do so, just place the:

before_save :add_nofollow

in your model declaration file and implement the add_nofollow method.

The add_nofollow method is pretty straightforward. We just cover two existing types of bracket: ‘ and “, then we skip links that are local (within our website) and we are done.

def add_nofollow
  doc = Nokogiri::HTML.parse(self.content)
  links = []
  doc.css(selector).each do |link|
    next unless link['rel'].blank?
    next if (link['href'][0,4] != 'http' && link['href'][0,3] != 'www')
    next if (link['href'].downcase.include?('senpuu.net'))
    links << link
  end

  links.uniq.each do |link|
    link['rel'] = 'nofollow'

    href1 = "href='#{link['href']}'"
    href2 = 'href="'+link['href']+'"'

    self.content = self.content.gsub(href1, href1+' rel="nofollow"')
    self.content = self.content.gsub(href2, href2+' rel="nofollow"')
  end
end

After implementing this in your model logic, you don’t need to worry again about any external links that are inserted into the models.

Problems installing Nokogiri?

If you’ve encountered a problems during the Nokogiri installation (running bundle install):

Building native extensions.  This could take a while…
ERROR:  Error installing nokogiri:
ERROR: Failed to build gem native extension.
/opt/ruby-enterprise-1.8.7-2010.02/bin/ruby extconf.rb
checking for iconv.h… yes
checking for libxml/parser.h… yes
checking for libxslt/xslt.h… no
—–
libxslt is missing.
—–
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of
necessary libraries and/or headers.  Check the mkmf.log file for more
details.  You may need configuration options.
Provided configuration options:
–with-opt-dir
–without-opt-dir
–with-opt-include
–without-opt-include=${opt-dir}/include
–with-opt-lib
–without-opt-lib=${opt-dir}/lib
–with-make-prog
–without-make-prog
–srcdir=.
–curdir
–ruby=/opt/ruby-enterprise-1.8.7-2010.02/bin/ruby
–with-zlib-dir
–without-zlib-dir
–with-zlib-include

Type into your console:

sudo apt-get install libxslt-dev libxml2-dev

and run again:

bundle install

Copyright © 2017 Running with Ruby

Theme by Anders NorenUp ↑