utf8 Archives - Closer to Code

If you get error like this one:

Mysql2::Error: Incorrect string value: '\xF0\x9F\x99\x82'

it means that you use verion of Mysql with settings that does not support full Unicode. Changing that in Ruby on Rails should be fairly simple:

class ChangeEncoding < ActiveRecord::Migration[5.0]
  def change
    config = Rails.configuration.database_configuration
    db_name = config[Rails.env]["database"]
    collate = 'utf8mb4_polish_ci'
    char_set = 'utf8mb4'

    execute("ALTER DATABASE #{db_name} CHARACTER SET #{char_set} COLLATE #{collate};")

    ActiveRecord::Base.connection.tables.each do |table|
      execute("ALTER TABLE #{table} CONVERT TO CHARACTER SET #{char_set} COLLATE #{collate};")
    end
  end
end

However you might encounter following error:

Specified key was too long; max key length is 767 bytes

767 bytes is the stated prefix limitation for InnoDB tables.

To fix that you need to:

Update your database.yml file
Upgrade to MySQL 5.7 or edit your my.cnf to enable innodb_large_prefix
Change row format to DYNAMIC
Alter the database and all the tables using Ruby on Rails migration

Database.yml

You need to change your encoding and collation for ActiveRecord:

production:
  encoding: utf8mb4
  collation: utf8mb4_polish_ci

MySQL my.cnf

Edit my.cnf and add following lines:

innodb_large_prefix=on
innodb_file_format=barracuda
innodb_file_per_table=true

DYNAMIC row format

If you don't do this, you will probably (if you have enough indexes) end up with this error:

767 bytes is the stated prefix limitation for InnoDB tables

To change row format, you need to run following query for each table:

ALTER TABLE table_name ROW_FORMAT=DYNAMIC;

However, you don't need to do this manually - below you will find a Ruby on Rails migration that will do everything that is required to make things going.

Complex migration to set everything as it should be in the MySQL database

class ChangeEncoding < ActiveRecord::Migration[5.0]
  def change
    config = Rails.configuration.database_configuration
    db_name = config[Rails.env]["database"]
    collate = 'utf8mb4_polish_ci'
    char_set = 'utf8mb4'
    row_format = 'DYNAMIC'

    execute("ALTER DATABASE #{db_name} CHARACTER SET #{char_set} COLLATE #{collate};")

    ActiveRecord::Base.connection.tables.each do |table|
      execute("ALTER TABLE #{table} ROW_FORMAT=#{row_format};")
      execute("ALTER TABLE #{table} CONVERT TO CHARACTER SET #{char_set} COLLATE #{collate};")
    end
  end
end

After you run this migration, everything should work fine.

# Quick test - just copy-paste this: ?%28t%B3odei%29 into your app url - if app crashes, you should read stuff below ;)

If you're here, than probably you've encountered this weird issue:

ArgumentError: invalid byte sequence in UTF-8

you might even have a backtrace like this:

rack-1.5.2/lib/rack/utils.rb:104→ normalize_params
rack-1.5.2/lib/rack/utils.rb:96→ block in parse_nested_query
rack-1.5.2/lib/rack/utils.rb:93→ each
rack-1.5.2/lib/rack/utils.rb:93→ parse_nested_query
rack-1.5.2/lib/rack/request.rb:373→ parse_query
actionpack-4.0.4/lib/action_dispatch/http/request.rb:321→ parse_query
rack-1.5.2/lib/rack/request.rb:188→ GET
actionpack-4.0.4/lib/action_dispatch/http/request.rb:274→ GET
actionpack-4.0.4/lib/action_dispatch/http/parameters.rb:16→ parameters
actionpack-4.0.4/lib/action_dispatch/http/filter_parameters.rb:37→ filtered_parameters
...
activesupport-4.0.4/lib/active_support/cache/strategy/local_cache.rb:83→ call
rack-1.5.2/lib/rack/sendfile.rb:112→ call
railties-4.0.4/lib/rails/engine.rb:511→ call
railties-4.0.4/lib/rails/application.rb:97→ call
railties-4.0.4/lib/rails/railtie/configurable.rb:30→ method_missing
puma-2.7.1/lib/puma/configuration.rb:68→ call
puma-2.7.1/lib/puma/server.rb:486→ handle_request
puma-2.7.1/lib/puma/server.rb:357→ process_client
puma-2.7.1/lib/puma/server.rb:250→ block in run
puma-2.7.1/lib/puma/thread_pool.rb:92→ call
puma-2.7.1/lib/puma/thread_pool.rb:92→ block in spawn_thread

First of all, this issue is not super-important. It's not a security issue as well. It's just an invalid byte sequence in your request url. Either way it would be good to fix it, even for a sole purpose of getting rid of this from our bug tracker.

But before we do anything with this, how can we determine, that our URL is an invalid UTF-8? We can use URI decode method for that:

# With an invalid byte sequence
url = 'http://senpuu.net/?techniki,Sawarabi_no_Mai_%28taniec_m%B3odej_paproci%29'
URI.decode(url).force_encoding('UTF-8').valid_encoding? #=> false

# and with a valid one
url = 'http://www.senpuu.net/aktualnosci'
URI.decode(url).force_encoding('UTF-8').valid_encoding? #=> true

So, how can we handle this? Well we need to catch it in middleware before anything else wants to process it. I think, that in such cases we should just raise 400 error - bad request, since this is not something that we expect. Middleware like this can be really simple:

class Utf8Sanitizer
  SANITIZE_ENV_KEYS = %w(
    HTTP_REFERER
    PATH_INFO
    REQUEST_URI
    REQUEST_PATH
    QUERY_STRING
  )

  def initialize(app)
    @app = app
  end

  def call(env)
    SANITIZE_ENV_KEYS.each do |key|
      string = env[key].to_s
      valid = URI.decode(string).force_encoding('UTF-8').valid_encoding?
      # Don't accept requests with invalid byte sequence
      return [ 400, { }, [ 'Bad request' ] ] unless valid
    end

    @app.call(env)
  end
end

and after that you just put into your config/application.rb this:

  config.middleware.use Utf8Sanitizer

and you're resistant to this issue.

## Update

It seems that there's a gem called utf8-cleaner that sanitizes non-utf8 strings. It has one issue - instead of rising 400 error it just removes invalid bytes but still - it's way better that nothing. If you just want to get rid of this problem, put this into your gemfile:

gem 'utf8-cleaner'

## Update 2
I've got a response from Rack guys and it seems that it's more like a Rails issue than a Rack one:
Raggi stated here that:

It is a web servers responsibility to translate IO to valid binary representations for the application layer. This isn't the whole picture though, in this case, the webserver has done that - the webserver does not know the encoding of the URI...

It is the responsibility of the IETF to define the validity of URI data in various encodings (not done), and so it is not entirely valid for web servers to make no assumptions for this field for the above...

Rack itself uses a binary regular expression here, which expects binary input strings. This is our response to the above subtleties. In normal operation (say, Webrick + Rack), this error is not raised...

The reason that this error is raised in your application is:

You have middleware in your stack that is forcing this string to UTF-8, even when it is not valid UTF-8. The code that is doing this is bugged.

Observe:

s = "a=\xff"
# => "a=\xFF"
s.force_encoding("binary")
# => "a=\xFF"
s.valid_encoding?
# => true
Rack::Utils.parse_nested_query(s)
# => {"a"=>"\xFF"}
s.force_encoding("utf-8")
# => "a=\xFF"
s.valid_encoding?
# => false
Rack::Utils.parse_nested_query(s)
ArgumentError: invalid byte sequence in UTF-8
        from /usr/local/google/home/raggi/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/rack-1.5.2/lib/rack/utils.rb:93:in `split'
        from /usr/local/google/home/raggi/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/rack-1.5.2/lib/rack/utils.rb:93:in `parse_nested_query'
        from (irb):21
        from /usr/local/google/home/raggi/.rbenv/versions/2.0.0-p247/bin/irb:12:in `<main>'

This is a rails bug. Calls to force_encoding should always assert that their output is valid.

Tag: utf8

Ruby on Rails, Mysql2::Error: Incorrect string value: ‘\xF0\x9F\x99\x82’ and Specified key was too long; max key length is 767 bytes

Database.yml

MySQL my.cnf

DYNAMIC row format

Complex migration to set everything as it should be in the MySQL database

Rack/Ruby on Rails: ArgumentError: invalid byte sequence in UTF-8