# Quick test - just copy-paste this: ?%28t%B3odei%29 into your app url - if app crashes, you should read stuff below ;)
If you're here, than probably you've encountered this weird issue:
ArgumentError: invalid byte sequence in UTF-8
you might even have a backtrace like this:
rack-1.5.2/lib/rack/utils.rb:104→ normalize_params rack-1.5.2/lib/rack/utils.rb:96→ block in parse_nested_query rack-1.5.2/lib/rack/utils.rb:93→ each rack-1.5.2/lib/rack/utils.rb:93→ parse_nested_query rack-1.5.2/lib/rack/request.rb:373→ parse_query actionpack-4.0.4/lib/action_dispatch/http/request.rb:321→ parse_query rack-1.5.2/lib/rack/request.rb:188→ GET actionpack-4.0.4/lib/action_dispatch/http/request.rb:274→ GET actionpack-4.0.4/lib/action_dispatch/http/parameters.rb:16→ parameters actionpack-4.0.4/lib/action_dispatch/http/filter_parameters.rb:37→ filtered_parameters ... activesupport-4.0.4/lib/active_support/cache/strategy/local_cache.rb:83→ call rack-1.5.2/lib/rack/sendfile.rb:112→ call railties-4.0.4/lib/rails/engine.rb:511→ call railties-4.0.4/lib/rails/application.rb:97→ call railties-4.0.4/lib/rails/railtie/configurable.rb:30→ method_missing puma-2.7.1/lib/puma/configuration.rb:68→ call puma-2.7.1/lib/puma/server.rb:486→ handle_request puma-2.7.1/lib/puma/server.rb:357→ process_client puma-2.7.1/lib/puma/server.rb:250→ block in run puma-2.7.1/lib/puma/thread_pool.rb:92→ call puma-2.7.1/lib/puma/thread_pool.rb:92→ block in spawn_thread
First of all, this issue is not super-important. It's not a security issue as well. It's just an invalid byte sequence in your request url. Either way it would be good to fix it, even for a sole purpose of getting rid of this from our bug tracker.
But before we do anything with this, how can we determine, that our URL is an invalid UTF-8? We can use URI decode method for that:
# With an invalid byte sequence url = 'http://senpuu.net/?techniki,Sawarabi_no_Mai_%28taniec_m%B3odej_paproci%29' URI.decode(url).force_encoding('UTF-8').valid_encoding? #=> false # and with a valid one url = 'http://www.senpuu.net/aktualnosci' URI.decode(url).force_encoding('UTF-8').valid_encoding? #=> true
So, how can we handle this? Well we need to catch it in middleware before anything else wants to process it. I think, that in such cases we should just raise 400 error - bad request, since this is not something that we expect. Middleware like this can be really simple:
class Utf8Sanitizer SANITIZE_ENV_KEYS = %w( HTTP_REFERER PATH_INFO REQUEST_URI REQUEST_PATH QUERY_STRING ) def initialize(app) @app = app end def call(env) SANITIZE_ENV_KEYS.each do |key| string = env[key].to_s valid = URI.decode(string).force_encoding('UTF-8').valid_encoding? # Don't accept requests with invalid byte sequence return [ 400, { }, [ 'Bad request' ] ] unless valid end @app.call(env) end end
and after that you just put into your config/application.rb this:
config.middleware.use Utf8Sanitizer
and you're resistant to this issue.
## Update
It seems that there's a gem called utf8-cleaner that sanitizes non-utf8 strings. It has one issue - instead of rising 400 error it just removes invalid bytes but still - it's way better that nothing. If you just want to get rid of this problem, put this into your gemfile:
gem 'utf8-cleaner'
## Update 2
I've got a response from Rack guys and it seems that it's more like a Rails issue than a Rack one:
Raggi stated here that:
It is a web servers responsibility to translate IO to valid binary representations for the application layer. This isn't the whole picture though, in this case, the webserver has done that - the webserver does not know the encoding of the URI...
It is the responsibility of the IETF to define the validity of URI data in various encodings (not done), and so it is not entirely valid for web servers to make no assumptions for this field for the above...
Rack itself uses a binary regular expression here, which expects binary input strings. This is our response to the above subtleties. In normal operation (say, Webrick + Rack), this error is not raised...
The reason that this error is raised in your application is:
You have middleware in your stack that is forcing this string to UTF-8, even when it is not valid UTF-8. The code that is doing this is bugged.
Observe:
s = "a=\xff" # => "a=\xFF" s.force_encoding("binary") # => "a=\xFF" s.valid_encoding? # => true Rack::Utils.parse_nested_query(s) # => {"a"=>"\xFF"} s.force_encoding("utf-8") # => "a=\xFF" s.valid_encoding? # => false Rack::Utils.parse_nested_query(s) ArgumentError: invalid byte sequence in UTF-8 from /usr/local/google/home/raggi/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/rack-1.5.2/lib/rack/utils.rb:93:in `split' from /usr/local/google/home/raggi/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/rack-1.5.2/lib/rack/utils.rb:93:in `parse_nested_query' from (irb):21 from /usr/local/google/home/raggi/.rbenv/versions/2.0.0-p247/bin/irb:12:in `<main>'
This is a rails bug. Calls to force_encoding should always assert that their output is valid.