Today I Learned

6 posts about #elasticsearch

Elasticsearch nad Catastrophic Backtracking threats

The Regular expression Denial of Service (ReDoS) is a Denial of Service attack, that exploits the fact that most Regular Expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size).”

The above assumes that there’s some kind of user with a malicious intent, who submits an evil regexp to the backend, stacking the web server. But that assumption impose another one which precedes it: there may be a sloppy programmer, who let that happen.

In one of my previous TILs I’vbe been sharing an example approach for building up regexps for ES querying and filtering, basing it on the user input. We had a validation in place which ensured that the regexp we end up with, would be a valid one. But what about catastrophic backtracking? Can you do something about it?

I did some research and it turns out that Elasticsearch itself does not have any native mechanism implemented which would put us on the safe side of the story. If anyone is curious and wants to know why, here is a bunch of links to check: https://github.com/elastic/elasticsearch/issues/17934 -> https://issues.apache.org/jira/browse/LUCENE-7256 -> http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-March/039269.html

So Elasticsearch gives us no safety net. What about Ruby? Can it bring something up to the table? I’ll use a quote here:

Ruby uses the “Non-recursive Backtracking Implementation” discussed in the second article, which means that it does exhibit the same exponentially slow performance as Perl does for pathological regex expressions. In other words, this means that Ruby is NOT using the most optimal regex algorithm available, Thompson NFA, that Russ described in the first article.”

Hence you should not expect that you would be able to validate regexp (using Ruby) before passing it into the Elasticsearch engine.

But there is a thing you can actually do, when looking at my previous example.We defined special chars as follows:

ESCAPE_ES_CHARS = %w(# @ & < > ~ \\\\ \.).freeze

and striped it in a following way:

str.gsub(Regexp.new(ESCAPE_ES_CHARS.join('|')))

What we actually do is we can use the build in Ruby method: https://ruby-doc.org/core-3.0.0/Regexp.html#method-c-escape

Also, as Ruby engineers we can always roll up our sleeves an come up with some star height rating analyzer, basing it on https://github.com/substack/safe-regex or other implementations which can be found on the Internet.

Working with ElasticSearch and Rails; part 4

Imagine you find yourself dealing with a following case: you want to allow users to search for employees with a given name. Easy. You quickly assemble a query similar to the one below:

GET company_employee/employee/_search
{
   "query": {
      "bool": {
         "must": [
            {
                "match_phrase": {
                   "name": "Charles"
                }
            }
         ]
      }
   }
}

Where “Charles” is the user input. However you quickly realize (or your client helps you to realize ;) ) that you need to actually retrieve all “Charles”, even if the user types ChaRleS or charles or CHarles into the form.

Assuming that changing the index config is not an option, what you can do is to change the query and try go with the regexp approach. The caveat here is of course that the regexp doesn’t allow for searching case insensitive, but you could always make it to do so “manually”. Here how:

ESCAPE_ES_CHARS = %w(# @ & < > ~ \\\\ \.).freeze

def filter
  {
    bool: {
      must:
        query_strings.map do |query_string|
          {
            bool: {
              should: fields.map do |field|
                {
                  regexp: {
                    field => { value: ".*(#{query_string}).*" }
                  }
                }
              end
            }
          }
        end
    }
  }
end

def query_strings
  @query_strings ||= q.split.map do |keyword|
    qs = keyword.split("").map(&:downcase).map { |char| "[#{[char.upcase, char].uniq.join}]" }.join
    escape_regexp_string(qs)
  end
end

def escape_regexp_string(str)
  str.gsub(Regexp.new(ESCAPE_ES_CHARS.join('|'))) do |match|
    '\\' + match
  end
end

Where the q is actually the user input and fields denotes the collection of fields we would like to match against. My example code is fro mslightly more advanced case however the idea is exactly the same. To programmatically create regexps from user inputs, where we end up with a Charlie guy, mangled like the one below:

GET company_employee/employee/_search
{
   "query": {
      "regexp": { "name": "[Cc][Hh][Aa][Rr][Ll][Ee][Ss]" }
    }
}

Working with ElasticSearch and Rails; part 3

“You can pass any object which implements a to_hash method, which is called automatically, so you can use a custom class or your favourite JSON builder to build the search definition”

The aforementioned sentence lies somewhere in the middle of the lengthy Readme file for the elasticsearch-model gem and can be easily overlooked, however it let’s you to create abstraction for some standard elements of the user intefrace.

Imagine you have a listing displaying records, which can be filtered by the user. There are different types of filters depending if the actual field is a date, string, boolean and so on. User interacts with the filters and the request is being fired up to the backend.

Then on the backend, you translate the payload, matching the params contents against in the the info specified in your META for a given search class, generating an array of filtering directives as the one below:

module Search
 module Filters
   class Range
     def initialize(name:, min:, max:)
       raise ::Search::Error::RangeFilterMissing.new(filter: name) if min.blank? && max.blank?

       @name = name
       @min  = min
       @max  = max
     end

     def to_hash
       {}.tap do |range|
         range[:range] = Hash[name, {}]
         range[:range][name][:gte] = min if min.present?
         range[:range][name][:lte] = max if max.present?
       end
     end

     private

     attr_accessor :name, :min, :max
   end
 end
end

Which can be easily passed as an input for the Model.search as:

def search_query
  { query: query, sort: sort, aggs: aggregations }
end

As long as we organized it with the following interface:

def query
  filter.presence || { match_all: {} }
end

def filter
  return if filter_options.blank?

  filter = { bool: { filter: [] } }
  filter_options.map do |f|
    filter[:bool][:filter].push(f)
  end

  filter
end

where the filter_options contains already mangled instances of various filter classes abstracitons.

Working with ElasticSearch and Rails; part 2

A typical setup for a Ruby on Rails app and Elasticsearch is built around the elasticsearch-model, elasticsearch-rails, elasticsearch-persistence and elasticsearch-dsl bundle.

One of the facilities of the aforementioned bundle is that they “automagically” refreshes the appropriate indexes in an atomic way, mainly thanks to this callback you have set on a related model:

  after_commit on: [:create] do
    __elasticsearch__.index_document
  end

  after_commit on: [:update] do
    __elasticsearch__.update_document
  end

  after_commit on: [:destroy] do
    __elasticsearch__.delete_document
  end

Thing is, as the development continues, it’s easy to end up with some “composite” indexed values, where a several database fields are taken into consideration when calculating the final value. Example:

def as_indexed_json(_options = {})
 {
   branch: branch_names
 }
end

def branch_names
 Array.new.tap do |ary|
   ary << "cleaning" if provides_cleaning_services.present?
   ary << "accommodation" if provides_accommodation_services.present?
   ary << "food_service" if has_alcohol_license.present?
 end.presence || ["no_branch"]
end

Sadly, updating any of the database fields impacting the branch_names index value won’t trigger the branch_names field refresh. Instead of waiting for a client to file a bug report (and making him unhappy in general) consider overwriting one of these callbacks to something along these lines:

after_commit on: [:update] do
  if self.previous_changes.present?
    __elasticsearch__.index_document
  end
end

Working with ElasticSearch and Rails; part 1

A typical setup for a Ruby on Rails app and Elasticsearch is built around the elasticsearch-model, elasticsearch-rails elasticsearch-persistence and elasticsearch-dsl bundle.

The above setup integrates well with the ActiveRecord, adding a convenient way to define indexes based on AR models, as well as providing a neat way to create and refresh them in “all go” and atomic way.

Typically you achieve so by iterating over your models and calling the following on each of them:

model_class.__elasticsearch__.create_index!(force: true)
model_class.__elasticsearch__.refresh_index!
model_class.import

The thing I learnt the hard way is that the default scope (used by other gems like acts_as_paranoid for instance) is not respected by the elasticsearch gem. Also it is very easy to get into the N+1 one problems if your field mappings go through your associations.

A remedy for this is to pass the scope to the import method call, like this:

model.import scope: 'for_es_import'

I typically define such scope per each indexed model, so all includes and default scopes are satisfied and I can iterate over a set of multiple indexed AR models easily.

Finally, if you need to work with multiple indexes based on the same model, you’ll achieve that so much easily using gem Chewy. From my experience it’s also a much easier ES wrapper gem to use, from the outside Rails environment (i.e you have to use it in a Sinatra app).

How to fix Elasticsearch 'FORBIDDEN/12/index read-only'

By default, Elasticsearch installed with homebrew on Mac OS goes into read-only mode when you have less than 5% of free disk space. If you see errors similar to this:

Elasticsearch::Transport::Transport::Errors::Forbidden:
  [403] {"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"}],"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"},"status":403}

Or in /usr/local/var/log/elasticsearch.log you can see logs similar to:

flood stage disk watermark [95%] exceeded on [nCxquc7PTxKvs6hLkfonvg][nCxquc7][/usr/local/var/lib/elasticsearch/nodes/0] free: 15.3gb[4.1%], all indices on this node will be marked read-only

Then you can fix it by running the following commands:

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'