Robinhood provides indexing, search and management capabilities to computing centres with massive high-performnace filesystems, namely Lustre.

It does this by scanning every file on the filesystem as efficiently as possible to initially populate its MySQL/MariaDB relational database, then it continuously listen for changelogs coming from Lustre to keep its state in-line with reality.

Upon looking into the type of operations the Robinhood client tools typically perform and reading some recent presentations & papers it seems that MySQL/MariaDB was settled on as a well-known good general database engine and not necassarily for any specific qualities that make it exceptional in this use-case.

I have worked with NoSQL quite a bit the last year and it seems the Robinhood schema could fit quite well; For example without looking into the code: the rbh-find user tool likely performs a full table search for matching filenames, rbh-du will do the same while matching the base path. This behaviour is more akin to a search engine than a key-value lookup.

Without espounding the pro’s of both DB engine types, I understand that NoSQL has advantages with “search engine” style queries, with significant software projects dedicated to the workload; One being Apache Lucene, while still having options to experiment further by trying MongoDB , CouchDB or even BlazingDB .

Experiment

The easiest and quickest way to investigate whether the scale and typical requests would fit into NoSQL is the following:

Lustre → Robinhood → SQL → NoSQL ← Query script

Feeding NoSQL from the existing MySQL data is easier than retrofitting Robinhood to speak to NoSQL, doing this can provide a simple query performance benchmark.

I settled on using Elasticsearch as my NoSQL database as it is the easiest to get started with and I have prior experience with the system. My parsing and ingestion method is to use Logstash with the JDBC input plugin. This setup can be configured to connect to MySQL and SELECT all the entries in the ENTRIES table, it does this in configurable increments to avoid long queries.

Results

Using a copy of one of our production Robinhood databases our high-performance server was able to achieve:

  • Ingesting 1mil rows per query from a single 54GiB MySQL instance serving Robinhood
  • Resulting in 62,226,322 Elasticsearch documents over 53 minutes!
  • Each MySQL queries took longer the more records it had to offset, i.e. the further through the table it got, taking a maximum of 31 seconds to return
  • Elasticsearch & Logstash’s load wasn’t particularly high while ingesting

In my first attempt I set the per-query amount to 10'000, hoping to avoid the process stalling but this made the ingest very slow - I aborted after 9 hours to try a higher value!

Something of note, Logstash is unable to reload its configuration or shutdown while the Logstash JDBC plugin is ingesting data as the plugin blocks these operations. I had to kill logstash in order to abort.

Query Benchmark

As hoped for Elasticsearch queries like uid: xyz return within a few seconds with matches of millions of records; As for dumping the matched records in-full Elasticsearch took 1.31 seconds to return 10000 records.

MySQL dump with the same query (SELECT * FROM ENTRIES WHERE uid = 'root' LIMIT 10000) took 10.092 seconds.

These numbers are to some degree not fair, the uid field in MySQL is not indexed, it is not indexed to enhance the ingest speed of the ENTRIES table but I believe this shows a benefit of using Elasticsearch as no additional configuration was needed to query against the uid field.

Further work

More work is needed to make this data useful, Robinhood stores data in multiple tables and I have only ingested the one. ANEX_INFO seems to store the lookup table matching filepaths (link column) with their Robinhood/Lustre ID. NAMES seems to serve a similar purpose but splits every file into its own row and represents paths via string of a foreign keys; This means there are duplicate filenames in the table, but each of their rows have different parent_id’s as they belong to different parent directories. Lastly ACCT_STAT contains file statistics generated by Robinhood, this is all the data people are interested in so this will need to be merged in with the Elasticsearch documents.

Later it is worth investigating if multiple Robinhood instances can be merged into one Elasticsearch index. This would allow for whole-estate searches and metrics, but it assumes file id’s are unique to the file’s state and not incremental - Although this can be worked around by adding another field stating the source filesystem.

Source

/etc/logstash/elasticsearch.robinhood.json:

{
  "template": "rbh",
  "version": 2,
  "settings": {
    "index.refresh_interval": "5s"
  },  
  "mappings": {
    "_default_": {
      "properties": {
        "@timestamp": { "type": "date" },
        "@version": { "type": "keyword" },

        "class_update": { "type": "date" },
        "creation_time": { "type": "date" },
        "last_access": { "type": "date" },
        "last_mdchange": { "type": "date" },
        "last_mod": { "type": "date" },
        "md_update": { "type": "date" },
        "size": { "type": "bytes" }
      }   
    }   
  }
}

Excerpt: /etc/logstash/conf.d/inputs.conf:

  jdbc {
    jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://HOST:3306/robinhood"
    jdbc_user => "robinhood"
    jdbc_password => "XYZ"
    jdbc_paging_enabled => "true"
    jdbc_page_size => "1000000"
    statement => "SELECT * from ENTRIES"
    tags => ["robinhood"]
  }

This Logstash JDBC configuration requires the mysql-connector-java package on CentOS.

/etc/logstash/config/rbh.output.conf:

output {
  if "robinhood" in [tags] {
    elasticsearch {
      hosts => ["ELASTICBOX"]
      index => "rbh"
      template => "/etc/logstash/elasticsearch.robinhood.json"
      template_name => "robinhood"
      template_overwrite => true
    }
  }
}