In Robinhood NoSQL I experimented with porting the Robinhood database from Elasticsearch, it being a commonly-used NoSQL database with specialisation in behaving like a Search Engine. Having ingested the ENTRIES table I now need to augment the records with filepaths from ANNEX_INFO, as I found out previously:

ANEX_INFO seems to store the lookup table matching filepaths (link column) with their Robinhood/Lustre ID.

NAMES seems to serve a similar purpose but splits every file into its own row and represents paths via string of a foreign keys; This means there are duplicate filenames in the table, but each of their rows have different parent_id’s as they belong to different parent directories.

I thought about ingesting both tables in separate SELECT queries, using the id column as the document_id in Elasticsearch, causing one document with a given id to be created and the second to match the same document and extend it with its fields.

Thankfully I didn’t jump straight into this and tried some SQL, I worked through the following:

  • SELECT * FROM ENTRIES - Pull everything from entries, including the relevant timestamps for metadata, uid, gid & misc.

  • SELECT * FROM ENTRIES, ANNEX_INFO WHERE ENTRIES.id = ANNEX_INFO.id - Using SQL to correlate the two tables where the id’s match. Unfortunately JDBC/SQL doesn’t tollerate a duplicate column name, in this case that is “id” which appears in both tables.

  • SELECT ENTRIES.id as id, uid, gid, size, blocks, creation_time, last_access, last_mod, last_mdchange, type, mode, nlink, md_update, invalid, fileclass, class_update, link FROM ENTRIES, ANNEX_INFO WHERE ENTRIES.id = ANNEX_INFO.id - This version is more verbose but it is necassary to eliminate the duplicate “id” column. SELECT ENTRIES.id as id specifies the “id” column to return whereas WHERE ENTRIES.id = ANNEX_INFO.id explicitly compares the “id” column from each table.

Unfortunately once this was applied to Logstash and the data ingested I noticed that the data from ANEX_INFO is incomplete, a lot of paths don’t make sense or aren’t paths. It appears I was unlucky when I used LIMIT 100 while digging through the database, as all the rows I saw had either full or partial paths.

The only table with a complete directory of files is ENTRIES but this requires pre-processing before ingestion as every segment of a path is connected by a foreign key. So for every full path to a file the pre-processing has to follow X amount of keys where X is the number of directories/links traversed plus the destination file. This could be achieve with a Logstash plugin written in Ruby, connecting to MySQL to execute JOIN’s to save SELECTing each row one at a time.