I experimented with porting the Robinhood database from Elasticsearch, it being a commonly-used NoSQL database with specialisation in behaving like a Search Engine. Having ingested the
ENTRIES table I now need to augment the records with filepaths from
ANNEX_INFO, as I found out previously:
ANEX_INFOseems to store the lookup table matching filepaths (
linkcolumn) with their Robinhood/Lustre
NAMESseems to serve a similar purpose but splits every file into its own row and represents paths via string of a foreign keys; This means there are duplicate filenames in the table, but each of their rows have different
parent_id’s as they belong to different parent directories.
I thought about ingesting both tables in separate
SELECT queries, using the
id column as the
document_id in Elasticsearch, causing one document with a given id to be created and the second to match the same document and extend it with its fields.
Thankfully I didn’t jump straight into this and tried some SQL, I worked through the following:
SELECT * FROM ENTRIES- Pull everything from entries, including the relevant timestamps for metadata, uid, gid & misc.
SELECT * FROM ENTRIES, ANNEX_INFO WHERE ENTRIES.id = ANNEX_INFO.id- Using SQL to correlate the two tables where the id’s match. Unfortunately JDBC/SQL doesn’t tollerate a duplicate column name, in this case that is “id” which appears in both tables.
SELECT ENTRIES.id as id, uid, gid, size, blocks, creation_time, last_access, last_mod, last_mdchange, type, mode, nlink, md_update, invalid, fileclass, class_update, link FROM ENTRIES, ANNEX_INFO WHERE ENTRIES.id = ANNEX_INFO.id- This version is more verbose but it is necassary to eliminate the duplicate “id” column.
SELECT ENTRIES.id as idspecifies the “id” column to return whereas
WHERE ENTRIES.id = ANNEX_INFO.idexplicitly compares the “id” column from each table.
Unfortunately once this was applied to Logstash and the data ingested I noticed that the data from
ANEX_INFO is incomplete, a lot of paths don’t make sense or aren’t paths. It appears I was unlucky when I used
LIMIT 100 while digging through the database, as all the rows I saw had either full or partial paths.
The only table with a complete directory of files is
ENTRIES but this requires pre-processing before ingestion as every segment of a path is connected by a foreign key. So for every full path to a file the pre-processing has to follow X amount of keys where X is the number of directories/links traversed plus the destination file. This could be achieve with a Logstash plugin written in Ruby, connecting to MySQL to execute
JOIN’s to save
SELECTing each row one at a time.