My IRIS-HEP Fellowship | Work Report
The Skyhook Data Management project adds data management capabilities to object storage for tabular data. SkyhookDM is a Ceph distributed object storage module that allows you to store and query database tables.
The goal of Skyhook is to allow users to transparently grow and shrink their data storage and processing needs as demands change, offloading computation and other data management tasks to the storage layer in order to reduce client-side resources needed for data processing in terms of CPU, memory, IO, and network traffic. Skyhook utilizes Ceph’s existing object class mechanism (“cls”) by developing customized C++ object classes and methods that enable database operations such as SELECT, PROJECT, AGGREGATE to be offloaded (i.e., pushed down) to the object storage layer. Data processing tasks are executed directly within storage at the object-level, and also include data management tasks such as local indexing and data transformations (e.g., row to column layout) to support dynamic data management in the cloud.
SkyhookDM, a performance-critical distributed storage system developed by embedding Apache Arrow, is a computational storage system. Small changes in the source code’s performance-critical parts will often result in significant performance changes. It’s essential to keep track of these performance changes so that the project can become more performant over time and avoid silent performance deterioration.
To overcome these challenges, the Ursalab benchmarking framework ( Conbench ) can be used to create benchmarks (very similar to unit tests) for all the performance-critical parts of the source code. These benchmarks can be added as a separate job in the CI/CD pipeline, which get triggered when any particular events like commit/push happen. A web dashboard can also be integrated to monitor the performance results of the CI tests
Overview of Work Done
Following are the few major features which I implemented through the coding period:
1. Integrated Conbench:- Conbench is a language-independent, continuous benchmarking framework, built specifically with the needs of a cross-language, platform-independent, high-performance project like Apache Arrow in mind.
Conbench allows to write benchmarks in any language, publish the results as JSON via an API, and persist them for comparison while iterating on performance improvements or to guard against regressions. It was the best fit for our project as SkyhookDM was made on top of the Apache Arrow.
All the relevant commit’s made by me can be found here.
2. Automated Benchmarking using Github Actions:-
Now as the benchmarking framework is successfully integrated and few of the custom-made benchmarks are working fine, to automate the process I made a CI/CD pipeline using Github Actions which performs different steps ( as shown in the picture below ) as soon as a new Pull request is created.
This Pipeline uses the latest built Skyhook-DM docker image which contains all the changes that are committed in the PR and are yet to be merged. After that, all the benchmarks are executed which generates a JSON output that is further being utilized by the custom-made python scripts to generate some analytical plots.
3. Visualizing the Outputs:-
To get a better understanding of the JSON outputs I wrote down some python scripts which compiles the different output and generate different plots based on a different type of selectivity ( row/column ). Currently, we just compare the results for two main data formats that are Parquet and Rados-Parquet but the script is written in such a way that it can be further utilized to extend its use-case for any number of the data formats.
4. Comment Bot Integrated
As a cheery on the top of the cake, I integrated a comment bot that adds a comment ( containing the Plots ) in the same PR thread. This proved to be a very useful addon as the contributors won’t need to click extra links in order to get the plots.
So the steps of event that happen are:
- A Github Action build is triggered whenever a contributor raises a PR in the SkyhookDM repo. The main job of the workflow is to build the latest docker image consisting of all the changes made by the contributor and as soon as the job is finished it triggers another build in the benchmarks repo using the webhooks.
- In the benchmarks workflow, all the benchmark tests are executed with the help of conbench and all the outputs generated are stored here in a well-structured format.
- Once all the JSON results are pushed, the python scrips are executed which further generates the plot.
- As a last step of the workflow, the comment bot adds all the plots to the same PR thread.
As I was completely new to the Ceph file system and SkyhookDM, it was quite hard for me to start working on the project as it needs a good understanding of both. But my mentor Jayjeet was very helpful and he explained me everything that was important. I am very grateful that I got a chance to work with him.