HarkDB : current state of affairs
I wanted to discuss one of my projects. What it does, where its at, and where its going. I've always been fascinated by GPU technology as I mentioned in my last blog post. For one of my class projects I decided to implement a GPU query engine, that leverages the parallelism of the GPU to speed up SQL analytics queries.
Writing GPU kernels is far from easy. Thats where the futhark programming langauge comes in. Futhark is a functional programming langauge for data parallelism. It was created and is being actively developed at the DIKU (Datalogisk Institut på Københavns Universitet). As a current student at DIKU, it felt appropriate to play around with it. The general premise is that writing good kernels is pretty hard and takes a lot of experience, so why not have a compiler generate highly optimized kernels for you, which you can write in a higher level of abstraction. Its actually pretty cute. From what I know there really aren't good alternative programming languages that do the same (its hard to tell because I'm in the echo chamber).
The logic for preforming the operations of the SQL queries are implemented in Futhark. The futhark is than compiled to Kernels, which can be called from a generated shared library. The parsing of the queries is done in C++. Futhark provides multiple backends including OpenCL, Cuda, PyOpenCL, and more. However in practice the OpenCL and Cuda backends perform the best. This is why I chose to implement the project in C++.
In the current state of the project all the actions are preformed via a CLI. Meta commands and queries are passed in the HarkDB REPL.
Where we are
Currently we only support the SQL SELECT clause end to end. The Kernels for the other SQL statements have been implemented in futhark, but the parser does not currently support these clauses. So the parser is the biggest disparity, and road block in the project.
Where we are going
The functionality for the other clauses is basically implemented. Certain steps can be taken to bootstrap the work that we have already completed to get the SQL SELECT clause working end to end. The first is that There should be a single Futhark MAIN function for handling all the futhark clauses. We should probably wait with implementing the where, and having clauses as their parsing is more involved.
Thus if we were trying to get an involved SQL statement (not including Join) would look something like this.
SELECT PID, AVG(Price), Min(Weight) FROM USERS GROUPBY PID
With the ideal Futhark function call having the following type signature
let entry select_groupby_from (Cols : [n]i32) (Cols_t : [n]i32) (db: u32) (GroupBy : u32) # Possibly array but don't currently support that
Its important to note that there are some invariants that have to be met for the futhark call to be legitimate. These invariants are something that should be checked in the c++ code that orchestrates the calls to the kernels generated by the futhark code.
Joins are more complicated as they are multi table rules. When coupling Joins with groupby and select, things do get more complicated. However an important observation to make is that the join operation can be performed before the other operations. Which means that we can effectively calculate an intermediary View/Table, and then the groupby, select and other clauses can be computed using this intermediary value as the table they operate on.
The type Signature of a Join statement in futhark looks looks this
join [n][m][s][t][l][k] (db1: [n][m]u32) (db2: [s][t]u32) (col1: i32) (col2: i32) (cols1: [l]i32) (cols2: [k]i32)
The idea is that we join on col1 and col2 from db1, and db2 respectively and select the columns from db1, and db2 that are specified in cols1, and cols2 respectively. currently only single column joins are supported as can be seen by the fact that col1, and col2 are ints and not arrays.
The current parser does not support join statements. This is something that I will have to expand, in order to orchestrate the call to the join kernel generated by my futhark code.
Honestly the parser I wrote is hot garbage. I have been grappling with the idea of completely rewriting the whole parser. It did what I needed it to do with a deadline fast approaching. However some of the code is a spectacle in how to never write code. Rob Pike has an interesting talk about writing Lexer's in GO. I think that this might provide some good insights on where to start my rewrite. At this stage adding to the parser seems like a poor way to move forward. Rewrite is the only good solution.
I have outlined a couple of the next steps for HarkDB in the blog post so far
- Implement GROUPBY end to end
- Implement JOIN end to end
- Rewrite the Parser
- Theres a lot more I didn't mention...
However what is the end goal? I'd really like to see if what I wrote actually does anything worth while. The way I plan on doing this by comparing the results against the Alenka GPU database engine. If some benchmarks are within a factor of four I'll be pretty stoked. It should be noted that I have a lot to implement before I can actually test against Alenka.
So I'll end with some positives. I got CI and testing working end to end with Travis and Google Test. I converted from Make to CMake, which has been pretty awesome, as I've been able to use some of the more advanced features of CMake to handle dependencies, making installing HarkDB far easier. I'm exited to see where the rest of this goes.
Oh and the source code for HarkDB if your interested