Polyglot data sets in Clojure - Reloaded

18 May 2014

Since the original installment of the datasets library, a couple of the original design decisions had to be revisited. This post will delve into a couple of interesting bits of the evolving design.

Query reordering

A key requirement of the dataset wrapper API is to support delegate to native implementations wherever possible, be that running filter logic on the original source (to limit network transfers) or supporting data source internal joins. At the same time operations not supported by the original source are supported in a Clojure wrapping layer. One case not supported in the original design is a sequence of operations as below:

(-> (sql-table->dataset h2-spec "accounts")
    (select [(subs :$mnemonic 3) :as :mnem] :$strategy)
    (where (= :$strategy "001")))

The challenge here is that the native SQL data source is wrapped by a select that uses Clojure functions, and will hence force evaluation in the Clojure wrapper. The ensuing where clause on the other hand is not dependent on the select statement (as strategy is passed through unchanged) and could/should be executed natively. One may argue that in this example the code author could simply reorder the statements. However, if the where clause is introduced for example as part of a directed join, or if the dataset and the select are provided as a utility data source this may not be possible. What is needed hence is support for query reordering.

To query plan or not to query plan

At this stage one possibility is to abandon the original design of smart datasources entirely and move to dumb datasource objects backed by an external query planner algorithm. However, part of the appeal of the original design is that datasources are easy to add and, following a couple of common behaviour rules, contain all the logic needed to integrate into the larger fabric. So for now let's implement where/select query reordering in the localized datasources.

First let's revisit the current distribution of responsibilities for select/where statements.

  1. The macro layer parses code arguments and turns them into quoted forms.
  2. The functional layer handles cross-cutting concerns of parsing arguments and splitting into supported and unsupported ones.
  3. The protocol implementations handle the datasource specific access and argument handling.
The natural place to handle query reordering would seem to be the functional layer, which already handles join flows. However, the layer has both too little information and tries to do too much. For example, it is not clear whether we want to reorder select and where clauses in general or just in cases where a Clojure wrapper was generated and prevents native selects. Secondly, given that arguments are parsed in the functional layer for pushing where clauses we would need to retrieve the nested source and try to parse that one. Note that query reordering for the described scenario is only really needed for Clojure wrapper datasets as opposed to everything else. So we can put the logic there.

Another challenge after this decision is that the query arguments are parsed in the functional layer so that the wrapper datasource does not get the raw arguments the wrapped datasource would need. Moreover, to safely push a where clause the intervening select must not modify the fields used in the where clause, which presupposes introspection into the where clause itself. Finally, given that the functional layer handles argument seggregation that logic would also now have to replicated in the wrapper to only pass those arguments to the wrapped datasource that it can handle. At this point it should be apparent that the division of responsibilities into a functional layer and protocol implementations has become untenable. So the first change is to push all logic into the protocols. The functional layer now only remains to delegate to wrapper datasources for operation not implemented by a given datasource. Also note, that pushing the logic into individual datasources means that some recurring concerns such as argument segregation now has to be handled by each datasource individually. However, we can still provide utilities that can be used to remove code duplication.

The new functional layer for select and where are shown below.

(defn select* [source fields]
  (if (satisfies? Selectable source)
    (-select source fields)
    (select-wrapper source fields)))

(defn where* [source conditions]
  (if (satisfies? Filterable source)
    (-where source conditions)
    (where-wrapper source conditions)))

Wrapper datasets

The most important utility is the concept of a default implementation of all methods that is available to other datasources for operations they don't wish to handle. Instead of a single ClojureDatasource class though each wrapper has seperate behaviour. For example the select wrapper below has logic for the query reordering we wish to accomplish:

(defn select-wrapper [source fields]
  (let [;; to allow some push behaviour we want to exclude one-to-one mappings
        output-fields (set (keep (fn [[k exp]] (when-not (= k exp) k)) fields))
        parsed-fields (map (fn [[k sexp]] [(field->keyword k) (sexp->fn sexp)]) fields)
        mapped-source
        (r/map
          (fn [rec]
            (persistent!
              (reduce
                (fn [res [key f]] (assoc! res key (f rec)))
                (transient {})
                parsed-fields)))
          source)]
    (reify
      Filterable
      (-where [self conditions]
        (let [{pushable true unpushable false}
              (group-by #(empty? (set/intersection (set (field-refs %)) output-fields)) conditions)
              inner (if (seq pushable) (where* source pushable) source)]
          (where-wrapper
            (select-wrapper inner fields)
            unpushable)))

      p/CollReduce
      (coll-reduce [_ f]
        (p/coll-reduce mapped-source f))
      (coll-reduce [_ f val]
        (p/coll-reduce mapped-source f val)))))

On the other hand the where wrapper is quite a bit more simple. Note that there may also be a case of reordering where clauses to push native predicates inside. For the moment we have kicked the requirement down the line, given that the implementation is a bit more complex than select to prevent spurious flip-flopping of where clauses.

(defn where-wrapper
  "Default implementation of Filterable protocol handled in Clojure.
  Delegate to this function for where clauses not natively handled."
  [source conditions]
  (let [parsed-conditions (map sexp->fn conditions)]
    (r/filter
      (fn [rec] (every? #(% rec) parsed-conditions))
      source)))

Next up .. pattern datasets

With the SQL layer in a reasonable shape the nxet step is to support datasources with more rigid access. Such as for example caches with linearized access keys or individual field indices. These will require a different approach to parsing where clauses to extract the field level settings and transform them to the non-SQL query DSL of the individual datasource.

comments powered by Disqus