Microsoft SQL Server PolyBase (product site)
As illustrated by the figure below, the goal of the Polybase project is to allow SQL Server PDW users to execute queries against data stored in Hadoop, specifically the Hadoop distributed file system (HDFS). Polybase is agnostic on both the type of the Hadoop cluster (Linux or Windows) and whether it is a separate cluster or whether the Hadoop nodes are co-located with the nodes of the PDW appliance. Using PDW’s CTAS (“create table as select”) syntax, Polybase provides users with the ability to move data in parallel between nodes of the Hadoop and PDW clusters. In addition, users can create external tables over HDFS-resident data. All standard HDFS file formats are supported as well as custom file formats as long as InputFormat (with RecordReader) and OutputFormat (RecordWriter) implementations are provided. This allows queries to reference data stored in HDFS as if it were loaded into a relational table. Users can seamlessly perform joins between tables in PDW and data in HDFS.
While some other parallel database systems provide similar capabilities, Polybase pushes the state-of-the-art further into two significant ways. First, when optimizing a SQL query that references data stored in HDFS, the Polybase query optimizer makes a cost-based decision (using statistics on the HDFS file stored in the PDW catalog) on whether or not it should transform relational operators over HDFS-resident data into MapReduce jobs for execution on the Hadoop cluster. Consider for example a simple query with two selections, a join, and an aggregate. If one of the two input tables is stored in HDFS, the Polybase query optimizer will evaluate whether or not to perform the select operator on that table as a Map job on the Hadoop cluster or whether it is more efficient to pull the entire file into PDW and perform the selection using the SQL Server instances running on the PDW appliance. If both input tables are in HDFS, not only will the Polybase QO consider pushing the selection as Map job, but it will also consider the benefits of pushing the join and aggregate as well. All of this is totally transparent to the user and is driven by a state-of-the art parallel query optimizer initially developed at the Gray Systems Lab. Polybase, however, carries the idea of “split-query” processing even further. Consider a configuration consisting of a small PDW appliance paired with a large Hadoop cluster. Polybase is capable of fully leveraging the larger compute and I/O power of the Hadoop cluster by moving work to Hadoop for processing even for queries that only reference PDW-resident data.
- Rimma Nehme
- Alan Halverson
- Srinath Shankar
- David DeWitt
- Hideaki Kimura
- Nikhil Teletia
- Willis Lang
- Jessie Li
- Vineeta Gankidi
 "Split Query Processing in Polybase", David J. DeWitt, Alan Halverson, Rimma Nehme, Srinath Shankar, Josep Aguilar-Saborit, Artin Avanes, Miro Flasza, and Jim Gramling, Proceedings of the 2013ACM SIGMOD Conference, N.Y., N.Y., June 2013.
 "Indexing HDFS Data in PDW: Splitting the data from the index", Vinitha Reddy Gankidi, Nikhil Teletia, Jignesh M. Patel, Alan Halverson, David J. DeWitt. VLDB 2014