dask-sql is a distributed SQL query engine in Python.
It allows you to query and transform your data using a mixture of
common SQL operations and Python code and also scale up the calculation easily
if you need it.
dask-sqlyou can mix the well known Python dataframe API of
Daskwith common SQL operations, to process your data in exactly the way that is easiest for you.
Daskecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if
Dasksupports it, so will
dask-sqlis just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups -
dask-sqlwill run out of the box on your machine and can be easily connected to your computing cluster.
dask-sqlintegrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
dask-sqlhas experimental support for running SQL queries on CUDA-enabled GPUs by utilizing RAPIDS libraries like
cuDF, enabling accelerated compute for SQL.
Read more in the documentation.
For this example, we use some data loaded from disk and query them with a SQL command from our python code.
Any pandas or dask dataframe can be used as input and
dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).
import dask.dataframe as dd from dask_sql import Context # Create a context to hold the registered tables c = Context() # Load the data and register it in the context # This will give the table a name, that we can use in queries df = dd.read_csv("...") c.create_table("my_data", df) # Now execute a SQL query. The result is again dask dataframe. result = c.sql(""" SELECT my_data.name, SUM(my_data.x) FROM my_data GROUP BY my_data.name """, return_futures=False) # Show the result print(result)
dask-sqlis currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!
dask-sql can be installed via
conda (preferred) or
pip - or in a development environment.
Create a new conda environment or use your already present environment:
conda create -n dask-sql conda activate dask-sql
Install the package from the
conda install dask-sql -c conda-forge
dask-sql needs Java for the parsing of the SQL queries.
Make sure you have a running java installation with version >= 8.
To test if you have Java properly installed and set up, run
$ java -version openjdk version "1.8.0_152-release" OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12) OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)
After installing Java, you can install the package with
pip install dask-sql
If you want to have the newest (unreleased)
dask-sql version or if you plan to do development on
dask-sql, you can also install the package from sources.
git clone https://github.com/dask-contrib/dask-sql.git
Create a new conda environment and install the development environment:
conda env create -f continuous_integration/environment-3.9-jdk11-dev.yaml
It is not recommended to use
pip instead of
conda for the environment setup.
If you however need to, make sure to have Java (jdk >= 8) and maven installed and correctly setup before continuing.
Have a look into
environment-3.9-jdk11-dev.yaml for the rest of the development environment.
After that, you can install the package in development mode
pip install -e ".[dev]"
To recompile the Java classes after changes have been made to the source contained in
python setup.py build_ext
This repository uses pre-commit hooks. To install them, call
You can run the tests (after installation) with
dask-sql comes with a small test implementation for a SQL server.
Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol.
It is - so far - only a start of the development and missing important concepts, such as
You can test the sql presto server by running (after installation)
or by using the created docker image
docker run --rm -it -p 8080:8080 nbraun/dask-sql
in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.
You can test this for example with the default presto client:
presto --server localhost:8080
Now you can fire simple SQL queries (as no data is loaded by default):
=> SELECT 1 + 1; EXPR$0 -------- 2 (1 row)
You can find more information in the documentation.
You can also run the CLI
dask-sql for testing out SQL commands quickly:
dask-sql --load-test-data --startup (dask-sql) > SELECT * FROM timeseries LIMIT 10;
At the core,
dask-sql does two things:
For the first step, Apache Calcite needs to know about the columns and types of the dask dataframes, therefore some java classes to store this information for dask dataframes are defined in
After the translation to a relational algebra is done (using
RelationalAlgebraGenerator.getRelationalAlgebra), the python methods defined in
dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.