Dask Development Log, Scipy 2018
This work is supported by Anaconda Inc
To increase transparency I’m trying to blog more often about the current work going on around Dask and related projects. Nothing here is ready for production. This blogpost is written in haste, so refined polish should not be expected.
Last week many Dask developers gathered for the annual SciPy 2018 conference. As a result, very little work was completed, but many projects were started or discussed. To reflect this change in activity this blogpost will highlight possible changes and opportunities for readers to further engage in development.
Dask on HPC Machines
The dask-jobqueue project was a hit at the conference. Dask-jobqueue helps people launch Dask on traditional job schedulers like PBS, SGE, SLURM, Torque, LSF, and others that are commonly found on high performance computers. These are very common among scientific, research, and high performance machine learning groups but commonly a bit hard to use with anything other than MPI.
This project came up in the Pangeo talk, lightning talks, and the Dask Birds of a Feather session.
During sprints a number of people came up and we went through the process of configuring Dask on common supercomputers like Cheyenne, Titan, and Cori. This process usually takes around fifteen minutes and will likely be the subject of a future blogpost. We published known-good configurations for these clusters on our configuration documentation
Additionally, there is a JupyterHub issue to improve documentation on best practices to deploy JupyterHub on these machines. The community has done this well a few times now, and it might be time to write up something for everyone else.
Get involved
If you have access to a supercomputer then please try things out. There is a 30-minute Youtube video screencast on the dask-jobqueue documentation that should help you get started.
If you are an administrator on a supercomputer you might consider helping to
build a configuration file and place it in /etc/dask
for your users. You
might also want to get involved in the JupyterHub on
HPC
conversation.
Dask / Scikit-learn talk
Olivier Grisel and Tom Augspurger prepared and delivered a great talk on the current state of the new Dask-ML project.
MyBinder and Bokeh Servers
Not a Dask change, but Min Ragan-Kelley showed how to run services through mybinder.org that are not only Jupyter. As an example, here is a repository that deploys a Bokeh server application with a single click.
I think that by composing with Binder Min effectively just created the free-to-use hosted Bokeh server service. Presumably this same model could be easily adapted to other applications just as easily.
Dask and Automated Machine Learning with TPOT
Dask and TPOT developers are discussing paralellizing the automatic-machine-learning tool TPOT.
TPOT uses genetic algorithms to search over a space of scikit-learn style pipelines to automatically find a decently performing pipeline and model. This involves a fair amount of computation which Dask can help to parallelize out to multiple machines.
Get involved
Trivial things work now, but to make this efficient we’ll need to dive in a bit more deeply. Extending that pull request to dive within pipelines would be a good task if anyone wants to get involved. This would help to share intermediate results between pipelines.
Dask and Scikit-Optimize
Among various features, Scikit-optimize
offers a BayesSearchCV
object that is like Scikit-Learn’s GridSearchCV
and RandomSearchCV
, but is a
bit smarter about how to choose new parameters to test given previous results.
Hyper-parameter optimization is a low-hanging fruit for Dask-ML workloads today,
so we investigated how the project might help here.
So far we’re just experimenting using Scikit-Learn/Dask integration through joblib to see what opportunities there are. Dicussion among Dask and Scikit-Optimize developers is happening here:
Centralize PyData/Scipy tutorials on Binder
We’re putting a bunch of the PyData/Scipy tutorials on Binder, and hope to embed snippets of Youtube videos into the notebooks themselves.
This effort lives here:
Motivation
The PyData and SciPy community delivers tutorials as part of most conferences. This activity generates both educational Jupyter notebooks and explanatory videos that teach people how to use the ecosystem.
However, this content isn’t very discoverable after the conference. People can search on Youtube for their topic of choice and hopefully find a link to the notebooks to download locally, but this is a somewhat noisy process. It’s not clear which tutorial to choose and it’s difficult to match up the video with the notebooks during exercises. We’re probably not getting as much value out of these resources as we could be.
To help increase access we’re going to try a few things:
- Produce a centralized website with links to recent tutorials delivered for each topic
- Ensure that those notebooks run easily on Binder
- Embed sections of the talk on Youtube within each notebook so that the explanation of the section is tied to the exercises
Get involved
This only really works long-term under a community maintenance model. So far we’ve only done a few hours of work and there is still plenty to do in the following tasks:
- Find good tutorials for inclusion
- Ensure that they work well on mybinder.org
- are self-contained and don’t rely on external scripts to run
- have an environment.yml or requirements.txt
- don’t require a lot of resources
- Find video for the tutorial
- Submit a pull request to the tutorial repository that embeds a link to the youtube talk at the top cell of the notebook at the proper time for each notebook
Dask, Actors, and Ray
I really enjoyed the talk on Ray another distributed task scheduler for Python. I suspect that Dask will steal ideas for actors for stateful operation. I hope that Ray takes on ideas for using standard Python interfaces so that more of the community can adopt it more quickly. I encourage people to check out the talk and give Ray a try. It’s pretty slick.
Planning conversations for Dask-ML
Dask and Scikit-learn developers had the opportunity to sit down again and raise a number of issues to help plan near-term development. This focused mostly around building important case studies to motivate future development, and identifying algorithms and other projects to target for near-term integration.
Case Studies
- What is the purpose of a case study: dask/dask-ml #302
- Case study: Sparse Criteo Dataset: dask/dask-ml #295
- Case study: Large scale text classification: dask/dask-ml #296
- Case study: Transfer learning from pre-trained model: dask/dask-ml #297
Algorithms
- Gradient boosted trees with Numba: dask/dask-ml #299
- Parallelize Scikit-Optimize for hyperparameter optimization: dask/dask-ml #300
Get involved
We could use help in building out case studies to drive future development in the project. There are also several algorithmic places to get involved. Dask-ML is a young and fast-moving project with many opportunities for new developers to get involved.
Dask and UMAP for low-dimensional embeddings
Leland McKinnes gave a great talk Uniform Manifold Approximation and Projection for Dimensionality Reduction in which he lays out a well founded algorithm for dimensionality reduction, similar to PCA or T-SNE, but with some nice properties. He worked together with some Dask developers where we identified some challenges due to dask array slicing with random-ish slices.
A proposal to fix this problem lives here, if anyone wants a fun problem to work on:
Dask stories
We soft-launched Dask Stories a webpage and project to collect user and share stories about how people use Dask in practice. We’re also delivering a separate blogpost about this today.
See blogpost: Who uses Dask?
If you use Dask and want to share your story we would absolutely welcome your experience. Having people like yourself share how they use Dask is incredibly important for the project.
blog comments powered by Disqus