Databricks provides a Unified Analytics Platform that accelerates innovation by unifying data science, engineering and business.
You can try it if you register here. There are two options:
I suggest you start trying it with the **Community Edition
In order to use databricks you will first need to create a cluster. The available options will change based on the Databricks version you are using.
With this version you can configure more things.
The default mode is
standard and will work well for 1 user.
For multiuser use
If you want to use scala you must use
Databricks runtimes are the set of core components that run on Databricks clusters.
This will decide the
spark version as well as some pre instaled libraries.
You can read more about the versions of each databricks runtime in (databricks releases notes)[https://docs.databricks.com/release-notes/runtime/supported.html#release-notes]).
Use python 3. Python 2 is deprecated so please do not use it.
The first paramter is
This will allow to create more workers for the cluster if needed and of course it will mean that the cost might increase as you need more machines.
There is the
terminate after param that sets the minutes of inactivity before shutting down the cluster.
Advanced Options and then
Spark you can set the Environment Variables.
This is really useful to do some configuration at the cluster level.
If you are using the Community Edition version of databricks is really easy to create a new version.
You only need to write a cluster name and decide the databricks runtime version.
Advanced Options /
Permissions is where permissions are set.
The available options are:
There are multiple ways of installing python libraries:
Librariessection in Cluster configuration
With databricks runtime some basic libraries will be installed. The good part is that you have those libraries available but at the cost that it is not easy to change the default version of that library.
For some libraries that will need some different configuration on the
driver and the
workers machines it is useful to install them using an init script.
dask is a good example of them since it is needed to start both the worker and the driver.
The default way for new libraries should be using the
From there you can install
eggs from a blob storage or directly using PyPI.
You can see an example of the libraries installed on a cluster:
It is not possible to install python
And lastly you should avoid installing libraries using
This also won't work if you are using a
conda databricks runtime.
There are some kind of logs:
In this section is possible to view the actions performed to the cluster. For example a start of the cluster will be registered here.
Those are the logs that the driver machine writes.
standard output and
standard error are self explanatory.
log4j output is the output that spark uses.
This allows to access the
stderr logs of each spark worker.
The way to work with databricks is by using notebooks. They are really similiar to the juupyter notebooks so it should be easy if you already know them.
Once you create a new notebook you will need to set:
You can later rename the notebook or attach it to another cluster if needed.
You can work with multiple languages in the same notebook even though there is a default one.
To do so you only need to use some magic commands like
%python to work with python.
No need to use anything to work with the default language
The languages you can use are:
One of the great things about databricks is that it comes with spark install, configured and already imported.
To use spark you can simply call
For example you could run in python
spark.sql("SELECT * FROM my_table").
Databricks comes with a class that has some utilities. This have 3 main functionalities:
This allows to interact with secrets. In Azure you will connect it to a Key Vault so that you can retrive secrets from there.
This allows to write and read files. It is also used to mount external storage units such as a datalake or a blob.
You can create custom widgets that allow you to interact with the notebook. For example you can create and dropdown that sets the value of one python variable.
This allow you to install packages.
I would advise to manage libraries with the
Libraries section of the cluster configuration.
If you are using a conda environment
display object is mainly used to show tables.
You can pass a spark or pandas dataframe and it will be nicely displayed.
It also allows to create plots based on the dataframe.
displayHTML as the name suggest allows to display HTML content.
It is the equivalent of an
iframe in HTML.
Databricks come with a folder where you can store data.
It is under the path
Hower I suggest that you mount your own external unit and store the data there.
One important aspect for reading and writting data is the paths.
Imagine that you have a table stored in
To read that in different languages you should:
|SQL||SELECT * FROM delta.`/mnt/blob/data.delta`|
You can schedule jobs with the section
It works like the Linux chron.
Hower it should only be used for small tasks since it is not an orchestrator like Airflow or Azure Data Factory (ADF).