Skip to content

Commit 4325cf5

Browse files
authored
test(v1): More testing, minor documentation updates (#94)
1 parent 6475104 commit 4325cf5

File tree

9 files changed

+212
-48
lines changed

9 files changed

+212
-48
lines changed

.github/workflows/pypi_deploy.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,15 @@ jobs:
99
steps:
1010
- uses: actions/checkout@v2
1111

12-
- name: Set up Python 3.9
12+
- name: Set up Python 3.10
1313
uses: actions/setup-python@v2
1414
with:
15-
python-version: 3.9
15+
python-version: 3.10
1616

1717
- name: Install Poetry
1818
uses: abatilo/actions-poetry@v2.1.4
1919
with:
20-
poetry-version: 1.2.0b1
20+
poetry-version: 1.3.2
2121

2222
- name: Install airflow-dbt-python with Poetry
2323
run: poetry install

.github/workflows/tagged_release.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,14 @@ jobs:
1717
checkName: CI
1818
ref: ${{ github.sha }}
1919
- uses: actions/checkout@v2.3.4
20-
- name: Set up Python 3.9
20+
- name: Set up Python 3.10
2121
uses: actions/setup-python@v2
2222
with:
23-
python-version: '3.9'
23+
python-version: '3.10'
2424
- name: Install Poetry
2525
uses: abatilo/actions-poetry@v2.1.4
2626
with:
27-
poetry-version: 1.2.2
27+
poetry-version: 1.3.2
2828
- name: Install airflow-dbt-python with Poetry
2929
run: poetry install
3030
- name: Build airflow-dbt-python with Poetry

README.md

Lines changed: 44 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,11 @@ Before using *airflow-dbt-python*, ensure you meet the following requirements:
2424
* Running Python 3.7 or later in your Airflow environment.
2525

2626
> **Warning**
27+
>
2728
> Even though we don't impose any upper limits on versions of Airflow and *dbt*, it's possible that new versions are not supported immediately after release, particularly for *dbt*. We recommend testing the latest versions before upgrading and [reporting any issues](https://github.com/tomasfarias/airflow-dbt-python/issues/new/choose).
2829
2930
> **Note**
31+
>
3032
> Older versions of Airflow and *dbt* may work with *airflow-dbt-python*, although we cannot guarantee this. Our testing pipeline runs the latest *dbt-core* with the latest Airflow release, and the latest version supported by [AWS MWAA](https://aws.amazon.com/managed-workflows-for-apache-airflow/).
3133
3234
## From PyPI
@@ -58,17 +60,23 @@ And installing with *Poetry*:
5860
poetry install
5961
```
6062

61-
## In MWAA:
63+
## In AWS MWAA
6264

6365
Add *airflow-dbt-python* to your `requirements.txt` file and edit your Airflow environment to use this new `requirements.txt` file, or upload it as a plugin.
6466

6567
Read the [documentation](https://airflow-dbt-python.readthedocs.io/en/latest/getting_started.html#installing-in-mwaa) for more a more detailed AWS MWAA installation breakdown.
6668

69+
## In other managed services
70+
71+
*airflow-dbt-python* should be compatible with most or all Airflow managed services. Consult the documentation specific to your provider.
72+
73+
If you notice an issue when installing *airflow-dbt-python* in a specific managed service, please open an [issue](https://github.com/tomasfarias/airflow-dbt-python/issues/new/choose).
74+
6775
# Features
6876

69-
*airflow-dbt-python* aims to make dbt a **first-class citizen** of Airflow by supporting additional features that integrate both tools. As you would expect, *airflow-dbt-python* can run all your dbt workflows in Airflow with the same interface you are used to from the CLI, but without being a mere wrapper: *airflow-dbt-python* directly interfaces with internal *dbt-core* classes, bridging the gap between them and Airflow's operator interface.
77+
*airflow-dbt-python* aims to make dbt a **first-class citizen** of Airflow by supporting additional features that integrate both tools. As you would expect, *airflow-dbt-python* can run all your dbt workflows in Airflow with the same interface you are used to from the CLI, but without being a mere wrapper: *airflow-dbt-python* directly communicates with internal *dbt-core* classes, bridging the gap between them and Airflow's operator interface. Essentially, we are attempting to use *dbt* **as a library**.
7078

71-
As this integration was completed, several features were developed to **extend the capabilities of dbt** to leverage Airflow as much as possible. Can you think of a way *dbt* could leverage Airflow that is not currently supported? Let us know in a [GitHub issue](https://github.com/tomasfarias/airflow-dbt-python/issues/new/choose)! The current list of supported features is as follows:
79+
As this integration was completed, several features were developed to **extend the capabilities of dbt** to leverage Airflow as much as possible. Can you think of a way *dbt* could leverage Airflow that is not currently supported? Let us know in a [GitHub issue](https://github.com/tomasfarias/airflow-dbt-python/issues/new/choose)!
7280

7381
## Independent task execution
7482

@@ -77,36 +85,39 @@ Airflow executes [Tasks](https://airflow.apache.org/docs/apache-airflow/stable/c
7785
In order to work with this constraint, *airflow-dbt-python* runs each dbt command in a **temporary and isolated directory**. Before execution, all the relevant dbt files are copied from supported backends, and after executing the command any artifacts are exported. This ensures dbt can work with any Airflow deployment, including most production deployments as they are usually running [Remote Executors](https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html#executor-types) and do not guarantee any files will be shared by default between tasks, since each task may run in a completely different environment.
7886

7987

80-
## Download dbt files from S3
88+
## Download dbt files from a remote storage
89+
90+
The dbt parameters `profiles_dir` and `project_dir` would normally point to a directory containing a `profiles.yml` file and a dbt project in the local environment respectively (defined by the presence of a *dbt_project.yml* file). *airflow-dbt-python* extends these parameters to also accept an URL pointing to a remote storage.
91+
92+
Currently, we support the following remote storages:
8193

82-
The dbt parameters `profiles_dir` and `project_dir` would normally point to a directory containing a `profiles.yml` file and a dbt project in the local environment respectively (defined by the presence of a *dbt_project.yml* file). *airflow-dbt-python* extends these parameters to also accept an [AWS S3](https://aws.amazon.com/s3/) URL (identified by a *s3* scheme):
94+
* [AWS S3](https://aws.amazon.com/s3/) (identified by a *s3* scheme).
95+
* Remote git repositories, like those stored in GitHub (both *https* and *ssh* schemes are supported).
8396

84-
* If an S3 URL is used for `profiles_dir`, then this URL must point to a directory in S3 that contains a *profiles.yml* file. The *profiles.yml* file will be downloaded and made available for the operator to use when running.
85-
* If an S3 URL is used for `project_dir`, then this URL must point to a directory in S3 containing all the files required for a dbt project to run. All of the contents of this directory will be downloaded and made available for the operator. The URL may also point to a zip file containing all the files of a dbt project, which will be downloaded, uncompressed, and made available for the operator.
97+
* If a remote URL is used for `project_dir`, then this URL must point to a location in your remote storage containing a *dbt* project to run. A *dbt* project is identified by the prescence of a *dbt_project.yml*, and contains all your [resources](https://docs.getdbt.com/docs/build/projects). All of the contents of this remote location will be downloaded and made available for the operator. The URL may also point to an archived file containing all the files of a dbt project, which will be downloaded, uncompressed, and made available for the operator.
98+
* If a remote URL is used for `profiles_dir`, then this URL must point to a location in your remote storage that contains a *profiles.yml* file. The *profiles.yml* file will be downloaded and made available for the operator to use when running. The *profiles.yml* may be part of your *dbt* project, in which case this argument may be ommitted.
8699

87100
This feature is intended to work in line with Airflow's [description of the task concept](https://airflow.apache.org/docs/apache-airflow/stable/concepts/tasks.html#relationships):
88101

89102
> Tasks don’t pass information to each other by default, and run entirely independently.
90103
91-
In our world, that means task should be responsible of fetching all the dbt related files it needs in order to run independently, as already described in [Independent Task Execution](#independent-task-execution).
92-
93-
As of the time of writing S3 is the only supported backend for dbt projects, but we have plans to extend this to support more backends, initially targeting other file storages that are commonly used in Airflow connections.
104+
We interpret this as meaning a task should be responsible of fetching all the *dbt* related files it needs in order to run independently, as already described in [Independent Task Execution](#independent-task-execution).
94105

95106
## Push dbt artifacts to XCom
96107

97108
Each dbt execution produces one or more [JSON artifacts](https://docs.getdbt.com/reference/artifacts/dbt-artifacts/) that are valuable to produce meta-metrics, build conditional workflows, for reporting purposes, and other uses. *airflow-dbt-python* can push these artifacts to [XCom](https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) as requested via the `do_xcom_push_artifacts` parameter, which takes a list of artifacts to push.
98109

99110
## Use Airflow connections as dbt targets (without a profiles.yml)
100111

101-
[Airflow connections](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html) allow users to manage and store connection information, such as hostname, port, user name, and password, for operators to use when accessing certain applications, like databases. Similarly, a dbt `profiles.yml` file stores connection information under each target key. *airflow-dbt-python* bridges the gap between the two and allows you to use connection information stored as an Airflow connection by specifying the connection id as the `target` parameter of any of the dbt operators it provides. What's more, if using an Airflow connection, the `profiles.yml` file may be entirely omitted (although keep in mind a `profiles.yml` file contains a configuration block besides target connection information).
112+
[Airflow connections](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html) allow users to manage and store connection information, such as hostname, port, user name, and password, for operators to use when accessing certain applications, like databases. Similarly, a *dbt* `profiles.yml` file stores connection information under each target key. *airflow-dbt-python* bridges the gap between the two and allows you to use connection information stored as an Airflow connection by specifying the connection id as the `target` parameter of any of the *dbt* operators it provides. What's more, if using an Airflow connection, the `profiles.yml` file may be entirely omitted (although keep in mind a `profiles.yml` file contains a configuration block besides target connection information).
102113

103114
See an example DAG [here](examples/airflow_connection_target_dag.py).
104115

105116
# Motivation
106117

107118
## Airflow running in a managed environment
108119

109-
Although [`dbt`](https://docs.getdbt.com/) is meant to be installed and used as a CLI, we may not have control of the environment where Airflow is running, disallowing us the option of using `dbt` as a CLI.
120+
Although [`dbt`](https://docs.getdbt.com/) is meant to be installed and used as a CLI, we may not have control of the environment where Airflow is running, disallowing us the option of using *dbt* as a CLI.
110121

111122
This is exactly what happens when using [Amazon's Managed Workflows for Apache Airflow](https://aws.amazon.com/managed-workflows-for-apache-airflow/) or MWAA: although a list of Python requirements can be passed, the CLI cannot be found in the worker's PATH.
112123

@@ -122,11 +133,11 @@ operator = BashOperator(
122133
)
123134
```
124135

125-
But it can get sloppy when appending all potential arguments a `dbt run` command (or other subcommand) can take.
136+
But it can get cumbersome when appending all potential arguments a `dbt run` command (or other subcommand) can take.
126137

127138
That's where *airflow-dbt-python* comes in: it abstracts the complexity of interfacing with *dbt-core* and exposes one operator for each *dbt* subcommand that can be instantiated with all the corresponding arguments that the *dbt* CLI would take.
128139

129-
## An alternative to *airflow-dbt* that works without the dbt CLI
140+
## An alternative to *airflow-dbt* that works without the *dbt* CLI
130141

131142
The alternative [`airflow-dbt`](https://pypi.org/project/airflow-dbt/) package, by default, would not work if the *dbt* CLI is not in PATH, which means it would not be usable in MWAA. There is a workaround via the `dbt_bin` argument, which can be set to `"python -c 'from dbt.main import main; main()' run"`, in similar fashion as the `BashOperator` example. Yet this approach is not without its limitations:
132143
* *airflow-dbt* works by wrapping the *dbt* CLI, which makes our code dependent on the environment in which it runs.
@@ -135,7 +146,7 @@ The alternative [`airflow-dbt`](https://pypi.org/project/airflow-dbt/) package,
135146

136147
# Usage
137148

138-
Currently, the following `dbt` commands are supported:
149+
Currently, the following *dbt* commands are supported:
139150

140151
* `clean`
141152
* `compile`
@@ -153,34 +164,35 @@ Currently, the following `dbt` commands are supported:
153164

154165
## Examples
155166

156-
All example DAGs are tested against against `apache-airflow==2.2.5`. Some changes, like modifying `import` statements or changing types, may be required for them to work in other versions.
167+
All example DAGs are tested against against the latest Airflow version. Some changes, like modifying `import` statements or changing types, may be required for them to work in other versions.
157168

158169
``` python
159-
from datetime import timedelta
170+
import datetime as dt
160171

172+
import pendulum
161173
from airflow import DAG
162-
from airflow.utils.dates import days_ago
174+
163175
from airflow_dbt_python.operators.dbt import (
164176
DbtRunOperator,
165177
DbtSeedOperator,
166-
DbtTestoperator,
178+
DbtTestOperator,
167179
)
168180

169181
args = {
170-
'owner': 'airflow',
182+
"owner": "airflow",
171183
}
172184

173185
with DAG(
174-
dag_id='example_dbt_operator',
186+
dag_id="example_dbt_operator",
175187
default_args=args,
176-
schedule_interval='0 0 * * *',
177-
start_date=days_ago(2),
178-
dagrun_timeout=timedelta(minutes=60),
179-
tags=['example', 'example2'],
188+
schedule="0 0 * * *",
189+
start_date=pendulum.today("UTC").add(days=-1),
190+
dagrun_timeout=dt.timedelta(minutes=60),
191+
tags=["example", "example2"],
180192
) as dag:
181193
dbt_test = DbtTestOperator(
182194
task_id="dbt_test",
183-
selector_name=["pre-run-tests"],
195+
selector_name="pre-run-tests",
184196
)
185197

186198
dbt_seed = DbtSeedOperator(
@@ -201,16 +213,18 @@ with DAG(
201213

202214
More examples can be found in the [`examples/`](examples/) directory and the [documentation](https://airflow-dbt-python.readthedocs.io).
203215

204-
# Testing
216+
# Development
217+
218+
See the [development documentation](https://airflow-dbt-python.readthedocs.io/en/latest/development.html) for a more in-depth dive into setting up a development environment, running the test-suite, and general commentary on working on *airflow-dbt-python*.
205219

206-
Tests are written using *pytest*, can be located in `tests/`, and they can be run locally with *Poetry*:
220+
## Testing
221+
222+
Tests are run with *pytest*, can be located in `tests/`. To run them locally, you may use *Poetry*:
207223

208224
``` shell
209225
poetry run pytest tests/ -vv
210226
```
211227

212-
See development and testing instructions in the [documentation](https://airflow-dbt-python.readthedocs.io/en/latest/development.html).
213-
214228
# License
215229

216230
This project is licensed under the MIT license. See ![LICENSE](LICENSE).

airflow_dbt_python/hooks/dbt.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -242,7 +242,17 @@ def run_dbt_task(
242242

243243
saved_artifacts = {}
244244
for artifact in artifacts:
245-
with open(Path(dbt_dir) / "target" / artifact) as artifact_file:
245+
artifact_path = Path(dbt_dir) / "target" / artifact
246+
247+
if not artifact_path.exists():
248+
self.log.warn(
249+
"Required dbt artifact %s was not found. "
250+
"Perhaps dbt failed and couldn't generate it.",
251+
artifact,
252+
)
253+
continue
254+
255+
with open(artifact_path) as artifact_file:
246256
json_artifact = json.load(artifact_file)
247257

248258
saved_artifacts[artifact] = json_artifact

docs/how_does_it_work.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
How does it work?
44
=================
55

6-
*airflow-dbt-python*'s main goal is to elevate *dbt* to a **first-class citizen** in Airflow. By this we mean that users of *dbt* can leverage as many Airflow features as possible, without breaking the assumptions that Airflow expects from any workflows it orchestrates. Perhaps more importantly, Airflow should **enhance** a *dbt* user's experience and not simply emulate the way they would run *dbt* in the command line.
6+
*airflow-dbt-python*'s main goal is to elevate *dbt* to **first-class citizen** status in Airflow. By this we mean that users of *dbt* can leverage as many Airflow features as possible, without breaking the assumptions that Airflow expects from any workflows it orchestrates. Perhaps more importantly, Airflow should **enhance** a *dbt* user's experience and not simply emulate the way they would run *dbt* in the command line. This is what separates *airflow-dbt-python* from other alternatives like *airflow-dbt* which simply wrap *dbt* cli commands in ``BashOperator``.
77

88
To achieve this goal *airflow-dbt-python* provides Airflow operators, hooks, and other utilities. Hooks in particular come in two flavors:
99

@@ -15,7 +15,9 @@ To achieve this goal *airflow-dbt-python* provides Airflow operators, hooks, and
1515
*dbt* as a library
1616
------------------
1717

18-
A lot of the code in *airflow-dbt-python* is required to provide a `wrapper <https://en.wikipedia.org/wiki/Adapter_pattern>`_ for *dbt*, as *dbt* only provides a CLI interface. There are `ongoing efforts <https://github.com/dbt-labs/dbt-core/issues/6356>`_ to provide a dbt library, which would significantly simplify our codebase.
18+
A lot of the code in *airflow-dbt-python* is required to provide a `wrapper <https://en.wikipedia.org/wiki/Adapter_pattern>`_ for *dbt*, as *dbt* only provides a CLI interface. There are `ongoing efforts <https://github.com/dbt-labs/dbt-core/issues/6356>`_ to provide a dbt library, which would significantly simplify our codebase. As of the time of development, these efforts are not in a state where they can be used by us, but we can keep an eye out for the future.
19+
20+
Most of the code used to adapting *dbt* can be found in the utilities module, as some of our features require that we break some assumptions *dbt* makes when initializing. For example, we need setup *dbt* to access project files stored remotely, or intiailize all profile settings from an Airflow Connection.
1921

2022
.. _dbt_operators:
2123

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
"""Sample DAG which showcases dbt-lab's very own Jaffle Shop GitHub repo."""
2+
import datetime as dt
3+
4+
import pendulum
5+
from airflow import DAG
6+
7+
from airflow_dbt_python.operators.dbt import (
8+
DbtRunOperator,
9+
DbtSeedOperator,
10+
DbtTestOperator,
11+
)
12+
13+
with DAG(
14+
dag_id="example_dbt_worflow_with_github",
15+
schedule=None,
16+
start_date=pendulum.today("UTC").add(days=-2),
17+
catchup=False,
18+
dagrun_timeout=dt.timedelta(minutes=60),
19+
) as dag:
20+
# Project files will be pulled from "https://github.com/dbt-labs/jaffle_shop"
21+
dbt_seed = DbtSeedOperator(
22+
task_id="dbt_seed",
23+
project_dir="https://github.com/dbt-labs/jaffle_shop",
24+
target="github_connection",
25+
do_xcom_push_artifacts=["run_results.json"],
26+
)
27+
28+
dbt_run = DbtRunOperator(
29+
task_id="dbt_run",
30+
project_dir="https://github.com/dbt-labs/jaffle_shop",
31+
target="github_connection",
32+
do_xcom_push_artifacts=["run_results.json"],
33+
)
34+
35+
dbt_test = DbtTestOperator(
36+
task_id="dbt_test",
37+
project_dir="https://github.com/dbt-labs/jaffle_shop",
38+
target="github_connection",
39+
do_xcom_push_artifacts=["run_results.json"],
40+
)
41+
42+
dbt_seed >> dbt_run >> dbt_test

0 commit comments

Comments
 (0)