0% found this document useful (0 votes)
93 views159 pages

Data Engineering Zoomcamp FAQ

This document provides answers to frequently asked technical questions regarding a Data Engineering course, including course start dates, prerequisites, and how to access materials. It also outlines the use of Google Cloud Platform, homework deadlines, and options for self-paced learning. Additionally, it offers troubleshooting tips for common issues and guidance on using Git and Docker.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views159 pages

Data Engineering Zoomcamp FAQ

This document provides answers to frequently asked technical questions regarding a Data Engineering course, including course start dates, prerequisites, and how to access materials. It also outlines the use of Google Cloud Platform, homework deadlines, and options for self-paced learning. Additionally, it offers troubleshooting tips for common issues and guidance on using Git and Docker.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 159

The purpose of this document is to capture frequently asked technical questions.

General course-related questions

When will the course start?

The exact day and hour of the course will be 15th Jan 2023 at 18h00. The course will start with
the first “Office Hours” live.
● Subscribe to course public Google Calendar (it works from Desktop only).flow
● Register for the course before the course starts using this link compu.
● Join the course Telegram channel with announcements.
● Don’t forget to register in DataTalks.Club's Slack and join the #course-data-engineering
channel.

I have registered for the Data Engineering Zoomcamp. When can


I expect to receive the confirmation email?
You don't need it. You're accepted.

What is the video/zoom link to the stream for the first “Office
Hour”?
It should be posted in the announcements channel on Telegram before it begins. Also, you will
see it live on the DataTalksClub YouTube Channel.

I can’t attend the “Office hours” will it be recorded?


Yes! Every “Office Hours” will be recorded so you can attend whenever you want.

What are the prerequisites for this course?


https://github.com/DataTalksClub/data-engineering-zoomcamp#prerequisites
What can I do before the course starts?
You can start by installing and setup all the requirements:

● Google cloud account


● Git and GitHub office Hours
● Google Cloud SDK
● Python 3 (installed with Anaconda)
● Terraform

Look over the prerequisites and syllabus to see if you are comfortable with these subjects.

What are the homework deadlines and project deadlines?

You can find the latest and uptodate deadlines here:


https://docs.google.com/spreadsheets/d/e/2PACX-1vQACMLuutV5rvXg5qICuJGL-yZqIV0FBD8
4CxPdC5eZHf8TfzB-CJT_3Mo7U7oGVTXmSihPgQxuuoku/pubhtml

Is the 20xx cohort going to be different from the 20xx cohort?


Yes. (DE Zoomcamp was first launched in 2022.)

🙂
For 2023, the main difference is the orchestration tool — we will use Prefect and not Airflow.
And new homeworks

For 2024 (source),

● we will again have a different orchestrator for the 2nd module - it'll be mage instead
of prefect
● terraform videos might be re-recorded

Can I follow the course after it finishes?


Yes, we will keep all the materials after the course finishes, so you can follow the course at your
own pace after it finishes.

Can I follow the course in a self-paced mode and get a


certificate?
No, you can only get a certificate if you finish the course with the f. We don't award certificates
for the self-paced mode.

Can I get support if I take the course in the self-paced mode?


Yes, the slack channel remains open and you can ask questions there. But always check the
FAQ first, most likely all your questions are already answered here.

Why are we using GCP and not other cloud providers?


Because everyone has a google account, GCP has a free trial period and gives $300 in credits
to new users. Also, we are working with BigQuery, which is a part of GCP.

Note that to sign up for a free GCP account, you need to have a valid credit card.

Do I need to pay for the cloud services?


No, if you use GCP and take advantage of their free trial.

The GCP and other cloud providers are not available in some
countries. Is it possible to provide a guide to installing a home
lab?
You can do most of the course without a cloud. Almost everything we use (excluding BigQuery)
can be run locally. We won’t be able to provide guidelines for some things, but most of the
materials are runnable without GCP.

For everything in the course, there’s a local alternative. You could even do the whole course
locally.

I want to use AWS. May I do that?


Yes, you can. Just remember to adapt all the information on the videos to AWS. Besides, the
final capstone will be evaluated based on the task: Create a data pipeline! Develop a
visualisation!
Besides the “Office Hour” which are the live zoom calls?
We will probably have some calls during the Capstone period to clear some questions but it will
be announced in advance if that happens.

I don’t want to watch the weekly no y videos or do homework.


Can I still do the final capstone?
Yes :) You can do the final capstone and if you pass it, you will get a certificate.

Are we still using the NYC Trip data for January 2021 or are we
using the 2022 data?
We will use the same data, as the project will essentially remain the same as last year’s. The
data Random is available here

Is the 2022 repo deleted?


No, but we moved the 2022 stuff here

Can I use Airflow instead of for my final project?


To complete the course, you'll build your data engineering pipeline from scratch using the
knowledge acquired during the course.

・Workflow orchestration

・Data Warehousing

・Analytical engineering

・Batch processing

・Stream processing
To complete the course, you'll build your data engineering pipeline
from scratch using the knowledge acquired during the course.
Yes, you can use any tool you want for your project.

Is it possible to use x tool instead of the one tool you use?


Yes, this applies if you want to use Airflow instead of Prefect, AWS or Snowflake instead of GCP
products or Tableau instead of Metabase or Google data studio.

The course covers 2 alternative data stacks, one using GCP and one using local installation of
everything. You can use either one of those, or use your tool of choice.

You do need to take in consideration that we can’t support you if you choose to use a different
stack, also you would need to explain the different choices of tool for the peer review of your
capstone project.

How can we contribute to the course?


Star the repo! Share it with friends if you find it useful ❣️

Is the course [Windows/mac/Linux/...] friendly?


Yes! Linux is ideal but technically it should not matter. Students last year used all 3 OSes
successfully

Any books or additional resources you recommend?


Yes to both! check out this document

Awesome Data Engineering

Can I still join the course?


Yes, even if you don't submit the homeworks, you're still eligible for a certificate as long as you
successfully pass the project at the end.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave
everything for the last minute.

What is Project Cohort#1 and Project Cohort#2 exactly?


You will have two attempts for a project. If the final project deadline is over and you’re late or
you submit the project and fail the first attempt, you have another chance to submit the project
with the second attempt.

I'm using 2 email ids while working on the zoomcamp. While


uploading the homework/projects etc are you mapping progress
to the email id which is shared as an input in the form or the one
used while being logged into Google(required to submit the
form)?=============================================
==================
You can use any email you want for the homeworks, it doesn't have to be the same as the one
you used for signing up. Just make sure you use the same email for all the homeworks

Alternative news source other than Slack


https://t.me/dezoomcamp

https://docs.google.com/document/d/1Bfp-K2hIovkETjeGsJOKl8Zo2dVyHY6SXIHyV5rkE0w

How to troubleshoot issues


The first step is to try to solve the issue on your own; get used to solving problems. This will be
a real life skill you need when employed.

1. What does the error say? There will often be a description of the error or instructions on
what is needed, I have even seen a link to the solution. Does it reference a specific line
of your code?
2. Restart the application or server/pc.
3. Google it. It is going to be rare that you are the first to have the problem, someone out
there has posted the issue and likely the solution. Search using: <technology> <problem
statement>. Example: pgcli error column c.relhasoids does not exist.
a. There are often different solutions for the same problem due to variation in
environments.
4. Check the tech’s documentation. Use its search if available or use the browsers search
function. doc
5. Try uninstall (this may remove the bad actor) and reinstall of application or
reimplementation of action. Don’t forget to restart the server/pc for reinstalls.
a. Sometimes reinstalling fails to resolve the issue but works if you uninstall first.
6. Post your question to Stackoverflow. Be sure to read the Stackoverflow guide on posting
good questions.
a. https://stackoverflow.com/help/how-to-ask
b. This will be your real life ask an expert in the future (in addition to coworkers).
7. Ask in Slack
a. Before asking a question, check the FAQ (this document)
b. DO NOT use screenshots, especially don’t take pictures from a phone.
c. DO NOT tag instructors, it may discourage others from helping you.Copy and
past errors; if it’s long, just post it in a reply to your thread.
i. Use ``` for formatting your code.
d. Use the same thread for the conversation (that means reply to your own thread).
i. DO NOT create multiple posts to discuss the issue.
ii. You may create a new post if the issue reemerges down the road. Be
sure to describe what has changed in the environment.
e. Provide additional information in the same thread of the steps you have taken for
resolution.
8. Take a break and come back to it later. You will be amazed at how often you figure out
the solution after letting your brain rest. Get some fresh air, workout, play a video game,
watch a tv show, whatever allows your brain to not think about it for a little while or even
until the next day.
9. Remember technology issues in real life sometimes take days or even weeks to resolve.
10. If somebody helped you with your problem and it's not in the FAQ, please add it there.
It will help other students.

Week 1

Taxi Data - Yellow Taxi Trip Records downloading error, Error 403
or XML error webpage
When you try to download the 2021 data from TLC website, you get this error:

If you click on the link, and ERROR 403: Forbidden on the terminal.

We have a backup, so use it instead: https://github.com/DataTalksClub/nyc-tlc-data


So the link should be
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-0
1.csv.gz

Note: Make sure to unzip the “gz” file (no, the “unzip” command won’t work for this.)

“gzip -d file.gz”

Taxi Data - How to handle taxi data files, now that the files are
available as *.csv.gz?

The pandas read_csv function can read csv.gz files directly. So no need to change anything in
the script.

Taxi Data - Data Dictionary for NY Taxi data?


Yellow Trips:
https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf

Green Trips:
https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf

How do I use Git / GitHub for this course?


After you create a GitHub account, you should clone the course repo to your local machine
using the process outlined in this video: Git for Everybody: How to Clone a Repository from
GitHub

Having this local repository on your computer will make it easy for you to access the instructors’
code and make pull requests (if you want to add your own notes or make changes to the course
content).
You will probably also create your own repositories that host your notes, versions of your file, to
do this. Here is a great tutorial that shows you how to do this:
https://www.atlassian.com/git/tutorials/setting-up-a-repository

Remember to ignoreHow to Create a Git Repository | Atlassian Git Tutorial large database, .csv,
and .gz files, as well as other files that should not be
saved to a repository. Use .gitignore for this:
https://www.atlassian.com/git/tutorials/saving-changes/gitignore NEVER store passwords or
keys in a git repo (even if that repo is set to private).

This is also a great resource: https://dangitgit.com/


wget is not recognized as an internal or external command
“wget is not recognized as an internal or external command”, you need to install it.

On Ubuntu, run:

$ sudo apt-get install wget

On MacOS, the easiest way to install wget is to use Brew:

$ brew install wget

On Windows, the easiest way to install wget is to use Chocolatey:

$ choco install wget

Or you can download a binary (https://gnuwin32.sourceforge.net/packages/wget.htm) and put it


to any location in your PATH (e.g. C:/tools/)

Also, you can following this step to install Wget on MS Windows

* Download the lastest wget binary for windows from [eternallybored]


(https://eternallybored.org/misc/wget/) (they are available as a zip with documentation, or just an
exe)

* If you downloaded the zip, extract all (if windows built in zip utility gives an error, use [7-zip]
(https://7-zip.org/)).

* Rename the file `wget64.exe` to `wget.exe` if necessary.

* Move wget.exe to your `Git\mingw64\bin\`.

Alternatively, you can use a Python wget library, but instead of simply using “wget” you’ll need to
use

python -m wget

You need to install it with pip first:


pip install wget

Alternatively, you can just paste the file URL into your web browser and download the file
normally that way. You’ll want to move the resulting file into your working directory.

Also recommended a look at the python library requests for the loading gz file
https://pypi.org/project/requests

wget - ERROR: cannot verify <website> certificate (MacOS)


Firstly, make sure that you add “!” before wget if you’re running your command in a Jupyter
Notebook or CLI. Then, you can check one of this 2 things (from CLI):

1. Using the Python library wget you installed with pip, try python -m wget <url>

2. Write the usual command and add --no-check-certificate at the end. So it should
be:

!wget <website_url> --no-check-certificate

Git Bash - Backslash as an escape character in Git Bash for


Windows
For those who wish to use the backslash as an escape character in Git Bash for
Windows (as Alexey normally does), type in the terminal: bash.escapeChar=\ (no
need to include in .bashrc)

Docker - Cannot connect to Docker daemon at


unix:///var/run/docker.sock. Is the docker daemon running?
Make sure you're able to start the Docker daemon, and check the issue immediately
down below:

And don’t forget to update the wsl in powershell the command is wsl –update
Docker - Error during connect: In the default daemon
configuration on Windows, the docker client must be run with
elevated privileges to connect.: Post:
"http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/containers/cre
ate" : open //./pipe/docker_engine: The system cannot find the file
specified

As the official Docker for Windows documentation says, the Docker engine can either use the
Hyper-V or WSL2 as its backend. However, a few constraints might apply

● Windows 10 Pro / 11 Pro Users:

In order to use Hyper-V as its back-end, you MUST have it enabled first, which you can
do by following the tutorial: Enable Hyper-V Option on Windows 10 / 11

● Windows 10 Home / 11 Home Users:

On the other hand, Users of the 'Home' version do NOT have the option Hyper-V
option enabled, which means, you can only get Docker up and running using the WSL2
credentials(Windows Subsystem for Linux). Url

You can find the detailed instructions to do so here:
https://pureinfotech.com/install-wsl-windows-11/

In case, you run into another issue while trying to install WSL2
(WslRegisterDistribution failed with error: 0x800701bc), Make sure you update the
WSL2 Linux Kernel, following the guidelines here:

https://github.com/microsoft/WSL/issues/5393

Docker - I am trying to run postgres with volume mounting it ran


success and the database server started but i could not see the
files on my machine for persistency. I am using Ubuntu WSL this
is the code I used to run docker:
docker run -it \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgres/data \
-p 5432:5432 \
postgres:13

I had to do this to make it work. got it from the readme files on the repo

pgDocker - Pull access denied for dbpage/pgadmin4, repository


does not exist or may require 'docker login': denied: requested
access to the resource is denied
Whenever a `docker pull is performed (either manually or by `docker-compose up`), it
attempts to fetch the given image name (pgadmin4, for the example above) from a repository
(dbpage).

IF the repository is public, the fetch and download happens without any issue whatsoever.

For instance:
● docker pull postgres:13
● docker pull dpage/pgadmin4

BE ADVISED:

The Docker Images we'll be using throughout the Data Engineering Zoomcamp are all public
(except when or if explicitly said otherwise by the instructors or co-instructors).

Meaning: you are NOT required to perform a docker login to fetch them.

So if you get the message above saying "docker login': denied: requested access to the
resource is denied. That is most likely due to a typo in your image name:

For instance:

$ docker pull dbpage/pgadmin4

● Will throw that exception telling you "repository does not exist or may require 'docker
login'

$ docker pull dbpage/pgadmin4 ✔ base 07:45:46

Using default tag: latest

Error response from daemon: pull access denied for dbpage/pgadmin4, repository does not exist or
may require 'docker login': denied: requested access to the resource is denied

● But that actually happened because the actual image is dpage/pgadmin4 and NOT
dbpage/pgadmin4

How to fix it:


$ docker pull dpage/pgadmin4
$ docker pull dpage/pgadmin4

Using default tag: latest

latest: Pulling from dpage/pgadmin4


a9eaa45ef418: Already exists
942bbf3d7389: Pull complete
fbe23c71dc3b: Pull complete
7c1be9e99602: Pull complete
ccc31a15f27f: Pull complete
617b6e01309f: Pull complete
e6cfa0ba7132: Pull complete
9dd539b143fa: Pull complete
6f3ff58d53db: Pull complete
a79e40a556fb: Pull complete
b05884a10df3: Pull complete
3a39531f7518: Pull complete
0337d3baf297: Pull complete
c7a9de9c5d61: Pull complete

Digest: sha256:79b2d8da14e537129c28469035524a9be7cfe9107764cc96781a166c8374da1f
Status: Downloaded newer image for dpage/pgadmin4:latest
docker.io/dpage/pgadmin4:latest

EXTRA NOTES:

In the real world, occasionally, when you're working for a company or closed organisation, the
Docker image you're trying to fetch might be under a private repo that your DockerHub
Username was granted access to.

For which cases, you must first execute:


$ docker login

● Fill in the details of your username and password.


● And only then perform the `docker pull` against that private repository

Docker - Error response from daemon: invalid mode: \Program


Files\Git\var\lib\postgresql\data.

Change the mounting path. Replace it with one of following:

● -v /e/zoomcamp/...:/var/lib/postgresql/data
● -v /c:/.../ny_taxi_postgres_data:/var/lib/postgresql/data\ (leading
slash in front of c:)
Docker - Docker won't start or is stuck in settings (Windows 10 /
11)

- First off, make sure you're running the latest version of Docker for Windows, which you
can download from here. Sometimes using the menu to "Upgrade" doesn't work (which
is another clear indicator for you to uninstall, and reinstall with the latest version)

- If docker is stuck on starting, first try to switch containers by right clicking the docker
symbol from the running programs and switch the containers from windows to linux or
vice versa

- [Windows 10 / 11 Pro Edition] The Pro Edition of Windows can run Docker either by
using Hyper-V or WSL2 as its backend (Docker Engine)

- In order to use Hyper-V as its back-end, you MUST have it enabled first, which
you can do by following the tutorial: Enable Hyper-V Option on Windows 10 / 11

- If you opt-in for WSL2, you can follow the same steps as detailed in the tutorial
here

- [Windows 10 / 11 Home Edition] If you're running a Home Edition, you can still make
it work with WSL2 (Windows Subsystem for Linux) by following the tutorial here

If even after making sure your WSL2 (or Hyper-V) is set up accordingly, Docker remains stuck,
you can try the option to Reset to Factory Defaults or do a fresh install.

Docker - Should I run docker commands from the windows file


system or a file system of a Linux distribution in WSL?
It is recommended by the Docker docs to store all code in your default Linux distro to get the
best out of file system performance (since Docker runs on WSL2 backend by default for
Windows 10 Home / Windows 11 Home users).

More info in the Docker Docs on Best Practises

Docker - The input device is not a TTY (Docker run for Windows)
You may have this error:

$ docker run -it ubuntu bash

the input device is not a TTY. If you are using mintty, try prefixing the
command with 'winpty''
Solution:

Use winpty before docker command (source)

$ winpty docker run -it ubuntu bash

You also can make an alias:


echo "alias docker='winpty docker'" >> ~/.bashrc

OR

echo "alias docker='winpty docker'" >> ~/.bash_profile

Docker - Cannot pip install on Docker container (Windows)


You may have this error:

Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after


connection broken by 'NewConnectionError('<pip._vendor.u

rllib3.connection.HTTPSConnection object at 0x7efe331cf790>: Failed to establish a


new connection: [Errno -3] Temporary failure in name resolution')':

/simple/pandas/

Possible solution might be:

$ winpty docker run -it --dns=8.8.8.8 --entrypoint=bash python:3.9

Docker - Setting up Docker on Mac


Check this article for details - Setting up docker in macOS

From researching it seems this method might be out of date, it seems that since docker
changed their licensing model, the above is a bit hit and miss. What worked for me was to just
go to the docker website and download their dmg. Haven’t had an issue with that method.
Docker - Could not change permissions of directory
"/var/lib/postgresql/data": Operation not permitted
$ docker run -it\
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="admin" \
-e POSTGRES_DB="ny_taxi" \
-v "/mnt/path/to/ny_taxi_postgres_data":"/var/lib/postgresql/data" \
-p 5432:5432 \
postgres:13

CCW
The files belonging to this database system will be owned by user "postgres".
This use The database cluster will be initialized with locale "en_US.utf8".
The default databerrorase encoding has accordingly been set to "UTF8".
xt search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... initdb:

error: could not change permissions of directory "/var/lib/postgresql/data":


Operation not permitted volume

One way to solve this issue is to create a local docker volume and map it to postgres data
directory /var/lib/postgresql/data

$ docker volume create --name dtc_postgres_volume_local -d local


$ docker run -it\
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \postgresConnection refused
-e POSTGRES_DB="ny_taxi" \
-v dtc_postgres_volume_local:/var/lib/postgresql/data \
-p 5432:5432 \
postgres:13

An alternate error could be:

initdb: error: directory "/var/lib/postgresql/data" exists but is not empty


If you want to create a new database system, either remove or empty
the directory "/var/lib/postgresql/data" or run initdb
witls
Docker - invalid reference format: repository name must be
lowercase (Mounting volumes with Docker on Windows)
Mapping volumes on Windows could be tricky. The way it was done in the course video doesn’t
work for everyone.

First, if you have spaces in the path, move your data to some folder without spaces. E.g. if your
code is in “C:/Users/Alexey Grigorev/git/…”, move it to “C:/git/…”

Try replacing the “-v” part with one of the following options:

● -v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data
● -v //c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data
● -v /c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data
● -v //c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data
● --volume //driveletter/path/ny_taxi_postgres_data/:/var/lib/postgresql/data

Try adding winpty before the whole command

winpty docker run -it


-e POSTGRES_USER="root"
-e POSTGRES_PASSWORD="root"
-e POSTGRES_DB="ny_taxi"
-v /c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data
-p 5432:5432
postgres:13

Try adding quotes:

● -v "/c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data"
● -v "//c:/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data"
● -v “/c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data"
● -v "//c/some/path/ny_taxi_postgres_data:/var/lib/postgresql/data"
● -v "c:\some\path\ny_taxi_postgres_data":/var/lib/postgresql/data

Note: (Window) if it automatically creates a folder called “ny_taxi_postgres_data;C” suggests


you have problems with volume mapping, try deleting both folders and replacing “-v” part with
other options. For me “//c/” works instead of “/c/”. And it will work by automatically creating a
correct folder called “ny_taxi_postgres_data”.

A possible solution to this error would be to use


/”$(pwd)”/ny_taxi_postgres_data:/var/lib/postgresql/data (with quotes’ position varying
as in the above list).

Yes for windows use the command it works perfectly fine


● -v /”$(pwd)”/ny_taxi_postgres_data:/var/lib/postgresql/data

Important: note how the quotes are placed.

If none of these options work, you can use a volume name instead of the path:

● -v ny_taxi_postgres_data:/var/lib/postgresql/data

For Mac: You can wrap $(pwd) with quotes like the highlighted.

docker run -it \


-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v "$(pwd)"/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5432:5432 \
Postgres:13

docker run -it \


-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="root" \
-e POSTGRES_DB="ny_taxi" \
-v "$(pwd)"/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5432:5432 \
postgres:13

Source:https://stackoverflow.com/questions/48522615/docker-error-invalid-reference-format-rep
ository-name-must-be-lowercase

Docker - build error: error checking context: 'can't stat


'/home/user/repos/data-engineering/week_1_basics_n_setup/2_d
ocker_sql/ny_taxi_postgres_data''.
This error appeared when running the command: docker build -t taxi_ingest:v001 .

When feeding the database with the data the user id of the directory ny_taxi_postgres_data was
changed to 999, so my user couldn’t access it when running the above command. Even though
this is not the problem here it helped to raise the error due to the permission issue.

Since at this point we only need the files Dockerfile and ingest_data.py, to fix this error one can
run the docker build command on a different directory (having only these two files).

A more complete explanation can be found here:


https://stackoverflow.com/questions/41286028/docker-build-error-checking-context-cant-stat-c-u
sers-username-appdata
Docker - ERRO[0000] error waiting for container: context
canceled
You might have installed docker via snap. Run “sudo snap status docker” to verify.
If you have “error: unknown command "status", see 'snap help'.” as a response than deinstall
docker and install via the official website

Docker - build error checking context: can’t stat


‘/home/fhrzn/Projects/…./ny_taxi_postgres_data’
Found the issue in the PopOS linux. It happened because our user didn’t have authorization
rights to the host folder ( which also caused folder seems empty, but it didn’t!).

✅Solution:
Just add permission for everyone to the corresponding folder

sudo chmod -R 777 <path_to_folder>

Example:

sudo chmod -R 777 ny_taxi_postgres_data/

Docker - failed to solve with frontend dockerfile.v0: failed to read


dockerfile: error from sender: open ny_taxi_postgres_data:
permission denied.
This happens on Ubuntu/Linux systems when trying to run the command to build the Docker
container again.

$ docker build -t taxi_ingest:v001 .

A folder is created to host the Docker files. When the build command is executed again to
rebuild the pipeline or create a new one the error is raised as there are no permissions on this
new folder. Grant permissions by running this command;

$ sudo chmod -R 755 ny_taxi_postgres_data

Or use 777 if you still see problems. 755 grants write access to only the owner.
Docker - Docker network name (solution for mac) ?

Get the network name via: $ docker network ls.

Docker - Error response from daemon: Conflict. The container


name "pg-database" is already in use by container “xxx”. You
have to remove (or rename) that container to be able to reuse that
name.
Sometimes, when you try to restart a docker image configured with a network name, the
above message appears. In this case, use the following command with the appropriate
container name:
>>> If the container is running state, use docker stop <container_name>
>>> then, docker rm pg-database
Or use docker start instead of docker run in order to restart the docker image without
removing it.

Docker - ingestion when using docker-compose could not


translate host name
Typical error: sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could
not translate host name "pgdatabase" to address: Name or service not known

When running docker-compose up -d see which network is created and use this for the
ingestions script instead of pg-network and see the name of the database to use instead of
pgdatabase

E.g.:
- pg-network becomes 2docker_default
- Pgdatabase becomes 2docker-pgdatabase-1
Docker - Cannot install docker on MacOS/Windows 11 VM
running on top of Linux (due to Nested virtualization).

Run this command before starting your VM:


- On Intel CPU:
modprobe -r kvm_intel
modprobe kvm_intel nested=1
- On AMD CPU:
modprobe -r kvm_amd
modprobe kvm_amd nested=1

pgdatabase

Docker - Connecting from VS Code


It’s very easy to manage your docker container, images, network and compose projects from VS
Code.
Just install the official extension and launch it from the left side icon.

It will work even if your Docker runs on WSL2, as VS Code can easily connect with your Linux.
Docker - How to stop a container?
Use the following command:

$ docker stop <container_id>

Docker-Compose - mounting error


error: could not change permissions of directory "/var/lib/postgresql/data":
Operation not permitted volume

- if you have used the prev answer (just before this) and have created a
local docker volume, then you need to tell the compose file about the
named volume:

volumes:
dtc_postgres_volume_local: # Define the named volume here

# services mentioned in the compose file auto become part of the same
network!
services:
your remaining code here . . .

- now use docker volume inspect dtc_postgres_volume_local to see the


location by checking the value of Mountpoint
- In my case, after i ran docker compose up the mouting dir created was
named ‘docker_sql_dtc_postgres_volume_local’ whereas it should have
used the already existing ‘dtc_postgres_volume_local’
- All i did to fix this is that I renamed the existing
‘dtc_postgres_volume_local’ to ‘docker_sql_dtc_postgres_volume_local’
and removed the newly created one (just be careful when doing this)
- run docker compose up again and check if the table is there or not!

Docker-Compose - Error translating host name to address

Couldn’t translate host name to address


Make sure postgres database is running.

​Use the command to start containers in detached mode: docker-compose up -d

(data-engineering-zoomcamp) hw % docker compose up -d


[+] Running 2/2
⠿ Container pg-admin Started
0.6s
⠿ Container pg-database Started

To view the containers use: docker ps.

(data-engineering-zoomcamp) hw % docker ps

CONTAINER ID IMAGE COMMAND CREATED STATUS


PORTS NAMES

faf05090972e postgres:13 "docker-entrypoint.s…" 39 seconds ago Up 37 seconds


0.0.0.0:5432->5432/tcp pg-database

6344dcecd58f dpage/pgadmin4 "/entrypoint.sh" 39 seconds ago Up 37 seconds


443/tcp, 0.0.0.0:8080->80/tcp pg-admin
hw

To view logs for a container: docker logs <containerid>


(data-engineering-zoomcamp) hw % docker logs faf05090972e

PostgreSQL Database directory appears to contain a database; Skipping initialization

2022-01-25 05:58:45.948 UTC [1] LOG: starting PostgreSQL 13.5 (Debian 13.5-1.pgdg110+1) on
aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-01-25 05:58:45.948 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2022-01-25 05:58:45.948 UTC [1] LOG: listening on IPv6 address "::", port 5432
2022-01-25 05:58:45.954 UTC [1] LOG: listening on Unix socket
"/var/run/postgresql/.s.PGSQL.5432"
2022-01-25 05:58:45.984 UTC [28] LOG: database system was interrupted; last known up at
2022-01-24 17:48:35 UTC
2022-01-25 05:58:48.581 UTC [28] LOG: database system was not properly shut down; automatic
recovery in
progress
2022-01-25 05:58:48.602 UTC [28] LOG: redo starts at 0/872A5910
2022-01-25 05:59:33.726 UTC [28] LOG: invalid record length at 0/98A3C160: wanted 24, got 0
2022-01-25 05:59:33.726 UTC [28

] LOG: redo done at 0/98A3C128


2022-01-25 05:59:48.051 UTC [1] LOG: database system is ready to accept connections

If docker ps doesn’t show pgdatabase running, run: docker ps -a

This should show all containers, either running or stopped.

Get the container id for pgdatabase-1, and run

Docker-Compose - Data retention (could not translate host name


"pg-database" to address: Name or service not known)
After executing `docker-compose up` - if you lose database data and are unable to successfully
execute your Ingestion script (to re-populate your database) but receive the following error:
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name
"pg-database" to address: Name or service not known

Docker compose is creating its own default network since it is no longer specified in a docker
execution command or file. Docker Compose will emit to logs the new network name. See the
logs after executing `docker compose up` to find the network name and change the network
name argument in your Ingestion script.

Docker-Compose - Hostname does not resolve


It returns --> Error response from daemon: network
66ae65944d643fdebbc89bd0329f1409dec2c9e12248052f5f4c4be7d1bdc6a3 not found
Try:

docker ps -a to see all the stopped&running containers

d to nuke all the containers

Try: docker-compose up -d again ports

On localhost:8080 server → Unable to connect to server: could not translate host name
'pg-database' to address: Name does not resolve

Try: new host name, best without “ - ” e.g. pgdatabase

And on docker-compose.yml, should specify docker network & specify the same network in both
containers

services:
pgdatabase:
image: postgres:13
environment:
- POSTGRES_USER=root
- POSTGRES_PASSWORD=root
- POSTGRES_DB=ny_taxi
volumes:
- "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw"
ports:
- "5431:5432"
networks:
- pg-network

pgadmin:
image: dpage/pgadmin4
environment:
- [email protected]
- PGADMIN_DEFAULT_PASSWORD=root
ports:
- "8080:80"
networks:
- pg-network
networks:
pg-network:
name: pg-network
Docker-Compose - Persist PGAdmin docker contents on GCP
So one common issue is when you run docker-compose on GCP, postgres won’t persist it’s data
to mentioned path for example:

services:


pgadmin:


Volumes:
- “./pgadmin”:/var/lib/pgadmin:wr”

Might not work so in this use you can use Docker Volume to make it persist, by simply changing

services:

….
pgadmin:


Volumes:
- pgadmin:/var/lib/pgadmin
volumes:
pgadmin:

Docker-Compose - dial unix /var/run/docker.sock: connect:


permission denied
This happens if you did not create the docker group and added your user. Follow these steps
from the link:

guides/docker-without-sudo.md at main · sindresorhus/guides · GitHub

And then press ctrl+D to log-out and log-in again. pgAdmin: Maintain state so that it
remembers your previous connection

If you are tired of having to setup your database connection each time that you fire up the
containers, all you have to do is create a volume for pgAdmin:

In your docker-compose.yaml file, enter the following into your pgAdmin declaration:
volumes:
- type: volume
source: pgadmin_data
target: /var/lib/pgadmin

Also add the following to the end of the file:ls

volumes:
Pgadmin_data:

Docker-Compose - docker-compose still not available after


changing .bashrc
This is happen to me after following 1.4.1 video where we are installing docker compose in our
Google Cloud VM. In my case, the docker-compose file downloaded from github named
docker-compose-linux-x86_64 while it is more convenient to use docker-compose
command instead. So just change the docker-compose-linux-x86_64 into
docker-compose.

Docker-Compose - Error getting credentials after running


docker-compose up -d
Installing pass via ‘sudo apt install pass’ helped to solve the issue. More about this can be found
here: https://github.com/moby/buildkit/issues/1078

Docker-Compose - Which docker-compose binary to use for


WSL?
To figure out which docker-compose you need to download from
https://github.com/docker/compose/releases you can check your system with these commands:
● uname -s -> return Linux most likely
● uname -m -> return "flavor"
Or try this command -
sudo curl -L
"https://github.com/docker/compose/releases/download/1.29.2/docker-com
pose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
Docker-Compose - Error undefined volume in Windows/WSL
If you wrote the docker-compose.yaml file exactly like the video, you might run into an error like
this:

service "pgdatabase" refers to undefined volume


dtc_postgres_volume_local: invalid compose project

In order to make it work, you need to include the volume in your docker-compose file. Just add
the following:
volumes:
dtc_postgres_volume_local:

(Make sure volumes is at the same level as services.)

WSL Docker initdb: error: could not change permissions of


directory
Solution: Use Docker volumes.

</> docker-compose.yaml
services:
postgres:
image: postgres:15-alpine
container_name: postgres
user: "0:0"
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_DB=ny_taxi
volumes:
- "pg-data:/var/lib/postgresql/data"
ports:
- "5432:5432"
networks:
- pg-network

pgadmin:
image: dpage/pgadmin4
container_name: pgadmin
user: "${UID}:${GID}"
environment:
- [email protected]
- PGADMIN_DEFAULT_PASSWORD=pgadmin
volumes:
- "pg-admin:/var/lib/pgadmin"
ports:
- "8080:80"
networks:
- pg-network

networks:
pg-network:
name: pg-network

volumes:
pg-data:
name: ingest_pgdata
pg-admin:
name: ingest_pgadmin

WSL - Insufficient system resources exist to complete the


requested service.
Cause:
It happens because the apps are not updated. To be specific, search for any pending updates
for Windows Terminal, WSL and Windows Security updates.

Solution
- for updating Windows terminal which worked for me:
1. Go to Microsoft Store.
2. Go to the library of apps installed in your system.
3. Search for Windows terminal.
4. Update the app and restart your system to see the changes.

- For updating the Windows security updates:


1. Go to Windows updates and check if there are any pending updates from Windows,
especially security updates.
2. Do restart your system once the updates are downloaded and installed successfully.
WSL - WSL integration with distro Ubuntu unexpectedly stopped
with exit code 1.

Up restardoting the same issue appears. Happens out of the blue on windows.
Solution 1: Fixing DNS Issue (credit: reddit) this worked for me personally

reg add "HKLM\System\CurrentControlSet\Services\Dnscache" /v "Start" /t REG_DWORD /d


"4" /f
Restart your computer and then enable it with the following
reg add "HKLM\System\CurrentControlSet\Services\Dnscache" /v "Start" /t REG_DWORD /d
"2" /f
Restart your OS again. It should work.
Solution 2: right click on running Docker icon (next to clock) and chose "Switch to Linux
containers"

bash: conda: command not found

Database is uninitialized and superuser password is not specified.


Database is uninitialized and superuser password is not specified.

PGCLI - connection failed: :1), port 5432 failed: could not receive
data from server: Connection refused could not send SSL
negotiation packet: Connection refused
Change

pgcli -h localhost -p 5432 -u root -d ny_taxi TO

pgcli -h 127.0.0.1 -p 5432 -u root -d ny_taxi


pgcli -h 127.0.0.1 -p 5432 -u root -d ny_taxi

PGCLI --help error


- probably some installation error, check out Install (pgcli.com)

PGCLI - Should we run pgcli inside another docker container?


In this section of the course, the 5432 port of pgsql is mapped to your computer’s 5432 port.
Which means you can access the postgres database via pgcli directly from your computer.

So No, you don’t need to run it inside another container. Your local system will do.

network

PGCLI - FATAL: password authentication failed for user "root"


(You already have Postgres)
FATAL: password authentication failed for user "root"

observations: Below in bold do not forget the folder that was created ny_taxi_postgres_data

This can happen if you already have Postgres installed on your computer. If it’s the case, use a
:different port, e.g. 5431:

-p 5431:5432

And use it when connecting with

pgcli:

pgcli -h localhost -p 5431 -U root -d ny_taxi

This will connect you to postgres.

If you want to debug: the following can help (on a MacOS)

To find out if something is blocking your port (on a MacOS):

● You can use the lsof command to find out which application is using a specific port on
your local machine. `lsof -i :5432`wi
● Or list the running postgres services on your local machine with launchctl
`
`

b
To unload the running service on your local machine (on a MacOS):

● unload the launch agent for the PostgreSQL service, which will stop the service and
free up the port
`launchctl unload -w
~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist`
● this one to start it again
`launchctl load -w
~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist`

Changing port from 5432:5432 to 5431:5432 helped me to avoid this error.

PGCLI - PermissionError: [Errno 13] Permission denied:


'/some/path/.config/pgcli'
I get this error

pgcli -h localhost -p 5432 -U root -d ny_taxi

Traceback (most recent call last):


File "/opt/anaconda3/bin/pgcli", line 8, in <module>
sys.exit(cli())
File "/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 1128,
in __call__
return self.main(*args, **kwargs)
File "/opt/anaconda3/lib/python3.9/sitYe-packages/click/core.py", line
1053, in main
rv = self.invoke(ctx)
File "/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 1395,
in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 754,
in invoke
return __callback(*args, **kwargs)
File "/opt/anaconda3/lib/python3.9/site-packages/pgcli/main.py", line 880,
in cli

os.makedirs(config_dir)
File "/opt/anaconda3/lib/python3.9/os.py", line 225, in makedirspython
mkdir(name, mode)PermissionError: [Errno 13] Permission denied:
'/Users/vray/.config/pgcli'

Make sure you install pgcli without sudo.

The recommended approach is to use conda/anaconda to make sure your system python is not
affected.

If conda install gets stuck at "Solving environment" try these alternatives:


https://stackoverflow.com/questions/63734508/stuck-at-solving-environment-on-anaconda

PGCLI - no pq wrapper available.


ImportError: no pq wrapper available.
Attempts made:
- couldn't import \dt
opg 'c' implementation: No module named 'psycopg_c'
- couldn't import psycopg 'binary' implementation: No module named
'psycopg_binary'
- couldn't import psycopg 'python' implementation: libpq library not
found

Solution:

First, make sure your Python is set to 3.9, at least.

And the reason for that is we have had cases of 'psycopg2-binary' failing to install because of an
old version of Python (3.7.3).

0. You can check your current python version with:


$ python -V(the V must be capital)

1. Based on the previous output, if you've got a 3.9, skip to Step #2


Otherwise, you're better off with a new environment with 3.9pyth

$ conda create -n de-zoomcamp python=3.9


$ conda activate de-zoomcamp

2. Next, you should be able to install the lib for postgres like this:

```
$ e

$ pip install psycopg_binary

```
3. Finally, make sure you're also installing pgcli, but use conda for that:
```
$ pgcli -h localhost -U root -d ny_taxi

```

There, you should be good to go now!

PGCLI - stuck on password prompt


If your Bash prompt is stuck on the password command for postgres

Use winpty: winpty pgcli -h localhost -p 5432 -u root -d ny_taxi

Alternatively, try using Windows terminal or terminal in VS code.

PGCLI - pgcli: command not found


Problem: If you have already installed pgcli but bash doesn't recognize pgcli

- On Git bash: bash: pgcli: command not found


- On Windows Terminal: pgcli: The term 'pgcli' is not recognized…

Solution: Try adding a Python path C:\Users\...\AppData\Roaming\Python\Python39\Scripts to


Windows PATH

For details:

1. Get the location: pip list -v


2. Copy C:\Users\...\AppData\Roaming\Python\Python39\site-packages
3. 3. Replace site-packages with Scripts:
C:\Users\...\AppData\Roaming\Python\Python39\Scripts

It can also be that you have Python installed elsewhere.

For me it was under c:\python310\lib\site-packages


So I had to add c:\python310\lib\Scripts to PATH, as shown below.

Put the above path in "Path" (or "PATH") in System Variables

Reference: https://stackoverflow.com/a/68233660

PGCLI - case sensitive use “Quotations” around columns with


capital letters
PULocationID will not be recognized but “PULocationID” will be. This is because
unquoted "Localidentifiers are case insensitive. See docs.

PGCLI - error column c.relhasoids does not exist


When using the command `\d <database name>` you get the error column
`c.relhasoids does not exist`.
Resolution:
1. Uninstall pgcli
2. Reinstall pgclidatabase "ny_taxi" does not exist
3. Restart pc

Postgres - OperationalError: (psycopg2.OperationalError)


connection to server at "localhost" (::1), port 5432 failed: FATAL:
role "root" does not exist
Can happen when connecting via pgcli

pgcli -h localhost -p 5432 -U root -d ny_taxi

Or while uploading data via the connection in jupyter notebook

engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')

This can happen when Postgres is already installed on your computer. Changing the port can
resolve that (e.g. from 5432 to 5431).

To check whether there even is a root user with the ability to login:

Try: docker exec -it <your_container_name> /bin/bash

- And then run: psql -h localhost -d ny_taxi -U root

Also, you could change port from 5432:5432 to 5431:5432

Other solution that worked:

Changing `POSTGRES_USER=juroot` to `PGUSER=postgres`

Based on this:
https://stackoverflow.com/questions/60193781/postgres-with-docker-compose-gives-fatal-role-ro
ot-does-not-exist-error

Also `docker compose down`, removing folder that had postgres volume, running `docker
compose up` again.

Postgres - OperationalError: (psycopg2.OperationalError)


connection to server at "localhost" (::1), port 5432 failed: FATAL:
database "ny_taxi" does not exist

~\anaconda3\lib\site-packages\psycopg2\__init__.py in connect(dsn, connection_factory,


cursor_factory, **kwargs)
120
121 dsn = _ext.make_dsn(dsn, **kwargs)
--> 122 conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
123 if cursor_factory is not None:
124 conn.cursor_factory = cursor_factory

OperationalError: (psycopg2.OperationalError) connection to server at "localhost" (::1), port


5432 failed: FATAL: database "ny_taxi" does not exist

Make sure postgres is running. You can check that by running `docker ps`

✅Solution: If you have postgres software installed on your computer before now, build your
instance on a different port like 8080 instead of 5432

Postgres - ModuleNotFoundError: No module named 'psycopg2'


Issue:

e…

Solution:

pip install psycopg2-binary

If you already have it, you might need to update it:

pip install psycopg2-binary --upgrade

Other methods, if the above fails:

● if you are getting the “ ModuleNotFoundError: No module named 'psycopg2' “ error even
after the above installation, then try updating conda using the command conda update -n
base -c defaults conda. Or if you are using pip, then try updating it before installing the
psycopg packages i.e
○ First uninstall the psycopg package
○ Then update conda or pip
○ Then install psycopg again using pip.
● if you are still facing error with r pcycopg2 and showing pg_config not found then you will
have to install postgresql. in MAC it is brew install postgresql

Postgres - "Column does not exist" but it actually does (Pyscopg2


error in MacBook Pro M2)
In the join queries, if we mention the column name directly or enclosed in single quotes it’ll


throw an error says “column does not exist”.
Solution: But if we enclose the column names in double quotes then it will work

pgAdmin - Create server dialog does not appear


pgAdmin has a new version. Create server dialog may not appear. Try using register-> server
instead.

Python - ModuleNotFoundError: No module named


'pysqlite2'
ImportError: DLL load failed while importing _sqlite3: The specified module
could not be found. ModuleNotFoundError: No module named 'pysqlite2'


The issue seems to arise from the missing of sqlite3.dll in path ".\Anaconda\Dlls\".
I solved it by simply copying that .dll file from \Anaconda3\Library\bin and put it under the
path mentioned above. (if you are using anaconda)

Python - Ingestion with Jupyter notebook - missing 100000


records
If you follow the video 1.2.2 - Ingesting NY Taxi Data to Postgres and you execute all the same
steps as Alexey does, you will ingest all the data (~1.3 million rows) into the table
yellow_taxi_data as expected.
However, if you try to run the whole script in the Jupyter notebook for a second time from top to
bottom, you will be missing the first chunk of 100000 records. This is because there is a call to
the iterator before the while loop that puts the data in the table. The while loop therefore starts
by ingesting the second chunk, not the first.S
✅Solution: remove the cell “df=next(df_iter)” that appears higher up in the notebook than the
📔Note: As this notebook is just used as a way to test the code, it was not intended to be run
while loop. The first time next(df_iter) is called should be within the while loop.

top to bottom, and the logic is tidied up in a later step when it is instead inserted into a .py file for
the pipeline

Python - Iteration csv without error


{t_end - t_start} seconds")

Python - Pandas can read *.csv.gzip


When a CSV file is compressed using Gzip, it is saved with a ".csv.gz" file extension. This file
type is also known as a Gzip compressed CSV file. When you want to read a Gzip compressed
CSV file using Pandas, you can use the read_csv() function, which is specifically designed to
read CSV files. The read_csv() function accepts several parameters, including a file path or a
file-like object. To read a Gzip compressed CSV file, you can pass the file path of the ".csv.gz"
file as an argument to the read_csv() function.
Here is an example of how to read a Gzip compressed CSV file using Pandas:

import pandas as pd
df = pd.read_csv('path/to/file.csv.gz', compression='gzip')

If you prefer to keep the uncompressed csv (easier preview n vscode and similar), gzip files can
be unzipped using gunzip (but not unzip). On a Ubuntu local or virtual machine, you may need
to apt-get install gunzip first.

Python - How to iterate through and ingest parquet file

Contrary to panda’s read_csv method there’s no such easy way to iterate through and set
chunksize for
files. We can use PyArrow (Apache Arrow Python bindings) to resolve that.parqu

import pyarrow.parquet as pq

output_name =
“https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-
01.parquet”

parquet_file = pq.ParquetFile(output_name)
parquet_size = parquet_file.metadata.num_rows

engine =
create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')

table_name=”yellow_taxi_schema”

# Clear table if exists


pq.read_table(output_name).to_pandas().head(n=0).to_sql(name=table_nam
e, con=engine, if_exists='replace')

# default (and max) batch size


index = 65536

for i in parquet_file.iter_batches(use_threads=True):
t_start = time()
print(f'Ingesting {index} out of {parquet_size} rows ({index /
parquet_size:.0%})')
i.to_pandas().to_sql(name=table_name, con=engine,
if_exists='append')
index += 65536
t_end = time()
print(f'\t- it took %.1f seconds' % (t_end - t_start))

GCP - Project creation failed: HttpError accessing … Requested


entity alreadytpep_pickup_datetime exists
It asked me to create a project. This should be done from the cloud console. So maybe we don’t
need this FAQ.

WARNING: Project creation failed: HttpError accessing


<https://cloudresourcemanager.googleapis.com/v1/projects?alt=json>: response:
<{'vtpep_pickup_datetimeary': 'Origin, X-Origin, Referer', 'content-type':
'application/json; charset=UTF-8', 'content-encoding': 'gzip', 'date': 'Mon,
24 Jan 2022 19:29:12 GMT', 'server': 'ESF', 'cache-control': 'private',
'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN',
'x-content-type-options': 'nosniff', 'server-timing': 'gfet4t7; dur=189',
'alt-svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443";
ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443";
ma=2592000; v="46,43"', 'transfer-encoding': 'chunked', 'status': 409}>,
content <{

"error": {

"code": 409,
"message": "Requested entity alreadytpep_pickup_datetime exists",

"status": "ALREADY_EXISTS"

From Stackoverflow:
https://stackoverflow.com/questions/52561383/gcloud-cli-cannot-create-project-the-project-id-yo
u-specified-is-already-in-us?rq=1

Project IDs are unique across all projects. That means if any user ever had a project with that
ID, you cannot use it. testproject is pretty common, so it's not surprising it's already taken.

GCP - The project to be billed is associated with an absent billing


account
If you receive the error: “Error 403: The project to be billed is associated with an absent billing
account., accountDisabled” It is most likely because you did not enter YOUR project ID. The
snip below is from video 1.3.2.

The value you enter here will be unique to each student. You can find this value on your GCP
Dashboard when you login.
Ashish Agrawal

Another possibility is that you have not linked your billing account to your current project

GCP - OR-CBAT-15 ERROR Google cloud free trial account

GCP Account Suspension Inquiry


If Google refuses your credit/debit card, try another - I’ve got an issue with Kaspi (Kazakhstan)
but it worked with TBC (Georgia).
Unfortunately, there’s small hope that support will help.
It seems that Pyypl web-card should work too.

GCP - Where can I find the “ny-ride.json” file?


The ny-rides.json is your private file in Google Cloud Platform (GCP).

And here’s the way to find it:

GCP -> Select project with your instance -> IAM & Admin -> Service Accounts Keys tab -> add
key, JSON as key type, then click create

Note: Once you go into Service Accounts Keys tab, click the email, then you can see the
“KEYS” tab where you can add key as a JSON as its key type
GCP - Do I need to delete my instance in Google Cloud?
In this lecture, Alexey deleted his instance in Google Cloud. Do I have to do it?

Nope. Do not delete your instance in Google Clo ud platform. Otherwise, you have to do this
twice for the week 1 readings.

GCP - Windows Google Cloud SDK install issue:gcp


for windows if you having trouble install SDK try follow these steps on the link, if you getting this
error:

These credentials will be used by any library that requests Application Default Credentials
(ADC).

WARNING:

Cannot find a quota project to add to ADC. You might receive a "quota exceeded" or "API not
enabled" error. Run $ gcloud auth application-default set-quota-project to add a quota
project.

For me:

● I reinstalled the sdk using unzip file “install.bat”,


● after successfully checking gcloud version,
● run gcloud init to set up project before
● you run gcloud auth application-default login

https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_1_basics_n
_setup/1_terraform_gcp/windows.md

GCP VM - I cannot get my Virtual Machine to start because GCP


has no resources.
1. Click on your VM
2. Create an image of your VM
3. On the page of the image, tell GCP to create a new VM instance via the image
4. On the settings page, change the location
GCP VM - Is it necessary to use a GCP VM? When is it useful?
The reason this video about the GCP VM exists is that many students had problems configuring
their env. You can use your own env if it works for you.

And the advantage of using your own environment is that if you are working in a Github repo
where you can commit, you will be able to commit the changes that you do. In the VM the repo
is cloned via HTTPS so it is not possible to directly commit, even if you are the owner of the
repo.

GCP VM - mkdir: cannot create directory ‘.ssh’: Permission


denied
I am trying to create a directory but it won't let me do it

User1@DESKTOP-PD6UM8A MINGW64 /

$ mkdir .ssh

mkdir: cannot create directory ‘.ssh’: Permission denied

You should do it in your home directory. Should be your home (~)

Local. But it seems you're trying to do it in the root folder (/). Should be your home (~)

Link to Video 1.4.1

GCP VM - Error while saving the file in VM via VS Code


Failed to save '<file>': Unable to write file
'vscode-remote://ssh-remote+de-zoomcamp/home/<user>/data_engineering_course/w
eek_2/airflow/dags/<file>' (NoPermissions (FileSystemError): Error: EACCES:
permission denied, open
'/home/<user>/data_engineering_course/week_2/airflow/dags/<file>')

You need to change the owner of the files you are trying to edit via VS Code. You can run the
following command to change the ownership.

ssh

sudo chown -R <user> <path to your directory>


GCP VM - VM connection request timeout
Question: I connected to my VM perfectly fine last week (ssh) but when I tried again this week,
the connection request keeps timing out.

✅Answer: Start your VM. Once the VM is running, copy its External IP and paste that into your
config file within the ~/.ssh folder.

cd ~/.ssh
code config ← this opens the config file in VSCode

GCP VM - connect to host port 22 no route to host

(reference:
https://serverfault.com/questions/953290/google-compute-engine-ssh-connect-to-host-ip-port-22
-operation-timed-out)Go to edit your VM.

1. Go to section Automation
2. Add Startup script
```
#!/bin/bash
sudo ufw allow ssh
```
4. Stop and Start VM.

GCP VM - Port forwarding from GCP without using VS Code

You can easily forward the ports of pgAdmin, postgres and Jupyter Notebook using the built-in
tools in Ubuntu and without any additional client:
1. First, in the VM machine, launch docker-compose up -d and jupyter notebook
in the correct folder.
2. From the local machine, execute: ssh -i ~/.ssh/gcp -L
5432:localhost:5432 username@external_ip_of_vm
3. Execute the same command but with ports 8080 and 8888.
4. Now you can access pgAdmin on local machine in browser typing localhost:8080
5. For Jupyter Notebook, type localhost:8888 in the browser of your local machine. If
you have problems with the credentials, it is possible that you have to copy the link with
the access token provided in the logs of the terminal of the VM machine when you
launched the jupyter notebook command.
6. To forward both pgAdmin and postgres use, ssh -i ~/.ssh/gcp -L 5432:localhost:5432 -L
8080:localhost:8080 [email protected]
Terraform - Install for WSL
https://techcommunity.microsoft.com/t5/azure-developer-community-blog/configuring-terraform-
on-windows-10-linux-sub-system/ba-p/393845

Terraform - Error acquiring the state lock


https://github.com/hashicorp/terraform/issues/14513

Terraform - Error 403 : Access denied


│ Error: googleapi: Error 403: Access denied., forbidden

Your $GOOGLE_APPLICATION_CREDENTIALS might not be pointing to the correct file


run = export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/YOUR_JSON.json
And then = gcloud auth activate-service-account --key-file
$GOOGLE_APPLICATION_CREDENTIALS

Terraform - Do I need to make another service account for


terraform before I get the keys (.json file)?

One service account is enough for all the services/resources you'll use in this course. After you
get the file with your credentials and set your environment variable, you should be good to go.

Terraform - Where can I find the Terraform 1.1.3 Linux (AMD 64)?
Here: https://releases.hashicorp.com/terraform/1.1.3/terraform_1.1.3_linux_amd64.zip
Terraform - Terraform initialized in an empty directory! The
directory has no Terraform configuration files. You may begin
working with Terraform immediately by creating Terraform
configuration files.g

You get this error because I run the command terraform init outside the working directory, and
this is wrong.
You need first to navigate to the working directory that contains terraform configuration files, and
and then run the command.

Terraform - Error creating Dataset: googleapi: Error 403: Request


had insufficient authentication scopes

The error:
Error: googleapi: Error 403: Access denied., forbidden

and
│ Error: Error creating Dataset: googleapi: Error 403: Request had
insufficient authentication scopes.

For this solution make sure to run:


echo $GOOGLE_APPLICATION_CREDENTIALS
echo $?

Solution:
You have to set again the GOOGLE_APPLICATION_CREDENTIALS as Alexey did in the
environment set-up video in week1:
export
GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys
>.json

SQL - SELECT * FROM zones_taxi WHERE Zone='Astoria Zone';


Error Column Zone doesn't exist

- For the HW1 I encountered this issue. The solution is


SELECT * FROM zones AS z WHERE z."Zone" = 'Astoria Zone';
- I think columns which start with uppercase need to go between “Column”. I ran into a lot
of issues like this and “ ” made it work out.

- Addition to the above point, for me, there is no ‘Astoria Zone’, only ‘Astoria’ is existing in
the dataset.
SELECT * FROM zones AS z WHERE z."Zone" = 'Astoria’;

SQL - SELECT Zone FROM taxi_zones Error Column Zone


doesn't exist
- It is inconvenient to use quotation marks all the time, so it is better to put the data to the
database all in lowercase, so in Pandas after
df = pd.read_csv(‘taxi+_zone_lookup.csv’)
Add the row:
df.columns = df.columns.str.lower()

CURL - curl: (6) Could not resolve host: output.csv


Solution (for mac users): os.system(f"curl {url} --output {csv_name}")

SSH Error: ssh: Could not resolve hostname linux: Name or


service not known
To resolve this, ensure that your config file is in C/User/Username/.ssh/config

Week 2

Conda Env: Why and for what do we need to work with Conda
environments exactly? What does this environment give us in
terms of ability?
Python and many other languages that advise on a virtual env suffer from a major flaw called
dependency hell. In a nutshell, that means:
It's common sense that different projects have different requirements, right? That could mean
different python versions, different libraries, frameworks, etc
So imagine you have a single Python installation (the one that comes with the operating system,
for example), and you're invited to work in a Project A and in a Project B

Project A Dependencies:
- A1
- A2
- A3

Project B Dependencies:
- B1
- B2
- B3
Now, a dependency, at the end of the day, also makes use of other dependencies. (These
"implicit" dependencies of the libraries/frameworks you are using in your project are called:
transient dependencies)
Now, with that in mind:
A1 depends on X version 1.0
Meaning "X", v1.0, is a transient dependency of A1
B1 depends on X version 2.0
● Meaning "X", v2.0, is a transient dependency of B1
When you are working on the same "python environment", the rule that applies is: Always keep
the most updated version of a library. So when you ask your system to install the dependency
"A1", it will also bring "X" at version 1.0
But then you switch to another project, called B, which also depends on X - but at version 2.0
That means that this Python environment will upgrade "X" from v1 to v2 to make it work for B1.
But the thing is: when you hop between major versions of a library, things often break.
Meaning: B1 will work just fine, because its transient dependency (X v2.0) is satisfied. But A1
may stop working if v2.0 is incompatible with its previous version (v1.0).

The scenario I just said is quite simple, but in the real world, you're going to be working with
different versions of Python and other sets of constraints. But instead we had many ppl in here
with Python at 3.7 unable to install the project dependencies for psycopg2, and once they
updated to Python 3.9, everything started to work fine.

But what if you DO have another project that is running in production on Python 3.7 and starts
bugging out ? The best you can do to reproduce/replicate said environment is to make yourself
an equivalent environment (w/ 3.7 instead of 3.9)... and the list goes on and on and on

So long story short,


It is considered a best practice to prevent catastrophic project issues like the ones I listed
above, not because "it's a best practice because it's a best practice".

Besides, you NEVER EVER want to mess up with Python environment that comes with your
Operating System (macOS / Linux) -> many many system tools today use Python, and if you
break it the one that comes bundled with the OS, you're sure gonna have a lot of headache (as
in: unexpected behaviors) to put the pieces back together
Hence why, once again, you use virtualenvs to provide you with isolation. Not only between your
projects, but also, between the projects and the underlying infrastructure from the OS

WHICH is why you don't use virtualenvs for containers.


Did you notice that the Dockerfile you're using already comes with Python and we didn't actually
have to install conda in there?

That's because the containers are not only meant to be disposable if/when they break, but they
also run in an isolated workspace of their own (but this is new and entire different discussion)

(Comment/Question: Your answer applies as well to Conda environments as also to virtualenv,


right? I write this because I wonder why people install Conda at all as you can just use
virtualenv.)

Why does Jeff have the same line twice in video 2.2.3 and 2.2.4,
etl_web_to_gcs.py and parameterized_flow.py?
df[“tpep_pickup_datetime”] =
pd.to_datetime(df[“tpep_pickup_datetime”])

Second one should be:

df[“tpep_dropoff_datetime”] =
pd.to_datetime(df[“tpep_dropoff_datetime”])

Repo is updated. Thank you jralduaveuthey and Valentine Zaretsky for catching!

Why does Jeff use a default parameter assignment in


etl_web_to_gcs.py instead of a type hint around time 12:40 in
video 2.2.3 and in 2.2.4?
That’s a typo. Should be:

@task(log_prints=True)
def clean(df: pd.DataFrame) -> pd.DataFrame:

not

@task(log_prints=True)
def clean(df=pd.DataFrame) -> pd.DataFrame:

Repo is updated. Thank you Valentine Zaretsky for catching it!


Prefect Blocks are not showing even after installing the
requirements.txt as shown in the video. How to fix:

With the Prefect Orion up & running, execute:


● prefect orion database reset
● prefect block register -m prefect_gcp
● prefect block register -m prefect_sqlalchemy

Prefect blocks: raise RuntimeError(RuntimeError: Unable to load


'de-zoomcamp-gcs' of block type None due to failed validation. To
load without validation, try loading again with `validate=False`.
Probable cause of error:

Copied json key for credentials incorrectly or the service account doesn't have the necessary
permissions

How to fix it:


Create a new Service Account with the permissions necessary, copy the json key, paste it into
the credentials block and finally execute the pipeline again.

This should be fixed!

Prefect blocks: (TimeOutError)


requests.exceptions.ConnectionError: ('Connection aborted.',
timeout('The write operation timed out'))
I was hitting the following error in the gcs_block.upload_from_path function

The solution for me was to set the timeout parameter of the function to 120 (seconds).
gcs_block.upload_from_path(from_path=path, to_path=path, )

The default timeout is 60 seconds. Timeout may vary depending on your internet speed.

You could also opt to work from a VM.

Also you can try this workaround https://github.com/googleapis/python-storage/issues/74 (try


smaller value of parameters _DEFAULT_CHUNKSIZE and _MAX_MULTIPART_SIZE for gcp blob lib
which is used under the hood of prefect gcp blocks)

Prefect: With Windows, Prefect-gcp 0.2.3 converted / slashes in a


path to \ in the to_path statement in the upload_from_path
function

Bug with prefect-gcp 0.2.3 on Windows only. Couldn’t upload the file into a folder as in

the video.

✅SOLUTION: Use prefect-gcp 0.2.4 You can specify the new version in requirements.txt
before installing or pip install -U prefect-gcp to upgrade in an existing environment.

Then use before the upload command.


Prefect deployment: ValueError: Path
opt/prefect/C:\Users\user\.prefect\storage/5eeca69056a042a2
84e87ea46f757188 does not exist.
Error when you run the command prefect deployment run
etl-parent-flow/docker-flow -p "months=[1,2]"
It looks like you ran the flow before with caching on. Now when it tries to find the cached
location, it’s outside Docker and can’t be accessed. If using Prefect 2.7.8 you can refresh the
cache and it should work.

Alternatively, you can set the cache_key_expiration to a short period of time - say a minute - and
rerun outside Docker and then in Docker, it won’t try to find the cached result.

See more instructions in the docs.

1. ✅Deleted the part that use the cache from the fetch task and the flow runs, i’ll provide
further details if i find why it was trying to pull from local cache
# @task(retries=3,cache_key_fn=task_input_hash,
cache_expiration=timedelta(days=1))
@task(retries=3)
def fetch(dataset_url: str) -> pd.DataFrame:

If using Docker with a function that says to look for a cached run, but you have cache
stored locally outside Docker from a previous run, Docker can’t access the cached file, so it
throws an error.
2. Also, make sure the Prefect Block you have for the GCS creds is having the service
account JSON data directly instead of the path to the service account JSON file
3. made sure I removed the CACHE settings on the @task and I re-ran every
command (building docker image, pushing, updating deployment) works!!!

For more info, you can also refer this slack thread for the same error -
https://datatalks-club.slack.com/archives/C01FABYF2RG/p1674928505612379

ISSUE with activating the credentials in the “Service Account


Info” field
When inserting your JSON credentials-dict into the “Service Account Info” text field, it seems
that prefect is not checking whether the json-input is correct or not. It happened to me that when
copying the content with nano, something in the formatting or else must have been wrong.
When just extracting the content with another text editor it worked.
You can have yourself a check by considering the following: If it fails, then it just prints eight
stars:
"********"

If it inserted correctly, then you should see the following masked by the stars (with a formatting
of key/value per line):
{ "type": "********", "auth_uri": "********", "client_id":
"********", "token_uri": "********", "project_id": "********",
"private_key": "********", "client_email": "********",
"private_key_id": "********", "client_x509_cert_url": "********",
"auth_provider_x509_cert_url": "********"}

Prefect Flow: ERROR | Task run 'write_local' - Encountered


exception during execution
Remember, you have to create the folders where you keep the files from the repository.

Prefect Flow: ERROR | Flow run 'xxxxxx' - Finished in state


Failed('Flow run encountered an exception.
google.api_core.exceptions.Forbidden: 403 GET: Access
Denied: Table xxxxx: Permission bigquery.tables.get denied
on table xxxxxx (or it may not exist).\n')
If you reuse the block with the service account which you created before to connect to Cloud
Storage Bucket you have to add permissions of BigQuery Administrator.

Prefect deployment: Attempt to run deployment


etl-parent-flow/docker-flow prefect.exceptions.ScriptError: Script
at 'parameterized_flow.py' encountered an exception:
FileNotFoundError(2, 'No such file or directory')
Please help with answer here 🙏 -> place parmeterized_flow.py in flows folder

Solution:

Ensure you copy the parameterized_flow.py file to the /opt/prefect/flows/ directory giving it the
same name as the file parameterized_flow.py

0rameteri1_start/parameterized_flow.py /opt/prefect/flows/pazed_flow.py
build
Prefect Deployment Encountered Exception–prefect deployment build
./parameterized_flow.py:etl_parent_flow -n "Parametereized ETL "Script at
'./parameterized_flow.py' encountered an exception: TypeError("'type' object is not
subscriptable")

While performing SQL query in python using pandas, I am facing


the error : TypeError: __init__() got multiple values for argument
'schema'
Install an earlier version of sqlalchemy (1.4.46 release). Sqlalchemy version 2.0.0 was released
recently and isn't compatible with pandasql. If you have the latest version of sqlalchemy
(v2.0.0), just do:

- pip uninstall sqlalchemy


- pip install sqlalchemy==1.4.46

$ python parameterized_flow.py

Traceback (most recent call last):

File "parameterized_flow.py", line 60, in <module>

def etl_parent_flow(months: list[int] = [1,2], year: int = 2021,


color: str = "yellow"):

TypeError: 'type' object is not subscriptable

Prefect Deployment - How to pass multiple parameters to prefect


deployment run?

During class 2.2.6, when he runs something like


prefect deployment run etl-parameters/docker-flow -p "month= [1,2]"

How can I add more arguments? This didn’t work:


prefect deployment run etl-parameters/docker-flow -p "color=yellow" "month= [1,2]"
"year= 2021"
I'm getting some errors and I've found out that my deployment doesn't have defined parameters,
that's why I need to know how to configure all parameters (we were using by
etl_parameters-deployment.yaml file)

✅Solution:
With this syntax you can pass multiple parameters:
prefect deployment run etl-parameters/docker-flow --params= '{"color":"yellow",
"month":[1,2], "year":2021}'

More info in the prefect documentation on: Build the deployment

Prefect deployment: ERROR | Flow could not be retrieved from


deployment

When using the GitHub block, you have to run the prefect deployment build command in
the same local folder as the root folder in the GitHub repository.

In the build command, you have to provide the path to the python file:
path/to/my_file.py:my_flow

--path is used as the upload path, which doesn’t apply for GitHub repository-based storage.

Docker: Trying to build the docker image throws this error:


=> ERROR [5/5] RUN mkdir /opt/prefect/data/yellow 0.3s
——
> [5/5] RUN mkdir /opt/prefect/data/yellow: #9 0.299 mkdir: cannot create directory
‘/opt/prefect/data/yellow’: No such file or directory
—— executor failed running [/bin/sh -c mkdir /opt/prefect/data/yellow]: exit code: 1

Check dockerfile and add -p to mkdir command:


RUN mkdir -p /opt/prefect/data/yellow

Thank you! Fixed in repo now.

OSError: Cannot save file into a non-existent directory:


‘/data/yellow’
Change the path to go back 2 directories or however many you need so it goes to the proper
path
path = Path(f"../../data/{color}/{dataset_file}.parquet")

OR
I added 2 lines code to write_local Function:
if not path.parent.is_dir():
path.parent.mkdir(parents=True)

I suggest using the 2nd fix, the first can cause issues when running through prefect doing steps
2.2.5.

Previous If statement didn’t work for me. I added following line to the write_local function, which
did the trick:
path.parent.mkdir(parents=True, exist_ok=True)

ALTERNATIVE:

Import os library, and do:


@task()
def write_local(df: pd.DataFrame, color: str, dataset_file: str) -> Path:
"""Write df out locally as a parquet file"""

outdir = f"./data/{color}"
if not os.path.exists(outdir):
os.makedirs(outdir)

path = Path(f"{outdir}/{dataset_file}.parquet")
df.to_parquet(path, compression='gzip')
return path

The first block checks if the path already exists, and if it doesn’t, creates it first.

Github Block- OSError: Failed to pull from remote: fatal: Too


many arguments. (when running deployment)

Make sure you don’t use the “Reference” field as description of the block cause otherwise it will
look endlessly for a branch or name tag in your Github repository that clearly doesn’t exist.
Prefect flow: etl-parent-flow/docker-flow Crashed with
ConnectionRefusedError: [Errno 111] Connect call failed
('127.0.0.1', 4200)
This error occurs when you run your prefect on WSL 2 on Windows 10. The error looks like this:
UI

Prefect Agent Log


httpx.ConnectError: All connection attempts failed
10:10:16.423 | INFO | prefect.infrastructure.docker-container - Docker
container 'placid-corgi' has status 'removing'
10:10:16.437 | INFO | prefect.infrastructure.docker-container - Docker
container 'placid-corgi' has status 'removing'
10:10:16.502 | INFO | prefect.agent - Reported flow run
'24feeb23-6eb9-4523-b330-19c365bd68fc' as crashed: Flow run infrastructure
exited with non-zero status code 1.

The error above is because we run Prefect locally on our machine at localhost:4200, when we
run docker without specifying their network, docker call the localhost:4200 inside the container
but not the localhost:4200 on our machine.
To solve this, you on

ly need to specify Network Mode to bridge.


Config from UI

Config from Code (make_docker_block.py)


from prefect.infrastructure.docker import DockerContainer

# alternative to creating DockerContainer block in the UI


docker_block = DockerContainer(
image="discdiver/prefect:zoom", # insert your image here
image_pull_policy="ALWAYS",
auto_remove=True,
network_mode="bridge"
)

docker_block.save("zoom", overwrite=True)
Prefect flow: Why are we writing the .csv locally in
etl_web_to_gcs.py?

Just to so some transformation. It is not necessary to keep it.

Prefect Flow: could not be retrieved from deployment: github


storage
Solved: ✅
It is due to prefect trying to read the file in origin or root level. Either try to push the file in root
directory in github or mention full path during the deployment command.
Example:
prefect deployment build -n “name” -sb
github/username/directory/filename.py: flow_name –apply

Prefect flow: Why Prefect DockerContainer deployments can’t


connect to other dockers on the host
Setup:
The project is running with docker compose. They include Prefect Orion, Prefect Agent,
Postgres DB etc.

Problem: When trigger the flows of Prefect DockerContainer deployment it was unable to
connect to other dockers.

Cause:
The DockerContainer deployment is not running on the same network as rest of the dockers

Fix:
Create a docker network
```
if [ ! "$(docker network ls | grep dtc-network)" ]; then
echo "\n**Create docker network"
docker network create -d bridge dtc-network
else
echo "dtc-network already exists."
fi
```
Add the networks to each container within docker-compose.yml
```
Container:
networks:
- dtc-network
networks:
dtc-network:
driver: bridge
name: dtc-network
```

Python lib: Problem with prefect with macOS M1 (arm) with poetry
the package manager, I got the below message:

I got this problem while tried to run `prefect orion start` to start the prefect ui on my
laptop

“ ValueError: the greenlet library is required to use this function.


No module named 'greenlet' “

For the people who got the same problem with poetry I solve this issue by add `greenlet` to
`pyproject.tom`

So run this command:


poetry add greenlet

Docker: “Localhost (127.0.0.1), port 5433 failed: Connection


refused.”

Instructor using port 5433 for week2. If you’re using previous containers you should set port
5432 and user/pass root/root

Prefect API: Client error '404 Not Found'

Error Message:
httpx.HTTPStatusError: Client error '404 Not Found' for url
'http://ephemeral-orion/api/flow_runs/6aed1011-1599-4fba-afad-b8d999cf
9073'
For more information check: https://httpstatuses.com/404
Solution: ✅
reference
[method-1]
Input the below command on local PC.
prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api
[method-2]
Add the below lines in the Dockerfile
ENV PREFECT_API_URL=http://127.0.0.1:4200/api
RUN prefect config set PREFECT_API_URL="${PREFECT_API_URL}"
(if you need, you can set the Env in Prefect Orion, like below. And you can change the value
depending on your demand, such as connection to Prefect Cloud)

Uploading to GCS Does not put the file inside a folder but instead
the file is named ‘data\green\...’ (Affects Windows)

Add .as_posix() to the path variable.


path = Path(f"data/{color}/{dataset_file}.parquet").as_posix()

Additionally, in my case I needed to update the prefect_gcp package to the version 0.2.6.

GITHUB PUSH ERROR PRE-HOOK DECLINED


You get this error if you accidentally send large files (over 120 MB) to github. Use the following
commands to correct this:
1. Git reset –soft HEAD~1 , this would reset the commit back by 1 step, you can use
HEAD~N to reset the commits back N steps. - - soft is used to retain changes made on
the local directory.
2. Use git restore - -staged <file name> to remove the large files from the staging area
3. Now you can push your commits and add the large file name to your .gitignore file.

Slack Webhook Not Working/‘404’


Error Message:
None
(The functionality of the webhook can be tested with the below code after setting up a slack
block, which if not functioning properly, will throw an error)
from prefect.blocks.notifications import SlackWebhook

slack_webhook_block = SlackWebhook.load("slack-block-name")
slack_webhook_block.notify("Hello!")
WARNING | apprise - Failed to send to Slack: Page not found.,
error=404

This can be fixed by adding your own webhook to the channel.

1. While logged in to the Temporary Data Engineering Zoomcamp slack (check the week 2
homework for an updated link), visit the Slack API Apps page to create a new app.
2. Using the splash or via the green “Create a new app” button, create an app “From an
app manifest”
3. Select the workspace “Temporary DE Zoomcamp Prefect Notifications”

4. Click “next” at the app manifest page (step 2 of 3)


5. Click “create”
6. At the next page, select “Incoming Webhooks” under “Add features and functionality”
7. Set the toggle to the right of “Activate incoming webhooks” to “On”
8. Click on the “Add new webhook to workspace” button now at the bottom of the screen
9. Select the #testing-notifications channel

10. The Webhook URL provided below is your new webhook- use it in place of the webhook
supplied by the homework.
Or you can use an existing one configuration:
1. Go to
https://temp-notify.slack.com/apps/A0F7XDUAZ-incoming-webhooks?tab=settings&next
_id=0
2. Click “edit” on any existing configuration

3. Scroll down to Webhook URL and copy URL


4. Paste it in your prefect notifications

GCS Bucket Block not found on Prefect Cloud UI


Step 1: From the command line type the following to register and configure the GCP blocks as
follows-
prefect block register -m prefect_gcp

Step 2: Go back to the UI and refresh and look at the blocks again, the GCP Bucket Block will
now be available for use.

File path does not exist after building Prefect to Docker

Tags: prefect, docker, copy


ValueError: Path /opt/prefect/C:\Users\(your
username)\.prefect\storage/5eeca69056a042a284e87ea46f757188 does not exist.

Solution: Remove the `task input hash` at the decorator function of your task. This is due to
Prefect “remembering” the absolute directory instead of relative.

File path does not exist after building Prefect deployment using
GitHub Bucket
Error: OSError: Cannot save file into a non-existent directory: 'data\green'

Solution: Create a directory data/green at the home directory of your repo.


Or

2.2.5 OSError: Cannot save file into a non-existent directory:


'..\\..\\data\\yellow'\n")
Add
if not path.parent.is_dir():
path.parent.mkdir(parents=True)
path = Path(path).as_posix()

see:
https://datatalks-club.slack.com/archives/C01FABYF2RG/p1675774214591809?thread_ts=1675
768839.028879&cid=C01FABYF2RG

File Path: Cannot save file into a non-existent directory:


'data/green'

Git won’t push an empty folder to GitHub, so if you put a file in that folder and then push, then
you should be good to go.

Or - in your code- make the folder if it doesn’t exist using Pathlib as shown here:
https://stackoverflow.com/a/273227/4590385.

For some reason, when using github storage, the relative path for writing locally no longer
works. Try using two separate paths, one full path for the local write, and the original relative
path for GCS bucket upload.

Prefect: Docker Connection Error on Windows 11/WSL 2


Error Message:
ERROR | prefect.engine - Engine execution of flow run '<hash>'
exited with unexpected exception
Traceback (most recent call last):
...
ConnectionRefusedError: [Errno 111] Connect call failed
...
httpcore.ConnectError: All connection attempts failed
...
httpx.ConnectError: All connection attempts failed
WARNING | prefect.infrastructure.docker-container - Docker container
<container-name> was removed before we could wait for its completion.

Likely associated with how WSL handles IPv4 addresses, as in this issue: WSL by default will
bind to an IPv6 address, which Prefect does not appear to handle.

A temporary fix can be had by using the subsystem IP assigned to the WSL instance. This
should work for Debian-based distros on WSL 2, including Ubuntu. Steps as follows:
1. In a WSL terminal, run the command ip addr to find the subsystem IP. You’re looking
for a connection with a valid inet address: in my example below, this is
172.22.53.106
eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group
default qlen 1000
link/ether 00:15:5d:f5:0c:25 brd ff:ff:ff:ff:ff:ff
inet 172.22.53.106/20 brd 172.22.63.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::215:5dff:fef5:c25/64 scope link
valid_lft forever preferred_lft forever
2. While in WSL, start Prefect Orion with that IP address as host:
prefect orion start --host 172.22.53.106
3. While in WSL, set the Prefect API to that IP address. The exact code for this should be
displayed when you start Prefect Orion in the step above.
prefect config set PREFECT_API_URL=http://172.22.53.106:4200/api
4. You should be able to run Prefect as set in the course videos from this point.

A couple of caveats:
● This is only a temporary fix. You'll likely need to copy the steps above for every new
instance of WSL- i.e., if you restart, look for the subsystem IP again and plug it into the
right places.
● I haven't completely ironed out the issues on my end, but I am getting new error
messages that suggest the docker deployment is running and connected (i.e.,
prefect.exceptions.ScriptError: Script at 'parameterized_flow.py'
encountered an exception: FileNotFoundError(2, 'No such file or
directory') ) Your mileage may vary.

Although I haven’t tested this, you may also be able to resolve the issue using docker-ce in
place of Docker Desktop for Windows.

Docker: Error on docker image build -t


failed to solve with frontend dockerfile.v0: failed to create LLB definition: dockerfile parse error
line 1: FROM requires either one or three arguments
Solved: check Dockerfile for some typo + space
FROM prefecthq/prefect:2.7.7-python3.9 (should not have space)

Run out of space in your machine while running prefect


Check:
- ls ~./prefect/storage
- There might be some data stored in there, try to remove them to get more space back

Email Server Credentials block not found on Prefect Cloud


Solved:
It seems that some of blocks that exist in Prefect local UI is not exist by default in Prefect Cloud.
For this kind of blocks, we need to register it manually by execute command as follows:
prefect block register -m <block-name>
For example, in my case:
Prefect block register -m prefect_email
Don’t forget to install the block module with pip install.

No column name lpep_pickup_datetime / tpep_pickup_datetime

Do not forget that the green dataset contains lpep_pickup_datetime while the yellow contains
tpep_pickup_datetime. Modify the script(s) accordingly based on the color of dataset when
needed.

Process to download the VSC using Pandas is killed right away

pd.read_csv

df_iter = pd.read_csv(dataset_url, iterator=True,


chunksize=100000)

The data needs to be appended to the parquet file using the fastparquet engine

df.to_parquet(path, compression="gzip",
engine='fastparquet', append=True)

Push to docker image failure

denied: requested access to the resource is denied

This can happen when you


1. Haven't logged in properly to Docker Desktop (use docker login -u "myusername")
2. Have used the wrong username when pushing to docker images. Use the same one as
your username and as the one you build on
docker image build -t <myusername>/<imagename>:<tag>
docker image push <myusername>/<imagename>:<tag>
Flow script fails with “killed” message:

16:21:35.607 | INFO | Flow run 'singing-malkoha' - Executing 'write_bq-b366772c-0'


immediately...

Killed

Solution: You probably are running out of memory on your VM and need to add more. For
example, if you have 8 gigs of RAM on your VM, you may want to expand that to 16 gigs.

FileNotFoundError: [Errno 2] No such file or directory:


At “etl_web_to_gcs.py” when handling csv/parquet files, this error appear to me, as I didn’t have
the directory where the path variable was pointing.
Solution: Create directories automatically, by splitting path variable in two variables, one for
path_file and other for path_dir. On path_dir use the method “mkdir(parents=True,
exist_ok=True)” to create all needed directories.
Also, we can join two path objects with one slash /

@task()
def write_local(df: pd.DataFrame, color: str, dataset_file: str) -> Path:
"""Write DataFrame out locally as parquet file"""
path_file = f"{dataset_file}.parquet"
path_dir = Path(f"data/{color}")
path_dir.mkdir(parents=True, exist_ok=True)
df.to_parquet(path_dir / path_file, compression="gzip")
return path_dir / path_file

Access to a directory <folder_name> denied when trying to


deploy and run workflow through agent
The problem is that prefect deployment check recursively every directory and all subdirectories
and files from the directory in which the .yaml file is. In my case it was a volume directory for
postgres from previous week and i accidentally create a docker container using sudo and that
was the reason why prefect didn’t have acces to it

Prefect Block regenerate on Prefect Cloud or elsewhere:


So, you have created a lot of blocks in your local and now you are deploying your flow on
prefect cloud or gcp or somewhere. The issue which you face is you would have to copy all the
prefect blocks required on that machine to run your flow.
Solution: Doing this yourself by typing could be really boring. So you could use Python API to
create prefect-block and then you could just re run those scripts anywhere to get those blocks
created easily.

Question 4 Hint
Read the docs here first you execute code using python to clone the code to your locals, then
build deployment. Note: You can also build the github bucket from the UI, therefore, you can
skip the first part and ensure to put token if your repo is private.

Docker - Finished in state Failed('Flow run encountered an


exception. ValueError: Path
/home/clamytoe/.prefect/storage/342da9557ddb4eecab92e889
7192f906 does not exist.\n'):
While trying to run the code from a docker container, it kept failing with this error. It was trying to
access the cache directory on my host system.

Solution: The issue turned out to be that the code had caching enabled. Once all of the caching
code was removed from the script, it worked flawlessly.

Docker - Rancher Desktop on Mac M1 Ventura: ValueError:


'host.docker.internal' does not appear to be an IPv4 or IPv6
address
When trying to run the code from a docker container declared as a block in Prefect, the DNS
'host.docker.internal' was not recognized.

Solution:
1. prefect orion start --host 192.168.178.90
Where the IP is the internal IP of my MAC
2. prefect config set PREFECT_API_URL=http://192.168.178.90:4200/api
3. Declaring the docker container block in Prefect with Network Mode = host
install -r requerements.txt on Mac
fatal error: Python.h: No such file or directory compilation
terminated.

Add env variable:


export
C_INCLUDE_PATH=/Library/Devpreeloper/CommandLineTools/Library/Frameworks/Python3.fr
amework/Versions/3.8/Headers

Add new dependency in requerement.txt


backports.zoneinfo==0.2.1

RuntimeError: Tasks cannot be run from within tasks. Did you


mean to call this task in a flow?
For example: you run the `bigquery_load_cloud_storage` function inside the task. You will get
this error.

Solution: That function creates a task. You can’t run a task inside a task in Prefect. The
solution is to call that function from the flow itself.

Cannot find ‘secret’ block in prefect local UI

To fix this error I had to run prefect orion database restart. After the restart, the blocks that
were not appearing in the UI appeared (no extra pip installs or block registering should be
required). Please note that this will delete all current blocks you have saved, such as your
GCP credentials so note down how to create them before doing this step.

GCP VM: Disk Space is full

After playing around with prefect for a while this can happen.
Ssh to your VM and run sudo du -h --block-size=G | sort -n -r | head -n 30 to see which
directory needs the most space.
Most likely it will be …/.prefect/storage, where your cached flows are stored. You can delete
older flows from there. You also have to delete the corresponding flow in the UI, otherwise it will
throw you an error, when you try to run your next flow.
Error Found in Python Script Already Deployed as a Flow
Initial, I would have to run the entire process from edit the script to build and apply the news
deployment. However, I found this to be best.

Solution:
Docker: Edit the python script, rebuild your docker image, push it to dockerhub using thesame
name and tag, and run your deployment.
Github: Edit the python script, commit your changes, push it to github and run your deployment.

OSError: Failed to pull from remote: 'git' is not recognized as an


internal or external command.
Faced this issue when trying github flow code deployment.

Solution:
Take clone of your repo to local directory.
- change directory to cloned repo ($ cd de-zoom-camp)
- Initialize git with command, git init
- open your terminal in VS code, and activate virtual environment(eg; conda)
conda activate zoomcamp
-Install prefect-github with pip:
pip install prefect-github
- register the block,
prefect block register -m prefect_github

It worked successfully.

SSL Certificate Verify: (I got it when trying to run flows on MAC):


urllib.error.URLError: <urlopen error [SSL:
CERTIFICATE_VERIFY_FAILED]

pip install certifi


/Applications/Python\ {ver}/Install\ Certificates.command
or

running the “Install Certificate.command” inside of the python{ver} folder

Docker: container crashed with status code 137.


It means your container consumed all available RAM allocated to it. It can happen in particular
when working on Question#3 in the homework as the dataset is relatively large and containers
eat a lot of memory in general.
I would recommend restarting your computer and only starting the necessary processes to run
the container. If that doesn’t work, allocate more resources to docker. If also that doesn’t work
because your workstation is a potato, you can use an online compute environment service like
GitPod, which is free under under 50 hours / month of use.

Timeout due to slow upload internet

In Q3 there was a task to run the etl script from web to GCS. The problem was, it wasn’t really
an ETL straight from web to GCS, but it was actually a web to local storage to local memory to
GCS over network ETL. Yellow data is about 100 MB each per month compressed and ~700
MB after uncompressed on memory
This leads to a problem where i either got a network type error because my not so good 3rd
world internet or i got my WSL2 crashed/hanged because out of memory error and/or 100%
resource usage hang.

Solution:
if you have a lot of time at hand, try compressing it to parquet and writing it to GCS with the
timeout argument set to a really high number (the default os 60 seconds)

the yellow taxi data for feb 2019 is about 100MB as parquet file

gcp_cloud_storage_bucket_block.upload_from_path(
from_path=f"{path}",
to_path=path,
timeout=600
)

Error: Bucket name must begin and end with alphanumeric


character.

Check your GCS Bucket block in Prefect UI @ http://127.0.0.1:4200 - you may have added
blank spaces at the beginning/ending of the bucket name.

alembic.script.revision.ResolutionError: No such revision or


branch

https://discourse.prefect.io/t/installation-error-prefect-version-cmd-errors/1784/5

Run the following commands to restart your prefect


rm ~/.prefect/orion.db
prefect orion start

ImportError: cannot import name 'SecretField' from 'pydantic'

pip install pydantic==1.10.0

Pydantic changed to v2 and older Prefect versions are not compatible.

Try: pip install --force-reinstall -v "pydantic==1.10.0"

Week 3

Does Q2 of the homework refer to Q1?


This question could be old. I am confused: Could it be that the wording of the question changed
during the week? I think it was not as clear as now, at the beginning of the week. See the
question also in:
https://app.slack.com/client/T01ATQK62F8/threads/thread/C01FABYF2RG-1675969917.402839

Suggestion:
For question 2, I feel the question is not correct its either the count of the entire dataset for both
tables, which the answer is there or the answer is not there.

Docker-compose takes infinitely long to install zip unzip packages


for linux, which are required to unpack datasets
A:

1 solution) Add -Y flag, so that apt-get automatically agrees to install additional packages

2) Use python ZipFile package, which is included in all modern python distributions
If you’re having problems loading the FHV_2021 data from the
github repo into GCS and then into BQ (input file not of type
parquet), you need to do two things. First, append the URL
Template link with ‘?raw=true’ like so:

URL_TEMPLATE = URL_PREFIX + "/fhv_tripdata_{{


execution_date.strftime(\'%Y-%m\') }}.parquet?raw=true"

Second, update make sure the URL_PREFIX is set to the following value:

URL_PREFIX =
"https://github.com/alexeygrigorev/datasets/blob/master/nyc-tlc/fhv"

It is critical that you use this link with the keyword blob. If your link has ‘tree’ here, replace it.
Everything else can stay the same, including the curl -sSLf command.

“bq: command not found”


Run the following command to check if “BigQuery Command Line Tool” is installed or not:
gcloud components list

You can also use bq.cmd instead of bq to make it work.

I am having problems with columns datatype while running


DBT/BigQuery
R: If you don’t define the column format while converting from csv to parquet Python will
“choose” based on the first rows.

✅Solution: Defined the schema while running web_to_gcp.py pipeline.


Sebastian adapted the script:

https://github.com/sebastian2296/data-engineering-zoomcamp/blob/main/week_4_analytics_en
gineering/web_to_gcs.py
Same ERROR - When running dbt run for fact_trips.sql, the task failed with error:

“Parquet column 'ehail_fee' has type DOUBLE which does not match the target
cpp_type INT64”

Reason: Parquet files have their own schema. Some parquet files for green data have records
with decimals in ehail_fee column.

There are some possible fixes:

Drop ehail_feel column since it is not really used. For instance when creating a partitioned
table from the external table in BigQuery

SELECT * EXCEPT (ehail_fee) FROM…

Modify stg_green_tripdata.sql model using this line cast(0 as numeric) as ehail_fee.

Modify Airflow dag to make the conversion and avoid the error.

pv.read_csv(src_file,
convert_options=pv.ConvertOptions(column_types = {'ehail_fee':
'float64'}))

Same type of ERROR - parquet files with different data types - Fix it with pandas

Here is another possibility that could be interesting:

You can specify the dtypes when importing the file from csv to a dataframe with pandas

pd.from_csv(..., dtype=type_dict)

One obstacle is that the regular int64 pandas use (I think this is from the numpy library)
does not accept null values (NaN, not a number). But you can use the pandas Int64 instead,
notice capital ‘I’. The type_dict is a python dictionary mapping the column names to the
dtypes.

Sources:

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Nullable integer data type — pandas 1.5.3 documentation

Problem with prefect with macOS M1 (arm)


In video 2.2.2 at 15:16 when running the command `python ingest_data_flow.py`, I got
a big error message who said this:
ValueError: the greenlet library is required to use this function....
... is an incompatible architecture (have 'x86_64', need 'arm64'))

My computer is a MacBook Pro M1 and miniconda arm64 installed.

I searched for the arm version of greenlet but couldn't find it, so I found the following solution
instead.

I used the instructions from this site How to Manage Conda Environments on an Apple Silicon
M1 Mac to create a conda x86 environment.

So instead of running the command:


conda create -n zoom python=3.9

Instead, I ran the command:


create_x86_conda_environment myenv_x86 python=3.9

This solution works! And I can now continue to prefect on my arm computer!

ERROR Cannot read and write in different locations: source: EU,


destination: US - Loading data from GCS into BigQuery (different
Region):
Be careful when you create your resources on GCP, all of them have to share the same Region
in order to allow load data from GCS Bucket to BigQuery. If you forgot it when you created them,
you can create a new dataset on BigQuery using the same Region which you used on your
GCS Bucket.
This means that your GCS Bucket and the BigQuery dataset are placed in different regions. You
have to create a new dataset inside BigQuery in the same region with your GCS bucket and
store the data in the newly created dataset.

Failed to create table: Error while reading data, error message:


Parquet column 'XYZ' has type INT which does not match the
target cpp_type DOUBLE. File: gs://path/to/some/blob.parquet

Ultimately, when trying to ingest data into a BigQuery table, all files within a given directory must
have the same schema.

When dealing for example with the FHV Datasets from 2019, however (see image below), one
can see that the files for '2019-05', and 2019-06, have the columns "PUlocationID" and
"DOlocationID" as Integers, while for the period of '2019-01' through '2019-04', the same column
is defined as FLOAT.

So while importing these files as parquet to BigQuery, the first one will be used to define the
schema of the table, while all files following that will be used to append data on the existing
table. Which means, they must all follow the very same schema of the file that created the table.
So, in order to prevent errors like that, make sure to enforce the data types for the columns on
the DataFrame before you serialize/upload them to BigQuery. Like this:

pd.read_csv("path_or_url").astype({
"col1_name": "datatype",
"col2_name": "datatype",
...
"colN_name": "datatype"
})
Encoding error when writing data from web to GCS:
Make sure to use Nullable dataTypes, such as Int64 when appliable.

These won't work. You need to make sure you use Int64:
Incorrect:
df['DOlocationID'] = pd.to_numeric(df['DOlocationID'], downcast=integer) or
df['DOlocationID'] = df['DOlocationID'].astype(int)
Correct:
df['DOlocationID'] = df['DOlocationID'].astype('Int64')

Question 5: The partitioned/clustered table isn’t giving me the


prediction I expected

Take a careful look at the format of the dates in the question.

What does it mean Stop with loading the files into a bucket.' Stop with loading the
files into a bucket? In the homework

What they mean is that they don't want you to do anything more than that. You should load
the files into the bucket and create an external table based on those files (but nothing like
cleaning the data and putting it in parquet format)

Cannot read and write in different locations when creating


external table
Error Message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 41721:
invalid start byte

Solution:
Step 1: When reading the data from the web into the pandas dataframe mention the encoding
as follows:
pd.read_csv(dataset_url, low_memory=False, encoding='latin1')
Step 2: When writing the dataframe from the local system to GCS as a csv mention the
encoding as follows:
df.to_csv(path_on_gsc, compression="gzip", encoding='utf-8')

Remember to save your queries


By the way, this isn’t a problem/solution, but a useful hint:
● Please, remember to save your progress in BigQuery SQL Editor.
● I was almost finishing the homework, when my Chrome Tab froze and I had to reload it.
Then I lost my entire SQL script.
● Save your script from time to time. Just click on the button at the top bar. Your saved file
will be available on the left panel.

Alternatively, you can copy paste your queries into an .sql file in your preferred editor
(Notepad++, VS Code, etc.). Using the .sql extension will provide convenient color formatting.

Fix Error when importing FHV data to GCS


If you receive the error gzip.BadGzipFile: Not a gzipped file (b'\n\n'), this is because you have
specified the wrong URL to the FHV dataset. Make sure to use
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/{dataset_file}.csv.gz
Emphasising the ‘/releases/download’ part of the URL.

Reading parquets from nyc.gov directly into pandas returns Out of


bounds error
If for whatever reason you try to read parquets directly from nyc.gov’s cloudfront into pandas,
you might run into this error:
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of
bounds
Cause:
1. there is one errant data record where the dropOff_datetime was set to year 3019 instead
of 2019.
2. pandas uses “timestamp[ns]” (as noted above), and int64 only allows a ~580 year range,
centered on 2000. See `pd.Timestamp.max` and `pd.Timestamp.min`
3. This becomes out of bounds when pandas tries to read it because 3019 > 2300 (approx
value of pd.Timestamp.Max
Fix:
1. Use pyarrow to read it:
import pyarrow.parquet as pq df =
pq.read_table('fhv_tripdata_2019-02.parquet').to_pandas(safe=False)
However this results in weird timestamps for the offending record
2. Read the datetime columns separately using pq.read_table

table = pq.read_table(‘taxi.parquet’)
datetimes = [‘list of datetime column names’]
df_dts = pd.DataFrame()
for col in datetimes:
df_dts[col] = pd.to_datetime(table .column(col), errors='coerce')

The `errors=’coerce’` parameter will convert the out of bounds timestamps into either the
max or the min
3. Use parquet.compute.filter to remove the offending rows

import pyarrow.compute as pc

table = pq.read_table("‘taxi.parquet")
df = table.filter(
pc.less_equal(table["dropOff_datetime"], pa.scalar(pd.Timestamp.max))
).to_pandas()
Unable to load data from external tables into a materialized table
in BigQuery due to an invalid timestamp error that are added
while appending data to the file in Google Cloud Storage

could not parse 'pickup_datetime' as timestamp for field pickup_datetime (position 2)

This error is caused by invalid data in the timestamp column. A way to identify the problem is to
define the schema from the external table using string datatype. This enables the queries to
work at which point we can filter out the invalid rows from the import to the materizlied table and
insert the fields with the timestamp data type.

Unable to run command (shown in video) to export ML model


from BQ to GCS
Issue: Tried running command to export ML model from BQ to GCS from Week 3
bq --project_id taxi-rides-ny extract -m nytaxi.tip_model gs://taxi_ml_model/tip_model

It is failing on following error:


BigQuery error in extract operation: Error processing job Not found: Dataset was not found in
location US

I verified the BQ data set and gcs bucket are in the same region- us-west1. Not sure how it
gets location US. I couldn’t find the solution yet.
Solution: Please enter correct project_id and gcs_bucket folder address. Mine gcs_bucket
folder address is gs://dtc_data_lake_optimum-airfoil-376815/tip_model

I query my dataset and get a Bad character (ASCII 0) error?


- Check the Schema
- You might have a wrong formatting
- Try to upload the CSV.GZ files without formatting or going through pandas via wget
- See this Slack conversation for helpful tips

Uploading files to GCS via GUI


This can help avoid schema issues in the homework.
Download files locally and use the ‘upload files’ button in GCS at the desired path. You can
upload many files at once. You can also choose to upload a folder.

Error: Missing close double quote (") character


To avoid this error you can upload data from Google Cloud Storage to BigQuery through
BigQuery Cloud Shell using the command:
$ bq load --autodetect --allow_quoted_newlines --source_format=CSV
dataset_name.table_name "gs://dtc-data-lake-bucketname/fhv/fhv_tripdata_2019-*.csv.gz"
Error on Running Prefect Flow to Load data to GCS
ValueError: Path /Users/kt/.prefect/storage/44ccce0813ed4f24ab2d3783de7a9c3a does not exist.

Remove ```cache_key_fn=task_input_hash ``` as it’s in argument in your function & run your
flow again.
Note: catche key is beneficial if you happen to run the code multiple times, it won't repeat the
process which you have finished running in the previous run. That mean, if you have this
```cache_key``` in your initial run, this might cause the error.

Cannot read and write in different locations: source: asia-south2,


destination: US
Solution: This problem arises if your gcs and bigquery storage is in different region.
One potential way to solve it:
1. Go to your google cloud bucket and check the region in field named “Location”

2. Now in bigquery, click on three dot icon near your project name and select create
dataset.
3. In region filed choose the same regions as you saw in your google cloud bucket
Tip: Using Cloud Function to read csv.gz files from github directly
to BigQuery in Google Cloud:

There are multiple benefits of using Cloud Functions to automate tasks in Google Cloud.

Use below Cloud Function python script to load files directly to BigQuery. Use your project id,
dataset id & table id as defined by you.

import tempfile
import requests
import logging
from google.cloud import bigquery

def hello_world(request):

# table_id = <project_id.dataset_id.table_id>
table_id = 'de-zoomcap-project.dezoomcamp.fhv-2019'

# Create a new BigQuery client


client = bigquery.Client()

for month in range(4, 13):


# Define the schema for the data in the CSV.gz files
url =
'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripda
ta_2019-{:02d}.csv.gz'.format(month)

# Download the CSV.gz file from Github


response = requests.get(url)

# Create new table if loading first month data else append


write_disposition_string = "WRITE_APPEND" if month > 1 else
"WRITE_TRUNCATE"

# Defining LoadJobConfig with schema of table to prevent it from


changing with every table
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("dispatching_base_num", "STRING"),
bigquery.SchemaField("pickup_datetime", "TIMESTAMP"),
bigquery.SchemaField("dropOff_datetime", "TIMESTAMP"),
bigquery.SchemaField("PUlocationID", "STRING"),
bigquery.SchemaField("DOlocationID", "STRING"),
bigquery.SchemaField("SR_Flag", "STRING"),
bigquery.SchemaField("Affiliated_base_number", "STRING"),
],
skip_leading_rows=1,
write_disposition=write_disposition_string,
autodetect=True,
source_format="CSV",
)

# Load the data into BigQuery


# Create a temporary file to prevent the exception- AttributeError:
'bytes' object has no attribute 'tell'"
with tempfile.NamedTemporaryFile() as f:
f.write(response.content)
f.seek(0)

job = client.load_table_from_file(
f,
table_id,
location="US",
job_config=job_config,
)
job.result()
logging.info("Data for month %d successfully loaded into table
%s.", month, table_id)

return 'Data loaded into table {}.'.format(table_id)

Tip: Downloading csv.gz from a url in a prefect environment


(sample snippet).

@task
def download_file(url: str, file_path: str):
response = requests.get(url)
open(file_path, "wb").write(response.content)
return file_path

@flow
def extract_from_web() -> None:
file_path =
download_file(url=f'{url-filename}.csv.gz',file_path=f'{filename}.csv.gz')

What do I do if my VM runs out of space?


- Try deleting data you’ve saved to your VM locally during ETLs
- Kill processes related to deleted files
- Download ncdu and look for large files (pay particular attention to files related to Prefect)
- If you delete any files related to Prefect, eliminate caching from your flow code

When querying two different tables external and materialized you


get the same result when count(distinct(*))
You need to uncheck cache preferences in query settings

How to handle type error from big query and parquet data?

Problem: When you inject data into GCS using Pandas, there is a chance that some dataset
has missing values on DOlocationID and PUlocationID. Pandas by default will cast these
columns as float data type, causing inconsistent data type between parquet in GCS and schema
defined in big query. You will see something like this:
error: Error while reading table: trips_data_all.external_fhv_tripdata,
error message: Parquet column 'DOlocationID' has type INT64 which does not
match the target cpp_type DOUBLE.

Solution:
- Fix the data type issue in data pipeline
- Before injecting data into GCS, use astype and Int64 (which is different from int64 and
accept both missing value and integer exist in the column) to cast the columns.

Something like:
df["PUlocationID"] = df.PUlocationID.astype("Int64")
df["DOlocationID"] = df.DOlocationID.astype("Int64")
NOTE: It is best to define the data type of all the columns in the
Transformation section of the ETL pipeline before loading to BigQuery

Invalid project ID . Project IDs must contain 6-63 lowercase


letters, digits, or dashes. Some project

Problem occurs when misplacing content after from clause in BigQuery SQLs.
Check to remove any extra apaces or any other symbols, keep in lowercases, digits and dashes
only

DATE() Error in BigQuery

Error Message:
PARTITION BY expression must be DATE(<timestamp_column>),
DATE(<datetime_column>), DATETIME_TRUNC(<datetime_column>,
DAY/HOUR/MONTH/YEAR), a DATE column,
TIMESTAMP_TRUNC(<timestamp_column>, DAY/HOUR/MONTH/YEAR),
DATE_TRUNC(<date_column>, MONTH/YEAR), or RANGE_BUCKET(<int64_column>,
GENERATE_ARRAY(<int64_value>, <int64_value>[, <int64_value>]))
Solution:
Convert the column to datetime first.
df["pickup_datetime"] = pd.to_datetime(df["pickup_datetime"])
df["dropOff_datetime"] = pd.to_datetime(df["dropOff_datetime"])
Native tables vs External tables in BigQuery?

Native tables are tables where the data is stored in BigQuery. External tables store the data
outside BigQuery, with BigQuery storing metadata about that external table.

Resources:
● https://cloud.google.com/bigquery/docs/external-tables
● https://cloud.google.com/bigquery/docs/tables-intro

Week 4

When attempting to use the provided quick script to load trip data
into GCS, you receive error Access Denied from the S3 bucket
If the provided URL isn’t working for you (https://nyc-tlc.s3.amazonaws.com/trip+data/):

We can use the GitHub CLI to easily download the needed trip data from
https://github.com/DataTalksClub/nyc-tlc-data, and manually upload to a GCS bucket.

Instructions on how to download the CLI here: https://github.com/cli/cli

Commands to use:

gh auth login

gh release list -R DataTalksClub/nyc-tlc-data

gh release download yellow -R DataTalksClub/nyc-tlc-data

gh release download green -R DataTalksClub/nyc-tlc-data

etc.

Now you can upload the files to a GCS bucket using the GUI.
All of sudden ssh stopped working for my VM after my last restart
One common cause experienced is lack of space after running prefect several times. When
running prefect, check the folder ‘.prefect/storage’ and delete the logs now and then to
avoid the problem.

If you have lost SSH access to your machine due to lack of


space. Permission denied (publickey)
You can try to do this steps:
When running your first dbt model, if it fails with an error:
404 Not found: Dataset was not found in location US404 Not
found: Dataset eighth-zenith-372015:trip_data_all was not found
in location us-west1

R: Go to BigQuery, and check the location of BOTH

1. The source dataset (trips_data_all), and

2. The schema you’re trying to write to (name should be dbt_<first initial><last name>
(if you didn’t change the default settings at the end when setting up your project))

Likely, your source data will be in your region, but the write location will be a multi-regional
location (US in this example). Delete these datasets, and recreate them with your specified
region and the correct naming format.

Alternatively, instead of removing datasets, you can specify the single-region location you are
using. E.g. instead of ‘location: US’, specify the region, so ‘location: US-east1’. See
this Github comment for more detail. Additionally please see this post of Sandy

In DBT cloud you can actually specify the location using the following steps:
1. Go to your profile page (top right drop-down --> profile)

2. Then go to under Credentials --> Analytics (you may have customised this name)

3. Click on Bigquery >

4. Hit Edit

5. Update your location, you may need to re-upload your service account JSON to re-fetch
your private key, and save. (NOTE: be sure to exactly copy the region BigQuery
specifies your dataset is in.)

When executing dbt run after installing dbt-utils latest version i.e.,
1.0.0 warning has generated
Error: `dbt_utils.surrogate_key` has been replaced by `dbt_utils.generate_surrogate_key`

Fix:

Replace dbt_utils.surrogate_key with dbt_utils.generate_surrogate_key in stg_green_tripdata


When executing dbt run after fact_trips.sql has been created,
the task failed with error:
R: “Access Denied: BigQuery BigQuery: Permission denied while globbing file
pattern.”

1. Fixed by adding the Storage Object Viewer role to the service account in use in BigQuery.

2. Add the related roles to the service account in use in GCS.


When You are getting error dbt_utils not found
You need to create packages.yml file in main project directory and add packages’ meta data:
ckage: dbt-labs/dbt_utils
version: 0.8.0packages
:
- pa

After creating file run:


dbt deps
And hit enter.

Lineage is currently unavailable. Check that your project does not


contain compilation errors or contact support if this error persists.

Ensure you properly format your yml file.

Why do my Fact_trips only contain a few days of data?


Make sure you use:

dbt run --var ‘is_test_run: false’ or dbt build --var ‘is_test_run:


false’ (watch out for formatted text from this document: re-type the single
quotes)

Why do my fact_trips only contain one month of data?


Check if you specified if_exists argument correctly when writing data from GCS to BigQuery.
When I wrote my automated flow for each month of the years 2019 and 2020 for green and
yellow data I had specified if_exists="replace" while I was experimenting with the flow setup.
Once you want to run the flow for all months in 2019 and 2020 make sure to set
if_exists="append"
- if_exists="replace" will replace the whole table with only the month data that you are
writing into BigQuery in that one iteration -> you end up with only one month in BigQuery
(the last one you inserted)
- if_exists="append" will append the new monthly data -> you end up with data from all
months

BigQuery returns an error when I try to run the


dm_monthly_zone_revenue.sql model.

R: After the second SELECT, change this line:

date_trunc('month', pickup_datetime) as revenue_month,

To this line:

date_trunc(pickup_datetime, month) as revenue_month,

Make sure that “month” isn’t surrounded by quotes!

I changed location in dbt, but dbt run still gives me an error


Remove the dataset from BigQuery which was created by dbt and run dbt run again so that it
will recreate the dataset in BigQuery with the correct location

I ran dbt run without specifying variable which gave me a table of


100 rows. I ran again with the variable value specified but my
table still has 100 rows in BQ.
Remove the dataset from BigQuery created by dbt and run again (with test disabled) to ensure
the dataset created has all the rows.

Error thrown by format_to_parquet_task when converting


fhv_tripdata_2020-01.csv using Airflow
R: This conversion is needed for the question 3 of homework, in order to process files for fhv
data. The error is:

pyarrow.lib.ArrowInvalid: CSV parse error: Expected 7 columns, got 1:


B02765

Cause: Some random line breaks in this particular file.

Fixed by opening a bash in the container executing the dag and manually running the following
command that deletes all \n not preceded by \r.

perl -i -pe 's/(?<!\r)\n/\1/g' fhv_tripdata_2020-01.csv

After that, clear the failed task in Airflow to force re-execution.

Why do we need the Staging dataset?


Vic created three different datasets in the videos.. dbt_<name> was used for development and
you used a production dataset for the production environment. What was the use for the staging
dataset?

R: Staging, as the name suggests, is like an intermediate between the raw datasets and the fact
and dim tables, which are the finished product, so to speak. You'll notice that the datasets in
staging are materialised as views and not tables.

Vic didn't use it for the project, you just need to create production and dbt_name + trips_data_all
that you had already.

DBT Docs Served but Not Accessible via Browser


Try removing the “network: host” line in docker-compose.

BigQuery adapter: 404 Not found: Dataset was not found in


location europe-west6
Got to Account settings >> Project Analytics >> Click on your connection >> go all the way
down to Location and type in the GCP location just as displayed in GCP (e.g. europe-west6).
You might need to reupload your GCP key.
Main branch is “read-only”
Create a new branch to edit. More on this can be found here in the dbt docs.

It appears that I can't edit the files because I'm in read-only.


Does anyone know how I can change that?
Create a new branch and switch to this branch. It allows you to make changes. Then you
can commit and push the changes to the “main” branch.

Compilation Error (Model


'model.my_new_project.stg_green_tripdata'
(models/staging/stg_green_tripdata.sql) depends on a source
named 'staging.green_trip_external' which was not found)

If you're following video DE Zoomcamp 4.3.1 - Building the First DBT Models, you
may have encountered an issue at 14:25 where the Lineage graph isn't displayed
and a Compilation Error occurs, as shown in the attached image. Don't worry - a
quick fix for this is to simply save your schema.yml file. Once you've done this, you
should be able to view your Lineage graph without any further issues.
Compilation Error in test
accepted_values_stg_green_tripdata_Payment_type__False___v
ar_payment_type_values_ (models/staging/schema.yml)

'NoneType' object is not iterable

> in macro test_accepted_values (tests/generic/builtin.sql)


> called by test
accepted_values_stg_green_tripdata_Payment_type__False___var_payment_t
ype_values_ (models/staging/schema.yml)

Remember that you have to add to dbt_project.yml the vars:

vars:
payment_type_values: [1, 2, 3, 4, 5, 6]

Troubleshooting in dbt:
The dbt error log contains a link to BigQuery. When you follow it you will see your query and the
problematic line will be highlighted.

Why changing the target schema to “marts” actually creates a


schema named “dbt_marts” instead?
It is a default behaviour of dbt to append custom schema to initial schema. To override this
behaviour simply create a macro named “generate_schema_name.sql”:

{% macro generate_schema_name(custom_schema_name, node) -%}


{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}

Now you can override default custom schema in “dbt_project.yml”:


How to set subdirectory of the github repository as the dbt project
root
There is a project setting which allows you to set `Project subdirectory` in dbt cloud:
Compilation Error : Model 'model.XXX'
(models/<model_path>/XXX.sql) depends on a source named '<a
table name>' which was not found
Remember that you should modify accordingly your .sql models, to read from existing table
names in BigQuery/postgres db
Example: select * from {{ source('staging',<your table name in the
database>') }}

Compilation Error : Model '<model_name>' (<model_path>)


depends on a node named '<seed_name>' which was not found
(Production Environment)

Make sure that you create a pull request from your Development branch to the Production
branch (main by default). After that, check in your ‘seeds’ folder if the seed file is inside it.
Another thing to check is your .gitignore file. Make sure that the .csv extension is not included.

When executing dbt run after using fhv_tripdata as an external


table: you get “Access Denied: BigQuery BigQuery: Permission denied”
1. Go to your dbt cloud service account
1. Adding the [Storage Object Admin,Storage Admin] role in addition to BigQuery Admin.

How to automatically infer the column data type (pandas missing


value issues)?
Problem: when injecting data to bigquery, you may face the type error. This is because pandas
by default will parse integer columns with missing value as float type.

Solution:

- One way to solve this problem is to specify/ cast data type Int64 during the data
transformation stage.
- However, you may be lazy to type all the int columns. If that is the case, you can simply
use convert_dtypes to infer the data type

# Make pandas to infer correct data type (as pandas parse int with
missing as float)

df.fillna(-999999, inplace=True)

df = df.convert_dtypes()
df = df.replace(-999999, None)

When loading github repo raise exception that ‘taxi_zone_lookup’


not found
Seed files loaded from directory with name ‘seed’, thats why you should rename dir with name
‘data’ to ‘seed’

‘taxi_zone_lookup’ not found


Check the .gitignore file and make sure you don’t have *.csv in it

Dbt error 404 was not found in location

My specific error:
Runtime Error in rpc request (from remote system.sql) 404 Not found: Table
dtc-de-0315:trips_data_all.green_tripdata_partitioned was not found in location europe-west6
Location: europe-west6 Job ID: 168ee9bd-07cd-4ca4-9ee0-4f6b0f33897c

Make sure all of your datasets have the correct region and not a generalised region:
Europe-west6 as opposed to EU

Match this in dbt settings:


dbt -> projects -> optional settings -> manually set location to match

Unable to configure Continuous Integration (CI) with Github


If you’re trying to configure CI with Github and on the job’s options you can’t see Run on Pull
Requests? on triggers, you have to reconnect with Github using native connection instead
clone by SSH. Follow these steps:

1. On Profile Settings > Linked Accounts connect your Github account with dbt project
allowing the permissions asked. More info at
https://docs.getdbt.com/docs/collaborate/git/connect-gith
2.

3. Disconnect your current Github’s configuration from Account Settings > Projects
(analytics) > Github connection. At the bottom left appears the button Disconnect,
press it.

4. Once we have confirmed the change, we can configure it again. This time, choose
Github and it will appears all repositories which you have allowed to work with dbt.
Select your repository and it’s ready.
5. Go to the Deploy > job configuration’s page and go down until Triggers and now you
can see the option Run on Pull Requests:
Data type errors when ingesting with parquet files

The easiest way to avoid these errors is by ingesting the relevant data in a .csv.gz file
type. Then, do:
CREATE OR REPLACE EXTERNAL TABLE `dtc-de.trips_data_all.fhv_tripdata`

OPTIONS (

format = 'CSV',

uris = ['gs://dtc_data_lake_dtc-de-updated/data/fhv_all/fhv_tripdata_2019-*.csv.gz']

);

As an example. You should no longer have any data type issues for week 4.

Inconsistent number of rows when re-running fact_trips


model
This is due to the way the deduplication is done in the two staging files.

Solution: add order by in the partition by part of both staging files. Keep adding columns
to order by until the number of rows in the fact_trips table is consistent when re-running the
fact_trips model.

Explanation (a bit convoluted, feel free to clarify, correct etc.)

We partition by vendor id and pickup_datetime and choose the first row (rn=1) from all these
partitions. These partitions are not ordered, so every time we run this, the first row might be a
different one. Since the first row is different between runs, it might or might not contain an
unknown borough. Then, in the fact_trips model we will discard a different number of rows when
we discard all values with an unknown borough.

Data Type Error when running fact table


If you encounter data type error on trip_type column, it may due to some nan values that isn’t
null in bigquery.

Solution: try casting it to FLOAT datatype instead of NUMERIC


CREATE TABLE has columns with duplicate name locationid.
This error could result if you are using some select * query without mentioning the name of table
for ex:

with dim_zones as (
select * from `engaged-cosine-374921`.`dbt_victoria_mola`.`dim_zones`
where borough != 'Unknown'
),

fhv as (
select * from `engaged-cosine-374921`.`dbt_victoria_mola`.`stg_fhv_tripdata`
)
select * from fhv
inner join dim_zones as pickup_zone
on fhv.PUlocationID = pickup_zone.locationid
inner join dim_zones as dropoff_zone
on fhv.DOlocationID = dropoff_zone.locationid
);

To resolve just replace use : select fhv.* from fhv

Bad int64 value: 0.0 error

Some ehail fees are null and casting them to integer gives Bad int64 value: 0.0 error,
Solution:
Using safe_cast returns NULL instead of throwing an error. So use safe_cast from dbt_utils
function in the jinja code for casting into integer as follows:
{{ dbt_utils.safe_cast('ehail_fee', api.Column.translate_type("integer"))}} as
ehail_fee,

The - vars argument must be a YAML dictionary, but was of type


str
Remember to add a space between the variable and the value. Otherwise, it won't be
interpreted as a dictionary.
It should be:
dbt run --var 'is_test_run: false'
Not able to change Environment Type as it is greyed out and
inaccessible

You don't need to change the environment type. If you are following the videos, you are
creating a Production Deployment, so the only available option is the correct one.'

Could not parse the dbt project. please check that the
repository contains a valid dbt project

Running the Environment on the master branch causes this error, you must activate “Only run
on a custom branch” checkbox and specify the branch you are working when Environment is
setup.
Made change to your modeling files and commit the your
development branch, but Job still runs on on old file?
Change to main branch, make a pull request from the development branch.
Note: this will take you to github.
Approve the merging and rerun you job, it would work as planned now

I’ve set Github and Bigquery to dbt successfully. Why nothing


showed in my Develop tab?
Before you can develop some data model on dbt, you should create development environment
and set some parameter on it. After the model being developed, we should also create
deployment environment to create and run some jobs.

Prefect Agent retrieving runs from queue sometimes fails with


httpx.LocalProtocolError
Error Message:
Investigate Sentry error: ProtocolError "Invalid input
ConnectionInputs.SEND_HEADERS in state ConnectionState.CLOSED"
Solution:
reference
Run it again because it happens sometimes. Or wait a few minutes, it will continue.

BigQuery returns an error when i try to run ‘dbt run’:


My taxi data was loaded into gcs with etl_web_to_gcs.py script that converts csv data into
parquet. Then I placed raw data trips into external tables and when I executed dbt run I got an
error message: Parquet column 'passenger_count' has type INT64 which does not match the
target cpp_type DOUBLE. It is because several columns in files have different formats of data.
When I added df[col] = df[col].astype('Int64') transformation to the columns: passenger_count,
payment_type, RatecodeID, VendorID, trip_type it went ok. Several people also faced this error
and more about it you can read on the slack channel.
Running dbt run --models stg_green_tripdata --var 'is_test_run:
false' is not returning anything:

Use the syntax below instead if the code in the tutorial is not working.
dbt run --select stg_green_tripdata --vars '{"is_test_run": false}'

Week 5

Spark-shell: unable to load native-hadoop library for platform -


Windows
If after installing Java (either jdk or openjdk), Hadoop and Spark, and setting the corresponding
environment variables you find the following error when spark-shell is run at CMD:

java.lang.IllegalAccessError: class
org.apache.spark.storage.StorageUtils$ (in unnamed module @0x3c947bc5)
cannot access class sun.nio.ch.DirectBuffer (in module java.base)
because module java.base does not export sun.nio.ch to unnamed
module @0x3c947bc5

Solution: Java 17 or 19 is not supported by Spark. Spark 3.x: requires Java 8/11/16. Install Java
11 from the website provided in the windows.md setup file.

Python was not found; run without arguments to install from the
Microsoft Store, or disable this shortcut from Settings > Manage
App Execution Aliases.
I found this error while executing the user defined function in Spark (crazy_stuff_udf). I am
working on Windows and using conda. After following the setup instructions, I found that the
PYSPARK_PYTHON environment variable was not set correctly, given that conda has different
python paths for each environment.

Solution:

● pip install findspark on the command line inside proper environment


● Add to the top of the script

import findspark
findspark.init()
Easy setup with miniconda env (worked on MacOS)
If anyone is a Pythonista or becoming one (which you will essentially be one along this journey),
and desires to have all python dependencies under same virtual environment (e.g. conda) as
done with prefect and previous exercises, simply follow these steps

1. Install OpenJDK 11,


a. on MacOS: $ brew install java11
b. Add export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"
to ~/.bashrc or ~/zshrc
2. Activate working environment (by pipenv / poetry / conda)
3. Run $ pip install pyspark
4. Work with exercises as normal

All default commands of spark will be also available at shell session under activated enviroment.

Hope this can help!

P.s. you won’t need findspark to firstly initialize.

Py4JJavaError: An error occurred while calling o35.csv. : java.net.ConnectException:


Call From USERNAME/192.168.29.38 to localhost:9000 failed on connection exception:
java.net.ConnectException: Connection refused: no further information;
If you're getting `Py4JavaError` with a generic root cause, such as the described above (Connection
refused: no further information). You're most likely using compatible versions of the JDK or Python with
Spark.

As of the current latest Spark version (3.3.2), it supports JDK 8 / 11 / 17. All of which can be easily
installed with SDKMan!:

$ sdk list java


$ sdk install java 17.0.6-librca

More importantly, Python 3.11 is not yet stable for PySpark. So, make sure you're setting up your
virtualenv with either Python 3.9 or Python 3.10

$ conda create -n ENV_NAME python=3.10 or


$ conda create -n ENV_NAME python=3.9

Followed by install pySpark properly witihin the virtualenv


$ conda activate ENV_NAME
$ pip install pyspark
lsRuntimeError: Java gateway process exited before sending
its port number

After installing all including pyspark (and it is successfully imported), but then running this script
on the jupyter notebook

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master("local[*]") \
.appName('test') \
.getOrCreate()

df = spark.read \
.option("header", "true") \
.csv('taxi+_zone_lookup.csv')

df.show()

it gives the error:

RuntimeError: Java gateway process exited before sending its port number

✅The solution (for me) was:


● pip install findspark on the command line and then
● Add

import findspark
findspark.init()

to the top of the script.

Another possible solution is:

● Check that pyspark is pointing to the correct location.


● Run pyspark.__file__. It should be list /home/<your user
name>/spark/spark-3.0.3-bin-hadoop3.2/python/pyspark/__init__.py if
you followed the videos.
● If it is pointing to your python site-packages remove the pyspark directory there and
check that you have added the correct exports to you .bashrc file and that there are not
any other exports which might supersede the ones provided in the course content.
To add to the solution above, if the errors persist in regards to setting the correct path for spark,
an alternative solution for permanent path setting solve the error is to set environment variables
on system and user environment variables following this tutorial:
Install Apache PySpark on Windows PC | Apache Spark Installation Guide

- Once everything is installed, skip to 7:14 to set up environment variables. This allows for
the environment variables to be set permanently.

Nano b nanoModule Not Found Error in Jupyter Notebook .


Even after installing pyspark correctly on linux machine (VM ) as per course
instructions, faced a module not found error in jupyter notebook .

The solution which worked for me(use following in jupyter notebook) :

!pip install findspark

import findspark

findspark.init()

Thereafter , import pyspark and create spark contex<<t as usual

None of the solutions above worked for me till I ran !pip3 install pyspark
instead !pip install pyspark.

ModuleNotFoundError: No module named 'py4j'` while


executing `import pyspark`
Make sure that the version under `${SPARK_HOME}/python/lib/` matches the filename of
py4j or you will encounter `ModuleNotFoundError: No module named 'py4j'` while
executing `import pyspark`.

For instance, if the file under `${SPARK_HOME}/python/lib/` was


`py4j-0.10.9.3-src.zip`.
Then the export PYTHONPATH statement above should be changed to `export
PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.3-src.zip:$PYTHONPATH
"` appropriately.

Additionally, you can check for the version of ‘py4j’ of the spark you’re using from here and
update as mentioned above.

TypeError: code() argument 13 must be str, not int , while


executing `import pyspark` (Windows/ Spark 3.0.3 - Python 3.11)

This is because Python 3.11 has some inconsistencies with such an old version of Spark. The
solution is a downgrade in the Python version. Python 3.9 using a conda environment takes
care of it.

Exception: Jupyter command `jupyter-notebook` not found.


Even after we have exported our paths correctly you may find that even though
Jupyter is installed you might not have Jupyter Notebook for one reason or
another. Full instructions are found here (for my walkthrough) or here (where I
got the original instructions from) but are included below. These instructions
include setting up a virtual environment (handy if you are on your own machine
doing this and not a VM):

Full steps:

1. Update and upgrade packages:

a. sudo apt update && sudo apt -y upgrade

2. Install Python:

a. sudo apt install python3-pip python3-dev

3. Install Python virtualenv:

a. sudo -H pip3 install --upgrade pip

b. sudo -H pip3 install virtualenv

4. Create a Python Virtual Environment:


a. mkdir notebook

b. cd notebook

c. virtualenv jupyterenv

d. source jupyterenv/bin/activate

5. Install Jupyter Notebook:

a. pip install jupyter

6. Run Jupyter Notebook:

a. jupyter notebook

Error java.io.FileNotFoundException
Code executed:

df = spark.read.parquet(pq_path)

… some operations on df …

df.write.parquet(pq_path, mode="overwrite")

java.io.FileNotFoundException: File
file:/home/xxx/code/data/pq/fhvhv/2021/02/part-00021-523f9ad5-14af-4332-9434-bdcb0831
f2b7-c000.snappy.parquet does not exist

The problem is that Sparks performs lazy transformations, so the actual action that trigger the
job is df.write, which does delete the parquet files that is trying to read (mode=”overwrite”)

✅Solution: Write to a different directorydf


df.write.parquet(pq_path_temp, mode="overwrite")
FileNotFoundException: Hadoop bin directory does not exist ,
when trying to write (Windows)
You need to create the Hadoop /bin directory manually and add the downloaded files in there,
since the shell script provided for Windows installation just puts them in /c/tools/hadoop-3.2.0/ .

Which type of SQL is used in Spark? Postgres? MySQL?


SQL Server?

Actually Spark SQL is one independent “type” of SQL - Spark SQL.

The several SQL providers are very similar:

SELECT [attributes]

FROM [table]

WHERE [filter]

GROUP BY [grouping attributes]

HAVING [filtering the groups]

ORDER BY [attribute to order]

(INNER/FULL/LEFT/RIGHT) JOIN [table2]

ON [attributes table joining table2] (...)

What differs the most between several SQL providers are built-in functions.

For Built-in Spark SQL function check this link:


https://spark.apache.org/docs/latest/api/sql/index.html

Extra information on SPARK SQL :

https://databricks.com/glossary/what-is-spark-sql#:~:text=Spark%20SQL%20is%20a%20Spark,
on%20existing%20deployments%20and%20data.
The spark viewer on localhost:4040 was not showing the
current run
✅Solution: I had two notebooks running, and the one I wanted to look at had opened a port on
localhost:4041.

If port is in use, then Spark uses next. It can be even 4044. You can run
spark.sparkContext.uiWebUrl

and result will be some like


'http://172.19.10.61:4041'

java.lang.NoSuchMethodError:
sun.nio.ch.DirectBuffer.cleaner()Lsun/misc/Cleaner Error
during repartition call (conda pyspark installation)
✅Solution: replace Java Developer Kit 11 with Java Developer Kit 8.
RuntimeError: Java gateway process exited before sending
its port number
Shows java_home is not set on the notebook log

https://sparkbyexamples.com/pyspark/pyspark-exception-java-gateway-process-exited-before-s
ending-the-driver-its-port-number/

Spark fails when reading from BigQuery and using `.show()` on


`SELECT` queries

✅I got it working using `gcs-connector-hadoop-2.2.5-shaded.jar` and Spark 3.1


I also added the google_credentials.json and .p12 to auth with gcs. These files are
downloadable from GCP Service account.

To create the SparkSession:

spark = SparkSession.builder.master('local[*]') \
.appName('spark-read-from-bigquery') \
.config('BigQueryProjectId','razor-project-xxxxxxx) \
.config('BigQueryDatasetLocation','de_final_data') \
.config('parentProject','razor-project-xxxxxxx) \
.config("google.cloud.auth.service.account.enable", "true") \
.config("credentialsFile", "google_credentials.json") \
.config("GcpJsonKeyFile", "google_credentials.json") \
.config("spark.driver.memory", "4g") \
.config("spark.executor.memory", "2g") \
.config("spark.memory.offHeap.enabled",True) \
.config("spark.memory.offHeap.size","5g") \
.config('google.cloud.auth.service.account.json.keyfile',
"google_credentials.json") \
.config("fs.gs.project.id", "razor-project-xxxxxxx") \
.config("fs.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
.config("fs.AbstractFileSystem.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") \
.getOrCreate()

Spark BigQuery connector Automatic configuration


While creating a SparkSession using the config spark.jars.packages as
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.2

spark =
SparkSession.builder.master('local').appName('bq').config("spark.jars.package
s",
"com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.23.2").getOrC
reate()

automatically downloads the required dependency jars and configures the connector, removing
the need to manage this dependency. More details available here

Spark Cloud Storage connector


Link to Slack Thread

has anyone figured out how to read from GCP data lake instead
of downloading all the taxi lodata again?
There’s a few extra steps to go into reading from GCS with PySpark
1.) IMPORTANT: Download the Cloud Storage connector for Hadoop here:
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#clusters

As the name implies, this .jar file is what essentially connects PySpark with your GCS

2.) Move the .jar file to your Spark file directory. I installed Spark using homebrew on my MacOS
machine and I had to create a /jars directory under "/opt/homebrew/Cellar/apache-spark/3.2.1/
(where my spark dir is located)

3.) In your Python script, there are a few extra classes you’ll have to import:

import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.context import SparkContext

4.) You must set up your configurations before building your SparkSession. Here’s my code
snippet:

conf = SparkConf() \
.setMaster('local[*]') \
.setAppName('test') \
.set("spark.jars",
"/opt/homebrew/Cellar/apache-spark/3.2.1/jars/gcs-connector-hadoop3-la
test.jar") \
.set("spark.hadoop.google.cloud.auth.service.account.enable",
"true") \

.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile",
"path/to/google_credentials.json")

sc = SparkContext(conf=conf)

sc._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
sc._jsc.hadoopConfiguration().set("fs.gs.impl",
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.json.key
file", "path/to/google_credentials.json")
sc._jsc.hadoopConfiguration().set("fs.gs.auth.service.account.enable",
"true")

5.) Once you run that, build your SparkSession with the new parameters we’d just instantiated in
the previous step:

spark = SparkSession.builder \
.config(conf=sc.getConf()) \
.getOrCreate()

6.) Finally, you’re able to read your files straight from GCS!

df_green = spark.read.parquet("gs://{BUCKET}/green/202*/")

How can I read a small number of rows from the parquet file
directly?
from pyarrow.parquet import ParquetFile
pf = ParquetFile('fhvhv_tripdata_2021-01.parquet')
#pyarrow builds tables, not dataframes
tbl_small = next(pf.iter_batches(batch_size = 1000))
#this function converts the table to a dataframe of manageable size
df = tbl_small.to_pandas()

Alternatively without PyArrow:

df = spark.read.parquet('fhvhv_tripdata_2021-01.parquet')
df1 = df.sort('DOLocationID').limit(1000)
pdf = df1.select("*").toPandas()
gcsu

DataType error when creating Spark DataFrame with a specified


schema?
Probably you’ll encounter this if you followed the video ‘5.3.1 - First Look at Spark/PySpark’ and
used the parquet file from the TLC website (csv was used in the video).

When defining the schema, the PULocation and DOLocationID are defined as IntegerType. This
will cause an error because the Parquet file is INT64 and you’ll get an error like:
Parquet column cannot be converted in file [...] Column [...] Expected: int, Found:
INT64

Change the schema definition from IntegerType to LongType and it should work

ModuleNotFoundError: No module named 'py4j'


This error was resolved by adding:
export PYTHONPATH=${SPARK_HOME}/python/:$(echo
${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}

To bashrc.

ModuleNotFoundError: No module named 'py4j' (Solve with latest


version)
If below does not work, then download the latest available py4j version with
conda install -c conda-forge py4j
Take care of the latest version number in the website to replace appropriately.

Now add
export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export
PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"
in your .bashrc file.

AttributeError: 'DataFrame' object has no attribute 'iteritems'

This error comes up on the Spark video 5.3.1 - First Look at Spark/PySpark,
because as at the creation of the video, 2021 data was the most recent which utilised csv
files but as at now its parquet.
So when you run the command spark.createDataFrame(df1_pandas).show(),
You get the Attribute error. This is caused by the pandas version 2.0.0 which seems
incompatible with Spark 3.3.2, so to fix it you have to downgrade pandas to 1.5.3 using the
command pip install -U pandas==1.5.3

Another option is adding the following after importing pandas, if one does not want to
downgrade pandas version (source) :
pd.DataFrame.iteritems = pd.DataFrame.items

Spark Standalone Mode on Windows


- Open a CMD terminal in administrator mode
- cd %SPARK_HOME%
- Start a master node: bin\spark-class
org.apache.spark.deploy.master.Master
- Start a worker node: bin\spark-class
org.apache.spark.deploy.worker.Worker
spark://<master_ip>:<port> --host <IP_ADDR>
- spark://<master_ip>:<port>: copy the address from the previous
command, in my case it was spark://localhost:7077
- Use --host <IP_ADDR> if you want to run the worker on a different machine.
For now leave it empty.
- Now you can access Spark UI through localhost:8080

Homework for Week 5:


Do not refer to the homework file located under week_5_batch_processing. The correct file is
located under cohorts/2023/week_5_batch_processing/.
https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2023/week_5
_batch_processing/homework.md

The homework says we will be loading data from June 2021 HVFHV Data.
This is very basic but it will save some time.

Export PYTHONPATH command in linux is temporary

You can either type the export command every time you run a new session, add it to the
.bashrc/ which you can find in /home or run this command at the beginning of your jupyter
notebook:

import findspark
findspark.init()
Compressed file ended before the end-of-stream marker was
reached
I solved this issue: unzip the file with:

!gzip -d fhvhv_tripdata_2021-01.csv.gz

before creating head.csv

Compression Error: zcat output is gibberish, seems like still


compressed

In the code along from Video 5.3.3 Alexey downloads the CSV files from the NYT website and
gzips them in their bash script. If we now (2023) follow along but download the data from the
GH course Repo, it will already be zippes as csv.gz files. Therefore we zip it again if we follow
the code from the video exactly. This then leads to gibberish outcome when we then try to cat
the contents or count the lines with zcat, because the file is zipped twitch and zcat only unzips it
once.

✅solution: do not gzip the files downloaded from the course repo. Just wget them and save
them as they are as csv.gz files. Then the zcat command and the showSchema command will
also work

URL="${URL_PREFIX}/${TAXI_TYPE}/${TAXI_TYPE}_tripdata_${YEAR}-${FMONTH}.csv.gz"
LOCAL_PREFIX="data/raw/${TAXI_TYPE}/${YEAR}/${FMONTH}"
LOCAL_FILE="${TAXI_TYPE}_tripdata_${YEAR}_${FMONTH}.csv.gz"
LOCAL_PATH="${LOCAL_PREFIX}/${LOCAL_FILE}"

echo "downloading ${URL} to ${LOCAL_PATH}"


mkdir -p ${LOCAL_PREFIX}
wget ${URL} -O ${LOCAL_PATH}

echo "compressing ${LOCAL_PATH}"


# gzip ${LOCAL_PATH} <- uncomment this line

PicklingError: Could not serialise object: IndexError: tuple index


out of range.
Occurred while running : spark.createDataFrame(df_pandas).show()
This error is usually due to the python version, since spark till date of 2 march 2023 doesn’t
support python 3.11, try creating a new env with python version 3.8 and then run this command.

Connecting from local Spark to GCS - Spark does not find my


google credentials as shown in the video?
Make sure you have your credentials of your GCP in your VM under the location defined in the
script.

Spark docker-compose setup


To run spark in docker setup
1. Build bitnamy spark docker
a. clone bitnami repo using command
git clone https://github.com/bitnami/containers.git
(tested on commit 9cef8b892d29c04f8a271a644341c8222790c992)
b. edit file `bitnami/spark/3.3/debian-11/Dockerfile` and update java and spark version as
following
"python-3.10.10-2-linux-${OS_ARCH}-debian-11" \
"java-17.0.5-8-3-linux-${OS_ARCH}-debian-11" \

reference: https://github.com/bitnami/containers/issues/13409
c. build docker image by navigating to above directory and running docker build
command
navigate cd bitnami/spark/3.3/debian-11/
build command docker build -t spark:3.3-java-17 .
2. run docker compose
using following file
```yaml docker-compose.yml
version: '2'

services:
spark:
image: spark:3.3-java-17
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- "./:/home/jovyan/work:rw"
ports:
- '8080:8080'
- '7077:7077'
spark-worker:
image: spark:3.3-java-17
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- "./:/home/jovyan/work:rw"
ports:
- '8081:8081'
spark-nb:
image: jupyter/pyspark-notebook:java-17.0.5
environment:
- SPARK_MASTER_URL=spark://spark:7077
volumes:
- "./:/home/jovyan/work:rw"

ports:
- '8888:8888'
- '4040:4040'
```
run command to deploy docker compose
docker-compose up
Access jupyter notebook using link logged in docker compose logs
Spark master url is spark://spark:7077

How do you read data stored in gcs on pandas with your local
computer?

To do this
pip install gcsfs,
Thereafter copy the uri path to the file and use
df = pandas.read_csc(gs://path)

TypeError when using spark.createDataFrame function on a


pandas df
Error:
spark.createDataFrame(df_pandas).schema
TypeError: field Affiliated_base_number: Can not merge type <class
'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

Solution:

Affiliated_base_number is a mix of letters and numbers (you can check this with a preview of the
table), so it cannot be set to DoubleType (only for double-precision numbers). The suitable type
would be StringType. Spark inferSchema is more accurate than Pandas infer type method in this
case. You can set it to true while reading the csv, so you don’t have to take out any data from
your dataset. Something like this can help:

df = spark.read \
.options(
header = "true", \
inferSchema = "true", \
)\
.csv('path/to/your/csv/file/')

Solution B:
It's because some rows in the affiliated_base_number are null and therefore it is assigned the
datatype String and this cannot be converted to type Double. So if you really want to convert
this pandas df to a pyspark df only take the rows from the pandas df that are not null in the
'Affiliated_base_number' column. Then you will be able to apply the pyspark function
createDataFrame.

# Only take rows that have no null values


pandas_df= pandas_df[pandas_df.notnull().all(1)]

MemoryManager: Total allocation exceeds 95.00%


(1,020,054,720 bytes) of heap memory
Default executor memory is 1gb. This error appeared when working with the homework dataset.

Error: MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap


memory
Scaling row group sizes to 95.00% for 8 writers

Solution:
Increase the memory of the executor when creating the Spark session like this:
spark = SparkSession.builder \
.master("local[*]") \
.appName('test') \
.config("spark.executor.memory", "4g") \
.config("spark.driver.memory", "4g") \
.getOrCreate()
Remember to restart the Jupyter session (ie. close the Spark session) or the config won’t take
effect.

How to spark standalone cluster is run on windows OS

Change the working directory to the spark directory:


if you have setup up your SPARK_HOME variable, use the following;
cd %SPARK_HOME%
if not, use the following;
cd <path to spark installation>

Creating a Local Spark Cluster

To start Spark Master:

bin\spark-class org.apache.spark.deploy.master.Master --host localhost

Starting up a cluster:

bin\spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077 --host localhost

Env variables set in ~/.bashrc are not loaded to Jupyter in VS


Code

I added PYTHONPATH, JAVA_HOME and SPARK_HOME to ~/.bashrc, import pyspark worked


ok in iPython in terminal, but couldn’t be found in .ipynb opened in VS Code

Instead of configuring paths in ~/.bashrc, I created .env file in the root of my workspace:

JAVA_HOME="${HOME}/app/java/jdk-11.0.2"
PATH="${JAVA_HOME}/bin:${PATH}"
SPARK_HOME="${HOME}/app/spark/spark-3.3.2-bin-hadoop3"
PATH="${SPARK_HOME}/bin:${PATH}"
PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.5-src.zip:$PYTHONPATH"
PYTHONPATH="${SPARK_HOME}/python/lib/pyspark.zip:$PYTHONPATH"
How to port forward outside VS Code

I don’t use visual studio, so I did it the old fashioned way: ssh -L 8888:localhost:8888 <my
user>@<VM IP> (replace user and IP with the ones used by the GCP VM, e.g. : ssh -L
8888:localhost:8888 [email protected]

“wc -l” is giving a different result then shown in the video

If you are doing wc -l fhvhv_tripdata_2021-01.csv.gz with the gzip file as the file
argument, you will get a different result. Unzip the file and then do wc -l
fhvhv_tripdata_2021-01.csv to get the right results.

`spark-submit` errors
when trying to:
URL="spark://$HOSTNAME:7077"
spark-submit \
--master="{$URL}" \
06_spark_sql.py \
--input_green=data/pq/green/2021/*/ \
--input_yellow=data/pq/yellow/2021/*/ \
--output=data/report-2021
and you get errors like the following (SUMMARIZED):
WARN Utils: Your hostname, <HOSTNAME> resolves to a loopback address..
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting
default log level to "WARN".
Exception in thread "main" org.apache.spark.SparkException: Master must either be
yarn or start with spark, mesos, k8s, or local at …

Try replacing --master="{$URL}"


with --master=$URL (edited)
Exception in thread "main" java.lang.UnsatisfiedLinkError:
org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/
lang/String;I)Z
If you are seeing this (or similar) error when attempting to write to parquet, it is likely an issue
with your path variables.
For Windows, create a new User Variable “HADOOP_HOME” that points to your Hadoop
directory. Then add “%HADOOP_HOME%\bin” to the PATH variable.

Additional tips can be found here:


https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlin
kerror-org-apache-hadoop-io

Java.io.IOException. Cannot run program


“C:\hadoop\bin\winutils.exe”. CreateProcess error=216, This
version of 1% is not compatible with the version of Windows you
are using.
Change the hadoop version to 3.0.1.Replace all the files in the local hadoop bin folder with the
files in this repo: winutils/hadoop-3.0.1/bin at master · cdarlint/winutils (github.com)
If this does not work try to change other versions found in this repository.
For more information please see this link: This version of %1 is not compatible with the version
of Windows you're running · Issue #20 · cdarlint/winutils (github.com)

ERROR: (gcloud.dataproc.jobs.submit.pyspark) The required


property [project] is not currently set. It can be set on a
per-command basis by re-running your command with the
[--project] flag.
Fix is to set the flag like the error states. Get your project ID from your dashboard and set it like
so:
gcloud dataproc jobs submit pyspark \
--cluster=my_cluster \
--region=us-central1 \
--project=my-dtc-project-1010101 \
gs://my-dtc-bucket-id/code/06_spark_sql.py
-- \

Run Local Cluster Spark in Windows 10 with CMD

1. Go to %SPARK_HOME%\bin
2. Run spark-class org.apache.spark.deploy.master.Master to run the
master. This will give you a URL of the form spark://ip:port
3. Run spark-class org.apache.spark.deploy.worker.Worker
spark://ip:port to run the worker. Make sure you use the URL you obtained in
step 2.
4. Create a new Jupyter notebook:
spark = SparkSession.builder \
.master("spark://192.168.0.38:7077") \
.appName('test') \
.getOrCreate()
5. Check on Spark UI the master, worker and app.

ServiceException: 401 Anonymous caller does not have


storage.objects.list access to the Google Cloud Storage bucket.
Permission 'storage.objects.list' denied on resource (or it may not
exist).

This occurs because you are not logged in “gcloud auth login” and maybe the project id is not
settled. Then type in a terminal:

gcloud auth login

This will open a tab in the browser, accept the terms, after that close the tab if you want. Then
set the project is like:

gcloud config set project <YOUR PROJECT_ID>

Then you can run the command to upload the pq dir to a GCS Bucket:

gsutil -m cp -r pq/ <YOUR URI from gsutil>/pq


py4j.protocol.Py4JJavaError GCP

When submit a job, it might throw an error about Java in log panel within Dataproc. I changed
the Versioning Control when I created a cluster, so it means that I delete the cluster and created
a new one, and instead of choosing Debian-Hadoop-Spark, I switch to Ubuntu
20.02-Hadoop3.3-Spark3.3 for Versioning Control feature, the main reason to choose this is
because I have the same Ubuntu version in mi laptop, I tried to find documentation to sustent
this but unfortunately I couldn't nevertheless it works for me.

Week 6
Could not start docker image “control-center” from the
docker-compose.yaml file.
On Mac OSX 12.2.1 (Monterey) I could not start the kafka control center. I opened Docker
Desktop and saw docker images still running from week 4, which I did not see when I typed
“docker ps.” I deleted them in docker desktop and then had no problem starting up the kafka
environment.

Module “kafka” not found when trying to run producer.py


Solution from Alexey: create a virtual environment and run requirements.txt and the python files
in that environment.

To create a virtual env and install packages (run only once)

python -m venv env


source env/bin/activate
pip install -r ../requirements.txt
To activate it (you'll need to run it every time you need the virtual env):

source env/bin/activate

To deactivate it:

deactivate
This works on MacOS, Linux and Windows - but for Windows the path is slightly different
(it's env/Scripts/activate)

Also the virtual environment should be created only to run the python file. Docker images should
first all be up and running.

Error importing cimpl dll when running avro examples


ImportError: DLL load failed while importing cimpl: The specified module could not be
found

... you may have to load librdkafka-5d2e2910.dll in the code. Add this before importing
avro:

from ctypes import CDLL


CDLL("C:\\Users\\YOUR_USER_NAME\\anaconda3\\envs\\dtcde\\Lib\\site-packages\\confluen
t_kafka.libs\librdkafka-5d2e2910.dll")

It seems that the error may occur depending on the OS and python version installed.

ALTERNATIVE:

ImportError: DLL load failed while importing cimpl

✅SOLUTION: $env:CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1 in Powershell.


You need to set this DLL manually in Conda Env.

Source: https://githubhot.com/repo/confluentinc/confluent-kafka-python/issues/1186?page=2

ModuleNotFoundError: No module named 'avro'


✅SOLUTION: pip install confluent-kafka[avro].
For some reason, Conda also doesn't include this when installing confluent-kafka via pip.

More sources on Anaconda and confluent-kafka issues:

● https://github.com/confluentinc/confluent-kafka-python/issues/590

● https://github.com/confluentinc/confluent-kafka-python/issues/1221
● https://stackoverflow.com/questions/69085157/cannot-import-producer-from-confluent-ka
fka

Error while running python3 stream.py worker


If you get an error while running the command python3 stream.py worker

Run pip uninstall kafka-python

Then run pip install kafka-python==1.4.6

Negsignal:SIGKILL while converting dta files to parquet format

Got this error because the docker container memory was exhausted. The dta file was upto
800MB but my docker container does not have enough memory to handle that.

Solution was to load the file in chunks with Pandas, then create multiple parquet files for each
dat file I was processing. This worked smoothly and the issue was resolved.

data-engineering-zoomcamp/week_6_stream_processing/python/
resources/rides.csv is missing
Copy the file found in the Java example:
data-engineering-zoomcamp/week_6_stream_processing/java/kafka_examples/src/main/resour
ces/rides.csv

Kafka- python videos have low audio and hard to follow up


tip:As the videos have low audio so I downloaded them and used VLC media player with putting
the audio to the max 200% of original audio and the audio became quite good or try to use auto
caption generated on Youtube directly.

Kafka Python Videos - Rides.csv


There is no clear explanation of the rides.csv data that the producer.py python programs use.
You can find that here
https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/2bd33e8990618
1e424f7b12a299b70b19b7cfcd5/week_6_stream_processing/python/resources/rides.csv.

kafka.errors.NoBrokersAvailable: NoBrokersAvailable

If you have this error, it most likely that your kafka broker docker container is not working.
Use docker ps to confirm
Then in the docker compose yaml file folder, run docker compose up -d to start all the
instances.

Kafka homwork Q3, there are options that support scaling


concept more than the others:
Ankush said we can focus on horizontal scaling option.
“think of scaling in terms of scaling from consumer end. Or consuming message via horizontal
scaling”

How to fix docker compose error: Error response from daemon:


pull access denied for spark-3.3.1, repository does not exist or
may require 'docker login': denied: requested access to the
resource is denied
If you get this error, know that you have not built your sparks and juypter images. This images
aren’t readily available on dockerHub.
In the spark folder, run ./build.sh from a bash cli to to build all images before running
docker compose

PipeRider workshop

Since most of us ran week 4 dbt in dbt cloud, how is is possible to


run piperider with dbt locally?
The workshop tutorial can be used as a starting point, but instead of duckdb as data source the
dbt-bigquery package must be installed like this here and running everything on dbt-core:
https://docs.getdbt.com/reference/warehouse-setups/bigquery-setup
https://docs.getdbt.com/docs/get-started/getting-started-dbt-core
Afterwards the project repository of week 4 can be used and piperider can be added after
installing it for bigquery.
I already have the .parquet files from Week 4 (Question 8) on my
Google Cloud Storage, how can I use that in conjunction with my
local dbt + DuckDB + PipeRider setup ?

Step 1.:
Start off by copying the local times from your GCS to a local directory in your computer with
gsutil:

gsutil cp -r "gs://BUCKET_NAME/PATH/TO/FOLDER/WITH/DATASETS ."

For instance:
gsutil cp -r "gs://iobruno_datalake_raw/dtc_ny_taxi_tripdata ."

datasets/
├── fhv
│ ├── fhv_tripdata_2019-01.parquet.snappy
│ ├── fhv_tripdata_2019-02.parquet.snappy
│ ├── fhv_tripdata_2019-03.parquet.snappy
│ ├── fhv_tripdata_2019-04.parquet.snappy
│ ├── fhv_tripdata_2019-05.parquet.snappy
│ ├── fhv_tripdata_2019-06.parquet.snappy
│ ├── fhv_tripdata_2019-07.parquet.snappy
│ ├── fhv_tripdata_2019-08.parquet.snappy
│ ├── fhv_tripdata_2019-09.parquet.snappy
│ ├── fhv_tripdata_2019-10.parquet.snappy
│ ├── fhv_tripdata_2019-11.parquet.snappy
│ └── fhv_tripdata_2019-12.parquet.snappy
├── green
│ ├── green_tripdata_2019-01.parquet.snappy
│ ├── green_tripdata_2019-02.parquet.snappy
│ ├── green_tripdata_2019-03.parquet.snappy
│ ├── green_tripdata_2019-04.parquet.snappy
│ ├── green_tripdata_2019-05.parquet.snappy
│ ├── green_tripdata_2019-06.parquet.snappy
│ ├── green_tripdata_2019-07.parquet.snappy
│ ├── green_tripdata_2019-08.parquet.snappy
│ ├── green_tripdata_2019-09.parquet.snappy
│ ├── green_tripdata_2019-10.parquet.snappy
│ ├── green_tripdata_2019-11.parquet.snappy
│ ├── green_tripdata_2019-12.parquet.snappy
│ ├── green_tripdata_2020-01.parquet.snappy
│ ├── green_tripdata_2020-02.parquet.snappy
│ ├── green_tripdata_2020-03.parquet.snappy
│ ├── green_tripdata_2020-04.parquet.snappy
│ ├── green_tripdata_2020-05.parquet.snappy
│ ├── green_tripdata_2020-06.parquet.snappy
│ ├── green_tripdata_2020-07.parquet.snappy
│ ├── green_tripdata_2020-08.parquet.snappy
│ ├── green_tripdata_2020-09.parquet.snappy
│ ├── green_tripdata_2020-10.parquet.snappy
│ ├── green_tripdata_2020-11.parquet.snappy
│ └── green_tripdata_2020-12.parquet.snappy
├── yellow
│ ├── yellow_tripdata_2019-01.parquet.snappy
│ ├── yellow_tripdata_2019-02.parquet.snappy
│ ├── yellow_tripdata_2019-03.parquet.snappy
│ ├── yellow_tripdata_2019-04.parquet.snappy
│ ├── yellow_tripdata_2019-05.parquet.snappy
│ ├── yellow_tripdata_2019-06.parquet.snappy
│ ├── yellow_tripdata_2019-07.parquet.snappy
│ ├── yellow_tripdata_2019-08.parquet.snappy
│ ├── yellow_tripdata_2019-09.parquet.snappy
│ ├── yellow_tripdata_2019-10.parquet.snappy
│ ├── yellow_tripdata_2019-11.parquet.snappy
│ ├── yellow_tripdata_2019-12.parquet.snappy
│ ├── yellow_tripdata_2020-01.parquet.snappy
│ ├── yellow_tripdata_2020-02.parquet.snappy
│ ├── yellow_tripdata_2020-03.parquet.snappy
│ ├── yellow_tripdata_2020-04.parquet.snappy
│ ├── yellow_tripdata_2020-05.parquet.snappy
│ ├── yellow_tripdata_2020-06.parquet.snappy
│ ├── yellow_tripdata_2020-07.parquet.snappy
│ ├── yellow_tripdata_2020-08.parquet.snappy
│ ├── yellow_tripdata_2020-09.parquet.snappy
│ ├── yellow_tripdata_2020-10.parquet.snappy
│ ├── yellow_tripdata_2020-11.parquet.snappy
│ └── yellow_tripdata_2020-12.parquet.snappy

Step 2.:

Next, on your schema.yml, where you define your data sources:


Set up a name/alias for the sources, and add the meta block as shown down below:

meta:
External_location:
"read_parquet('/path/to/datasets/where/you/downloadeded/from/gcs/{name
}/*')":
Note the {name} expression above. That will be replaced by the name of tables that define in
the block below. That means:

When you refer to {{ source('parquet', 'yellow') }} , for example:

It will attempt to fetch the parquets for the yellow_tripdata dataset in:
/path/to/datasets/where/you/downloadeded/from/gcs/yellow/

version: 2

sources:

- name: parquet
meta:
external_location:
"read_parquet('/Users/iobruno/Vault/datasets/{name}/*.parquet.snappy')"
tables:
- name: fhv
- name: green
- name: yellow
- name: zone_lookup

With that, you can start fresh with DuckDB, with no tables loaded on it whatsoever.
Pretty neat, huh!?

The .html report is not loading correctly on my VM.


Check out this extension for VS code. It should correct the issues.
https://marketplace.visualstudio.com/items?itemName=ritwickdey.LiveServer

Error: Unable to initialize main class org.example.JsonProducer -


Caused by: java.lang.NoClassDefFoundError: CsvException

Solution: Just open a “java project” in your visual studio code, and the dependencies etc. seem
to be loaded correctly.
Project

How is my capstone project going to be evaluated?

● Each submitted project will be evaluated by 3 (three) randomly assigned students that
have also submitted the project.

● You will also be responsible for grading the projects from 3 fellow students yourself.
Please be aware that: not complying to this rule also implies you failing to achieve the
Certificate at the end of the course.

● The final grade you get will be the median score of the grades you get from the peer
reviewers.

● And of course, the peer review criteria for evaluating or being evaluated must follow the
guidelines defined here.

Does anyone know nice and relatively large datasets?


See a list of datasets here:
https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/week_7_project/datas
ets.md

How to run python is start up script?


You need to redefine the python environment varible to that of your user account
Spark Streaming - How do I read from multiple topics in the same
Spark Session

Initiate a Spark Session

spark = (SparkSession
.builder
.appName(app_name)
.master(master=master)
.getOrCreate())

spark.streams.resetTerminated()

query1 = spark
.readStream


.load()

query2 = spark
.readStream


.load()

query3 = spark
.readStream


.load()

query1.start()
query2.start()
query3.start()

spark.streams.awaitAnyTermination() #waits for any one of the query to


receive kill signal or error failure. This is asynchronous

# On the contrary query3.start().awaitTermination() is a blocking


call. Works well when we are reading only from one topic.

Orchestrating dbt with Airflow


The trial dbt account provides access to dbt API. Job will still be needed to be added manually.
Airflow will run the job using a python operator calling the API. You will need to provide api key,
job id, etc. (be careful not committing it to Github).

Detailed explanation here: https://docs.getdbt.com/blog/dbt-airflow-spiritual-alignment

Source code example here:


https://github.com/sungchun12/airflow-toolkit/blob/95d40ac76122de337e1b1cdc8eed35ba1c305
1ed/dags/examples/dbt_cloud_example.py

Orchestrating DataProc with Airflow


https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/go
ogle/cloud/operators/dataproc/index.html

https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_modules/airflow/provide
rs/google/cloud/operators/dataproc.html

Give the following roles to you service account:

- DataProc Administrator

- Service Account User (explanation here)


Use DataprocSubmitPySparkJobOperator, DataprocDeleteClusterOperator and
DataprocCreateClusterOperator.

When using DataprocSubmitPySparkJobOperator, do not forget to add:

dataproc_jars =
["gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.24.0
.jar"]

Because DataProc does not already have the BigQuery Connector.

Project evaluation - Reproducibility


The slack thread :
https://datatalks-club.slack.com/archives/C01FABYF2RG/p1677678161866999

The question is that sometimes even if you take plenty of effort to document every single step,
and we can't even sure if the person doing the peer review will be able to follow-up, so how this
criteria will be evaluated?

Alex clarifies: “Ideally yes, you should try to re-run everything. But I understand that not
everyone has time to do it, so if you check the code by looking at it and try to spot errors, places
with missing instructions and so on - then it's already great”

Orchestrating dbt - cloud with Prefect


DBT cloud in not compatible with Prefect in the free version. The other option is to use dbt-core
instead as this can work with Prefect.

Spark docker - `ModuleNotFoundError: No module named 'py4j'`


while executing `import pyspark`
You can get the version of py4j from inside docker using this command

docker exec -it --user airflow airflow-airflow-scheduler-1 bash -c "ls /opt/spark/python/lib"


Why there’s ‘Learning in public links’ block in the Projects
evaluation form?

Add here any post links where you share some thoughts and insights you’ve got during an
assessment of other participants’ projects.

2022 - Week 2 (Airflow)

Airflow - I’ve got this error:


google.auth.exceptions.DefaultCredentialsError: File
/.google/credentials/google_credentials.json was not found.

Change the path of the google_credentials mounting in the docker-compose file to an absolute
one. For example in Ubuntu,

Instead of this: /.google/credentials/:/.google/credentials:ro

Use this: /home/<username>/.google/credentials/:/.google/credentials

I got the error below when I was running download_dataset_task:

*** Log file does not exist:


/opt/airflow/logs/taxi_zone_dag/download_dataset_task/2022-02-02T09:39:17.124318+00:00/6.log

*** Fetching from:


http://:8793/log/taxi_zone_dag/download_dataset_task/2022-02-02T09:39:17.124318+00:00/6.log

*** Failed to fetch log file from worker. Request URL missing either an 'http://' or
'https://' protocol.
✅I resolved it by running:
docker-compose down -v --rmi all --remove-orphans

After that, remove the following line from my codes:

From datetime import time

And then, restart docker-compose again:

docker-compose up

Installing pyamlython libraries in airflow


Under this section of the docker-compose.yaml file, find the

_PIP_ADDITIONAL_REQUIREMENTS:
build:
context: .
dockerfile: ./Dockerfile
environment:

_PIP_ADDITIONAL_REQUIREMENTS:${_PIP_ADDITIONAL_REQUIREMENTS:-}

E.g

_PIP_ADDITIONAL_REQUIREMENTS:${_PIP_ADDITIONAL_REQUIREMENTS:- pyspark}

See documentation:
https://airflow.apache.org/docs/docker-stack/entrypoint.html#installing-additional-requireme
nts

Airflow won’t update the DAG / It keeps returning errors even


though I supposedly installed additional Python libraries

Make sure that you update your Airflow image to a more recent one. Inside your Dockerfile,
modify the FROM apache/airflow:2.2.3 to any of the more recent images available in the
official Airflow Docker repository, available at https://hub.docker.com/r/apache/airflow/tags
Airflow web login issue on docker:

I was unable to log onto my linux instance of airflow with the web password until I modified the
config file in docker_compose.yaml from:

_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}

to :

_AIRFLOW_WWW_USER_CREATE=True
_AIRFLOW_WWW_USER_USERNAME=airflow
_AIRFLOW_WWW_USER_PASSWORD=airflow

google.api_core.exceptions.NotFound:
If you stuck with this problem - check this - https://github.com/mozilla/bigquery-etl/issues/1409

P.S. These entities must be created by your terraform main.tf file


GCP credentials json file cannot be found when running a DAG
I got this error when running a DAG which needs to authenticate connection to the GCP:

File "/home/airflow/.local/lib/python3.7/site-packages/google/auth/_default.py", line 108,


in load_credentials_from_filechown
"File {} was not found.".format(filename)
google.auth.exceptions.DefaultCredentialsError: File
/.google/credentials/google_credentials.json was not found.

Make sure first that you put the credentials file into the directory. Then proceed.

How did I solve it? Those are the changes I made on the docker-compose.yaml:

volumes:
~/.google/credentials/<credentials_file_name>.json:/.google/credentia
ls/google_credentials.json:ro

environment:
AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT:
'google-cloud-platform://?key_path=%2F.google%2Fcredentialsgoogle_cre
dentials.json&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-p
latform&project=<project_id>&num_retries=5'

In this stage, you might get another error, associated with directory permissions. Just grant a
per

mission to the directory via this expression (run it on the terminal):

chmod 774 ~/.google/credentials/<filename.json>


Airflow DAG is failing with "CsvOptions can only be specified if
storage format is CSV." error while trying to create external table
in BigQuery

Case:
Airflow DAG tries to create external table in BigQuery, which refers to Google cloud storage
parquet file, and fails with the error: "CsvOptions can only be specified if storage format is CSV."

Failing part of the code of data_ingestion_gcs_dag.py:

bigquery_external_table_task = BigQueryCreateExternalTableOperator(
task_id="bigquery_external_table_task",
table_resource={
"tableReference": {
"projectId": PROJECT_ID,
"datasetId": BIGQUERY_DATASET,
"tableId": "external_table",
},
"externalDataConfiguration": {
"sourceFormat": "PARQUET",
"sourceUris": [f"gs://{BUCKET}/raw/{parquet_file}"],
},
},
)

Solution:
Add an underscore sign in the names of the two parameters of
BigQueryCreateExternalTableOperator:
"source_format"
"source_uris"

bigquery_external_table_task = BigQueryCreateExternalTableOperator(
task_id="bigquery_external_table_task",
table_resource={
"tableReference": {
"projectId": PROJECT_ID,
"datasetId": BIGQUERY_DATASET,
"tableId": "external_table",
},
"externalDataConfiguration": {
"source_format": "PARQUET",
"source_uris": [f"gs://{BUCKET}/raw/{parquet_file}"],
"autodetect": True,
},
},
)

The full class description with the correct names can be found in the following source file:
https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/operators/bigquery.p
y

Postgres is failing with 'could not open relation mapping file


"global/pg_filenode.map" '

See this link

Assigning the unprivileged Postgres user as the owner of the Postgres data directory

sudo chown -R postgres /usr/local/var/postgres

Memory is not enough to run airflow init


✅ Solution:
Make a .wslconfig file at the HOME directory and add the following configurations
```
# Settings apply across all Linux distros running on WSL 2
[wsl2]

# Limits VM memory to use no more than 4 GB, this can be set as whole
numbers using GB or MB
memory=6GB

```

Running the ingestion file using python

File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in


connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not translate host name "pg-database" to
address: Name or service not known

This is due to not being able to connect to the container port. In this case, we can point to the
localhost port.
Command:
python3 ingest_data.py \
--user=root \
--password=root \
--host=localhost \
--port=5432 \
--db=ny_taxi \
--table_name=green_taxi_trips \
--url="https://github.com/DataTalksClub/nyc-tlc-data/releasexecute
binaryes/download/green/green_tripdata_2019-01.csv.gz"

Connecting to postgres in docker from local machine (windows)


using pgcli via port 5432

Returns : Password Authentcation Failed for user <username>


Cause: I had pgAdmin installed in my windows PC and it was already using port 5432.
Fix: delete the container and rerun it. This time change the port binding from 5432:5432 to
5433:5432 . This will ask enable docker binds it’s port 5432 to your PC’s port 5433 since your
PC’s port 5432 is already in use.

Blocker: DE Zoomcamp 2.2.6 at around 11:11: denied: requested


access to the resource is denied
✅ Solution:
It is assumed that you have an active Docker account. If not sign-up here.

Make sure you use below command, before pushing image to Docker hub
docker login

docker build -t docker image DOCKER_USERNAME/IMAGE_NAME:VERSION


docker build -t docker image timapple/prefect_cloud:v000
docker push DOCKER_USERNAME/IMAGE_NAME:VERSION
docker push timapple/prefect_cloud:v000
URL: In case the above does not work. Kindly read this thread.

Blocker: NotADirectoryError: [Errno 20] Not a directory:


'/opt/prefect/flows'
Code block:

# base Docker image that we will build on


FROM prefecthq/prefect:2.7.7-python3.9

# Copy the list of env packages needed. Don't forget what's on the
list
COPY docker_env_req.txt .

# Setup the env packages requirement


RUN pip install -r docker_env_req.txt --trusted-host pypi.python.org
--no-cache-dir

# Copy the flow code


COPY parametric_web_to_gcs.py /opt/prefect/flows
COPY data /opt/prefect/data

✅Solution:
From Jeff Hale: Looks like a Docker copy error. Maybe you can't copy a file to a directory that is
being created in the same statement. So change the copy command from the .py file to the
name of the folder. For example, copy the local flows folder into the /opt/prefect/flows folder.

Change code to:


COPY parametric_web_to_gcs.py /opt/prefect/flows/week_2_workflow_orchestration

Code
Then I revert to
COPY parametric_web_to_gcs.py /opt/prefect/flowsBut got an error:
Script at 'parametric_web_to_gcs.py' encountered an exception:
FileNotFoundError(2, 'No such file or directory')

Build the image and push it again then deploy. And it works! The
mystery of the universe
URL:
https://datatalks-club.slack.com/archives/C01FABYF2RG/p167496466581708
9?thread_ts=1674570724.323079&cid=C01FABYF2RG

Alternative solution:

Managed to make this work by adding a / to the copy lines in Dockerfile, like so:

COPY parameterized_flow.py /opt/prefect/flows/


COPY data /opt/prefect/data/

Running pip command in Dockerfile


THE ERROR:
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None,
status=None)) after connection broken by
'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object
at 0x7fc88c50a940>: Failed to establish a new connection: [Errno -3]
Temporary failure in name resolution')': /simple/pip/

✅THE SOLUTION:
Add the nameserver line below into /etc/resolv.conf
- sudo nano /etc/resolv.conf

```
nameserver 8.8.8.8
```

Unable to find block document named zoom-gcs for block type


gcs-bucket when running $ python3 etl_web_to_gcs.py
You have not set up the gcs-block yet. Follow this excellent summary by Padhila, Section: DE
Zoomcamp 2.2.3 - ETL with GCP & Prefect up until Step 8. Note: If you've replaced your GCP
credentials you can generate new one via: GCP > MENU > IAM & Admin > Service Accounts >
"User" > Keys > Add Key > Create New Key > JSON

UPDATE: I noticed that this can happen if you are logged into the prefect cloud even in another
terminal. Log out of prefect cloud `prefect cloud logout`

Question 4. Github Storage Block


pydantic.error_wrappers.ValidationError: 1 validation error for Deployment
infrastructure.
Infrastructure block must have 'run-infrastructure' capabilities. (type=value_error)

-sb, --storage-block TEXT The storage block to use, in


block-type/block-name or
block-type/block-name/path format.
Note that the appropriate library
supporting the storage filesystem
must be installed.

prefect deployment build path/to/etl_web_to_gcs.py:etl_web_to_gcs -n "github-deployment" -sb


github/github-zoomcamp
Running out of space on GCP VM.
While deploying docker images I received a notification that my VM (GCP, Ubuntu) was out of
space.
This can be confirmed in the terminal with df -h where 99% of /dev/root was used.
The easiest way to fix this is via the GCP GUI
https://console.cloud.google.com/compute/disksDetail/
Here you can edit and increase the size without having to mount extra disks.
You will need to restart your VM but no other changes will need to be made.
Database Error in rpc request (from remote system.sql) Location
europe-west3 does not support this operation. (source destination
error)

The only regions available in Europe for Data Transfer are europe-north1, europe-west2
and europe-west6 [Source]

Even if the source and destination locations are same, Big query can still cause errors,
apparently its an internal bug
for americas, one of the “bug free” regions is us-east4 where the source destination error
should not occur.
I had to re-create my bigQuery dataset, re-run the airflow DAGs only for creating external and
partitioned table (didn’t have to change location of gcs) and then dbt run worked

Prefect ERROR: This profile is already


authenticated with that key.
When running prefect cli: `prefect cloud login -k $ENV_VAR` Having
PREFECT_API_KEY as env var prevent from login to cloud. Open issue
on Prefect front: https://github.com/PrefectHQ/prefect/issues/7797
Quick fix: just name your variable name something else than
PREFECT_API_KEY

Vscode shortcut keys


Cmd+j - Hide/Show terminal

Homework Week 5: `**HVFHW June 2021**`


HVFHW in the question refers to the FHVHV dataset linked here:
https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/fhvhv

Literally, the best resource to study Kafka: https://www.gentlydownthe.stream/

You might also like