0% found this document useful (0 votes)

84 views33 pages

Assignment 1 Sadia AfrinMallick

This document outlines an assignment focused on analyzing a YouTube dataset using a Data Lakehouse architecture with Snowflake. The assignment involves data ingestion from Azure Blob Storage, data cleaning, and transformation to extract insights on YouTube trends across multiple countries. The process includes creating external tables, transferring data into structured tables, and ensuring data integrity by addressing duplicates and missing values.

Uploaded by

Tauseef Nawaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views33 pages

Assignment 1 Sadia AfrinMallick

Uploaded by

Tauseef Nawaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Assignment 1: Data Lakehouse with Snowflake

Student Name: Sadia Afrin Mallick

Student ID: 24880231

Course Name: 94693 Big Data Engineering

Course Coordinator: William So

Submission Date: 07 September, 2024

Introduction
The objective of this assignment is to analyze a comprehensive YouTube dataset using a Data
Lakehouse architecture built on Snowflake.The top popular YouTube videos from 10 regions
(India, USA, Great Britain, Germany, Canada, France, Brazil, Mexico, South Korea, and Japan)
are listed daily in this CSV and JSON file dataset. The data was obtained from the YouTube API
and made public through Kaggle. It covers the period from August 2020 to April 2024. Getting
insights into YouTube trends, including metrics like views, likes, shares, and comments, is the
main objective of the analysis.

For this project, we are going to ingest, clean, transform, and analyze this dataset using the
cloud data platform provided by Snowflake. Using Snowflake's data warehousing features in
conjunction with SQL queries, we examine important trends, determine the most well-liked video
genres, and evaluate the effectiveness of various content producers. Based on trending data,
the study also aids in developing successful YouTube channel launch strategies.

By the end of the analysis, we aim to provide data-driven recommendations, highlighting content
strategies that maximize reach and engagement on the platform across different countries.

Part 1: Data Ingestion

Objective:

The goal of the data ingestion process was to upload the YouTube Trending and Category
datasets into Snowflake from an Azure Blob Storage account. These datasets were utilized to
extract insights and analyze trends. Setting up Azure Blob Storage, linking it to Snowflake, and
generating external tables for the data were all part of the ingestion process.

Creating a shared access signature (SAS) token for authenticated access and setting Azure
Blob Storage to safely store the datasets were the first two crucial steps in the ingestion
process. After that, external stages were created to link Snowflake to the Azure Blob Storage
account, enabling smooth data integration between Snowflake's environment and the cloud
storage. In order to query the raw data, external tables were constructed at the end. The data
was then transformed and loaded into final, structured tables that could be used for analysis.

This procedure made sure that big datasets could be effectively maintained in a cloud setting
and made use of Snowflake's ability to handle both semi-structured and structured data. The
datasets were easily accessible for the subsequent stages of data cleansing, transformation,
and analysis by establishing this intake pipeline.

1. Download the (compressed) dataset

In the initial step, the YouTube Trending Videos dataset was downloaded from the google drive
link that is provided in the Assignment Brief. This dataset was compressed and contained data
in both CSV and JSON formats, providing a record of daily trending videos from various regions
between August 2020 and April 2024.

2. Setting Up Azure Blob Storage

The data ingestion process began with the upload of the YouTube Trending and Category
datasets to Azure Blob Storage. In order to make this easier, an Azure Storage Account named
‘utssadiam’ was established, acting as a cloud-based raw data file storage option. To host the
CSV files with trending video data and the JSON files with video category data, containe’s name
‘assignment-1’, basically folders within Azure storage were put up under this storage account.
Once the files were uploaded to their respective containers, an Azure Shared Access Signature
(SAS) token was generated. The SAS token plays a crucial role in ensuring secure access to
the files stored in the Azure Blob Storage. It offers temporary access to the storage resources,
enabling us to read, write, and edit files in accordance with the permissions we've been granted.
Since the SAS token in this instance was set up to provide read and write rights, Snowflake was
able to safely access the data for processing and querying without disclosing private login
information.
The subsequent procedure entails utilizing this SAS token to authenticate and establish an
external stage in Snowflake, facilitating smooth connection between the Snowflake system and
Azure Blob Storage.This configuration facilitated a secure linkage, enabling Snowflake to
directly extract the unprocessed data from Azure for subsequent analysis.This approach not
only guaranteed the protection of data but also enhanced the efficiency of the process by
utilizing cloud infrastructure to efficiently store and retrieve extensive datasets.
3. Connecting Snowflake to Azure Blob Storage

To enable Snowflake to access the data stored in Azure Blob Storage, a stage was established
in Snowflake using the CREATE STAGE command.A stage in Snowflake serves as a link
between Snowflake’s environment and external cloud storage, in this case, Azure Blob Storage.
It gives Snowflake a logical pointer to the files' location so that it can interact with them directly
without requiring that they first be imported into Snowflake's internal storage.

In order to establish a secure connection, the previously generated Shared Access Signature
(SAS) token from Azure was utilized. The SAS token granted Snowflake the requisite privileges
to access and manipulate data in the designated Azure Blob Storage containers, ensuring a
secure connection. This ensures that unauthorized action is prevented, as the token is
time-limited and scoped to specific actions (like reading or writing) that are required for the
ingestion process.

The URL points to the exact location of the storage container in Azure where the data resides,
while the SAS token authenticates Snowflake’s access to that location.Snowflake could now
retrieve, query, and analyze the data files straight from the cloud without requiring manual file
handling or downloading when the stage was successfully formed.

By employing token-based authentication, this method reduces the possibility of data leakage or
unauthorized access while offering a scalable, secure, and effective system to manage massive
information.
4. Creating External Tables

In order to optimize the retrieval and manipulation of the datasets stored in Azure Blob Storage,
external tables were established in Snowflake. An external table allows Snowflake to reference
data stored outside of its environment (in this case, in Azure Blob Storage) without needing to
import the data directly into Snowflake's internal storage.This method minimizes needless data
duplication and saves storage costs, making it extremely effective, particularly when working
with enormous datasets.

Two external tables were constructed, one for the Category data and one for the YouTube
Trending data, using the ‘CREATE EXTERNAL TABLE’ command. To guarantee that Snowflake
could correctly comprehend and query the data, each table was linked to the appropriate file
format.The trending data, being structured and stored in CSV format, required Snowflake to
treat each row as a separate record and each comma as a delimiter between fields.However,
because the category data was semi-structured and kept in JSON format, Snowflake had to
parse the data's hierarchical structure.

For the trending data (CSV format), the external table was created as follows:

In this case, the file format is specified as CSV, and the pattern '.*[.]csv' ensures that only files
with a .csv extension in the Azure Blob Storage are read into this external table.

For the category data (JSON format), the external table was created as follows:
Here, the file format is set to JSON, allowing Snowflake to parse the nested structure of the
JSON files. The pattern '.*[.]json' ensures that only JSON files are loaded into this external table.

By using external tables, Snowflake can directly query the data stored in Azure without copying
it into Snowflake's local storage.This makes data retrieval much more efficient, especially when
working with huge datasets, and guarantees that any modifications made to the source files in
Azure (such adding new data) are promptly reflected in Snowflake.

5. Transfer the data from external tables into tables

Once the external tables were created and connected to the raw data stored in Azure, the next
step was to transfer this data into structured tables in Snowflake. These tables were designed to
hold well-organized data, ready for analysis and reporting.

table_youtube_trending:

To hold the data on trending videos, a new database named table_youtube_trending was made.
Each video's essential details are listed in this table, along with performance indicators like
views, likes, and comments as well as metadata which in this case is country. The following
columns were included in the table:

● video_id
● title
● publishedAt
● channelId
● channelTitle
● categoryId
● trending_date
● view_count
● likes
● dislikes
● comment_count
● country

This table was populated with data using the following SQL query:
The result is shown below after creating the table_youtube_trending:

table_youtube_category

Similarly, a table named table_youtube_category was created to store the category information
of each video. Understanding the way YouTube videos are categorized into themes or genres
requires an understanding of the category data. The table includes the following columns:

● categoryId
● category_title
● country

The following query was used to extract and load the category data into the table from its
JSON-formatted storage:
The result is shown below after creating the table_youtube_category :

This section of the process ensured that both the trending video data and the category data
were properly structured, transformed. These tables are now the starting point for all further
data transformation, cleaning, and analysis after combining them into one final table.

6. Loading Data into Final Tables

After creating the external tables, the data was transformed and properly structured and put all
the data into two tables named ‘table_youtube_trending’ and ‘table_youtube_category’. Now we
are going to combine two individual tables into one final table for further analysis.

One of the key transformations involved extracting the country code from the file name stored in
Azure Blob Storage. The SPLIT_PART() method in Snowflake was used to separate the country
code since the file names (such as US_youtube_trending_data.csv) followed a particular pattern
where the country code was embedded. Using the underscore (_) as a separator, this method
allowed us to divide the file name string into segments and retrieve the first segment, which
stood for the nation. Without depending on other data sources, we were able to link each video
or category to the appropriate nation by doing this.

For example:
This line pulls the initial section, which corresponds to the country code (e.g., "US" for the
United States, "IN" for India), from the metadata$filename field, which contains the entire file
name.

Additionally, the UUID_STRING() function was used to uniquely identify each record across the
YouTube Trending and Category datasets. This function ensures that every row in the final table
has a unique, non-repeating identifier by creating a universally unique identifier (UUID) for each
record. This is particularly helpful in big datasets when several records could share the same
properties (like the video ID or category ID), and additional data processing, analysis, or joins
with other tables require a unique identification.

The query of combining two tables into one final table is given below:

From the given code we can say that, ‘UUID_STRING() AS id’ generates a unique string for
each row, ensuring that every record has a distinct identifier that can be referenced later on
during the analysis.

The output of the final table is given below:

Here, we can see that the mentioned function has created a new column named ‘id’.

Next, we are going to check if we have successfully merged the two tables into one final table
by checking the row numbers of the final tables.

Query for checking row numbers is given below:

The output is shown below:

As we can see from the output, the final table has a total ‘2667041’ rows as mentioned from the
assignment brief. That means we have successfully merged two tables without losing any data.

Finally, we can say that in this part, The YouTube Trending and Category datasets were
transferred from Azure Blob Storage to Snowflake for data ingestion. The process involved
setting up an Azure Blob Storage account, generating a SAS token for secure access, and
creating an external stage in Snowflake. External tables were created to reference the raw data
without copying it into internal storage. The data was transformed using functions like
SPLIT_PART() and UUID_STRING() to extract country codes and generate unique identifiers
for each record. This data prepared the data for further cleaning, analysis, and insight
generation.

Part 2: Data Cleaning

A crucial stage in getting the dataset ready for analysis is data cleaning, which makes sure
there are no duplicates, inconsistencies, or missing entries. For this part, we focused on
cleaning the table_youtube_category and table_youtube_final datasets by addressing
duplicates, missing values, and erroneous records. The activities associated with each of the
data cleaning are shown here, along with the SQL queries and an explanation of each data
cleaning procedure.

1. Identifying Duplicate category_title (Excluding categoryid)

Finding the ‘category_title’ values that recur in the table_youtube_category without taking the
categoryid column into account is the first job. As certain categories might have unintentionally
been repeated, this helps guarantee that the category names are distinct throughout the
collection.

SQL Query:

We use the HAVING clause to filter out categories that exist more than once in this query, which
groups the records by category_title. As required, the query returns a single duplicate
category_title.
Output:

From the output, we can see that the category "Gaming" has duplicates in the
‘table_youtube_category’.This indicates that even when the categoryid field is ignored, the
category_title "Gaming" exists many times in the dataset.

2. Finding category_title that Appears in Only One Country

Finding the category_title from “table_youtube_category” that only exists in one country is the
next challenge. This aids in our comprehension of if the dataset has any categories that are
regionally restricted, maybe exposing local patterns or problems with data collection.

SQL Query:

In order to guarantee that only titles that exist in a single nation are chosen, this query groups
the data by category_title and employs COUNT(DISTINCT country). This enables us to
recognise groupings with a narrow geographic scope.

Output:
The output shows that the table_youtube_category dataset has only one nation where the
category "Nonprofits & Activism" is present. This suggests that the category has a restricted
geographic distribution because it is limited to a single country in the data. This can draw
attention to geographical patterns or possibly the under-representation of certain countries in
videos under the "Nonprofits & Activism" category.

3. Identifying Missing ‘category_title’ in ‘table_youtube_final’

In ‘table_youtube_final’, we needed to find the ‘categoryid’ of the rows where the ‘category_title’
was missing (i.e., NULL). Missing values can cause issues in analysis, so identifying and
addressing them is crucial.

SQL Query:

This query returns the distinct ‘categoryid’ for all records where the ‘category_title’ is missing.
This result will be used in the next step to fill in the missing values.

Output:
The output reveals that In ‘table_youtube_final’, the ‘categoryid’ 29 is linked to the missing
‘category_title’. This indicates that category ID 29 is associated with the dataset records in
which the ‘category_title’ is NULL.This information is crucial for the next step, where we will
replace the missing ‘category_title’ values with the appropriate category corresponding to
‘categoryid’ 29.

4. Updating ‘table_youtube_final’ to Replace NULL Values in

‘category_title’

After identifying the ‘categoryid’ associated with the missing ‘category_title’, the next step is to
update the table and fill in the missing values with the correct category information.

SQL Query:

In this query, we perform an UPDATE operation on table_youtube_final, setting the

‘category_title’ to the value from ‘table_youtube_category’ where the ‘categoryid’ matches
between both tables and the ‘category_title’ in ‘table_youtube_final’ is NULL.

Output:

In ‘table_youtube_final’, 1,563 rows were modified in total. This shows that there were 1,563
records in which the ‘category_title’ was absent (NULL). By using the matching 'categoryid' from
‘table_youtube_category’, these records were successfully updated with the right
‘category_title’.
5. Finding Videos Without ‘channeltitle’

Finding the videos in ‘table_youtube_final’ without a ‘channeltitle’ is another task. These

documents should be treated appropriately since they can include incomplete data.

SQL Query:

This simple query returns the ‘title’ of any video where the ‘channeltitle’ is missing, indicating
potential gaps in the data that may need further investigation.

Output:

The output shows that the query produced no results, meaning that there are no videos in the
‘table_youtube_final’ where the ‘channeltitle’ is missing (i.e., NULL).

This indicates that the dataset is complete in terms of channel information, and every video has
an associated ‘channeltitle’. As a result, there are no records that need further cleaning or
attention for missing channelTitle values.

6. Deleting Records with video_id = '#NAME?'

Identified records with invalid video_id values, like ‘#NAME?’, have to be deleted from the
dataset. This guarantees that there are no corrupt entries and that the data is still valid.
SQL Query:

This query deletes any record where the ‘video_id’ is equal to '#NAME?', cleaning up any invalid
data entries.

Output:

32,081 rows from ‘table_youtube_final’ were successfully removed, according to the output. The
entries with the video_id of "#NAME?," which denotes an incorrect or corrupted entry in the
dataset, are associated with these removed rows.

7. Creating new table, “table_youtube_duplicates” containing only

the “bad” duplicates by using the row_number() function

Dealing with duplicate records in ‘table_youtube_final’ is the next task. The duplicates had
different metrics (likes, dislikes, etc.) even though they had the same video_id, country, and
trending_date. We choose to preserve the record with the greatest view_count in order to
remedy this.

SQL Query:
By utilizing the ROW_NUMBER() function to determine which videos are duplicates, this query
generates a new table called ‘table_youtube_duplicates’. The data is divided according to
video_id, country, and trending_date, and the rows are arranged according to view_count. The
records that have duplicates (row_num larger than 1) are the only ones that are chosen.

Output:

This output shows that the new table named ‘TABLE_YOUTUBE_DUPLICATES’ has been
created.

8. Deleting the Duplicates

Our next task is to delete the duplicates in “table_youtube_final“ by using

“table_youtube_duplicates”.

SQL Query:

Based on the "bad" duplicates found in ‘table_youtube_duplicates’, the SQL query displayed is
used to remove duplicate entries from ‘table_youtube_final’. Based on a combination of id,
country, trending_date, and view_count, the DELETE FROM statement eliminates records from
‘table_youtube_final’ where a matching entry exists in ‘table_youtube_duplicates’.
Output:

The output shows that a total of 37,466 duplicate rows were successfully deleted from
‘table_youtube_final’. These rows were identified as "bad" duplicates based on the combination
of id, country, trending_date, and view_count.

9. Verifying the Final Row Count

Our last task is to verify the row number which is mentioned in the question. The final row
numbers should be 2,597,494.

SQL Query:

This query ensures that the final table contains the correct number of rows after all cleaning
processes have been applied.

Output:
Finally, we verified that the number of rows in ‘table_youtube_final’ matches the expected total
of 2,597,494 rows.

The data cleaning process ensured that the dataset was free from duplicates, missing values,
and invalid entries. The ROW_NUMBER() function was used to find and eliminate duplicates,
data from the table_youtube_category was used to fill in the category_title field's missing values,
and records with incorrect video_id values were eliminated. After being thoroughly cleaned, the
dataset has a verified total of 2,597,494 rows and is now available for additional analysis.

Part 3: Data Analysis

This section's objective is to evaluate the YouTube Trending and Category datasets in order to
extract insightful information from the information. In order to address particular business and
analytical issues about video views, content trends, and channel performance across different
countries, a number of SQL queries were run for this section. These searches yield greater
insights into the popularity of material on YouTube across various areas and genres. A thorough
explanation of the analysis for every question is provided below.

1. Top 3 Most Viewed Videos in the Gaming Category for Each

Country on 2024-04-01

In order to determine the top three most viewed videos in the Gaming category for each country
on April 1, 2024, we filtered the dataset based on both the category and date. The view_count
for every nation was used to rank the results. To understand better, the sql query is given below:
The code uses a ROW_NUMBER() function to rank videos based on view_count in descending
order, with a PARTITION BY country clause for each country, and a WHERE clause to filter
results for records with a trending date of 2024-04-01. The query selects country, title,
channeltitle, view_count, and rank, and filters results to include top 3 videos for each country,
and orders results by country and rank, ensuring they are displayed in rank order.

Output:

The query highlighted the top-performing gaming videos for each country, showcasing the
diversity in gaming preferences across countries.Content producers and marketers who want to
know what the regional trends in YouTube gaming are can benefit from this.

2. Count of Distinct Videos with “BTS” in the Title (Case-Insensitive)

by Country
In order to answer this question, we counted the number of unique videos in each country
whose title contained the term "BTS" (regardless of case). Understanding the global influence
and scope of the BTS phenomenon on YouTube required the completion of this investigation.To
understand better, the sql query is given below:

The code counts distinct video_ids with the word "BTS" in the title for each country, performs a
case-insensitive search using the LOWER() function, groups results by country, and orders
results in descending order based on the count of videos, with countries with the highest
number of BTS-related videos appearing first.

Output:

South Korea has the most BTS-related videos, with 468, indicating their popularity in their home
country. India follows with 288, showcasing a strong BTS fanbase. The US follows with 268,
showcasing BTS's global reach and fanbase.This provides useful insights for content strategists
and fans of the band.

3. Most Viewed Video and Likes Ratio for Each Country by Year and
Month (2024 Only)

The goal of this inquiry was to ascertain which country's most popular videos were in each of
the months of 2024. The likes_ratio, which is the percentage of likes against the view_count,
was also computed and trimmed to two decimal places.To understand better, the sql query is
given below:
The TO_CHAR() function formatters the trending_date into a YYYY-MM column. The ROUND()
function calculates the likes_ratio, while the ROW_NUMBER() function ranks videos by view
count within each country and month. The query filters for the year 2024 using the
EXTRACT(YEAR FROM trending_date) = 2024 condition.
The query selects the most viewed video (rk = 1) for each country and year_month. The result is
ordered by year_month and country to ensure chronological order and grouping by country.

Output:

The video's trending country, month, title, channel, category, view count, and likes ratio are all
important details in analyzing user engagement. MrBeast's videos consistently had high
viewership in January and February 2024, especially in the Entertainment category. The likes
ratio indicates how many viewers liked the video relative to the total views, indicating user
engagement.
Notably, the video with the highest likes_ratio for a single month and country is Rockstar
Games' Grand Theft Auto VI Trailer 1 in the US for February 2024, with a ratio of 6.73.

This query provided insights into the most popular videos and their engagement levels for each
country on a monthly basis. It highlighted videos with high engagement rates, offering useful
data for creators looking to replicate such success in future campaigns.
4. Category with the Most Distinct Videos by Country from 2022

The goal of this investigation was to determine which category from 2022 provided the most
unique videos for each nation. We also computed the proportion of unique videos in that
category compared to the total number of unique videos in the nation.To understand better, the
sql query is given below:

The CTE (Common Table Expression) calculates the number of distinct videos for each country
and category from 2022, grouping data by country and category_title. It calculates the total
number of distinct videos per country without considering the category. The
MaxCategoryByCountry CTE (Common Table Expression) joins the results to calculate the
percentage of videos from each category out of the total videos in the country. The ROUND()
function formats the percentage to two decimal places, and the ROW_NUMBER() function ranks
categories by the number of distinct videos for each country.

Output:

The "Entertainment" category is the most prolific in several countries, indicating a strong interest
in entertainment videos. The percentage representation shows the significant role of this
category in content production within each country.The percentage column shows how
significant the top category is in terms of content production within each country. For example, if
the "Entertainment" category accounts for 35.67% in a country, it means that over a third of all
distinct videos from that country in 2022 belong to this category.Country-specific trends may
also influence the top categories, such as "Music" in Japan, "KR" in South Korea, or
"Entertainment" or "Music" in India.

5. Channel with the Most Distinct Videos

The channel that had created the most unique videos throughout the dataset was found using
this query. Knowing which channels are the most popular might help you get a sense of which
content creators are producing the most stuff on YouTube.To understand better, the sql query is
given below:

The COUNT function counts distinct videos per channel, ensuring only unique ones are
counted. The GROUP BY channeltitle clause groups results by channel title, while the ORDER
BY num_distinct_videos DESC sorts results in descending order. The LIMIT 1 limit restricts the
result to the top channel, which has produced the most distinct videos.

Output:

This output shows that the channel "Vijay Television" has produced the most distinct videos,
with a total of 2,049 distinct videos. This means that out of all the channels in the dataset, Vijay
Television is the most prolific in terms of unique content (i.e., each video_id is unique for this
channel).This result highlights Vijay Television as a highly active channel on YouTube, creating
a significant number of distinct videos compared to other channels.

The data analysis of the YouTube dataset aimed to understand video trends across countries
and categories. The top three most viewed gaming videos were identified for each country on
the trending date "2024-04-01". The most distinct BTS videos were found in South Korea,
indicating the global popularity of the South Korean band. The most viewed video and likes ratio
for 2024 were calculated for each country and month, with popular channels like MrBeast
dominating viewership. The most distinct videos by category were found in the "Entertainment"
category in most countries, while "Gaming" featured prominently in the US and Canada. Vijay
Television was identified as the most prolific content creator on YouTube.

Part 4: Business Question

In this part, the goal is to determine which category of YouTube videos (excluding "Music" and
"Entertainment") would be the most strategic to focus on if launching a new YouTube channel to
make the videos in top trending, and whether this strategy would work across different
countries.

I have focused on two different approaches to find the better strategy to launch a new YouTube
channel with top trending videos. The approaches are mentioned below:

1. Top Categories Excluding Music and Entertainment:

The first step was to determine which categories (excluding "Music" and "Entertainment") have
the most trending videos across all countries. This gives us insight into which video categories
are generally more popular worldwide. The SQL query is attached below:

The query aggregates trending videos across countries, excluding Music and Entertainment,
and sums total videos, counting countries each category appears in, identifying global popular
categories.

2.Country-Specific Category Rankings:

The second part ranked each country’s top categories based on several factors, including:

● Distinct trending videos

● Average views per video
● Average likes, dislikes, and comments per video

The SQL query is attached below:

The CTE dataset is filtered to include 2024 data, excluding Music and Entertainment.
Trending_categories calculates statistics like video views, likes, and comments by country.
Ranking_categories ranks categories based on distinct videos per country. The final SELECT
retrieves the top-ranked category for each country.

Outputs and Analysis

Global Insights:
From the global insights, the overview we get is that the Gaming category had the highest
number of trending videos worldwide (62,009), followed by Sports with 55,305 and People &
Blogs with 50,122.All these categories appeared in 10 different countries, indicating their global
appeal and presence on the YouTube trending lists.

Country-Specific Insights:

With 627 unique trending videos and an astounding average view count of over 10.6 million
views per video, gaming is the top category in the United States. This is a crucial sign of its
acceptance and popularity in the United States.
Canada, Japan, and Mexico also showed Gaming as the top category, with substantial average
views per video (7.5 million in Canada, 9.3 million in Japan, and 1.9 million in Mexico).Other
countries, such as Brazil, Germany, France, and the U.K., showed Sports as the dominant
category, with high levels of engagement (5.2 million to 6.8 million average views per video).
People & Blogs was the most popular category in both South Korea and India, with somewhat
high average view counts of 11.6 million and 3.5 million, respectively.

Based on the analysis, The best category to pick would be Gaming.Here’s why:

● Global Popularity: In a number of nations, including the United States, Canada, Japan,
and Mexico, gaming continuously holds the top spot. With 62,009 trending videos, it has
the most trending videos worldwide.

● Engagement: The Gaming category has remarkably high average views per video. With
an average of more than 10.6 million views per video in the United States, this category
excels in terms of audience interaction.

● Widespread Appeal: The popularity of gaming is evident not only in the United States but
also in Canada, Japan, Mexico, and other nations, where it holds the top spot. This
increases the category's potential for growth into new markets.
While Sports is strong in certain countries like Brazil and Germany, Gaming has a more
consistent presence across a variety of key regions, making it the most strategic choice for
launching a YouTube channel.

Challenges and Solutions

Many obstacles were faced during this assignment, all of which needed to be carefully
examined and creatively solved in order to guarantee a reliable data flow and appropriate
analysis.

Challenge 1:Invalid URL in Azure Blob Storage

During the data ingestion phase, an error occurred connecting Azure Blob Storage with
Snowflake due to an incorrectly formatted SAS token, causing a security issue for Azure data
access.

Solution:

By closely examining the SAS token creation procedure and making sure the right permissions
were granted, the problem was fixed. To further assure its legitimacy, the storage account URL
was double-checked, which made it possible for Snowflake to access the data without any
problems.

Challenge 2:Invalid URL in Azure Blob Storage

There were CSV and JSON files in the dataset. Making sure that both file types were
appropriately digested and processed was a difficulty, especially when dealing with JSON data
that needed extra processing steps due to its nested structures.

Solution:

Using the CREATE FILE FORMAT command, appropriate file format configurations were set up
in Snowflake, with distinct parameters for CSV and JSON files. Certain parsing options were
used on JSON files in order to retrieve the data required for analysis and flatten nested
structures.
Challenge 3: Duplicate Records and Data Cleaning

Duplicates in the ‘table_youtube_final’ table presented a challenge, as different records with the
same video_id, country, and trending_date could have varying metrics such as view_count,
likes, and dislikes. The rows were not matching with the output given due to this challenge.

Solution:

The duplicates were identified using the ROW_NUMBER() function, ranking records within each
group by view_count. Then, the "bad" duplicates were separated into a new table
(table_youtube_duplicates) and deleted from the main table, ensuring that only the highest
view_count record for each video was kept. Also instead of video_id, I have used just id to
delete the rows. When I tried to drop the rows mentioning ‘id’, the rows matched with the output
that’s provided.

Conclusion

With the use of a Snowflake Data Lakehouse environment, this project illustrates the entire data
analysis lifecycle, from data intake and cleaning to in-depth trend analysis and responding to
business enquiries.

The YouTube Trending dataset provided rich insight into global video trends across multiple
countries.This dataset was successfully imported into Snowflake from Azure Blob Storage.
Duplicate values were removed, and the dataset was cleaned using SQL to enable thorough
analysis. This helped us find the most popular channels, the categories with the most videos
trending, and the answers to important business concerns like which video category to focus on
to obtain the greatest exposure.

Based on our research, the gaming category ranked highly consistently in a number of different
countries, which makes it the best category to start a new YouTube channel that aims to follow
global trends. However, depending on the popularity of certain categories in a given location,
localized strategies could also concentrate on categories like Sports in nations like Brazil and
Germany, or People & Blogs in countries like India and South Korea.

Overall, the use of Snowflake provided a powerful platform for handling large datasets, with
scalability and flexibility that allowed us to ingest, clean, and analyze data with ease. The results
of the analysis provide actionable insights for launching successful YouTube channels in various
regions, demonstrating the power of data-driven decision-making.

Reference

1. https://docs.snowflake.com/en/user-guide/data-load-azure-config

2. https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-fact
ory

3. https://docs.matillion.com/metl/docs/2044580/

4. https://cloudsonmars.com/what-is-data-lakehouse-architecture/

5. https://www.ibm.com/topics/data-lakehouse

6. https://www.dataideology.com/snowflake-data-warehouse-advantages/

7. https://airbyte.com/data-engineering-resources/sql-data-cleaning

8. https://stackoverflow.com/questions/75043146/cleaning-varchar-data-in-sql-snowflake

9. https://www.pythian.com/blog/business-insights/top-challenges-of-data-ingestion-pipeline
s-and-how-to-overcome-them-with-google-cloud

10. https://www.researchgate.net/publication/275534400_Big_Data_Challenges_Opportuniti
es_and_Cloud_Based_Solutions

Common questions

Snowflake played a critical role in the analysis of YouTube Trending and Category datasets by providing an infrastructure to directly interact with data stored in Azure Blob Storage through external tables . This system minimized data duplication and enabled efficient data handling and querying without the need for internal storage imports . Benefits included reduced storage costs, real-time data analysis capabilities, and enhanced data security with the use of token-based authentication . Snowflake's system facilitated seamless data transformation and cleaning processes, setting the stage for complex analytical queries and insights generation .

An Azure Shared Access Signature (SAS) token provides temporary access to Azure storage resources, enabling operations like reading, writing, and editing based on the permissions granted . When integrating Snowflake with Azure Blob Storage, the SAS token is used to securely authenticate and establish a connection to an external stage in Snowflake. This connection allows for the direct querying and analysis of files stored in Azure without necessitating manual handling or direct imports into Snowflake’s environment . The SAS token is time-limited and scoped to specific actions, reducing the possibility of unauthorized access and data breaches .

The analysis suggested focusing on categories that have a substantial number of trending videos globally, excluding "Music" and "Entertainment" . By examining the frequency of trending videos across countries, potential categories such as "Gaming" or "Education" could be more strategic targets for new channel launches, as these areas show consistent global appeal . This suggests that aligning content strategy to tap into widely trending non-mainstream categories could provide new channels with a competitive edge in gaining visibility and audience engagement across different regions .

The functions SPLIT_PART() and UUID_STRING() were used to prepare and clean the YouTube datasets in Snowflake . SPLIT_PART() was employed to extract country codes from concatenated strings, while UUID_STRING() generated unique identifiers for each record to ensure distinct entries . These operations were pivotal in standardizing data, preventing duplication, and maintaining data integrity, which are critical for accurate analysis and subsequent insights . The cleaning process further involved identifying duplicates, resolving missing values, and correcting erroneous records to ensure data quality .

Vijay Television was identified as the channel with the most distinct videos, totaling 2,049 unique entries in the dataset . This highlights the channel's prolific nature in content creation, suggesting a strategic focus on diverse content generation to engage a broader audience . Recognizing such channels can be critical for understanding market trends, assessing the impact of high-volume content strategy, and learning from their engagement and monetization tactics .

External tables in Snowflake allow the database to directly reference data stored externally in Azure Blob Storage without importing it into internal storage, thereby minimizing data duplication and storage costs . An external table is linked to the appropriate file format, such as CSV or JSON, ensuring that Snowflake can correctly interpret and query the datasets based on their respective structures . This method is particularly efficient when working with large datasets, as it ensures swift data retrieval and that changes in the source files are immediately reflected in Snowflake .

Analyzing the top viewed gaming videos by country on a specific date, like April 1, 2024, offers insights into regional gaming preferences and trends . By using a ranking system based on view counts, content creators can identify which types of gaming videos resonate most in different regions . This information is invaluable for content producers and marketers as it highlights regional trends in gaming interest, allowing them to tailor their content strategies to align with audience preferences, potentially increasing engagement and viewership .

Calculating a likes ratio, which represents the percentage of likes against the total view count, provides a nuanced understanding of viewer engagement beyond mere viewership numbers . A high likes ratio suggests that a significant proportion of viewers found the content appealing enough to like it, indicating strong engagement and satisfaction . Analyzing such metrics helps identify content that resonates well with audiences, offer insights into content quality, and enable strategists to develop more engaging future content .

The analysis revealed that South Korea had the highest number of BTS-related videos, followed by India and the US, highlighting BTS's strong global influence and fanbase . These findings demonstrate BTS's wide-reaching impact and popularity transcending cultural and geographic boundaries . For content strategists, this indicates the potential for high engagement with content related to globally recognized phenomena, suggesting opportunities for leveraging such trends for similar fandom-based marketing strategies . This could guide the development of content that resonates broadly across multiple demographics and regions.

External tables in Snowflake are designed to point directly to data stored in Azure Blob Storage, allowing Snowflake to reference the data without importing it into its local environment . This setup means that any changes or updates made to the source files in Azure Blob Storage, such as adding new data, are automatically reflected when querying the external tables in Snowflake . This ensures that the data remains current and up-to-date, providing real-time insights and reducing the workload associated with manually updating internal datasets .

Snowflake
No ratings yet
Snowflake
73 pages
Snowflake
No ratings yet
Snowflake
25 pages
End To End Project ADF
100% (1)
End To End Project ADF
73 pages
All Snowflake Details Document
No ratings yet
All Snowflake Details Document
105 pages
SnowFlake Notes
100% (1)
SnowFlake Notes
40 pages
Snowflake Viva
No ratings yet
Snowflake Viva
4 pages
Best Practices For Using Tableau With Snowflake
No ratings yet
Best Practices For Using Tableau With Snowflake
63 pages
Data Engineering - Behind The Scene of Data by Hoda Ragaie
No ratings yet
Data Engineering - Behind The Scene of Data by Hoda Ragaie
44 pages
PROJECT 10 For Python
No ratings yet
PROJECT 10 For Python
16 pages
Snowflake Data Engineer Course Syllabus
No ratings yet
Snowflake Data Engineer Course Syllabus
4 pages
Unit - Iii
No ratings yet
Unit - Iii
39 pages
SnowPro Advanced Data Engineer Study Guide
No ratings yet
SnowPro Advanced Data Engineer Study Guide
14 pages
Best Practices For Using Tableau With Snowflake
No ratings yet
Best Practices For Using Tableau With Snowflake
64 pages
Ravi Snowflake Interview Questions-1
No ratings yet
Ravi Snowflake Interview Questions-1
20 pages
Snowflake
No ratings yet
Snowflake
16 pages
Snowflake Table Types Explained
No ratings yet
Snowflake Table Types Explained
51 pages
Snowflake Prctice1
100% (1)
Snowflake Prctice1
51 pages
Snowflake Questions V2
No ratings yet
Snowflake Questions V2
6 pages
PROJECT 12 For Python
No ratings yet
PROJECT 12 For Python
26 pages
Database Theory Report 2
No ratings yet
Database Theory Report 2
2 pages
Best Practices For Using Tableau With Snowflake
100% (1)
Best Practices For Using Tableau With Snowflake
63 pages
Cloud Data Lakes For Dummies Snowflake Special Edition V1 4
No ratings yet
Cloud Data Lakes For Dummies Snowflake Special Edition V1 4
10 pages
Programming+in+Snowflake+ +All+Slides
100% (1)
Programming+in+Snowflake+ +All+Slides
342 pages
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
Snowflake Interview Questions and Answers
No ratings yet
Snowflake Interview Questions and Answers
5 pages
Snowflake
No ratings yet
Snowflake
7 pages
Azure Datalake
No ratings yet
Azure Datalake
8 pages
PROJECT 11 For Python
No ratings yet
PROJECT 11 For Python
22 pages
Snow Flake Notes
No ratings yet
Snow Flake Notes
36 pages
Arch and Types of Snowflake Stages
No ratings yet
Arch and Types of Snowflake Stages
51 pages
Snowflake Syllabus
100% (2)
Snowflake Syllabus
2 pages
Snowflake Cloud Data Warehouse Guide
0% (1)
Snowflake Cloud Data Warehouse Guide
15 pages
Solutions Partner Technical Onboarding Guide
100% (1)
Solutions Partner Technical Onboarding Guide
27 pages
Data Engineering for Streaming Data
No ratings yet
Data Engineering for Streaming Data
59 pages
Az 103
100% (6)
Az 103
152 pages
Cloud Data Warehouse: Streamsets For Snowflake
No ratings yet
Cloud Data Warehouse: Streamsets For Snowflake
6 pages
Snowflake Data Warehouse Guide
No ratings yet
Snowflake Data Warehouse Guide
8 pages
ADF Copy Data: Blob to SQL Guide
No ratings yet
ADF Copy Data: Blob to SQL Guide
85 pages
Complete Notes On Azure de 1734338895
No ratings yet
Complete Notes On Azure de 1734338895
94 pages
Snowflake Mastering
No ratings yet
Snowflake Mastering
6 pages
Week 4 - Azure-AWSStorage
No ratings yet
Week 4 - Azure-AWSStorage
97 pages
Best Practices For Optimizing Your DBT and Snowflake Deployment
100% (1)
Best Practices For Optimizing Your DBT and Snowflake Deployment
30 pages
SF Notes Anuja
No ratings yet
SF Notes Anuja
12 pages
Snowflake Vs Data Bricks
100% (1)
Snowflake Vs Data Bricks
10 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
Snowflake Web Interface Guide
No ratings yet
Snowflake Web Interface Guide
44 pages
Snowflake Cloud Data Warehouse Intro
No ratings yet
Snowflake Cloud Data Warehouse Intro
6 pages
Snowflake Fundamentals Overview
No ratings yet
Snowflake Fundamentals Overview
50 pages
Azure Data Lake and U-SQL
No ratings yet
Azure Data Lake and U-SQL
51 pages
Snowpro Core Guide
No ratings yet
Snowpro Core Guide
16 pages
Azure Data Solutions Guide
No ratings yet
Azure Data Solutions Guide
77 pages
Snowflake Data Integration and Loading Guide
No ratings yet
Snowflake Data Integration and Loading Guide
2 pages
Introduction To Snowflake: Sunil Gurav
No ratings yet
Introduction To Snowflake: Sunil Gurav
65 pages
Snowflake 101 - For Data Architects - LinkedIn
No ratings yet
Snowflake 101 - For Data Architects - LinkedIn
17 pages
An Introduction To Snowflake - SQLKonferenz
No ratings yet
An Introduction To Snowflake - SQLKonferenz
29 pages
Appstore Auth Diagram
No ratings yet
Appstore Auth Diagram
1 page
Config Webservices API Reference Guide
No ratings yet
Config Webservices API Reference Guide
247 pages
Advance Spark
No ratings yet
Advance Spark
8 pages
NIFI Project
No ratings yet
NIFI Project
2 pages
New 1728114367508
No ratings yet
New 1728114367508
31 pages
dp-700 6
No ratings yet
dp-700 6
16 pages
String Matching Algorithms Guide
No ratings yet
String Matching Algorithms Guide
46 pages
Session 4 - Geotechnical Modeler
No ratings yet
Session 4 - Geotechnical Modeler
11 pages
Data Mining for Hepatitis Analysis
No ratings yet
Data Mining for Hepatitis Analysis
10 pages
Tips For Programming and Development
No ratings yet
Tips For Programming and Development
22 pages
12TH Ip Set A
No ratings yet
12TH Ip Set A
7 pages
Working With CSV Files
No ratings yet
Working With CSV Files
4 pages
Cambridge International General Certificate of Secondary Education
No ratings yet
Cambridge International General Certificate of Secondary Education
12 pages
D XII CS Practical List 2024-25
No ratings yet
D XII CS Practical List 2024-25
24 pages
CS Pre2
No ratings yet
CS Pre2
11 pages
Aryan Sirohi, XII-A, Practical File CS
No ratings yet
Aryan Sirohi, XII-A, Practical File CS
33 pages
Palphlrt Form 4 Only - Computer Studies
No ratings yet
Palphlrt Form 4 Only - Computer Studies
111 pages
PRAVEEN
No ratings yet
PRAVEEN
10 pages
EIKON LogicBuilder v5 User Manual
No ratings yet
EIKON LogicBuilder v5 User Manual
44 pages
Programming For Data Science Assignment-2
No ratings yet
Programming For Data Science Assignment-2
23 pages
Sage 200 Cost Centre Setup Guide
No ratings yet
Sage 200 Cost Centre Setup Guide
8 pages
LMS Admin Documentation
No ratings yet
LMS Admin Documentation
56 pages
CCS341 Data Warehousing Lab Manual
No ratings yet
CCS341 Data Warehousing Lab Manual
89 pages
Importing Data Into R Using RStudio - Watermark
No ratings yet
Importing Data Into R Using RStudio - Watermark
3 pages
CSCU9T4 Assignment2 Instructions Spring2024
No ratings yet
CSCU9T4 Assignment2 Instructions Spring2024
6 pages
Chapter 3 - Bacnet Server in Insight: Creating User Accounts For Bacnet
No ratings yet
Chapter 3 - Bacnet Server in Insight: Creating User Accounts For Bacnet
12 pages
Excel Practic Exam 3
No ratings yet
Excel Practic Exam 3
65 pages
Spark Entity Resolution with DataFrame Analysis
No ratings yet
Spark Entity Resolution with DataFrame Analysis
5 pages
XMITIP Guide
No ratings yet
XMITIP Guide
60 pages
K-Pos DP Recorder
No ratings yet
K-Pos DP Recorder
64 pages
Data Analysis with Python Libraries
No ratings yet
Data Analysis with Python Libraries
29 pages
Google Cluster Usage Traces
No ratings yet
Google Cluster Usage Traces
13 pages
ADMS-11: Instruction Manual
No ratings yet
ADMS-11: Instruction Manual
34 pages
AI Board Paper-2025
No ratings yet
AI Board Paper-2025
8 pages
Single
No ratings yet
Single
64 pages
Libraview: Product Manual
No ratings yet
Libraview: Product Manual
40 pages
WPA Logging Guide
No ratings yet
WPA Logging Guide
60 pages

Assignment 1 Sadia AfrinMallick

Uploaded by

Assignment 1 Sadia AfrinMallick

Uploaded by

Assignment 1: Data Lakehouse with Snowflake

Student Name: Sadia Afrin Mallick

Student ID: 24880231

Course Name: 94693 Big Data Engineering

Course Coordinator: William So

Submission Date: 07 September, 2024

Part 1: Data Ingestion

1. Download the (compressed) dataset

2. Setting Up Azure Blob Storage

5. Transfer the data from external tables into tables

6. Loading Data into Final Tables

The output of the final table is given below:

Query for checking row numbers is given below:

The output is shown below:

Part 2: Data Cleaning

1. Identifying Duplicate category_title (Excluding categoryid)

2. Finding category_title that Appears in Only One Country

3. Identifying Missing ‘category_title’ in ‘table_youtube_final’

4. Updating ‘table_youtube_final’ to Replace NULL Values in

In this query, we perform an UPDATE operation on table_youtube_final, setting the

Finding the videos in ‘table_youtube_final’ without a ‘channeltitle’ is another task. These

6. Deleting Records with video_id = '#NAME?'

7. Creating new table, “table_youtube_duplicates” containing only

8. Deleting the Duplicates

Our next task is to delete the duplicates in “table_youtube_final“ by using

9. Verifying the Final Row Count

Part 3: Data Analysis

1. Top 3 Most Viewed Videos in the Gaming Category for Each

2. Count of Distinct Videos with “BTS” in the Title (Case-Insensitive)

5. Channel with the Most Distinct Videos

Part 4: Business Question

1. Top Categories Excluding Music and Entertainment:

2.Country-Specific Category Rankings:

● Distinct trending videos

The SQL query is attached below:

Outputs and Analysis

Challenges and Solutions

Challenge 1:Invalid URL in Azure Blob Storage

Challenge 2:Invalid URL in Azure Blob Storage

Common questions

How was Snowflake's role crucial in the analysis of YouTube Trending and Category datasets, and what were the benefits of this system?

How was Snowflake's role crucial in the analysis of YouTube Trending and Category datasets, and what were the benefits of this system?

What is the role of an Azure Shared Access Signature (SAS) token in connecting Snowflake to Azure Blob Storage, and how does it ensure data security?

What is the role of an Azure Shared Access Signature (SAS) token in connecting Snowflake to Azure Blob Storage, and how does it ensure data security?

What strategies were suggested to focus on launching a new YouTube channel based on the analysis of categories excluding "Music" and "Entertainment"?

What strategies were suggested to focus on launching a new YouTube channel based on the analysis of categories excluding "Music" and "Entertainment"?

What functions were used to prepare and clean the YouTube datasets for analysis in Snowflake, and what was their significance?

What functions were used to prepare and clean the YouTube datasets for analysis in Snowflake, and what was their significance?

What were the implications of the channel 'Vijay Television' being identified as the channel with the most distinct videos in the YouTube dataset?

What were the implications of the channel 'Vijay Television' being identified as the channel with the most distinct videos in the YouTube dataset?

How do external tables in Snowflake optimize the storage and retrieval of large datasets from Azure Blob Storage?

How do external tables in Snowflake optimize the storage and retrieval of large datasets from Azure Blob Storage?

How can the analysis of YouTube's top viewed gaming videos across countries on a specific date provide strategic insights for content creators?

How can the analysis of YouTube's top viewed gaming videos across countries on a specific date provide strategic insights for content creators?

How does calculating a likes ratio contribute to understanding video engagement levels, based on the analysis of YouTube data?

How does calculating a likes ratio contribute to understanding video engagement levels, based on the analysis of YouTube data?

What were the significant findings regarding BTS-related videos globally, and how do these insights inform content strategies for similar phenomena?

What were the significant findings regarding BTS-related videos globally, and how do these insights inform content strategies for similar phenomena?

How does the use of external tables allow for immediate reflection of changes from source files in Azure Blob Storage to Snowflake's querying results?

How does the use of external tables allow for immediate reflection of changes from source files in Azure Blob Storage to Snowflake's querying results?

You might also like