Analyzing Youtube Trends in a Data Lakehouse

04 Sep 2022

Snowflake SQL Azure

In today’s data-driven world, platforms like YouTube generate vast amounts of data daily. Analyzing this data can provide valuable insights into user behavior and trending content. In this blog, we’ll explore how to use a Data Lakehouse on Snowflake to analyze YouTube’s trending data. We’ll walk through the entire process, from setting up cloud storage to deriving meaningful insights.

Link to Github

Why Snowflake?

Snowflake is a fully managed SaaS solution that acts as a Data Warehouse, Data Lake, or Data Lakehouse. It operates on AWS, Azure, or Google Cloud Platform, making it versatile and accessible. Its unique architecture separates storage and compute, enabling independent scaling and optimizing performance.

Dataset
Infrastructure Setup
1. Integrate Snowflake with Azure
2. Data Ingestion on Snowflake
Data Cleaning
Data Analysis

Dataset

This dataset is a daily record of the top trending YouTube videos. It includes several months of data on daily trending videos for multiple regions, including the US, GB, DE, CA, FR, RU, MX, KR, JP, and IN (USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan, and India)

Each region’s data is stored in a separate file and contains detailed information about the trending videos, such as video title, channel title, publish time, tags, views, likes, dislikes, description, and comment count. Additionally, the dataset features a category_id field, which can be cross-referenced with the associated JSON files to retrieve category details for specific videos.

The dataset is available here.

Infrastructure Setup

A Snowflake and Cloud Storage account are needed for this project. Microsoft Azure was used due to its ease of use and availability of free credits for students.

Integrate Snowflake with Azure

Set up an Azure account and upload all the uncompressed dataset files into a new storage container. Retrieve the Directory ID, so that it can be used for integration with Snowflake.
In a new Snowflake worksheet, use SQL commands to create a new database and switch to the database.
```
 CREATE OR REPLACE DATABASE bde_at_1;
 USE DATABASE bde_at_1;
```

Set up a storage integration with Azure using the following script.

 CREATE OR REPLACE STORAGE INTEGRATION azure_bde_at_1
 TYPE = EXTERNAL_STAGE
 STORAGE_PROVIDER = AZURE
 ENABLED = TRUE
 AZURE_TENANT_ID = 'your-tenant-id'
 STORAGE_ALLOWED_LOCATIONS = ('azure://your-storage-url');

Assign the necessary roles and permissions on Azure to allow Snowflake to access the storage container.

Data Ingestion on Snowflake

Once our storage is set up, we need to load the data into Snowflake.

Create a Stage: A stage is a named location where data files are stored to be loaded into or unloaded from Snowflake tables, providing a seamless way to interact with external storage.
```
  CREATE OR REPLACE STAGE stage_bde_at_1
  STORAGE_INTEGRATION = azure_bde_at_1
  URL = 'azure://utsbdeat1.blob.core.windows.net/bde-at1';
```
After creating the stage, the files stored in the stage can be viewed with the ‘LIST’ command.
Load CSV and JSON Files: As there were both csv and json files present, different approaches were required to process these file formats into Snowflake tables.
- For the CSV files, the following script was used to specify the file format to interpret their structure and content.
```
 CREATE OR REPLACE FILE FORMAT file_format_csv TYPE = 'CSV'
 FIELD_DELIMITER = ','
 SKIP_HEADER = 1
 NULL_IF = ('\\N', 'NULL', 'NUL', '') FIELD_OPTIONALLY_ENCLOSED_BY = '"';
```
  This file format is then used to create an external table by specifying the schema (column names and data types), the location (stage), the file format (file_format_csv), and the filename (pattern = ‘filename’). Additionally, a new column for the country is included to distinguish between the files. The external tables were combined into a single table called table_youtube_trending.
- For JSON files, specifying the file format as JSON imports the data as a string. A lateral flatten is used to obtain fields at the same hierarchy. The data is then loaded into external tables with the required columns, including the country. These external tables are combined into a single table and call it table_youtube_category.
Merging the tables: The final trending and category tables are merged by joining on country and category ID. The new table is called table_youtube_final.

Data Cleaning

To ensure data quality and accuracy, several cleaning steps were performed on the datasets to remove duplicates and handle missing values. Here’s a summary of the key data cleaning steps:

Identify Duplicate Categories: The table_youtube_category was checked for duplicate category_titles, with Comedy the only category found to contain duplicates.
```
 SELECT country, category_title, COUNT(*) AS count
 FROM table_youtube_category
 GROUP BY country, category_title
 HAVING COUNT(*) > 1;
```
Single-Country Categories: The category_title that only appeared in one country in the table_youtube_category was Nonprofits & Activism.
```
 SELECT category_title, COUNT(*)
 FROM table_youtube_category
 GROUP BY category_title
 HAVING COUNT(*) = 1;
```

Handle Missing Category Titles: In table_youtube_final, the categoryid for records with missing category_title was found to be 29.

 SELECT country, categoryid 
 FROM table_youtube_final
 WHERE category_title IS NULL
 GROUP BY country, categoryid;

Update Missing Category Titles: The table_youtube_final was updated to replace NULL values in category_title with the identified category from the previous step.
```
 UPDATE table_youtube_final
 SET category_title = 'Nonprofits & Activism'
 WHERE category_title IS NULL;
```
Videos without Channel Titles: Only 1 Video in table_youtube_final that didn’t have a channeltitle was found: Kala Official Teaser | Tovino Thomas | Rohith V S | Juvis Productions | Adventure Company
```
 SELECT title, channeltitle
 FROM table_youtube_final
 WHERE (channeltitle is NULL or channeltitle = '');
```
Remove Invalid Records: Records from table_youtube_final where video_id was “#NAME?” were deleted.
```
 DELETE FROM table_youtube_final
 WHERE video_id = '#NAME?';
```

Handle Duplicate Records: A new table called table_youtube_duplicates was created to store “bad” duplicates, assuming the record with the highest view_count is the one to keep. The bad duplicates were then deleted from table_youtube_final.

 CREATE OR REPLACE TABLE table_youtube_duplicates AS
 SELECT  *
 FROM (SELECT *, ROW_NUMBER() OVER(PARTITION BY country, video_id, trending_date ORDER BY view_count DESC) AS rownum 
       FROM table_youtube_final) q
 WHERE rownum > 1;

 DELETE FROM table_youtube_final
 WHERE EXISTS( SELECT 1 FROM table_youtube_duplicates Where table_youtube_final.id = table_youtube_duplicates.id);

The number of rows in table_youtube_final was counted to ensure it contained 1,123,017 rows, verifying the integrity of the cleaned dataset.
```
 SELECT COUNT(*) 
 FROM table_youtube_final;
```

Data Analysis

With clean data, various analyses were performed:

On the trending date ‘2021-10-17’, the analysis focused on identifying the top three most viewed sports videos for each country. The videos were ranked by their view counts, and the results were filtered accordingly. This provided insights into the popularity of sports content across different regions on that specific day.

SELECT *
FROM (SELECT country, title, channeltitle, view_count, RANK() OVER (PARTITION BY country ORDER BY view_count DESC) AS rk
        FROM table_youtube_final
    WHERE trending_date = '2021-10-17'
    AND category_title = 'Sports') q
WHERE rk IN (1,2,3);

Popularity of “BTS” Videos

The analysis then examined the global popularity of BTS by counting the number of distinct videos with “BTS” in their titles for each country. The results were ordered in descending order of count, revealing the countries where BTS videos are trending the most. This highlighted the widespread appeal and significant following of BTS across various regions.

SELECT country, count(distinct video_id) as ct 
FROM table_youtube_final
WHERE title LIKE '%BTS%'
GROUP BY country
ORDER BY ct DESC;

Most Viewed and Liked Videos by Month

For each country and month, the analysis identified the video with the highest view count and calculated its likes ratio (percentage of likes against view count). This helped to pinpoint the most popular and well-received videos over time, providing valuable insights into viewer preferences and engagement trends across different periods.

SELECT country, year_month, title, channeltitle, category_title, view_count, ROUND(likes * 100.0 / view_count, 2) AS like_ratio
FROM (SELECT *, DATE_FROM_PARTS(YEAR(trending_date), MONTH(trending_date), 1 ) AS year_month, ROW_NUMBER()OVER (PARTITION BY country, year_month ORDER BY view_count DESC) AS rownum
        FROM table_youtube_final
    )
WHERE rownum = 1
ORDER BY year_month, country;

Distribution of Video Categories

To understand the distribution of content types, the analysis calculated the percentage of distinct videos for each category in different countries. This involved counting the number of distinct videos for each category title and dividing by the total number of distinct videos in that country. The results revealed the prevalence of various video categories in different regions, indicating which types of content are most popular.

SELECT a.country, category_title, total_category_video, total_country_video, ROUND(total_category_video * 100.0 / total_country_video, 2) AS percentage  
FROM 
    (SELECT country, category_title, count(distinct video_id) as total_category_video
    from table_youtube_final
    group by country, category_title
    order by total_category_video desc) a
LEFT JOIN 
    (SELECT country, count(distinct video_id) as total_country_video
    from table_youtube_final
    group by country
    order by total_country_video desc) b
ON a.country = b.country
ORDER BY category_title, country ASC;

Most Prolific Channels

The analysis also identified the most prolific content creators on YouTube by counting the number of distinct videos produced by each channel. The channel Colors TV emerged as the top content creator with 805 distinct videos, the highest in the dataset. This highlighted the channels with extensive content production and their significant contribution to the platform.

SELECT channeltitle, count(distinct video_id) AS ct
FROM table_youtube_final
GROUP BY channeltitle
ORDER BY ct DESC;