
Data cleaning with SQL
Cleaning dataset of Audible audiobooks using PostgreSQL.
Introduction
Data cleaning is a fundamental aspect of the data analysis pipeline. It involves transforming raw data into a usable format by resolving inconsistencies, handling missing values, and ensuring the data adheres to a standardized structure. This process lays the groundwork for accurate analysis, and thereby actionable insights.
In this case study, I will walk you through the SQL queries I used to clean a 10 000-row dataset of Audible audiobooks. For this project, I used PostgreSQL as the database management system and DBeaver as the administration tool. You can access the full SQL script on my GitHub repository
Identifying and handling duplicates
To start, I used a Common Table Expression (CTE) to check for potential duplicates within the dataset. While no duplicates were found in this particular dataset, the CTE could easily be adapted to delete duplicate rows, ensuring that we only retain unique entries. It's important to carefully validate duplicates before removal to avoid unintentional data loss.
WITH duplicate_cte AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY name, author, narrator, time, releasedate, language, price)
AS row_num
FROM audible_audiobooks.audible_staging
)
SELECT *
FROM duplicate_cte
WHERE row_num > 1;
Standardizing name formats
A significant part of the data cleaning process involved dealing with inconsistencies in how names were written in the author and narrator columns. The names were concatenated into a single string and often included multiple names in the same field, making them hard to read. To fix this, I employed Regular Expressions (REGEX) to add spaces between words where appropriate. Due to the size of the dataset, multiple edge cases were handled, resulting in a long but effective REGEX query. Although this approach most likely didn't catch every possible variation, it significantly improved the readability of the vast majority of names.
UPDATE audible_audiobooks.audible_staging
SET author =
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(author,
'([a-z])([A-Z])', '\1 \2', 'g'),
'([A-Za-z])\.(?=[A-Za-z])', '\1. ', 'g'),
',([A-Za-z])',', \1','g'),
'([A-Za-z])([0-9])', '\1 \2', 'g'),
'([0-9])([A-Za-z])', '\1 \2', 'g'),
',([A-Za-zÅåÄäÖö])', ', \1', 'g'),
'([ÁÉÍÓÚáéíóú])([A-Z])', '\1 \2', 'g'),
'([А-Яа-я])([А-Я])','\1 \2','g'),
'([А-Яа-я])\.(?=[А-Яа-я])', '\1. ', 'g'),
',([А-Яа-я])', ', \1', 'g'),
'Mc ', 'Mc', 'g'),
'BKFKStudio', 'BKFK Studio', 'g');
Correcting typos
During the data cleaning process, I identified a recurring typo in the narrator column. To correct this, I used the REPLACE function to systematically correct the spelling, ensuring consistency across the dataset.
WITH duplicate_cte AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY name, author, narrator, time, releasedate, language, price)
AS row_num
FROM audible_audiobooks.audible_staging
)
SELECT *
FROM duplicate_cte
WHERE row_num > 1;
Standardizing language entries
The language column required standardization. To ensure a uniform format, I used the INITCAP function to capitalize the first letter of each word in the language names. This simple adjustment improved the overall consistency of the dataset.
UPDATE audible_audiobooks.audible_staging
SET language = INITCAP(language);
Date format normalization
Date formatting is a common challenge in data cleaning, especially when the dataset uses inconsistent formats. In this case, dates were stored in dd/mm/yy format, so I used a SQL query to convert them into the yyyy-mm-dd format, which is both standard and compatible with further analysis.
UPDATE audible_audiobooks.audible_staging
SET releasedate = TO_CHAR(TO_DATE(releasedate , 'DD/MM/YY'), 'YYYY-MM-DD');
Time standardization
The time column, which represents the audiobook's listening duration, was stored as a string in various formats (e.g., "2 hrs and 3 mins", "10 hrs"). To standardize this, I separated the hours and minutes into distinct columns. Additionally, I removed the "hrs" and "mins" text to convert the values into numeric format, allowing for easier manipulation and calculations. For entries with both hours and minutes, I used the word "and" to distinguish between them and split them accordingly into separate columns.
-- Add new columns: hours and mins
ALTER TABLE audible_audiobooks.audible_staging
ADD COLUMN hours VARCHAR(50),
ADD COLUMN mins VARCHAR(50);
-- Split values that include both hours and minutes into respective columns
UPDATE audible_audiobooks.audible_staging
SET
hours = SPLIT_PART(time, 'and', 1),
mins = SPLIT_PART(time, 'and', 2)
WHERE time LIKE '%and%';
-- Remove ' hrs', ' hr', ' mins' and ' min' text from columns
UPDATE audible_audiobooks.audible_staging
SET
hours = REPLACE(REPLACE(hours, ' hrs', ''), ' hr', ''),
mins = REPLACE(REPLACE(mins, ' mins', ''), ' min', '')
WHERE time LIKE '%and%';
-- Transfer minutes values from time to mins column
UPDATE audible_audiobooks.audible_staging
SET mins = time
WHERE time NOT LIKE '%and%';
UPDATE audible_audiobooks.audible_staging
SET mins = REPLACE(REPLACE(mins, ' mins', ''), ' min', '')
WHERE time NOT LIKE '%and%';
-- Move values with only hours to correct column
SELECT time, hours, mins
FROM audible_audiobooks.audible_staging
WHERE mins LIKE '%hr%';
UPDATE audible_audiobooks.audible_staging
SET
hours = mins,
mins = '0'
WHERE mins LIKE '%hr%';
SELECT time, hours, mins
FROM audible_audiobooks.audible_staging
WHERE hours LIKE '%hr%';
UPDATE audible_audiobooks.audible_staging
SET hours = REPLACE(REPLACE(hours, ' hrs', ''), ' hr', '')
WHERE hours LIKE '%hr%';
Calculating total listening time
Once the time data was standardized, I calculated the total listening time for each audiobook in minutes. By having hours and minutes stored in numeric columns, I was able to perform this calculation easily and ensure that the dataset was ready for analysis.
ALTER TABLE audible_audiobooks.audible_staging
ADD COLUMN total_time_mins NUMERIC(10, 0);
UPDATE audible_audiobooks.audible_staging
SET total_time_mins = ((hours_numeric * 60) + mins_numeric);
Converting price and ratings to numeric format
Similar to the time column, the price and stars (ratings) columns were also converted to numeric formats for more precise analysis. This allows for further calculations, such as identifying price trends or analyzing correlations.
-- Add two new columns: price_temp and price_numeric
ALTER TABLE audible_audiobooks.audible_staging
ADD COLUMN price_temp VARCHAR(50),
ADD COLUMN price_numeric NUMERIC(10, 2);
-- Remove commas used to separate thousands from price_temp
UPDATE audible_audiobooks.audible_staging
SET price_temp = REPLACE(price, ',', '');
-- Update price_temp instances with value 'Free' to '0'
UPDATE audible_audiobooks.audible_staging
SET price_temp = '0'
WHERE price_temp = 'Free';
-- Convert temp_price to numeric format
UPDATE audible_audiobooks.audible_staging
SET price_numeric = CAST(price_temp AS NUMERIC);
Final adjustments
Once the data was cleaned, I created a new table to store the cleaned dataset. At this point, I also removed unnecessary columns and renamed others to provide clarity and adopt consistent naming practices. This ensured that the final dataset was both well structured and easy to work with.
-- Create final version of table
SELECT *
FROM audible_audiobooks.audible_staging;
CREATE TABLE audible_cleaned AS
TABLE audible_audiobooks.audible_staging;
-- Remove unnecessary columns (price, price_temp, hours, mins, stars)
ALTER TABLE audible_audiobooks.audible_cleaned
DROP COLUMN price,
DROP COLUMN price_temp,
DROP COLUMN hours,
DROP COLUMN mins,
DROP COLUMN time,
DROP COLUMN stars;
-- Rename price_numeric as price, hours_numeric as hours, mins_numeric as mins
ALTER TABLE audible_audiobooks.audible_cleaned
RENAME COLUMN price_numeric TO price;
ALTER TABLE audible_audiobooks.audible_cleaned
RENAME COLUMN hours_numeric TO hours;
ALTER TABLE audible_audiobooks.audible_cleaned
RENAME COLUMN mins_numeric TO mins;
Conclusion
Through the data cleaning process, the dataset was transformed into a more organized and consistent form. By resolving formatting issues and ensuring consistency across the data, I now have a solid foundation for further exploration and insight generation. For the full SQL script as well as before and after versions of the dataset, please visit my GitHub repository.

Get in touch
I’m excited to explore new opportunities and challenges.