When you have an app that is a number of years old, you'll find that the database schema has changed over the years. As a DB admin or analyst, this can be very frustrating as you'll find that you have to write logic in your SQL to work around having data you need in two different places/tables, with the need to combine them into one column for analysis.
Scenario
You need to export all of your article and author data from one database to another. The problem is that author data is stored in two locations after a change was made to the database schema.
Initially the app stored authors within the article
table as a string value in the author
column.
article_id | author | slug |
---|---|---|
1 | Mary Poppins | twas-an-east-wind |
2 | Jane Banks | my-dad-doesnt-listen |
After a while, a change was made to the app where two new tables were created, authorships
and author
. This allowed multiple authors to be connected to an article; a much better choice of schema. Now in the authors
table we can have additional information about the author that can be used to populate an author page and have separate links on an article page and ensure we keep each author separate in the database.
author_id | first_name | last_name | |
---|---|---|---|
110 | Mary | Poppins | twitter.com/mpoppins |
111 | Jane | Banks | twitter.com/jbanks |
The authorships
table can associate many articles to many authors. In the example below we have two authors 110
and 111
assigned to article_id
1
.
author_id | article_id |
---|---|
110 | 1 |
111 | 1 |
The problem is that when the new tables were created, there wasn't a migration for the old data to the new tables. So some articles will have the author information stored in the articles
table and other articles will have a BLANK
value in the articles
table and instead have information stored in the authorships
/authors
tables.
So you are now faced with three tables (articles
, authorships
and authors
) that have unique information for the authorship of each article.
Tech debt
Why did the app developer not migrate the author data? The app requires logic to render the article page properly. It has to now check both tables for an author entry and use an IF statement to determine what table to use. Ideally the app developer would have migrated all old articles to use the new authorships
and author
tables, so that all article data is stored in the same way in the database, instead of creating a divide in how the author data is stored.
This is called "technical debt" and you need to pay up.
IF statement to the rescue
If you were to solve this with an IF statement with a scripting language like Python, you could do the following:
# check to see if the article table has an author
if ( article_author != NULL ):
# get the author from the article table
author_byline = article_author
else:
# get the author from the authorships table
author_byline = authorship_author
This is straighforward and may be the first thing you think of doing. IF/ELSE is what computers were designed for. However when you're doing analysis, you'll want to avoid having to run your analysis with an additional scripting language like Python etc. We can transform the data on the fly using the right SQL operators.
Creating an IF statement with SQL using CASE
CASE
is a great way to perform conditional logic in your SQL. You can throw it in your SELECT
or WHERE
statement. Here is an example.
SELECT
CASE
WHEN author.first_name = 'Mary' AND author.last_name = 'Poppins'
THEN CONCAT(author.first_name, ' ', author.last_name, ' is practically perfect in every way')
ELSE CONCAT(author.first_name, ' ', author.last_name)
END
FROM
authors
;
Here we check the author name. WHEN
author matches 'Mary Poppins' THEN
we create a new value on the fly by using CONCAT() to put together a string of text that appends ' is practically perfect in every way'. The end result being "Marry Poppins is practically perfect in every way".
Silly example, but you get the point.
Writing the initial query
I always like to write pseudo SQL in the beginning so I can annotate the information I need with any details that are helpful to remember. I also add a LIMIT 100
(if using MySQL or Postgres) or TOP 100
(if using MSSQL) to ensure I'm not burdening the database server. I do this as a habit to ensure I'm not needlessly grabbing 100,000's of rows for no reason. I just need to QA a small data set for now.
SELECT
articles.article_id
, articles.article_url
-- check for the author name from the articles table
, articles.author
-- check for the author name from the authorships/author table
, CONCAT(authors.first_name, ' ', authors.last_name)
, articles.article_status
FROM
articles
LIMIT 100
This query is unusable. But it's a starting point to help me understand the data objectives I have.
You may notice that in order to get authors.first_name
and authors.last_name
you would need to JOIN
two tables authors
and authorships
. That could be done with the following JOIN
statements:
LEFT JOIN authorships ON articles.article_id = authorships.article_id
LEFT JOIN authors ON authors.author_id = authorships.author.id
I'd do a LEFT JOIN
here, as I know that only a subset of articles have author information assigned to them. A LEFT JOIN
ensures that I show all the article rows even if the authorships
table doesn't have any data connected to an article. If I left it as a simple JOIN
or INNER JOIN
then the only results I'd get back would be articles that do have a connected author via the authorships
table.
This all said though, we actually want to add this JOIN
inside of a subquery.
Writing an SQL Subquery
A subquery is a regular query that is put inside of parenthesis. You can add a subquery in a SELECT
statement, or in a WHERE
statement. I personally like to write my subqueries separately and test them before adding them as a subquery. Helps with QA.
Complete SQL query
SELECT
article.article_id
, article.url
, CASE
WHEN article.author IS NULL
THEN (SELECT CONCAT(authors.first, ' ', authors.last)
FROM authors
JOIN authorships ON authorships.author_id = authors.author_id
WHERE authorships.article_id = a.article_id
LIMIT 1)
ELSE article.author
END as author
, articles.article_status
FROM
articles
;