When you have an app that is a number of years old, you'll find that the database schema has changed over the years. As a DB admin or analyst, this can be very frustrating as you'll find that you have to write logic in your SQL to work around having data you need in two different places/tables, with the need to combine them into one column for analysis.
You need to export all of your article and author data from one database to another. The problem is that author data is stored in two locations after a change was made to the database schema.
Initially the app stored authors within the
article table as a string value in the
After a while, a change was made to the app where two new tables were created,
author. This allowed multiple authors to be connected to an article; a much better choice of schema. Now in the
authors table we can have additional information about the author that can be used to populate an author page and have separate links on an article page and ensure we keep each author separate in the database.
authorships table can associate many articles to many authors. In the example below we have two authors
111 assigned to
The problem is that when the new tables were created, there wasn't a migration for the old data to the new tables. So some articles will have the author information stored in the
articles table and other articles will have a
BLANK value in the
articles table and instead have information stored in the
So you are now faced with three tables (
authors) that have unique information for the authorship of each article.
Why did the app developer not migrate the author data? The app requires logic to render the article page properly. It has to now check both tables for an author entry and use an IF statement to determine what table to use. Ideally the app developer would have migrated all old articles to use the new
author tables, so that all article data is stored in the same way in the database, instead of creating a divide in how the author data is stored.
This is called "technical debt" and you need to pay up.
IF statement to the rescue
If you were to solve this with an IF statement with a scripting language like Python, you could do the following:
# check to see if the article table has an author
if ( article_author != NULL ):
# get the author from the article table
author_byline = article_author
# get the author from the authorships table
author_byline = authorship_author
This is straighforward and may be the first thing you think of doing. IF/ELSE is what computers were designed for. However when you're doing analysis, you'll want to avoid having to run your analysis with an additional scripting language like Python etc. We can transform the data on the fly using the right SQL operators.
Creating an IF statement with SQL using
CASE is a great way to perform conditional logic in your SQL. You can throw it in your
WHERE statement. Here is an example.
WHEN author.first_name = 'Mary' AND author.last_name = 'Poppins'
THEN CONCAT(author.first_name, ' ', author.last_name, ' is practically perfect in every way')
ELSE CONCAT(author.first_name, ' ', author.last_name)
Here we check the author name.
WHEN author matches 'Mary Poppins'
THEN we create a new value on the fly by using CONCAT() to put together a string of text that appends ' is practically perfect in every way'. The end result being "Marry Poppins is practically perfect in every way".
Silly example, but you get the point.
Writing the initial query
I always like to write pseudo SQL in the beginning so I can annotate the information I need with any details that are helpful to remember. I also add a
LIMIT 100 (if using MySQL or Postgres) or
TOP 100 (if using MSSQL) to ensure I'm not burdening the database server. I do this as a habit to ensure I'm not needlessly grabbing 100,000's of rows for no reason. I just need to QA a small data set for now.
-- check for the author name from the articles table
-- check for the author name from the authorships/author table
, CONCAT(authors.first_name, ' ', authors.last_name)
This query is unusable. But it's a starting point to help me understand the data objectives I have.
You may notice that in order to get
authors.last_name you would need to
JOIN two tables
authorships. That could be done with the following
LEFT JOIN authorships ON articles.article_id = authorships.article_id
LEFT JOIN authors ON authors.author_id = authorships.author.id
I'd do a
LEFT JOIN here, as I know that only a subset of articles have author information assigned to them. A
LEFT JOIN ensures that I show all the article rows even if the
authorships table doesn't have any data connected to an article. If I left it as a simple
INNER JOIN then the only results I'd get back would be articles that do have a connected author via the
This all said though, we actually want to add this
JOIN inside of a subquery.
Writing an SQL Subquery
A subquery is a regular query that is put inside of parenthesis. You can add a subquery in a
SELECT statement, or in a
WHERE statement. I personally like to write my subqueries separately and test them before adding them as a subquery. Helps with QA.
Complete SQL query
WHEN article.author IS NULL
THEN (SELECT CONCAT(authors.first, ' ', authors.last)
JOIN authorships ON authorships.author_id = authors.author_id
WHERE authorships.article_id = a.article_id
END as author