1: Basic SQL
- ERD Fundamentals
- Types of Databases
- Types of Statements
- SELECT & FROM
- ORDER BY
- Derived Columns – Alias (AS)
- Logical Operators
- AND & BETWEEN
|Statement||How to Use It||Other Details|
|SELECT||SELECT Col1, Col2, …||Provide the columns you want|
|FROM||FROM Table||Provide the table where the columns exist|
|LIMIT||LIMIT 10||Limits based number of rows returned|
|ORDER BY||ORDER BY Col||Orders table based on the column. Used with DESC.|
|WHERE||WHERE Col > 5||A conditional statement to filter your results|
|LIKE||WHERE Col LIKE ‘%me%’||Only pulls rows where column has ‘me’ within the text|
|IN||WHERE Col IN (‘Y’, ‘N’)||A filter for only rows with column of ‘Y’ or ‘N’|
|NOT||WHERE Col NOT IN (‘Y’, ‘N’)||NOT is frequently used with LIKE and IN|
|AND||WHERE Col1 > 5 AND Col2 < 3||Filter rows where two or more conditions must be true|
|OR||WHERE Col1 > 5 OR Col2 < 3||Filter rows where at least one condition must be true|
|BETWEEN||WHERE Col BETWEEN 3 AND 5||Often easier syntax than using an AND|
2: SQL Joins
When creating a database, it is really important to think about how data will be stored. This is known as normalization, and it is a huge part of most SQL classes. If you are in charge of setting up a new database, it is important to have a thorough understanding of database normalization.
There are essentially three ideas that are aimed at database normalization:
- Are the tables storing logical groupings of the data?
- Can I make changes in a single location, rather than in many tables for the same information?
- Can I access and manipulate data quickly and efficiently?
- Inner Joins
- Alias in joins helps frequently just letters
- Left Join
- Right Join
- Full Outer Join
Primary and Foreign Keys
You learned a key element for JOINing tables in a database has to do with primary and foreign keys:
- primary keys – are unique for every row in a table. These are generally the first column in our database (like you saw with the id column for every table in the Parch & Posey database).
- foreign keys – are the primary key appearing in another table, which allows the rows to be non-unique.
Choosing the set up of data in our database is very important, but not usually the job of a data analyst. This process is known as Database Normalization.
In this lesson, you learned how to combine data from multiple tables using JOINs. The three JOIN statements you are most likely to use are:
- JOIN – an INNER JOIN that only pulls data that exists in both tables.
- LEFT JOIN – pulls all the data that exists in both tables, as well as all of the rows from the table in the FROM even if they do not exist in the JOIN statement.
- RIGHT JOIN – pulls all the data that exists in both tables, as well as all of the rows from the table in the JOIN even if they do not exist in the FROM statement.
There are a few more advanced JOINs that we did not cover here, and they are used in very specific use cases. UNION and UNION ALL, CROSS JOIN, and the tricky SELF JOIN. These are more advanced than this course will cover, but it is useful to be aware that they exist, as they are useful in special cases.
You learned that you can alias tables and columns using AS or not using it. This allows you to be more efficient in the number of characters you need to write, while at the same time you can assure that your column headings are informative of the data in your table.
3: SQL Aggregations
- NULL > is NULL
- COUNT (Count*)
- MIN MAX
- AVG (Median complex)
- GROUP BY
- DISTINCT is always used in SELECT statements, and it provides the unique rows for all columns written in the SELECT statement. Therefore, you only use DISTINCT once in any particular SELECT statement.
- Expert Tip – It’s worth noting that using DISTINCT, particularly in aggregations, can slow your queries down quite a bit.
- HAVING (where clause you cannot filter on aggregates)
- HAVING is the “clean” way to filter a query that has been aggregated, but this is also commonly done using a subquery. Essentially, any time you want to perform a WHERE on an element of your query that was created by an aggregate, you need to use HAVING instead.
- Dates in Databases are yyyy mm dd
- CASE > “If then else”
4: Subqueries & Temporary Tables
What exactly is a subquery?
A subquery is a query within a query.
- As a reminder, a query has both SELECT and FROM clauses to signify what you want to extract from a table and what table you’d like to pull data from. A query that includes subquery, as a result, has multiple SELECT and FROM clauses.
- The subquery that sits nested inside a larger query is called an INNER QUERY. This inner query can be fully executed on its own and often is run independently before when trying to troubleshoot bugs in your code.
When do you need to use a subquery?
You need to use a subquery when you have the need to manipulate an existing table to “pseudo-create” a table that is then used as a part of a larger query.
Subqueries vs. Joins
Output: Either a scalar (a single value) or rows that have met a condition.
Use Case: Calculate a scalar value to use in a later part of the query (e.g., average price as a filter).
Dependencies: Stand independently and be run as complete queries themselves.
Output: A joint view of multiple tables stitched together using a common “key”.
Use Case: Fully stitch tables together and have full flexibility on what to “select” and “filter from”.
Dependencies: Cannot stand independently.
Use Case for
- When a user wants to create a version of an existing table to be used in a larger query (e.g., aggregate daily prices to an average price table).
- It is advantageous for readability purposes.
Use Case for a Nested Subquery
- When a user wants to filter an output using a condition met from another table.
- This type of placement also has advantages for making the code easy to read.
Use Case for Inline Subquery
- It is very similar use case to ‘With’ subqueries. It creates a “pseudo table” that aggregates or manipulates an existing table to be used in a larger query.
- The disadvantage of the inline subquery is that it is not easy to read.
Use Case for Scalar subquery placement
- It selects only one column or expression and returns one row, used in the select clause of the main query
- It has the advantage of performance or if the data set is small
- If a scalar subquery does not find a match, it returns a NULL.
- If a scalar subquery finds multiple matches, it returns an ERROR.
Simple Subquery: The inner subquery is completely independent from the larger query.
Correlated Subquery: The inner subquery is dependent on the larger query.
Readability: How easy it is to determine what the code is doing.
Performance: How quickly the code runs.
Query Plan: What happens under the hood.
Before diving head first into building a subquery, consider the workflow below. Strong SQL users walk through the following before ever writing a line of code:
- Determine if a subquery is needed (or a join/aggregation function will suffice).
- If a subquery is needed, determine where you’ll need to place it.
- Run the subquery as an independent query first: is the output what you expect?
- Call it something! If you are working with With or Inline subquery, you’ll most certainly need to name it.
- Run the entire query — both the inner query and outer query.
Subquery Facts to Know:
- Commonly used as a filter/aggregation tool
- Commonly used to create a “temporary” view that can be queried off
- Commonly used to increase readability
- Can stand independently
This lesson was the first of the more advanced topics in writing SQL. Being able to break a problem down into the necessary tables and finding a solution using the resulting table is very useful in practice.
If you didn’t get the solutions to these queries on the first pass, don’t be afraid to come back another time and give them another try. Additionally, you might try coming up with some questions of your own to see if you can find the solution.
The remaining portions of this course may be key to certain analytics roles, but you have now covered all of the main SQL topics you are likely to use on a day to day basis.
SQL Data Cleaning
Normalization: Standardizing or “cleaning up a column” by transforming it in some way to make it ready for analyses. A few normalization techniques are below:
- Adjusting a column that includes multiple currencies to one common currency
- Adjusting the varied distribution of a column value by transforming it into a z-score
- Converting all price into a common metric (e.g., price per ounce)
The key steps to consider when going about your data cleaning task include the following:
- What data do you need?: Review what data you need to run an analysis and solve the problem at hand.
- What data do you have?: Take stock of not only the information you have in your dataset today but what data types those fields are. Do these align with your data needs?
- How will you clean your data?: Build a game plan of how you’ll convert the data you currently have to the data you need. What types of actions and data cleaning techniques will you have to apply? Do you have the skills you need to go from the current to future state?
- How will you analyze your data?: Now, it’s game time! How do you run an effective analysis? Build an approach for analysis, as well. And visualize your plan to solve the problem. Finally, remember to question “so what?” at then end of your results, which will help drive recommendations for your organization.
The following set of methods cover three types of data cleaning techniques: parsing information, returning where information lives, and changing the data type of the information.
- Left: Extracts a number of characters from a string starting from the left
- Right: Extracts a number of characters from a string starting from the right
- Substr: Extracts a substring from a string (starting at any position)
- Position: Returns the position of the first occurrence of a substring in a string
- Strpos: Returns the position of a substring within a string
- Concat: Adds two or more expressions together
- Cast: Converts a value of any type into a specific, different data type
- Coalesce: Returns the first non-null value in a list A handful of these functions, as you’ll quickly realize, are more commonly used than others.