August 10, 2017

SQL Intermediate: PostgreSQL, Subqueries, and more!

If you're in the early phases of learning SQL and have completed one or more introductory-level courses, you've probably learned most of the basic fundamentals and possibly even some high-level database concepts. As you prepare to embark on the next phase of learning SQL, it's important to not only understand SQL itself, but also the engine that makes it all possible: the database.

In most introductory-level courses, you'll typically use some sort of embedded database (e.g. SQLite).

Embedded databases are an excellent choice for beginners so that they can focus on learning SQL without the burden of allocating additional computing resources and mitigating the time to install configure the software. However, as the requirements of your application and/or data begin to scale, you may need a more robust solution.

At a high level, there are two main types of databases: server-based, and file-based (also called embedded).

One of the most popular file-based databases is SQLite. It's great for small-scale applications, but as your needs become more complex it has many limitations. For example, it's based on one large binary file that's limited to 140 terabytes.

If you're thinking that this is larger than anything you'd ever need (at least any time soon), you're probably right. However, many filesystems limit the maximum size of a file to something far less than this. So if you wanted to develop a large scale web application, this would not be feasible.

Furthermore, there are security concerns, among many others. Server-based databases provide key functionality to larger and more robust applications, like high availability/failover, increased and more flexible security, scalability, increased performance, and more.

Today, we're going to perform data discovery and analysis on two data sets from the Consumer Financial Protection Bureau (CFPB) using PostgreSQL (postgres). The data consists of consumer complaints regarding two types of financial services: bank accounts and credit cards.

First, we'll set up your database by creating tables, creating users, and loading the data from CSV files. Next, we'll dive into more advanced SQL concepts while analyzing the data and discovering valuable insights! Some of the SQL concepts we'll be covering today are below.

  • Creating users, databases and tables,
  • The IS NULL/IS NOT NULL clause,
  • Views,
  • Subqueries,
  • Union/union all,
  • Intersect/except,
  • String concatenation, and
  • Casting data types.

If you need to brush up on your SQL fundamentals, we recommend our blog post

SQL Basics: Working with Databases.

Our data

We’re going to be working with two different CSV files of consumer complaint data from the CFPB:

  • Bank_Account_or_Service_Complaints.csv
  • Credit_Card_Complaints.csv

The files were originally downloaded from the CFPB website, however we have uploaded the versions using in this post as a which you can download here so you can use the exact file versions we used, since CFPB regularly update their data. Both CSV files have identical fields. Below is each field, and the data type The table below consists of the data set’s fields and data type we intend to use for them:

Field Name Data Type
complaint_id text
date_received date
product text
sub_product text
issue text
sub_issue text
consumer_complaint_narrative text
company_public_response text
company text
state text
zip_code text
tags text
consumer_consent_provided text
submitted_via text
date_sent date
company_response_to_consumer text
timely_response text
consumer_disputed text

You may be wondering why some of the numeric fields like zip_code and complaint_id are stored as text. The general rule of thumb is to always store numeric characters as text unless you are planning on doing some type of mathematical calculations on the values themselves. As an example here is an example of a field containing numbers that you’d want to store as a numeric type, and one that you wouldn’t:

  • Store as numeric type – An age field stores people’s ages and they want to do things like find the average of a person living in a certain zip code.
  • Store as text – An ID field of some sort that you cannot do any mathematical functions on. Generally speaking, you can’t add, multiply, or perform any other type of arithmetic on a zip code that would make any sense.

Downloading & Installing PostgreSQL

The first thing you’ll need to do is setup PostgreSQL if you don’t yet have it set up. Below are download links for the major operating systems.

Create New User, Database, and Tables

We need to create two new tables to store the data in the files that we downloaded, earlier. You can achieve this several different ways, but we’re going to use the psql shell throughout this tutorial. To open the psql shell, this can be achieved by opening a a command prompt (Windows) or terminal (Mac) window, and typing psql and hitting enter.

This will launch the psql shell. If this is your first time using psql shell, your user will likely not have necessary perlessons to launch the utility. First, you will need to use an account with administrator privileges, and then use psql -U postgres to login with the postgres user (which was created automatically during the install)

~$ psql
sql (9.5.3)
ype "help" for help.postgres=#

The postgres=# that you see is the psql prompt, and postgres will be replaced with whichever database you are connected to. We’re going to create a new user for our database.

For this tutorial we’ll call our user oracle, but you can use anything, and it’s convenient to use the same username you use for your operating system, as you then don’t have to worry about passwords or specifying a user moving forward. To create a new user, we use the following syntax:

CREATE ROLE name WITH options_go_here;

Click

here for the documentation regarding the CREATE ROLE statement. Here you’ll find all of the available options and relative DESCriptions. We’re going to give this new user the following perlessons

  • SUPERUSER: The user will be a superuser.
  • CREATEDB: The user will be allowed to create databases.
  • CREATEROLE: The user will be allowed to create new roles.
  • LOGIN: The user will be able to login.
  • INHERIT: The user will inherit all privelages of roles it’s a member of.

Let’s create the new user,

oracle, with the password welcome1.

CREATE ROLE oracle password 'welcome1' SUPERUSER CREATEDB CREATEROLE INHERIT LOGIN;

Our new user should now be created. Let’s check to make sure our new user has sufficient privileges. Enter the command

du and hit enter. Your user should have the following perlessons.

postgres=# du
           List of roles
ole name | Attributes | Member of
span class="token operator">----------+------------+-----------
racle   | Superuser  | {}
          : Create role
          : Create DB
ostgres | Superuser  | {}
          : Create role
          : Create DB

This will allow you to connect using the oracle user from now on. For now, we can switch over to the oracle user using the command

SET ROLE oracle;

If we had quit the psql shell (using q) we can connect when we launch the psql shell:

~$ psql -U oracle postgres
sql (9.5.3)
ype "help" for help.

ostgres=#

The -U flag allows us to specify the user, and the postgres specifies the postgres account database (if we don’t specify a database here, we’ll get an error). Now that we’re logged in as the oracle user, we’re going to create a new database. To create a new database, we use the following syntax:

CREATE DATABASE database_name;

The CREATE DATABASE statement has a lot more options that we don’t need right now – if you like you can read about them in the documentation. Now, let’s create the new database, consumer_complaints.

CREATE database consumer_complaints;

To confirm that our database has been created, retrieve a list of the databases in the Postgres cluster by using the

l (list) command. You should see the new database, as shown below.

postgres=# l
                                      List of databases
ame         |  Owner   | Encoding |  Collation  |    Ctype    |   Access privileges
span class="token operator">---------------------+----------+----------+-------------+-------------+-----------------------
onsumer_complaints | oracle   | UTF8     | en_US.UTF-8 | en_US.UTF-8 |

Let’s connect to our new database by using the

connect command:

oracle=# connect consumer_complaints
ou are now connected to database "consumer_complaints" as user "oracle".
onsumer_complaints=#

You can also launch psql directly conencted to your database from the command line.

~$ psql -U oracle consumer_complaints
sql (9.5.3)
ype "help" for help.

onsumer_complaints=#

Creating tables for the data

Next, we can create the tables to store our CSV data. The table names are as follows:

  • credit_card_complaints
  • bank_account_complaints

The fields and their respective data types will be the same in both tables. To create a new table, we use the following syntax:

CREATE TABLE name_of_table (
   name_of_column data_type [PRIMARY KEY],
   name_of_column data_type,
   name_of_column data_type,
   name_of_column data_type,
   name_of_column data_type);

Like the CREATE DATABASE statement, the CREATE TABLE has many more options which you can find in the documentation. To create our two databases, you’ll need to run the following two statements in the psql shell.

CREATE TABLE bank_account_complaints (
complaint_id text PRIMARY KEY,
date_received date,
product text,
sub_product text,
issue text,
sub_issue text,
consumer_complaint_narrative text,
company_public_response text,
company text,
state text,
zip_code text,
tags text,
consumer_consent_provided text,
submitted_via text,
date_sent date,
company_response_to_consumer text,
timely_response text,
consumer_disputed text);;
CREATE TABLE credit_card_complaints (
complaint_id text PRIMARY KEY,
date_received date,
product text,
sub_product text,
issue text,
sub_issue text,
consumer_complaint_narrative text,
company_public_response text,
company text,
state text,
zip_code text,
tags text,
consumer_consent_provided text,
submitted_via text,
date_sent date,
company_response_to_consumer text,
timely_response text,
consumer_disputed text);;

You should get the output

'CREATE TABLE' from these statements to let you know they were created correctly.

Loading Data into the Database

There are several ways to load external data into your postgres tables. Here are a few common ways to load data existing in a CSV file:

Today, we’re going to load the data from our two CSV files into our two new tables, using the Postgres Copy utility via psql shell. Copy allows you to export data out of, or import data into, Postgres database. It supports the following formats:

  • Binary.
  • Tab delimited.
  • Comma delimited.

To copy data from a file to a database table using the copy statement, we use the following syntax:

COPY name_of_table (<name_of_column, name_of_column, name_of_column)
   FROM 'filename'
   WITH options_go_here

We’ll be using the

WITH CSV HEADER option, but as before the documenation has full syntax and explanations. We’ll presume that the CSV files are in the same folder as where you launched your psql session. We’ll start by importing the data from the Credit_Card_Complaints.csv file into the credit_card_complaints postgres table.

consumer_complaints=# COPY credit_card_complaints (date_received, product, sub_product, issue, sub_issue, consumer_complaint_narrative, company_public_response, company, state, zip_code, tags, consumer_consent_provided, submitted_via, date_sent, company_response_to_consumer, timely_response, consumer_disputed, complaint_id)
   FROM './Credit_Card_Complaints.csv'
   WITH CSV HEADER

Now let’s do the same for the

Bank_Account_or_Service_Complaints.csv file and the bank_account_complaints postgres table.

consumer_complaints=# COPY bank_account_complaints (date_received, product, sub_product, issue, sub_issue, consumer_complaint_narrative, company_public_response, company, state, zip_code, tags, consumer_consent_provided, submitted_via, date_sent, company_response_to_consumer, timely_response, consumer_disputed, complaint_id)
   FROM './Bank_Account_or_Service_Complaints.csv'
   WITH CSV HEADER

Querying our Postgres database

We can run queries directly from our psql shell, however it can be difficult to understand the results, especially with many rows and columns. In this tutorial, we will be querying the database using Python. If we want to return the results of a query, we will use the Pandas module, which will allow us to store the results into a Dataframe object. This will make it easier to visualize and analyze the results. To do this, we have to first import the pandas, SQLAlchemy and psycopg2 modules. If you don’t have these, you can easily install them using

pip:

~$
ip install pandas sqlalchemy pscopg2

Or conda (although all come with the standard anaconda package):

~$
onda install pandas sqlalchemy pscopg2

We will also need to create two helper functions – one to run our queries and one to run our commands (i.e. creating/dropping views, tables, sequences, etc.). Don’t worry if you don’t understand how these functions work, just know that they’ll help us interface with our database using Python and give us results that are easier to read.

import pandas as pd

span class="token comment" spellcheck="true"># psycopg2 lets us easily run commands against our db

span class="token keyword">import psycopg2
onn = psycopg2.connect("dbname=consumer_complaints user=oracle")
onn.autocommit = True
ur = conn.cursor()

span class="token keyword">def run_command(command):
   cur.execute(command)
   return cur.statusmessage

span class="token comment" spellcheck="true"># sqlalchemy is needed to allow pandas to seemlessly connect to run queries

span class="token keyword">from sqlalchemy import create_engine
ngine = create_engine('postgresql://oracle@localhost/consumer_complaints')

span class="token keyword">def run_query(query):
   return pd.read_sql_query(query,con=engine)

As a test query to show how these functions work, we’ll validate the data in our tables to ensure that all the data from the CSV files was loaded. A quick way to do this is to get the number of records for a specific table using the

count function.

query = 'SELECT count(*) FROM credit_card_complaints;'
un_query(query)
count
0 87718
query = 'SELECT count(*) FROM bank_account_complaints;'
un_query(query)
count
0 84811

Counting with Conditionals

When comparing a column to

null (no value), you cannot use the = operator. null denotes an undefined value. Since an undefined value cannot equal an undefined value, null cannot equal null.

  • Instead of = NULL, we use IS NULL.
  • Instead of <> NULL, we use IS NOT NULL.

If we look at the data in each file, we can see that there are several fields with a large amount of null values. One of the biggest value propositions of this dataset is the consumer complaint narratives. This allows us to understand and analyze the sentiment of the consumers complaint. Let’s see how many records in each table have null values for the consumer complaint narrative field, starting with the

credit_card_complaints table.

query = '''
ELECT count(*) FROM credit_card_complaints
HERE consumer_complaint_narrative IS NOT NULL;
''
un_query(query)
count
0 17433

There are 17,433 records with a consumer complaint narrative in the

credit_card_complaints table. Since there are 87,718 records in the table, total, there must be 70,285 (87,718 – 17,433) records that do not have a consumer complaint narrative (or are null). Let’s check this manually:

query =
span class="token triple-quoted-string string">'''
ELECT count(*) FROM credit_card_complaints
HERE consumer_complaint_narrative IS NULL;
''
un_query(query)
count
0 70285

Looks like our numbers line up! Next, we’ll look at

null values for the consumer_complaint_narrative column for the bank_account_complaints table.

query = '''
ELECT count(*) FROM bank_account_complaints
HERE consumer_complaint_narrative IS NOT NULL;
''
un_query(query)
count
0 13860

There are 13,860 records with a consumer complaint narrative in the credit_card_complaints table. Since there are 84,811 records in the table, total, there must be 70,951 (84,811 – 13,860) records that do not have a consumer complaint narrative. Let’s see:

query = '''
ELECT count(*) FROM bank_account_complaints
HERE consumer_complaint_narrative IS NULL;
''
un_query(query)
count
0 70951

Again, the numbers line up.

Using views

A

view is essentially a logical representation of a query’s result. It behaves like a traditional table where you can select values, but you cannot insert into, update, or delete from it. Here are a few benefits of using views:

  • Reusability – Views can be used to reuse complex queries that are frequently used.
  • Security – Views are often made accessible to certain users so that they cannot view the underlying tables and only relevant data is available.
  • Query Performance – Sometimes, writing complex queries consisting of several subqueries and aggregations will utilize a large amount of resources.

Considering the last point in more detail, even if we have adequate resources we could potentially degrade performance of either processes running on the machine or it could take an extremely long time. We can use views to filter down the dataset ahead of time, thus tremendously mitigating the complexity and overall execution time. A temporary view (

TEMPORARY or TEMP) is automatically dropped at the end of the current session. Depending on the use case, you may or may not want to use temporary view. For example, using temporary views can be extremely helpful while validating or testing certain data points that are created via several joins, tables, and calculations. Performing complex data validations often times require large and convoluted queries that are not only hard to read and understand, but also perform poorly. Large queries consisting of multiple subqueries/inline views can utilize a large amount of spool space and computing resources, thus resulting in poor query performance and degraded performance of the application. In these instances, it’s usually a much better idea and to create one or more views. In the previous section, we determined that a large amount of records had null values for the consumer complaint narrative field. Instead of having to filter on this field later on, we’ll create views instead for later use. We’ll create four views based on two dimensions: by product (eg credit card vs bank account), and whether they contain null values for consumer complaint narrative. views Our four views will be called

  • credit_card_w_complaints
  • credit_card_wo_complaints
  • bank_account_w_complaints
  • bank_account_wo_complaints

The syntax for creating a view is extremely simple:

CREATE VIEW view_name AS
   [query_to_generate_view_goes_here];

Let’s create our four views.

command =
span class="token triple-quoted-string string">'''
REATE VIEW credit_card_w_complaints AS
   SELECT * FROM credit_card_complaints
   WHERE consumer_complaint_narrative IS NOT NULL;
''
un_command(command)
'CREATE VIEW'
command = '''
REATE VIEW credit_card_wo_complaints as
   SELECT * FROM credit_card_complaints
   WHERE consumer_complaint_narrative IS NULL;
''
un_command(command)
'CREATE VIEW'
command = '''
REATE VIEW bank_account_w_complaints AS
   SELECT * FROM bank_account_complaints
   WHERE consumer_complaint_narrative IS NOT
NULL;
''
un_command(command)
'CREATE VIEW'
command = '''
REATE VIEW bank_account_wo_complaints AS
   SELECT * FROM bank_account_complaints
   WHERE consumer_complaint_narrative IS
NULL;
''
un_command(command)
'CREATE VIEW'

Let’s take a quick look at one of our newly-created views so we can see that it works just like a table.

query =
span class="token triple-quoted-string string">'''
ELECT * FROM credit_card_w_complaints LIMIT 5;
''
un_query(query)
complaint_id date_received product sub_product issue sub_issue consumer_complaint_narrative company_public_response company state zip_code tags consumer_consent_provided submitted_via date_sent company_response_to_consumer timely_response consumer_disputed
0 1297939 2015-03-24 Credit card None Other None Received Capital One charge card offer XXXX. A… None Capital One OH 440XX None Consent provided Web 2015-03-30 Closed with explanation Yes No
1 1296693 2015-03-23 Credit card None Rewards None I ‘m a longtime member of Charter One Bank/RBS… None Citizens Financial Group, Inc. MI 482XX None Consent provided Web 2015-03-23 Closed with explanation Yes Yes
2 1295056 2015-03-23 Credit card None Other None I attempted to apply for a Discover Card Onlin… None Discover MD 217XX None Consent provided Web 2015-03-23 Closed with non-monetary relief Yes No
3 1296880 2015-03-23 Credit card None Late fee None XXXX Card services was bought out by Capital O… None Capital One MI 488XX None Consent provided Web 2015-03-23 Closed with monetary relief Yes No
4 1296890 2015-03-23 Credit card None Billing disputes None I was reported late by Discover Card to the re… None Discover OK 741XX Servicemember Consent provided Web 2015-03-23 Closed with explanation Yes No

Union/Union All

The

UNION operator combines the result-set of two or more queries. Think of 2 Lego bricks with the exact same dimensions. Since they both share the same dimensions, they will stack on top of each other perfectly. Unions behave in a similar fashion. If the select statements have the same number of columns, data types, and are in the same order (by data type), then they will “stack” on top of each other, similar to the Lego brick analogy I used earlier. There are two variations of union:

  • UNION ALL selects distinct (unique) values.
  • UNION selects all values (including duplicates).

If you recall from the last section, we have four views. What if we wanted to view the two products’ data together, and only differentiate by whether their records contain null values for consumer complaint narrative? Since all four views have the same number of columns and data types, this makes them a perfect candidate for a union. We’ll create two new views that will essentially consolidate the four current views down to two by performing a union on each products’ views.

command = '''
REATE VIEW with_complaints AS
   SELECT * from credit_card_w_complaints
   UNION ALL
   SELECT * from bank_account_w_complaints;
''
un_command(command)
'CREATE VIEW'
command = '''
REATE VIEW without_complaints AS
   SELECT * FROM credit_card_wo_complaints
   UNION ALL
   SELECT * FROM bank_account_wo_complaints;
''
un_command(command)
'CREATE VIEW'

Intersect/Except

The

INTERSECT operator returns all records that are in the result-sets of both queries, while the EXCEPT operator returns all records that are in the result-set of the first query but not the second. The syntax for INTERSECT/EXCEPT is simple, we just use the commands to connect two select statements:

(SELECT * FROM name_of_table_one
span class="token keyword">INTERSECT [or EXCEPT]
span class="token keyword">SELECT * FROM name_of_table_two);

In the previous section, we consolidated our four views into two new views by performing a union. If we want to validate to make sure our new views were created correctly, we can achieve this by using both

INTERSECT and EXCEPT operator. First, let’s retrieve the record count for the credit_card_wo_complaints view.

query = 'SELECT count(*) FROM credit_card_wo_complaints;'
un_query(query)
count
0 70285

If our logic was correct in our union, we should be able to intersect the

without_complaints view with the credit_card_wo_complaints view and return the same amount of records.

query = '''
ELECT count(*)FROM (SELECT * FROM without_complaints
     INTERSECT
     SELECT * FROM credit_card_wo_complaints)
ppg;
''
un_query(query)
count
0 70285

Conversely, if we replace the intersect operator with an except operator, we should get difference in record counts between the

without_complaints view and the 70,285 record count that was returned previously. The difference should be 70,951 (141,236 – 70,285).

query =
span class="token triple-quoted-string string">'''
ELECT count(*)FROM (SELECT * FROM without_complaints
     EXCEPT
ELECT * FROM credit_card_wo_complaints) ppg;'''
un_query(query)
count
0 70951

String Concatenation

String concatenation is simply combining two or more strings (text values) together into a single string. This is an extremely useful tool, in several instances. Let’s say we had a “year” field and a “month” field, but we wanted to show it as one field (i.e. “2017-March”), instead. We can achieve this by concatenating the values in both fields using the

|| operator. The syntax is as follows:

SELECT <string_1> || <string_2> FROM name_of_table;

Let’s try it out by selecting

complaint_id, product, company, zip_code, and concatenating all of those fields, separated by a hyphen (-), from credit_card_complaints.

query = '''
ELECT complaint_id, product, company, zip_code,
      complaint_id || '-' || product || '-' || company || '-' ||
zip_code AS concat
ROM credit_card_complaints
IMIT 10
''
un_query(query)
complaint_id product company zip_code concat
0 469026 Credit card Citibank 45247 469026-Credit card-Citibank-45247
1 469131 Credit card Synchrony Financial 98548 469131-Credit card-Synchrony Financial-98548
2 479990 Credit card Amex 78232 479990-Credit card-Amex-78232
3 475777 Credit card Capital One 32226 475777-Credit card-Capital One-32226
4 469473 Credit card Citibank 53066 469473-Credit card-Citibank-53066
5 470828 Credit card Wells Fargo & Company 89108 470828-Credit card-Wells Fargo & Company-89108
6 470852 Credit card Citibank 78249 470852-Credit card-Citibank-78249
7 479338 Credit card JPMorgan Chase & Co. 19809 479338-Credit card-JPMorgan Chase & Co.-19809
8 480935 Credit card Citibank 07018 480935-Credit card-Citibank-07018
9 469738 Credit card Wells Fargo & Company 95409 469738-Credit card-Wells Fargo & Company-95409

Subqueries

A subquery (technically known as an inline view and also referred to as a subselect statement, derived table) is a select statement in the

FROM clause within another select statement. They’re extremely useful in many different scenarios. We actually saw an example of this earlier when we used COUNT(*) with our INTERSECT and EXCEPT example. Subqueries help to simplify otherwise complex queries and allow for more flexibility, and as their proper name (inline view) suggests, is just like creating a mini view within a single query. The syntax is:

SELECT [column_names_here]
   FROM ( [subquery_goes_here] );

The idea of subqueries can be a little difficult at first, so we’re going to show you a subquery in action and then break it down.

query = '''
ELECT ccd.complaint_id, ccd.product, ccd.company, ccd.zip_code
ROM (SELECT complaint_id, product, company, zip_code
     FROM credit_card_complaints
     WHERE zip_code = '91701') ccd LIMIT 10;
''
un_query(query)
complaint_id product company zip_code
0 24857 Credit card Barclays PLC 91701
1 33157 Credit card Citibank 91701
2 12245 Credit card Bank of America 91701
3 3151 Credit card Barclays PLC 91701
4 352534 Credit card Citibank 91701
5 1963836 Credit card JPMorgan Chase & Co. 91701
6 2178015 Credit card Discover 91701
7 2234754 Credit card Discover 91701
8 2235915 Credit card Discover 91701

The most important thing to realize with subqueries is that you need to read them from the inside out.

The innermost query (in purple) is executed first, which filters result set down by a specific zip code. Then the outermost query executes its select statement on the result-set derived from the innermost query. You’ll notice that we’ve given our subquery a name – ccd – that we then use to refer to the columns in the outer SELECT statement. This is something that PostgreSQL requires, but beyond that it makes the query easier to read and makes sure the SQL engine understands exactly your intention (which can avoid errors or unintended results). If you’re thinking that the query above could’ve been simplified and that you could’ve achieved the same result without a subquery, you’re correct. You didn’t actually need to use an inline view. It was merely a simple example to demonstrate how inline views are structured. A more efficient and simple way to structure the query is as follows.

query = '''
elect complaint_id, product, company, zip_codefrom credit_card_complaints
here zip_code = '91701'
imit 10
''
un_query(query)
complaint_id product company zip_code
0 24857 Credit card Barclays PLC 91701
1 33157 Credit card Citibank 91701
2 12245 Credit card Bank of America 91701
3 3151 Credit card Barclays PLC 91701
4 352534 Credit card Citibank 91701
5 1963836 Credit card JPMorgan Chase & Co. 91701
6 2178015 Credit card Discover 91701
7 2234754 Credit card Discover 91701
8 2235915 Credit card Discover 91701

So why are inline views/subqueries useful? Let’s look at a more complex example.

Subqueries in action

If we were a financial services company, one of the insights we might want to derive from this data is where the highest number of complaints are coming from. Moreover, it might also be useful to see the same metrics from our competitiors. To achieve this, we’ll need to return a distinct list of all companies, and then find the zip code associated with highest number of complaints, by each state. So for each company, there should be a distinct list of states, along with the zip code associated with the highest number of complaints in that state. Let’s start by querying the company name, the state, and zip code, associated with the number of complaints. Let’s also go ahead and filter the records that have a null state name. For now, let’s start with one company, Citibank.

query = '''
ELECT company, state, zip_code, count(complaint_id) AS complaint_count
ROM credit_card_complaints
HERE company = 'Citibank' AND state IS NOT NULL
ROUP BY company, state, zip_code
RDER BY 4 DESC
IMIT 10;
''
un_query(query)
company state zip_code complaint_count
0 Citibank NY 100XX 80
1 Citibank TX 750XX 67
2 Citibank CA 945XX 58
3 Citibank NY 112XX 52
4 Citibank CA 900XX 50
5 Citibank GA 300XX 46
6 Citibank IL 606XX 44
7 Citibank NJ 070XX 44
8 Citibank FL 331XX 39
9 Citibank CA 926XX 38

Now that we have a list of states, zip codes, and amount of complaints by company, sorted by the number of complaints, we can take this a step further. Next using the query above as a subquery we can return the records with the highest number of complaints by each state. We’re still going to use Citibank, for now.

Our first query has become the subquery, highlighted in purple, and is executed first. The outer query is then run on the results of our subquery. Let’s take a look at the results.

query = '''
ELECT ppt.company, ppt.state, max(ppt.complaint_count) AS complaint_count
ROM (SELECT company, state, zip_code, count(complaint_id) AS complaint_count
     FROM credit_card_complaints
     WHERE company = 'Citibank'
      AND state IS NOT NULL
     GROUP BY company, state, zip_code
     ORDER BY 4 DESC) ppt
ROUP BY ppt.company, ppt.state
RDER BY 3 DESC
IMIT 10;
''
un_query(query)
company state complaint_count
0 Citibank NY 80
1 Citibank TX 67
2 Citibank CA 58
3 Citibank GA 46
4 Citibank IL 44
5 Citibank NJ 44
6 Citibank FL 39
7 Citibank VA 32
8 Citibank MD 32
9 Citibank DC 31

Next, let’s continue to expand on our query and return the state and zip code with the highest number of complaints.

This is a complex query, so let’s look at it one subquery at a time.

  • We start with our first subquery in purple, which counts the number of complaints grouped by company, state and zip_code.
  • Next, we join the results to a subquery that contains two more subqueries. Remembering that we need to read subqueries from the inside out, let’s look at these one at a time:
    • Our innermost subquery in orange we have seen before. It’s the first query from this section that selects complaint numbers for Citibank grouped by state and zip_code.
    • Our next subquery in blue we’ve also seen before, it’s the outer query from our last example, minus the LIMIT statement.
    • Our outermost subquery in red finds just the maximum complaint count from all states/zip-codes.
query = '''
ELECT ens.company, ens.state, ens.zip_code, ens.complaint_count
ROM (select company, state, zip_code, count(complaint_id) AS complaint_count
     FROM credit_card_complaints
     WHERE state IS NOT NULL
     GROUP BY company, state, zip_code) ens
NNER JOIN
  (SELECT ppx.company, max(ppx.complaint_count) AS complaint_count
   FROM (SELECT ppt.company, ppt.state, max(ppt.complaint_count) AS complaint_count
         FROM (SELECT company, state, zip_code, count(complaint_id) AS complaint_count
               FROM credit_card_complaints
               WHERE company = 'Citibank'
                AND state IS NOT NULL
               GROUP BY company, state, zip_code
               ORDER BY 4 DESC) ppt
         GROUP BY ppt.company, ppt.state
         ORDER BY 3 DESC) ppx
   GROUP BY ppx.company) apx
N apx.company = ens.company
AND apx.complaint_count = ens.complaint_count
RDER BY 4 DESC;
''
un_query(query)
company state zip_code complaint_count
0 Citibank NY 100XX 80

Finally, we can remove the company filter from the innermost (orange) subquery so that we can see the state and zip code where each company receives the most complaints, and how many complaints they’ve received.

query = '''
ELECT ens.company, ens.state, ens.zip_code, ens.complaint_count
ROM (select company, state, zip_code, count(complaint_id) AS complaint_count
     FROM credit_card_complaints
     WHERE state IS NOT NULL
     GROUP BY company, state, zip_code) ens
NNER JOIN
  (SELECT ppx.company, max(ppx.complaint_count) AS complaint_count
   FROM (SELECT ppt.company, ppt.state, max(ppt.complaint_count) AS complaint_count
         FROM (SELECT company, state, zip_code, count(complaint_id) AS complaint_count
               FROM credit_card_complaints
               WHERE company = 'Citibank'
                AND state IS NOT NULL
               GROUP BY company, state, zip_code
               ORDER BY 4 DESC) ppt
         GROUP BY ppt.company, ppt.state
         ORDER BY 3 DESC) ppx
   GROUP BY ppx.company) apx
N apx.company = ens.company
AND apx.complaint_count = ens.complaint_count
RDER BY 4 DESC;
''
un_query(query)
company state zip_code complaint_count
0 Citibank NY 100XX 80
1 JPMorgan Chase & Co. NY 100XX 44
2 Amex NY 100XX 43
3 Bank of America NY 10024 34
4 Capital One NY 112XX 30
5 Capital One GA 300XX 30
6 Synchrony Financial NY 112XX 29
7 Barclays PLC FL 337XX 28
8 Discover MN 551XX 19
9 U.S. Bancorp DC 200XX 19

Now we can see which companies receive the most complaints, and where the most complaints are generated from!

Casting Data Types

Sometimes you’ll need to change a field’s data type on the fly. For example, if the field

Age is has a text data type, but you need it to be a numeric data type so that you can perform calculations on that data, you can simply use the CAST() function to temporarily accommodate for the calculation you’re trying to perform. To cast a field to a new data type, we use the following syntax:

CAST(source_column AS type);

Each data type may or may not have additional required parameters, so please check

the documentation for your current version of PostgreSQL to be sure. When we created our tables earlier, we declared the complaint_id field as our primary key. However, we declared the data type as text. Let’s assume we want to create a view by querying this table, but want to add an index on complaint_id for increased performance. Indexing on numeric data types is faster than string (text) data types, simply because they take up less space. So let’s first run a simple query on the bank_account_complaints table to return the complaint_id as a float data type.

query = '''
ELECT CAST(complaint_id AS float) AS complaint_id
ROM bank_account_complaints LIMIT 10;
''
un_query(query)
complaint_id
0 468889.0
1 468879.0
2 468949.0
3 468981.0
4 469185.0
5 475273.0
6 469309.0
7 469414.0
8 469446.0
9 469447.0

Next, let’s test our query that we’re going to use to define our new view. Our query should pull all of the fields’ values for Wells Fargo (company = ‘Wells Fargo & Company’) in the state of California (state = ‘CA’) where the consumer did not dispute the company’s response (consumer_disputed = ‘No’).

query = '''
ELECT CAST(complaint_id AS int) AS complaint_id,
      date_received, product, sub_product, issue, company,
      state, zip_code, submitted_via, date_sent, company_response_to_consumer,
      timely_response, consumer_disputed
ROM bank_account_complaints
HERE state = 'CA'
   AND consumer_disputed = 'No'
   AND company = 'Wells Fargo & Company'
IMIT 5;
''
un_query(query)
complaint_id date_received product sub_product issue company state zip_code submitted_via date_sent company_response_to_consumer timely_response consumer_disputed
0 468889 2013-07-29 Bank account or service Checking account Using a debit or ATM card Wells Fargo & Company CA 95992 Web 2013-07-31 Closed with explanation Yes No
1 469185 2013-07-29 Bank account or service Other bank product/service Deposits and withdrawals Wells Fargo & Company CA 96088 Web 2013-08-01 Closed with explanation Yes No
2 469446 2013-07-29 Bank account or service Checking account Deposits and withdrawals Wells Fargo & Company CA 92277 Web 2013-08-05 Closed with explanation Yes No
3 471075 2013-07-31 Bank account or service Savings account Account opening, closing, or management Wells Fargo & Company CA 91214 Web 2013-08-02 Closed with explanation Yes No
4 463525 2013-07-22 Bank account or service Other bank product/service Account opening, closing, or management Wells Fargo & Company CA 95376 Phone 2013-07-30 Closed with explanation Yes No

Now let’s create our new view named

wells_complaints_v.

command = '''
REATE VIEW wells_complaints_v AS (
   SELECT CAST(complaint_id AS int) AS complaint_id,
          date_received, product, sub_product, issue, company,
          state, zip_code, submitted_via, date_sent,
          company_response_to_consumer,
          timely_response, consumer_disputed
   FROM bank_account_complaints
    WHERE state = 'CA'
          AND consumer_disputed = 'No'
         AND company = 'Wells Fargo & Company')'''
un_command(command)
'CREATE VIEW'

SQL Challenges

Learning the fundamentals of SQL is relatively straight-forward. However, learning the syntax and memorizing all the concepts that are typically taught in SQL tutorials or database classes, is just the tip of the iceberg. Your ability to apply the concepts that you’ve learned thus far, and translating a set of requirements by leveraging your critical thinking and problem solving skills is the most important piece of the puzzle. With that said, subsequent sections will consist different scenarios that will include a short list of requirements. All scenarios are using the same database tables/views that were created during this tutorial. The solutions to all scenarios are posted at the end of this section.

Challenge 1

  • Return a unique list of companies (company), along with the total number of records associated a distinct company name (company_amt) from the credit_card_complaints table.
  • Sort by the company_amt field, from greatest to least.
  • Output Sample:
company company_amt
Citibank 16561
Capital One 12740
JPMorgan Chase & Co. 10203
Bank of America 8995
Synchrony Financial 8637

Challenge 2

  • Build on the query created in Challenge 1, adding a third column (total) that reflects the total number of records in the credit_card_complaints table. Each row should contain the same number.
  • Output Sample:
company company_amt total
Citibank 16561 87718
Capital One 12740 87718
JPMorgan Chase & Co. 10203 87718
Bank of America 8995 87718
Synchrony Financial 8637 87718

Challenge 3

  • Build on the query created in Challenge 2, adding a a fourth column (percent) that reflects the percentage of records associated with each company in the credit_card_complaints table.
  • Output Sample:
company company_amt total percent
Citibank 16561 87718 18.879819
Capital One 12740 87718 14.523815
JPMorgan Chase & Co. 10203 87718 11.631592
Bank of America 8995 87718 10.254452
Synchrony Financial 8637 87718 9.846326

Solutions – SQL Challenges

Here are the solutions to each challenge.

Solution 1

SELECT company, COUNT(company) AS company_amt
span class="token keyword">FROM credit_card_complaints
span class="token keyword">GROUP BY company
span class="token keyword">ORDER BY 2 DESC;

Solution 2

SELECT company, COUNT(company) as company_amt,
   (SELECT COUNT(*) FROM credit_card_complaints) AS total
span class="token keyword">FROM credit_card_complaints
span class="token keyword">GROUP BY company
span class="token keyword">ORDER BY 2 DESC;

Solution 3

SELECT ppg.company, ppg.company_amt, ppg.total,
      ((CAST(ppg.company_amt AS double precision) / CAST(ppg.total as double precision)) * 100) AS percent
span class="token keyword">FROM (SELECT company, COUNT(company) as company_amt, (SELECT COUNT(*) FROM credit_card_complaints) AS total
     FROM credit_card_complaints
     GROUP BY company
     ORDER BY 2 DESC) ppg;

Next Steps

In this tutorial we’ve analyzed financial services consumer complaints data while learning the intermediate concepts of SQL, including:

  • Creating users, databases and tables,
  • The IS NULL/IS NOT NULL clause,
  • Views,
  • Subqueries,
  • Union/union all,
  • Intersect/except,
  • String concatenation, and
  • Casting data types.

Although we covered each concept in great detail and even tested our knowledge with the SQL scenario problems, it’s important that you practice each concept on your own. You can do this by coming up with your own scenarios using the data and then writing queries to return that data. If you’d like to dive into SQL in more detail, you can sign up for free and check out our interactive course SQL & Databases: Intermediate which covers a lot of the content taught in this tutorial and more.

Eric Sunby

About the author

Eric Sunby

Business intelligence/data warehousing/big data consultant. An avid fisherman living in Atlanta with a passion for data.

Learn data skills for free

Headshot Headshot

Join 1M+ learners

Try free courses