Home | Portfolio | GitHub | LinkedIn | Medium | Stack Overflow | Terms | E-mail
Analysing Sales Data: SQL and Data Visualization
The majority of companies keep their important data stored in a database.
That said, how often do these companies take the time to analyse such data for meaningful insights?
When used efficiently, interrogating a database with SQL and generating data visualizations from those insights can greatly enhance understanding of the data.
Let’s take an example.
Revenue Analysis with SQL
Suppose a retailer has the following hypothetical table stored in a database - which contains the product sold as indicated by a product ID, the category to which the product belongs, the date on which it was sold, and the revenue gained from the sale of that product.
>>> select * from products;
date | product | category | revenue
---------------------+---------+-----------+---------
2023-01-01 00:00:00 | B1 | Clothes | 88.91
2023-01-02 00:00:00 | A1 | Books | 155.88
2023-01-03 00:00:00 | C1 | Computers | 2109.58
2023-01-04 00:00:00 | C2 | Computers | 957.41
2023-01-05 00:00:00 | D1 | Games | 425.17
2023-01-06 00:00:00 | C3 | Computers | 869.65
2023-01-07 00:00:00 | A2 | Books | 180.90
2023-01-08 00:00:00 | C4 | Computers | 273.31
2023-01-09 00:00:00 | B2 | Clothes | 426.63
2023-01-10 00:00:00 | D2 | Games | 482.78
2023-01-11 00:00:00 | B3 | Clothes | 69.51
2023-01-12 00:00:00 | D3 | Games | 176.90
2023-01-13 00:00:00 | C5 | Computers | 408.94
2023-01-14 00:00:00 | D4 | Games | 92.12
2023-01-15 00:00:00 | D5 | Games | 289.52
2023-01-16 00:00:00 | A3 | Books | 176.73
2023-01-17 00:00:00 | A4 | Books | 143.18
2023-01-18 00:00:00 | D6 | Games | 436.34
2023-01-19 00:00:00 | E3 | Health | 350.77
2023-01-20 00:00:00 | B4 | Clothes | 133.26
2023-01-21 00:00:00 | D7 | Games | 348.53
2023-01-22 00:00:00 | D8 | Games | 338.95
2023-01-23 00:00:00 | E4 | Health | 349.63
2023-01-24 00:00:00 | E5 | Health | 72.66
2023-01-25 00:00:00 | C6 | Computers | 959.80
2023-01-26 00:00:00 | E6 | Health | 63.45
2023-01-27 00:00:00 | E7 | Health | 436.94
2023-01-28 00:00:00 | C7 | Computers | 1320.21
2023-01-29 00:00:00 | E1 | Health | 65.72
2023-01-30 00:00:00 | A5 | Books | 144.29
2023-01-31 00:00:00 | E2 | Health | 277.49
(31 rows)
How could the retailer use SQL queries to analyse the above data? Let’s take some examples.
1. Calculating revenue by category with subquery
To assess the contribution of each category to revenue, we can calculate a percentage share using a subquery as follows:
>>> select category, cast((group_values/(select sum(revenue) from products))*100 as decimal(4,2)) as group_values
from (select category, sum(revenue) as group_values from products group by category order by category) as subquery;
category | group_values
-----------+--------------
Books | 6.34
Clothes | 5.69
Computers | 54.64
Games | 20.52
Health | 12.81
(5 rows)
In this particular example, computer sales account for the majority of revenue - with over 54% of sales originating from this category.
2. Calculate moving average of revenue by category
Suppose that a retailer wants to get a better idea of how average revenue by category varies over time. They wish to smooth out this data to account for the fact that products in the same category can have significantly different prices.
Using SQL, a four-period moving average can be calculated using a window function - while also using PARTITION BY to ensure that the average for each category is calculated separately.
>>> select date, category, revenue, avg(revenue) OVER
(PARTITION BY category ORDER BY date ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
FROM products order by category, date;
date | category | revenue | avg
---------------------+-----------+---------+-----------------------
2023-01-02 00:00:00 | Books | 155.88 | 155.8800000000000000
2023-01-07 00:00:00 | Books | 180.90 | 168.3900000000000000
2023-01-16 00:00:00 | Books | 176.73 | 171.1700000000000000
2023-01-17 00:00:00 | Books | 143.18 | 164.1725000000000000
2023-01-30 00:00:00 | Books | 144.29 | 161.2750000000000000
2023-01-01 00:00:00 | Clothes | 88.91 | 88.9100000000000000
2023-01-09 00:00:00 | Clothes | 426.63 | 257.7700000000000000
2023-01-11 00:00:00 | Clothes | 69.51 | 195.0166666666666667
2023-01-20 00:00:00 | Clothes | 133.26 | 179.5775000000000000
2023-01-03 00:00:00 | Computers | 2109.58 | 2109.5800000000000000
2023-01-04 00:00:00 | Computers | 957.41 | 1533.4950000000000000
2023-01-06 00:00:00 | Computers | 869.65 | 1312.2133333333333333
2023-01-08 00:00:00 | Computers | 273.31 | 1052.4875000000000000
2023-01-13 00:00:00 | Computers | 408.94 | 627.3275000000000000
2023-01-25 00:00:00 | Computers | 959.80 | 627.9250000000000000
2023-01-28 00:00:00 | Computers | 1320.21 | 740.5650000000000000
2023-01-05 00:00:00 | Games | 425.17 | 425.1700000000000000
2023-01-10 00:00:00 | Games | 482.78 | 453.9750000000000000
2023-01-12 00:00:00 | Games | 176.90 | 361.6166666666666667
2023-01-14 00:00:00 | Games | 92.12 | 294.2425000000000000
2023-01-15 00:00:00 | Games | 289.52 | 260.3300000000000000
2023-01-18 00:00:00 | Games | 436.34 | 248.7200000000000000
2023-01-21 00:00:00 | Games | 348.53 | 291.6275000000000000
2023-01-22 00:00:00 | Games | 338.95 | 353.3350000000000000
2023-01-19 00:00:00 | Health | 350.77 | 350.7700000000000000
2023-01-23 00:00:00 | Health | 349.63 | 350.2000000000000000
2023-01-24 00:00:00 | Health | 72.66 | 257.6866666666666667
2023-01-26 00:00:00 | Health | 63.45 | 209.1275000000000000
2023-01-27 00:00:00 | Health | 436.94 | 230.6700000000000000
2023-01-29 00:00:00 | Health | 65.72 | 159.6925000000000000
2023-01-31 00:00:00 | Health | 277.49 | 210.9000000000000000
(31 rows)
3. Rank products by revenue
Suppose a retailer wants to rank products in each category by revenue. This can be done using DENSE_RANK (which assigns consecutive ranks to each descending value - but the same rank for identical values), as well as a window function.
>>> select category, product, revenue,
dense_rank() over (partition by category order by revenue desc) as revenueorder
from products order by category, revenueorder;
category | product | revenue | revenueorder
-----------+---------+---------+--------------
Books | A2 | 180.90 | 1
Books | A3 | 176.73 | 2
Books | A1 | 155.88 | 3
Books | A5 | 144.29 | 4
Books | A4 | 143.18 | 5
Clothes | B2 | 426.63 | 1
Clothes | B4 | 133.26 | 2
Clothes | B1 | 88.91 | 3
Clothes | B3 | 69.51 | 4
Computers | C1 | 2109.58 | 1
Computers | C7 | 1320.21 | 2
Computers | C6 | 959.80 | 3
Computers | C2 | 957.41 | 4
Computers | C3 | 869.65 | 5
Computers | C5 | 408.94 | 6
Computers | C4 | 273.31 | 7
Games | D2 | 482.78 | 1
Games | D6 | 436.34 | 2
Games | D1 | 425.17 | 3
Games | D7 | 348.53 | 4
Games | D8 | 338.95 | 5
Games | D5 | 289.52 | 6
Games | D3 | 176.90 | 7
Games | D4 | 92.12 | 8
Health | E7 | 436.94 | 1
Health | E3 | 350.77 | 2
Health | E4 | 349.63 | 3
Health | E2 | 277.49 | 4
Health | E5 | 72.66 | 5
Health | E1 | 65.72 | 6
Health | E6 | 63.45 | 7
(31 rows)
As we can see, products by rank can now be identified across each category.
Data Visualization with Seaborn
Now that the above analysis has been conducted in SQL, let us see how the data yielded from these queries can be visualised using Python’s seaborn.
Using Python, we can connect to the SQL database using SQLAlchemy as follows:
>>> from sqlalchemy import create_engine
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> import math
>>> import numpy as np
>>> import seaborn as sns
>>> sns.set()
>>> import os;
>>> engine = create_engine('postgresql+psycopg2://postgres:password@localhost/db')
>>> table_name = 'products'
>>> df = pd.read_sql_table(table_name, engine)
By incorporating the above queries, a heatmap of revenue by category can be generated:
>>> df = pd.read_sql_query("select category, cast((group_values/(select sum(revenue) from products))*100 as decimal(4,2)) as group_values from (select category, sum(revenue) as group_values from products group by category order by category) as subquery;", engine)
>>> analysis = pd.pivot_table(df, index='category', values="group_values")
>>> sns.heatmap(analysis, annot=True, cmap="coolwarm", vmin=5, vmax=55, fmt='g')
>>> plt.title("Revenue by category (%)")
>>> plt.savefig("revenue by category.png")
>>> plt.close()
Source: Plot generated by author
Now, let us generate a boxplot - whereby we can see a spread of the revenue data across categories:
>>> df = pd.read_sql_query("SELECT category, revenue FROM products", engine)
>>> sns.boxplot(data=df, x="revenue", y="category")
>>> plt.xlabel("Revenue")
>>> plt.ylabel("Category")
>>> plt.title("Category by Revenue")
>>> plt.savefig("category boxplot.png")
>>> plt.close()
Source: Plot generated by author
By combining SQL and data visualization - we can see that a much more intuitive understanding of the data can be achieved.