Home | Portfolio | GitHub | LinkedIn | Medium | Stack Overflow | Terms | E-mail
Importing CSVs in PostgreSQL and Conducting Regression Analysis
As data scientists, we often prefer to conduct regression analysis — and admittedly almost all forms of analysis — in Python and R.
That said, while it is common to work with CSV files across small datasets or on side projects — real world data of any substantial size is almost always stored in a SQL database.
Often times, importing an entire table into Python or R (depending on its size) can be impractical.
Let’s suppose that we wish to conduct regression analysis on a particular dataset that was imported from a SQL database. For all we know, the results of a regression analysis across this dataset could be spurious — and this means that valuable time and resources have been wasted by pulling the data for more advanced analysis.
Wouldn’t it be more efficient to be able to conduct some preliminary analysis in SQL before importing the data to an external program for more advanced analysis?
Well, simple regression analysis is indeed possible using SQL. Let’s take a look at how this can be done.
Dataset and import from CSV
Consider a dataset of car sales (available from my GitHub), whereby the sale price of a particular vehicle is provided, along with factors such as the monthly income level of the buyer, their total debt, and the average number of miles per month driven.
While the data may be provided in a table within the SQL database already, there are often occasions where we wish to import the data from a CSV file. Let’s see how this can be done.
The first thing to ensure is that SQL has appropriate permissions to access the CSV file. For instance, suppose you are using LibreOffice Calc (which I am as a Linux user) — you will need to ensure that permissions are allowed by right-clicking on the CSV file, selecting Properties, and toggling to Permissions:
Source: LibreOffice Calc
Before the CSV file can be imported, both the database and table need to be created with the appropriate variables defined in the latter.
To create and open the database (called vehicles in this case), the following is entered:
sudo -u postgres createdb vehicles sudo -u postgres psql vehicles
Now, a table called cars is created:
vehicles=# create table cars ( vehicles(# age decimal(8,2), vehicles(# gender decimal(8,2), vehicles(# miles decimal(8,2), vehicles(# debt decimal(8,2), vehicles(# income decimal(8,2), vehicles(# sales decimal(8,2) vehicles(# );
Having created the table, we can now import the data from the CSV file into the table:
vehicles=# copy cars (age, gender, miles, debt, income, sales) vehicles-# from '/home/Desktop/cars.csv' vehicles-# DELIMITER ',' vehicles-# CSV HEADER; COPY 963
With 963 entries in the dataset, the COPY 963 message indicates that the values were transferred correctly.
We can now verify that the values are present by selecting the first 10 observations from the table:
vehicles=# select * from cars limit 10; age | gender | miles | debt | income | sales -------+--------+-------+----------+---------+---------- 28.00 | 0.00 | 23.00 | 0.00 | 4099.00 | 620.00 26.00 | 0.00 | 27.00 | 0.00 | 2677.00 | 1792.00 30.00 | 1.00 | 58.00 | 41576.00 | 6215.00 | 27754.00 26.00 | 1.00 | 25.00 | 43172.00 | 7626.00 | 28256.00 20.00 | 1.00 | 17.00 | 6979.00 | 8071.00 | 4438.00 58.00 | 1.00 | 18.00 | 0.00 | 1262.00 | 2102.00 44.00 | 1.00 | 17.00 | 418.00 | 7017.00 | 8520.00 39.00 | 1.00 | 28.00 | 0.00 | 3282.00 | 500.00 44.00 | 0.00 | 24.00 | 48724.00 | 9980.00 | 22997.00 46.00 | 1.00 | 46.00 | 57827.00 | 8163.00 | 26517.00 (10 rows)
Correlation Coefficients and Regression Analysis
Now that the data has been imported, let’s start conducting some analysis.
Particularly, we would like to determine which variables have a particular impact on the sales price for each vehicle. As mentioned, we would like to determine if there are particularly relevant findings by using SQL before considering more advanced analysis.
Firstly, let’s generate correlation coefficients (which measure the strength of the relationship between two variables) between the following variables:
- Sales and Income
- Sales and Debt
- Sales and Miles
To calculate the correlation coefficient between sales and income, we will enter the following, with the Y (dependent) variable entered first, and the X (independent) variable entered second:
vehicles=# select corr(sales, income) as correlation from cars; correlation ------------------- 0.674685436054842 (1 row)
The analysis yields a correlation coefficient of 0.67, indicating that there is a positive correlation between the two variables.
For reference, a correlation coefficient of 1 indicates a perfectly positive correlation, -1 indicates a perfectly negative correlation, while 0 indicates no correlation.
Here are the correlation coefficients for the other variables:
vehicles=# select corr (sales, debt) as correlation from cars; correlation ------------------ 0.83554132018907 (1 row) vehicles=# select corr (sales, miles) as correlation from cars; correlation ------------------- 0.636675650108261 (1 row)
Interestingly, we see that sales and debt has the strongest correlation at 0.83. This could indicate to us that customers who hold more debt are more likely to take out a loan to finance a car purchase. Intuitively, this makes sense — since more expensive vehicles are often financed with debt rather than a cash purchase.
To dive deeper, let’s conduct a regression analysis.
We will firstly regress sales against income:
vehicles=# select round(regr_slope(sales, income)::numeric, 2) as slope, round(regr_intercept(sales, income)::numeric, 2) as y_intercept from cars; slope | y_intercept -------+------------- 1.86 | 205.29 (1 row)
The analysis generates a slope of 1.86, indicating that for every $1 increase in income, the sales figure increases by $1.86. The intercept is somewhat spurious in this case — but essentially represents what the minimum sales figure would be if a customer’s income was $0.
Now, let’s conduct regression analysis on the other variables:
vehicles=# select round(regr_slope(sales, debt)::numeric, 2) as slope, round(regr_intercept(sales, debt)::numeric, 2) as y_intercept from cars; slope | y_intercept -------+------------- 0.41 | 5892.27 (1 row) vehicles=# select round(regr_slope(sales, miles)::numeric, 2) as slope, round(regr_intercept(sales, miles)::numeric, 2) as y_intercept from cars; slope | y_intercept --------+------------- 427.69 | -158.93 (1 row)
- For every $1 increase in debt, the sales figure rises by $0.41. The intercept indicates that the minimum number of sales would be $5,892.27 if debt was $0.
- For every increase in average miles driven per day by 1, the sales price increases by $427.69. This is contrary to what we would expect — we would assume that the more miles on a car — the lower the sales price. However, in this case the average miles driven per day is the variable in question — not total miles. In this case, this variable may well be spurious from a modelling perspective and cannot really tell us much about particular influences on the sales price.
To better gauge the meaningfulness of these readings, let’s take a look at a statistic known as R-Squared.
Essentially, any dependent variable (in this case, sales) is influenced by both independent variables in the model and random variation that these variables do not pick up.
R-Squared serves as a measure of the degree to which a model accounts for the total variation in the dependent variable. The R-Squared value ranges between 0 and 1, with 0 indicating that the model accounts for no variation in the dependent variable, while 1 indicates that the model accounts for all the variation in the dependent variable.
Here are the R-Squared calculations for the variables in question:
vehicles=# select round(regr_r2(sales, income)::numeric, 3) as r_squared from cars; r_squared ----------- 0.455 (1 row) vehicles=# select round(regr_r2(sales, debt)::numeric, 3) as r_squared from cars; r_squared ----------- 0.698 (1 row) vehicles=# select round(regr_r2(sales, miles)::numeric, 3) as r_squared from cars; r_squared ----------- 0.405 (1 row)
We can see that the regression of sales against debt has the highest R-Squared at almost 70%, meaning that the debt variable accounts for almost 70% of the total variation in sales.
This has been an introduction to the use of regression analysis in PostgreSQL. You have also seen how to import a dataset in CSV format and transfer it to a SQL table.
I recommend reading Practical SQL by Anthony DeBarros for a more in-depth look on how to implement regression analysis and other analytical tools using SQL.
As mentioned, SQL can be quite useful when it comes to a “quick and dirty” analysis of a particular set of data — and this can greatly aid in determining whether further analysis is justified using an external program such as Python or R.
While more in-depth regression analysis such as t-test calculations can be difficult to implement in SQL (and generally not recommended) — conducting preliminary analysis in the database table itself always has its advantages.
Many thanks for reading, and you can find further examples of useful SQL practices here.