# Cohort Analysis: Data Sourcing With SQL

As an online business owner, you hope that engagement increases over time, leading to a longer period of retention. This is rarely the case, though. The majority of web applications see a gradual decrease in user engagement, eventually leading to churn. Your goal then is to stretch out the length of engagement as long as possible. The best way to measure this is through cohort analysis.

Put simply, cohort analysis is used to test whether certain groups of users, based on the conversion date, are active and engaging longer (or shorter) than other groups.

The ultimate goal is to test not only how users within each cohort engaged within your app over time, but to also compare and contrast different cohorts with one another. It’s quite often the case that even subtle changes in your application’s feature set will change user engagement, positively or negatively. If the latter, you want to know this as soon as possible to prevent further churn.

That said, let’s look at how to source and cleanse your data in order to begin analysis.

Please note: I will be using a MySQL database for these examples. The corresponding SQL statements should be simple enough to port over to whatever RDMS you use. Comment if you have questions. You can also look at this spreadsheet to see how to conduct cohort analysis in Excel.

## Setup

If you’d like to follow along, follow these steps to create the table and load the sample data.

### Create the database and tables

1. Access MySQL via the Shell (or your preferred method):
1. Create a new database, and then select it for use:
1. Use the following commands to create the tables users and events:

Download the database file here. Then using the command line, navigate to the file path where the dump.sql file was downloaded. Now, type the following command into your terminal:

This command should take a few minutes to run as it loads the data in the database.

## Sourcing Data

When sourcing your data, it’s important to begin by structuring your SQL queries around user behavior - engagement, in this case.

In this example, we will use a database filled with sample data for a subscription photo-editing application. The database has one table for customers, Users, and another for engagement, Events, among others:

Users:

Events:

As you can tell, there are 54,541 rows of data in the Users table and 101,684 rows in the Events table. The type field in the Events table specifies whether a user has shared (to Twitter or Facebook), commented on, or liked a photo.

Start by looking at each table individually, beginning with basic queries before moving on to more advanced queries to get a feel for the data you’re working with.

### Monthly Cohorts

I will be using a new Group Date function to create the cohorts. Follow the installation instructions here.

Group users into monthly cohorts, based on sign-up date, and add in the total users for each cohort:

Did you notice the time zone (Greenwich)? Try experimenting with other time zones to see how the results change. This is my favorite feature from the Group Date function.

Like the last query, split the events into cohorts:

Based on these tables, you can see the total engagement vs. the total users. Right off the bat you can tell that engagement decreases in the latter months? Will this trend continue?

### Individual Cohorts

Now start looking at each individual cohort to see if you can see any outliers or large discrepancies. For simplicity, you can follow this syntax:

For example:

In this query, we look at the total percent of twitter shares divided by the total # of engagements:

You can even zoom in on individual users:

### Analysis

Finally, create a chart that shows all the cohorts and their subsequent monthly engagement % for easy comparison. Make sure to normalize the data in this chart (see the ROUND function).

This is clearly a difficult query, which will require significant modifications based on your own data. In this query we are defining-

• each cohort by breaking the dates into intervals of 30.4 days (~ one month); and,
• the percent of active users (grouped by cohort) by dividing the number of cohorts by total activities for each cohort, then obtaining a percentage by multiplying the results by 100.

To make such queries easier, make sure you -