Reverse causation

data science
The tools and strategies data scientists use when analyzing correlations in collections data, the pitfalls and misconceptions that can emergence from misunderstanding correlations, and the importance of experimenting with different strategies using randomized samples.
Author

Lucas A. Meyer

Published

November 22, 2021

Gather close as I tell you a cautionary tale about data science.

A new data scientist had just started working on the collections team for a large company. One of the problems they wanted to solve was “does contacting customers help at all?”

It’s an important question. After all, the cost of collections is directly related to the number of contacts. The more contacts, the more people.

The DS had a table with of invoices and payment dates, and a table with a history of contacts by invoice. Using the invoice issue date and invoice payment date, he created a payment delay metric. Then, he joined the contact history and created a metric for number of contacts. Finally, he calculated the correlation between the number of contacts and payment delay.

He was startled by the results: the more contacts per invoice, the longer it took customers to pay.

He told the collections manager, who got very excited. He long suspected that the contact center had too many people. Now he had some proof.

The DS carefully separated the data in a training and a test sample, and trained a regression model using XGBoost. Given inputs such as the invoice country and amount, the model predicted the delay in days given an estimated number of contacts. The model had a very low error (<0.5d) in the training dataset. When using the unseen test dataset, the error was also minimal.

The DS and the collections manager made a plan to deploy a tool to collectors. When assigned an invoice, collectors should enter the characteristics of the invoice in the tool and add 1 contact. For example, if a collector was assigned an invoice with 10 days delay and 2 contacts, they should enter 3 as the number of contacts. If the model returned a number larger than the current delay, the collector should not make the contact. They should just wait.

Once the tool was deployed, the number of contacts initially dropped a lot. To the collections manager surprise, however, the payment delays also increased. Customers were paying much later than usual, and ended up having to be contacted anyway.

As you may already have realized, the problem with the idea was that customers were not delaying because they were annoyed with too many contacts. The relation went in the other direction: customers were being contacted because they were late. While contacts and lateness was very correlated, lateness caused contacts, not the other way around.

The project was scrapped. The team decided that data science was all hype. The data scientist moved on to another team. The collections team went back to the old way of contacting customers: about once per week until they paid.

It took a while (and a new data scientist) for that team to trust data science again: a few years later, a new data scientist started experimenting with assigning different strategies to randomized customers, and that improved collections a lot. But that’s another tale.