Is Your Predictive Model “Too Good To Be True”? Beware Of Biases From Your Data

Niels Bohr - Predictions are difficult, especially regarding the future

This famous quote is attributed to the Nobel laureate physicist Niels Bohr who received his Nobel prize for his work on atomic structure and quantum mechanics. I was reminded of this quote when analyzing the results of proof-of-concept (POC) “bake-offs” that several of our prospects were running as an evaluation of Mintigo and other predictive marketing vendors.

Can too much data be as dangerous for modeling as too little data? It might be. When building predictive marketing models, you always run the risk of providing too much of the wrong data to your modeler that has the potential to rig the test and create models that have grand predictions on the past but very poor ones on the future. In this post I’ll give a few reasons of why this happens, share some metrics from our own modeling experience at Mintigo, and then show how you can design your modeling process to avoid most of those pitfalls.

3 Types Of Data Typically Used For Predictive Models

There is a lot of noise in the predictive marketing space – lots of vendors are telling similar stories on the surface, which makes it hard for those looking to select and implement a predictive solution to understand the differences. So what do you do? A good option is to run a POC – I’ve written several posts in the past that discusses how to conduct a successful POC.

However, I’ve recently heard from customers that some of the models they received in the POC process were magnificent – so good that they might be “too good to be true”. What does this mean and why is this a bad thing?

The main issue that is causing this behavior is simply the type of data used in these models. There are three types of data that models can use in a predictive marketing:

  1. Data about the lead and the company: This refers to information about the lead that is not updated as part of the sales process in terms of the status of the lead or where they fall in the marketing funnel. For example, data that falls into this category include the lead’s job title, the industry that the company is categorized in, the database or the type of VPN the company is using, and the regulations it has to comply with. All of these are characteristics of the lead or company that we at Mintigo call Marketing Indicators (MI’s).
  2. A lead’s behavioral data: This includes information about a lead or an account in terms of its activity in the marketing funnel. For example, activities such as content downloaded by a lead, website visits or webinar attendance fall into this category.
  3. Data about the sales process: This is information about the sales process that happens after marketing nurtures or passes a lead to sales. For example, when sales starts working on an account, they tend to open new contacts under that account. Let’s call this “sales process” data.

“The Good, The Bad, and The Ugly” of Predictive Modeling With Wrong Data

The first thing you need to ask when considering implementing predictive at your organization is: what is the goal or desired outcome of the predictive models? Do I have different goals that require different sets of data?

A few common examples are shown in the table below, with the different data types that can or can’t be used in each:

3 Types Of Data Used In Predictive Modeling

Let’s say for example that you are trying to identify which leads from your funnel are most likely to convert to sales opportunities (use case #2 in the table above). When modeling propensity to buy, we need to be very careful to include only data that was time stamped before the lead was engaged with sales. Let me explain why through an example.

Mike, an IT director at ACME company, is an imaginary lead in our funnel. Our goal is to engage Mike in a sales process to offer IT solutions for field employees. As described previously, we have three types of data that we may use, and the use of two of these may result in a few pitfalls that we need to avoid.

Data about the lead and the company:

Mintigo’s Marketing Indicators are data points discovered by mining the web using big data algorithms. By scanning the web, we discovered that Mike’s company actually has employees in the field service function, as well as locations across the US. In addition, they have an active “Bring Your Own Device” initiative. These are important data points to determine whether or not Mike is a good fit as a potential customer and his readiness to buy.

A lead’s behavioral data:

Mike had engaged several times with our company in the past few weeks: repeat visits to the website as well as attending a webinar. After connecting with a sales rep, Mike then downloaded a few white papers and also revisited our website a few more times.

Here is the catch – since we only use historical data in our model, we can only see his initial visits to the website and his attendance of the webinar. If we use the engagement data that happened after talking to sales in our model, the model will clearly show that subsequent engagement of leads similar to Mike, such as downloading those white papers, will be a great fit for sales follow-up. Unfortunately, this will skew the results. By using data that was received after talking to sales, we are actually diminishing the predictive power of the model for leads that should be talking with sales. Therefore, we need to be careful and use only data that was time-stamped before leads contacted sales.

Data about the sales process:

After Mike was qualified by sales, two additional leads were added to the ACME account – Deanna from IT and Brad from the purchasing department. These are contacts created and added to the ACME account record after sales has already qualified and converted Mike. In this case, to model for our leads’ likelihood to buy, we can only include leads in the model that were created prior to the initial contact with sales. So as was the case with some of the lead’s behavioral data, the data that came about from the sales process does not really tells us what is going to happen. Both Deanna and Brad were brought in late in the sales cycle, so in order to predict which leads at the top of the funnel that sales should be talking to, both of these contact records should be discarded from the model. If we allow the models to look at all the leads/contacts in the account (Mike, Deanna and Brad) including those that were created after the opportunity was generated, we will have a skewed model.

Identifying “Cheat Features” In Modeling

Similar to the above example, we were looking at data from one of our prospects in a POC project with us, and we noticed all sorts of false signals from behavioral and “sales process” data that weakened the predictive power of the model. We call these “cheat features” in predictive modeling terminology because they allow the model to “cheat” during training and produce more false positives when put into practice.

If we looked at their entire data set of sales process data and the lead’s behavioral data, we would notice the following:

  1. The different leads created in each account.
  2. The dates they were created.
  3. The result of each lead in sales.

To identify the effect of potential biases from cheat features in predictive modeling, we compared two approaches to building a predictive model. The first approach used data about the leads in the account as well as the dates of when they were created and stages they reached in the sales process. The second approach only used data about the lead and the company, and not behavioral and sales process data that weren’t timestamped.

We then created a test model to see the effect of cheat features by taking these steps:

  1. We took one year of historical data for all the leads created in each account and the dates these contacts were generated.
  2. We randomly selected 30% of the leads in step 1 as a “testing set”. This represented the leads that we are not using to create a model (a.k.a. “training the model”**). After training the model on the remaining 70%, we tried to predict which members of the testing set of leads had the highest likelihood to buy.
  3. We then created two models: one that is based only on data about the lead and the company (i.e., a model without cheat features), and another which used data about the sales process in addition to the data about the lead and the company (i.e. a model with cheat features).
Biases of Predictive Models With Cheat Features vs Model Without Cheat Features

In the graph above (which is called an ROC Curve if you’re a statistics geek), the model represented by the red line, which only included data about the lead and the company (i.e., model without cheat features) showed us that 70% of the positive leads (as defined by wins) were aligned in the top 20% of the leads scored in this model. This is compared to the model represented by the blue line that included sales process and behavioral data (i.e., model with cheat features) data, which identified approximately 80% of the positive leads fall in the top 20% of the leads score. As can be seen, the model represented by the blue line seemed to perform significantly better than the model represented by red line.

However, this is not indicative of true behavior that led to wins. How real and accurate is the second prediction? Not very – the main mistake is in the test design. In order to build a correct test model, you must look at a training set that is older in time than the test set. The goal of building a predictive model is to predict future events, not reiterate past ones – remember Niels Bohr’s quote?

For this example, let’s take a year’s worth of data. What we need to do is take the first 9 months in the data set for training the model, and then the next 3 months of the data set for testing the model (see footnote #2). With only data about the lead and the company used in modeling (non-cheat features), this 9 month-to-3 month system works extremely well. On the other hand, including data about the sales process and lead’s behavioral data (cheat features) overlooks the human factor that introduces incorrect biases to the model – as an example, most of the sales process happened in the first 9 months, and then all the activities that are related to closing the deal are in the last 3 months, which is used in the testing set. Due to the human factor of the sales process in these models, there is typically a leakage or cross-over of data used in the training set to the test set of the machine learning model.

Note that this does not fully eliminate bias in the model from behavioral and sales process features, but it does reduce it. The reason it doesn’t fully eliminate it is that some leads take a while to close. So while most of the sales process happened in the first 9 months, some sales processes can take 6 months or more, which means there may still be some artifacts in the model. However, if the models we build this way perform significantly lower, it does indeed raise a red flag on how predictive these models truly are.

Recommendations For Designing a Proper Predictive Modeling POC

When building a predictive scoring model, you need to make sure that the model you choose can actually predict and not reiterate past by using the types of data that have no applicability for future prediction in the problem we are trying to solve.

To prevent such misguided methods as described in this post, you need to design the proof-of-concept correctly. Our recommendations are:

  1. Use the right data for test purposes – put 70% of the leads that were created earlier into the training set, and put the newest 30% into the test set. Of course, don’t provide the modeler the results on the test set – check them after you get the results to see for yourself how well the model performed.
  2. Ask to see the top features used by the model – are these the features that you need to predict who is a better fit lead, or is this simply a reflection of how your sales team works a lead?
  3. Compare the model performance between newer and older data – as seen above, if the model is stable in performance when training on older and newer data, it is more likely to be able to predict the future successfully.
  4. Don’t provide too much information at the proof-of-concept stage – see what modelers can predict based only on account names, emails, and basic identification details. If they can build a good model on little information, the models will only get better after adding more information at a later stage.

I hope this explains why a model that uses data about the lead and the company has much more predictive power, and that a model that uses sales process data simply predicted the past. If you’d like to get additional information on how to run a successful POC for selecting a predictive marketing vendor, check out my previous blog post series covering this topic.

In addition, our research analyst friends have additional content on this topic, such as SiriusDecisions’ brief on “Questions To Ask Predictive Lead Scoring Vendors” and Forrester’s “New Technologies Emerge To Help Unearth Buyer Insight From Mountains Of B2B Data” – subscriptions are required to access both.

Please share any questions or thoughts you may have in the comments section below.

**Note – In machine learning, computers apply statistical learning techniques to automatically identify patterns in data; training the model is the part of the learning process in which a set of distinctive nuance data characteristics is decided and run on the ‘testing set’ mentioned above. See this site: Introduction to Machine Learning for an excellent explanation.


How To Be A Data-Driven Marketing Powerhouse With Predictive Analytics & Big Data Webinar Replay

To learn more about Predictive Marketing & Big Data, watch this webinar replay presented by John Bara (Mintigo), Megan Heuer (SiriusDecisions) and Russ Glass (LinkedIn).
Get The Slides & Watch Now!


Tal Segalov


Tal is a Co-Founder and Chief Technology Officer at Mintigo. He brings more than 15 years of experience in software development. Prior to Mintigo, Tal was AVP Research and Development for modu, the modular mobile handset company. His previous experience includes developing complex, large scale data analysis systems. He holds a B.Sc. EE and a B.A in Physics from the Technion – Israel’s leading school of technology. He also holds an executive MBA from Tel Aviv University.