Linear, tree, forest and support vector regression: comparison, source code and ready-to-use app

In this article I am going to walk you through the process of building, training and evaluating a prediction model for the number of ad impressions delivered in a digital marketing campaign. All the techniques can be analogously applied to other regression problems, especially to predict various campaign performance metrics. The predictions can be used to evaluate future marketing campaigns before launch as well as to determine the best parameters including e.g. timeline and budget size for such campaigns. You can use your own campaign data or a provided sample data set to code along in Python. Next to all source code I also provide a simple app to predict impressions, clicks and conversions for purchase-based digital marketing campaigns.
App: predictor.stagelink.com
Code: github.com/kinosal/predictor
Story outline
- Requirements
- Define your goals
- Get the dataset
- The first glance
- Pre-process your data
- Train your models
- Evaluate your models
- Predict the results for your next campaign
- BONUS: Ready-to-use trained model
Requirements
We will use data from past marketing campaigns in order to predict the outcome of future campaigns. And generally speaking, the more data, i.e. campaigns, the more accurate the predictions. The exact number depends i.a. on the homogeneity of your campaigns, but it is likely that you will need data of at least a few hundred campaigns. Also, since we’ll use supervised learning techniques you need the same inputs, i.e. dimensions or features, for the future campaigns for which you want to estimate the outcome.
In case you do not have a suitable dataset at hand right now, don’t worry: You can download a CSV with the sample I am going to use for this article here:
https://github.com/kinosal/predictor/blob/master/model/impressions.csv
Define your goals
What do we actually mean when we refer to the success or outcome of a campaign? This obviously depends on your specific situation. For this article, we will try to predict the number of impressions of a single campaign. Similarly, clicks and conversions can be predicted to complete the classic marketing funnel:

Get the dataset
We are presented with several past campaigns that each provide one observation or row in a table with several dimensions or columns including the dependent variable we want to predict as well as multiple explaining independent variables or features:

Since the campaigns of which we would like to predict the outcome lie in the future, the features in this case do not include any prior performance data but different observable qualities of a campaign. As we usually don’t know beforehand which features will turn out to be good predictors, I recommend to also use variables that might seem only remotely related to your campaigns and to invest some time in finding or constructing new features. Although there are also arguments to reduce the feature space, this can usually still be handled at a later stage.
You can load the CSV and save into a Pandas data frame with a very simple function:
import pandas as pd
data = pd.read_csv('impressions.csv')
The first glance
Before building and training the prediction model, I always take a first look at the data to get an idea of what I am dealing with and to already spot potential peculiarities. We will use the sample data to predict the number of impressions for a marketing campaign, hence “impressions.csv” contains one row per campaign, each of their total number of impressions as well as metric and categorical features to help us predict the number of impressions for future campaigns. We will confirm this by loading the data and showing its shape, columns and first 5 rows:
>>> data.shape (241, 13)>>> data.columns Index(['impressions', 'budget', 'start_month', 'end_month', 'start_week', 'end_week', 'days', 'region', 'category', 'facebook', 'instagram', 'google_search', 'google_display'], dtype='object')>>> data.head(5) impressions budget start_month ... google search google_display 9586 600 7 ... 1 0 ...
The first column contains the dependent (to be predicted) variable “impressions” while there are 12 feature columns for 241 records (rows) in total. We can also use data.describe() to display count, mean, standard deviation, range and quartiles for every metric column.
We can further observe that we are dealing with ten numerical features and two categorical ones, while four of the numerical columns are binary:

Now let us plot the histograms for the numerical features. We’ll use two very handy data visualization libraries, Matplotlib and Seaborn (which builds upon Matplotlib):
import matplotlib.pyplot as plt import seaborn as snsquan = list(data.loc[:, data.dtypes != 'object'].columns.values) grid = sns.FacetGrid(pd.melt(data, value_vars=quan), col='variable', col_wrap=4, height=3, aspect=1, sharex=False, sharey=False) grid.map(plt.hist, 'value', color="steelblue") plt.show()

As a final glance we will have a look at basic linear correlations between the numerical features. First, let us visualize those with a Seaborn heatmap:
sns.heatmap(data._get_numeric_data().astype(float).corr(),
square=True, cmap='RdBu_r', linewidths=.5,
annot=True, fmt='.2f').figure.tight_layout()
plt.show()

In addition, we can also output the correlations of each feature with the dependent variable:
>>> data.corr(method='pearson').iloc[0].sort_values(ascending=False)
impressions 1.000000
budget 0.556317
days 0.449491
google_display 0.269616
google_search 0.164593
instagram 0.073916
start_month 0.039573
start_week 0.029295
end_month 0.014446
end_week 0.012436
facebook -0.382057
Here we can see that the number of impressions is positively correlated with the budget amount and the campaign duration in days, and negatively correlated with the binary option to use Facebook as a channel. However, this only shows us a pair-wise linear relationship and can merely serve as a crude initial observation.
Pre-process your data
Before we can begin to construct a predictive model, we need to make sure our data is clean and usable, since here applies: “Garbage in, garbage out.”
We are lucky to be presented with a fairly well-structured dataset in this case, but we should still go through a quick pre-processing pipeline specific to the challenge at hand:
- Only keep rows where the dependent variable is greater than zero since we only want to predict outcomes larger than zero (theoretically values equal to zero are possible, but they won’t help our predictions).
- Check for columns with missing data and decide whether to drop or fill them. Here we’ll drop columns with more than 50% missing data since these features cannot add much to the model.
- Check for rows with missing values and decide whether to drop or fill them (doesn’t apply for the sample data).
- Put rare categorical values (e.g. with a share of less than 10%) into one “other” bucket to prevent overfitting our model to those specific occurrences.
- Encode categorical data into one-hot dummy variables since the models we will use require numerical inputs. There are various ways to encode categorical data, this article provides a very good overview in case you’re interested to learn more.
- Specify dependent variable vector and independent variable matrix.
- Split dataset into training and test set to properly evaluate your models’ goodness of fit after training.
- Scale features as required for one of the models we are going to build.
Here’s the full pre-processing code:
Train your models
Finally, we can proceed to build and train multiple regressors to ultimately predict the outcome (value of the dependent variable), i.e. the number of impressions for the marketing campaign in question. We will try four different supervised learning techniques — linear regression, decision trees, random forest (of decision trees) and support vector regression — and will implement those with the respective classes provided by the Scikit-learn library, which was already used to scale and split the data during pre-processing.
There are many more models we could potentially use to develop regressors, e.g. artificial neural networks, which might yield even better predictors. However, the focus of this article is rather to explain some of the core principles of such regression in a — for the most part — intuitive and interpretable way than to produce the most accurate predictions.
Linear regression

Constructing a linear regressor with Scikit-learn is very simple and only requires two lines of code, importing the required function from Scikit’s linear model class and assigning it to a variable:
from sklearn.linear_model import LinearRegression
linear_regressor = LinearRegression(fit_intercept=True, normalize=False, copy_X=True)
We want to keep the default parameters since we need an intercept (result when all features are 0) to be calculated and we do not require normalization preferring interpretability. The regressor will calculate the independent variable coefficients and the intercept by minimizing the sum of the squared errors, i.e. the deviations of the predicted from the true outcome, which is known as the Ordinary Least Squares method.
We can also output the coefficients and their respective p-values, the probability of the output being independent from (here also uncorrelated to) a specific feature (which would be the null-hypothesis of the coefficient equaling 0), and hence a measure of statistical significance (the lower the more significant).
Having visualized the correlations between the numerical features earlier during our “first glance”, we expect the features “budget”, “days” and “facebook” to carry relatively small p-values with “budget” and “days” having positive and “facebook” a negative coefficient. The statsmodels module provides an easy way to output this data:
model = sm.OLS(self.y_train, sm.add_constant(self.X_train)).fit()
print(model.summary())

The p-value here was calculated using a t-statistic or -score based on the t-distribution. The summary also gives us a first hint at the accuracy or goodness of fit for the whole model, scored by the coefficient of determination R-squared measuring the share of variance in the output explained by the input variables, here 54.6%.
However, in order to compare all models and to suit our particular challenge we will use a different scoring method which I call “Mean Relative Accuracy”, defined as 1 – mean percentage error = 1 – mean(|(prediction – true value) / true value|). This metric is obviously undefined if the true value is 0, but in our case this is not relevant as we check this condition in the pre-processing step (see above) and we will thus obtain good interpretability matching an intuitive definition of accuracy. We will calculate this score for all models using five-fold cross-validation, randomly splitting the dataset five times and taking the mean of each of the scores. Scitkit-learn also provides a handy method for this:
linear_score = np.mean(cross_val_score(estimator=linear_regressor,
X=X_train, y=y_train, cv=5,
scoring=mean_relative_accuracy))
The training score we obtain for the linear regressor is 0.18; ergo the best fit we were able to produce with this model results only in a 18% prediction accuracy. Let’s hope the other models are able to outperform this.
Decision Tree

Next up is a regressor made from a single decision tree. Here we will use a Scikit-learn function with a few more arguments, so-called hyperparameters, than for the linear model, including some we don’t know the desired settings for yet. That’s why we are going to introduce a concept called Grid Search. Grid Search, again available from Scikit-learn, lets us define a grid or matrix of parameters to test when training the prediction model and returns the best parameters, i.e. the ones yielding the highest score. This way we could test all available parameters for the Decision Tree Model, but we will focus on two of those, the “criterion” to measure the quality of a split of one branch into two and the minimum number of samples (data points) for one leaf (final node) of the tree. This will help us to find a good model with the training data while limiting overfitting, i.e. failing to generalize from the training data to new samples. From now on we will also set a random state equal to one for all stochastic calculations so that you will receive the same values coding along. The rest works similar to the linear regression we’ve built earlier:
tree_parameters = [{'min_samples_leaf': list(range(2, 10, 1)),
'criterion': ['mae', 'mse'],
'random_state': [1]}]
tree_grid = GridSearchCV(estimator=DecisionTreeRegressor(),
param_grid=tree_parameters,
scoring=mean_relative_accuracy, cv=5,
n_jobs=-1, iid=False)
tree_grid_result = tree_grid.fit(X_train, y_train)
best_tree_parameters = tree_grid_result.best_params_
tree_score = tree_grid_result.best_score_
The best parameters chosen from the grid we defined include the mean squared error as the criterion to determine the optimal split at each node and a minimum of nine samples for each leaf, yielding a mean relative (training) accuracy of 67% — which is already a lot better compared to the 18% from the linear regression.
One of the advantages of a decision tree is that we can easily visualize and intuitively understand the model. With Scikit-learn and two lines of code you can generate a DOT representation of the fitted decision tree which you can then convert into a PNG image:
from sklearn.tree import export_graphviz
export_graphviz(regressor, out_file='tree.dot',
feature_names=X_train.columns)

As you can see, only 4 of all 16 features have been used to construct this model: budget, days, category_concert and start_month.
Random Forest
The main challenges of single decision trees lie in finding the optimal split at each node and overfitting to the training data. Both can be mitigated when combining multiple trees into a random forest ensemble. Here, the trees of a forest will be trained on different (random) subsets of the data and each node of a tree will consider a (again random) subset of the available features.
The random forest regressor is built almost exactly like the decision tree. We only need to add the number of trees, here called estimators, as a parameter. Since we do not know the optimal number, we will add another element to the grid search to determine the best regressor:
forest_parameters = [{'n_estimators': helpers.powerlist(10, 2, 4),
'min_samples_leaf': list(range(2, 10, 1)),
'criterion': ['mae', 'mse'],
'random_state': [1], 'n_jobs': [-1]}]
forest_grid = GridSearchCV(estimator=RandomForestRegressor(),
param_grid=forest_parameters,
scoring=mean_relative_accuracy, cv=5,
n_jobs=-1, iid=False)
forest_grid_result = forest_grid.fit(X_train, y_train)
best_forest_parameters = forest_grid_result.best_params_
forest_score = forest_grid_result.best_score_
The best parameters for the forest model according to the grid search we defined include the mean absolute error criterion, a minimum leaf sample size of three and 80 estimators (trees). With these settings we can again — compared to the single decision tree — increase the training accuracy to 70%.
Support Vector Regressor
The last regressor we are going to build is based on support vector machines, a beautiful mathematical concept developed by Vladimir Vapnik between the 1960s and 90s. Unfortunately, explaining their inner workings would go beyond the scope of this article. Still, I strongly recommend to check them out; one good introductory resource is professor Winston’s lecture at MIT.
A very rudimentary summary: Support vector regressors try to fit given samples into a multi-dimensional (with the order of the number of features) hyperplane of a diameter defined by linear boundaries while minimizing the error or cost.
Although this type of model is fundamentally different from decision trees and forests, the implementation with Scikit-learn is similar:
svr_parameters = [{'kernel': ['linear', 'rbf'],
'C': helpers.powerlist(0.1, 2, 10),
'epsilon': helpers.powerlist(0.01, 2, 10),
'gamma': ['scale']},
{'kernel': ['poly'],
'degree': list(range(2, 5, 1)),
'C': helpers.powerlist(0.1, 2, 10),
'epsilon': helpers.powerlist(0.01, 2, 10),
'gamma': ['scale']}]
svr_grid = GridSearchCV(estimator=SVR(),
param_grid=svr_parameters,
scoring=mean_relative_accuracy, cv=5,
n_jobs=-1, iid=False)
svr_grid_result = svr_grid.fit(X_train_scaled, y_train_scaled)
best_svr_parameters = svr_grid_result.best_params_
svr_score = svr_grid_result.best_score_
We can use grid search again to find the optimal values for some model parameters. The most important here is the kernel that transforms the samples into a feature space of higher dimensions where the data can be separated or approximated linearly, i.e. by the hyperplane described above. We are testing a linear kernel, a polynomial one and a radial basis function. With an epsilon of 0.08, the maximum (scaled) distance of the prediction from the true value where no error is associated with it, and a penalty parameter C of 12.8, the linear kernel performs best, reaching a (scaled) training accuracy of 23%.
Evaluate your models
After we have now identified the best parameters for our models based on the training data at hand, we can use these to eventually predict the outcomes of the test set and calculate their respective test accuracies. First, we need to fit our models with the desired hyperparameters to the training data. This time, we don’t need cross-validation anymore and will fit the models to the full training set. Then we can use the fitted regressors to predict training and test set results and calculate their accuracies.
training_accuracies = {}
test_accuracies = {}
for regressor in regressors:
if 'SVR' in str(regressor):
regressor.fit(X_train_scaled, y_train_scaled)
training_accuracies[regressor] = hel.mean_relative_accuracy(
y_scaler.inverse_transform(regressor.predict(
X_train_scaled)), y_train)
test_accuracies[regressor] = hel.mean_relative_accuracy(
y_scaler.inverse_transform(regressor.predict(
X_test_scaled)), y_test)
else:
regressor.fit(X_train, y_train)
training_accuracies[regressor] = hel.mean_relative_accuracy(
regressor.predict(X_train), y_train)
test_accuracies[regressor] = hel.mean_relative_accuracy(
regressor.predict(X_test), y_test)
Here are the results:
Training accuracies: Linear 0.34, Tree 0.67, Forest 0.75, SVR 0.63
Test accuracies: Linear 0.32, Tree 0.64, Forest 0.66, SVR 0.61
Our best regressor is the random forest with the highest test accuracy of 66%. It seems to be a little overfit though since the deviation from its training accuracy is relatively large. Feel free to experiment with other values for the hyperparameters to further improve all of the models.
Before finally saving our model to make predictions on new data, we will fit it to all available data (training and test set) to incorporate as much information as possible.
Predict results
Now we have a model ready to predict the results of future marketing campaigns. We only have to call the predict method on it passing a specific feature vector and will receive the respective prediction for the metric we trained our regressor on. We can also compare the true impressions from the existing dataset to their predictions based on our new model:

The average relative deviation of a prediction from its actual, true value equals 26%, ergo we reached an accuracy of 74%. With only 14% the median deviation is even smaller.
Conclusion
We were able to build and train regressors that allow us to predict the number of impressions (and other performance metrics in an analogue fashion) for future marketing campaigns from historical campaign data.
The highest prediction accuracy has been achieved with a random forest model.
We can now use these predictions to evaluate a new marketing campaign even before its start. Also, this allows us to determine the best parameters including e.g. timeline and budget size for our campaigns since we can calculate the predictions with different values for those features.
BONUS: Ready-to-use trained model
You don’t (yet) have the data at hand to build an accurate prediction model for your planned digital marketing campaigns? Don’t worry: I have trained multiple models to predict impressions, clicks and purchases with the data from more than 1,000 campaigns. With a combination of different models those predictions reach an accuracy of up to 90%. At predictor.stagelink.com you find a simple app to predict the results of your future campaigns with just a few inputs. The models are mainly trained on data from digital marketing campaigns promoting event ticket sales, so that is where they will likely perform best.

In addition to that you can find all the code used for the discussed marketing performance predictor on my Github: github.com/kinosal/predictor
Thank you for reading – I look forward to any feedback you might have!
Source: https://towardsdatascience.com/how-to-predict-the-success-of-your-marketing-campaign-579fbb153a97