There can be a lot of data to sort through.
The goal I set out on this project was relatively simple. Given data such as the timestamp of a visitor arriving to our platform, how long would they need to wait in order to be connected with an agent?
Looking at the data frame manually, it’s pretty obvious we have some insane outliers. Let’s see if they follow any sort of pattern.
A simple scatter plot was a hard to read given the way that real life activity does not follow the start and midnight and end at the next midnight that a simple 0 to 23 hour square scatter plot would give you. Instead, I converted the hours of each record to radians and plotted them on a polar plot using matplotlib.
rds = df.loc[:, ['createTime','timeTillAccept']].copy() rds.loc[:,'createTime'] = rds.loc[:,'createTime'].apply(lambda x: np.pi*((x.hour+1)/12)) rds['timeTillAccept'] = rds['timeTillAccept'].apply(lambda x: x.total_seconds()) sns.set() plt.figure(figsize=(10,10)) ax = plt.subplot(111, projection=’polar’) ax.set_rmax(rds[‘timeTillAccept’].max()) ax.set_theta_zero_location(‘N’) ax.set_theta_direction(-1) ax.set_title(‘Time to accept and hour’) angles = (np.arange(360,step=15))+15 angleLabels = (np.arange(24))+1 ax.set_thetagrids(angles, labels=angleLabels) for index,row in rds.iterrows(): ax.plot(row[‘createTime’], row[‘timeTillAccept’], ‘bo’)
This plot did a pretty good job, clearly showing us the way customers slowly have to wait shorter times as their agents come in for a day’s work.
Now that we know what our data that we will be trying to predict looks like, we need to look at what data could be used to predict it. The seaborn pairplot works very well for this, working out of the box as long as you have your data frame’s data-types in a format that seaborn likes.
You can see the cover image of this post for what the raw pair-plot looked like, but after realizing that half the information in that graphic was useless, I cut it down, showing only plots that seemed to have some sort of relationships in them.
Look at them curves
In order for seaborn to graph these without complaining, I used sklearn’s label encoder. I’m not sure why this was needed considering seaborn’s api example used string classes… but that’s how it turned out.
Labels and Label Encoding Tag
While the final sections of code didn’t require enormous amounts of pain and hours of code, I still feel like it wasn’t quite as easy as this task ought to be. Having used tools such as weka and excel in school these tools while not as powerful overall still had pretty robust features for data exploration.
Later, after the project was completely done, I learned about a tool called pixiedust, a python library by the IBM team behind watson. It plugs into jupyter notebooks with a simple code as simple as the following:
import pixiedust display(data)
Then pixiedust will launch an interactive tool in your notebook cell where you can use it’s gui to create a graph, as far as dragging dragging and dropping x and y variables, and choosing the rendering engine to use (such as matplotlib, seaborn, plot.ly, or bokeh). While I’d never used bokeh before, it’s charts generated via pixiedust where actually nice and had built in interactivity. In the future, I’ll be exploring bokeh more, especially due to their powerful dashboard creation tools.
Easy as pie!
The first, and perhaps easiest thing I attempted was a simple linear regression. However, this had quite terrible results as our data was not strictly linear.
Vanilla Linear Regression Vs. One with Linear Features
My first intuition was to find a way to fit the linear regression to some sort of curve, which the function did seem to have. Luckily, there is a simple way to do this using sklearn’s Polynomial Features tool. The graph above compares the normal linear regression versus the one with polynomial features, where the green dots were data points from the training set, and black dots the points from the test set.
While the linear regression had a very high R² score, I shortly realized that it’s root mean square error was very high. It also had some very notable over-fitting issues.
While I believe the two problems of the polynomial featured regression could be resolved with time, I decided to keep on moving onto other models before buckling down on one.
After throwing several models at it, I ended up settling on a random forest regression. This sort of model automatically grows a series of decision trees for your training set, taking all of these into account these models often have lower degrees of over-fitting compared to their single-treed cousins.
Example tree from our random forest.
After deciding on the random forest, I attempted briefly to create several features from the data we already had to improve our scores. However, doing so with already limited data caused the model to still over fit these new features too much and actually harmed performance. The only thing that actually seemed to help was cutting down to only one feature, and then splitting the model to predict each state separately. The trade off of this was that we were able to improve performance in almost all states, with a few exceptions (Some states have very low data available). In the future, after more data has been collected, I’d like to revisit this problem to see if the scores may be able to improved further.
After finishing our model, it was time to evaluate it (again). Training the model again we used all the data except for the most recent full week, then using this in tests against said most recent full week.
While common evaluations such as RMSE, R², and others could have been used here was well, we decided that raw accuracy wasn’t the only thing we care about. Since we are predicting wait times for real-people in real time, it is better to overestimate their wait time to some degree than underestimate it, since a visitor could feel ignored or cheated if their wait time was much higher than what they were given.
So instead of something such as RMSE which can be a bit hard to understand in a real-world context, or R² which is pretty much incomprehensible without some sort minor knowledge of statistics, I decided to evaluate our model based purely on the median difference between the predicted and actual value. In this way we can see not just the magnitude of the error, but the direction of said error. Below is an example of the charts created in order to view this sort of error. Note that the Y axis is difference in seconds.
Median difference of Predicted-Real
Overall, for the majority of cases we are now able to predict the wait time for a visitor to our platform to some degree of precision. However; in the coming months our methods for performing this predictions could be vastly improved. For the time being we will be waiting to collect more data before trying different techniques.
Do you want to read more of botsplash team contributions? Check out articles [here](https://medium.com/botsplash-engineering).
For more articles on Live Chat, Automated Bots, SMS Messaging and Conversational Voice solutions, explore our [blog](https://blogs.botsplash.com/).
Botsplash, is an innovative, digital messaging software with the ability to connect agents and customers across any digital platform. In order to win and keep a customer’s business, businesses must be able to connect with customers in a meaningful way using websites, social media, text and email. Botsplash helps businesses adopt digital strategy with right balance on Live Chat and Automation.