Dataset Introduction: San Francisco 311 Cases
Provided by DataSF, this set is a monstrosity containing about 4.25 Million rows and counting. For those not familiar, 311 is a general customer service number for city government, most commonly associated with non-emergency complaints. 311 cases can be created via phone, mobile, and web. The dataset covers the time period from July 2008 until present.
In order to do data exploration and analysis of this dataset we needed to make a working sample to reduce memory and cpu usage upfront:
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}' filename</code>
Further information about the dataset can be found here.
One of the fascinating aspects of city living is human behavior.
We’re going to look at what people complain about.
One behavior every human seems to enjoy is complaining, in some shape, way, or form. Whether its done in exercise of rights to recourse, for entertainment, or out of civic duty, we encounter a situtation that make us uncomfortable and… in SF, we open a 311 case on it. You can even do it via Twitter! I wanted to focus on the subset of cases that reflect the spirit of ‘complaints–’ in particular those concerned with the behavior of others– and that usually require some sort of short term physical response to resolve.
Sadly, there aren’t many cases filed to commend folks for good behavior, so we will be looking at mostly negative or unpleasant situations here. Accordingly, we have attempted to exclude cases concerning general administrative requests, such as building permits, tree maintenance, and the like. Also, despite it being filled with negative comments, I also chose to exclude the muni category, insofar as the Muni (city bus & train operators) is its own organization with its own culture, and union, that I don’t care to upset by pointing out the exceedingly high volume of complaints.
From my personal experience, once the 311 case is filed, as with many government functions, what happens next is a mystery. It goes into a black box, and we can only hope to guess at if or when the matter will be addressed. This can be very frustrating for the complainant, and in turn likely results in corresponding frustration for the people inside the black box, who receive many repeated complaints each day.
If only there were a way to predict how long it would take to resolve each issue…
Well, luckily, there are ways, in particular statistical learning models, and we shall see what fruit they may bear.
Data processing highlights:
- encode, convert or cast every single feature
- extreme values
- malformed data
- datetime index
- 4.2M rows
MUNI - A short Detour
Here are the top cases filed against the city’s public tranportation operation:
MUNI - Conduct_Inattentiveness_Negligence 0.240359
MUNI - Services_Service_Delivery_Facilities 0.160987
MUNI - Conduct_Discourteous_Insensitive_Inappropriate_Conduct 0.112108
MUNI - Conduct_Unsafe_Operation 0.078027
MUNI - Services_Miscellaneous 0.055157
MUNI - Services_Service_Delivery_Facilities 0.051121
MUNI - Services_Service_Planning 0.046188
MUNI - 0.045291
MUNI - Conduct_Discourteous_Insensitive_Inappropriate_Conduct 0.039462
MUNI - **Commendation** 0.039013
MUNI - Conduct_Inattentiveness_Negligence 0.035426
MUNI - Services_Miscellaneous 0.031839
MUNI - Conduct_Unsafe_Operation 0.024664
MUNI - Services_Service_Planning 0.021525
MUNI - Services_Criminal_Activity 0.016592
MUNI - Services_Criminal_Activity 0.001794
So…. I was wrong about it being ALL bad…
Fully 3% of the MUNI cases are commendations for MUNI employees. Sounds about right. I would draw your attention to the subject of the remaining cases but I promised not to bash them…
Anyway, back to the future:
We want to predict how long it will take for our case to be resolved, but that feature didnt’t exist.
It was created by finding the time difference between the Opened and Closed timestamp. We will call this the Time To Resolution or *ttr for short:
wrangler.make_feature('ttr',['Opened','Closed'],lambda x : x[1]-x[0])
In order to extract more predictive power from the past data, we create another feature, called workload, which is simply the number of cases that were open at the time of case creation.
def calc_open_cases(sometime): #input time
df= wrangler.working[['CaseID','Opened','Closed']]
opened_prior = df['Opened'] < sometime # cases opened before it,
not_closed_prior = ~(df['Closed'] < sometime) # not closed,
open_at_thattime = opened_prior & not_closed_prior #and
return open_at_thattime.sum() ### Now lets look more at our features:
DATA Exploration:
Daily Case Statistics
- upward trend in new cases, workload
- non-monotonic increase
- extreme values in TTR
Monthly Case Statistics
- use groupby() and agg()
- mean ttr possible artifact created by dropping open cases the end of the dataset
- NOTE: ttr is ‘looking ahead’ contains information about future state
How can we model the ttr and predict the resolution time of a case opened?
Since the variance of the target is so high, and thinking about the wide range of request types being serviced, it seems reasonable to divide response time into a few classes. Doing so we may discover that certain types of requests tend to be resolved faster in general, or that there are other presently unknown factors that influence how soon our complaints are attended to..
We can explore the boundaries for the classes.
Everyone likes a low, medium, high system. We also preview the effect of dropping extreme values.
Assign the classes according to (my) common sense heuristic:
- ‘low’ ttr = 4 hr or less
- ‘med’ ttr = between 4 and 40 hours
- ‘high’ttr = anything above 40 hrs
I loathe to drop rows if they dont represent bad data, and there is justification for some of the long ttrs.
df= wrangler.working.copy()
df.ttr = df.ttr.dt.total_seconds()/3600
df[df.ttr > xb].Category.value_counts().head(6)
- Graffiti 1873
- Street and Sidewalk Cleaning 610
- Other 370
- Damaged Property 348
- Encampments 232
- Street Defects 163
So we will add the target classes to as above with out dropping, yielding the following balance:
high 0.450375
med 0.285966
low 0.263659
Time to predict the future….
We have simplifed our problem to predicting whether the case we open today will be resolved in one of three time intervals, making this a classfication problem. Our first approach will be to fit a simple logistic regression, followed by some tree based classifiers. At this point we’ll drop CaseID, Status, Notes, and any TimeStamped columns that can either uniquely identify the case, or contain information that was not known at the time of creation. We can retain some features like hour of the day, day of week. Like so:
features = df.columns.drop(['CaseID','Opened','Closed','Updated','Status',
'Status Notes','ttr','workload','ttr_class']) #drop time or status related columns
and then:
...setup stuff...
logreg.fit(X_train,y_train);
y_val_pred= logreg.predict(X_val)
y_pred_proba=logreg.predict_proba(X_val); #need for ROC
y_test_pred= logreg.predict(X_test)
accuracy_score(y_val,y_val_pred) ,accuracy_score(y_test,y_test_pred)
(0.4929255097794424, 0.4899284168470118)
… and survey says, LogReg not so hot.
With validation and test accuracy of (0.4929255097794424, 0.490427834193441), this certainly sets a baseline for further model evaluation. Here are the results of several iterations of various tree models:
decisiontreeclassifier
Training Accuracy: 0.9958901258974092
Validation Accuracy: 0.5624219725343321
Test Accuracy: 0.5568503412685201
baseline Accuracy: 0.45037456300982187
randomforestclassifier
Training Accuracy: 0.7205805847466444
Validation Accuracy: 0.643986683312526
Test Accuracy: 0.6435824870983852
baseline Accuracy: 0.45037456300982187
xgbclassifier
Training Accuracy: 0.6871293309749246
Validation Accuracy: 0.6437786100707449
Test Accuracy: 0.6445813217912435
baseline Accuracy: 0.45037456300982187
precision recall f1-score support
high 0.67 0.91 0.77 2633
low 0.64 0.69 0.66 1590
med 0.52 0.22 0.30 1784
accuracy 0.64 6007
macro avg 0.61 0.60 0.58 6007
weighted avg 0.62 0.64 0.60 6007
ok thats a bit better, trees win
Although we are much better at predicting the majority class, ‘high,’ its not that amazing. Here is a visual representation of the classification data in the form of confusion matrix:
Why can’t our models achieve better performance with all this data?
For one the amount of data we have in terms of time is vast… Recall the trends observed in the data exploration. We see at least two types of motion, one linearly up, and the other oscillating at various frequencies. Unfortunately the simplistic approach to our data is likely be confounding the models. What other influences could be at play?
- Are there seasonal trends for some types of cases?
- Are external factors, like population growth relevant?
- Public sentiment and news about 311?
All of the above seem likely. Separating out factors like seasonality and other multi-year cycles from the trends are part of another methodology, time series analysis. While I don’t have the time or experience yet to fully elaborate this, we can use some of the tools already demonstrated to set the stage.
Let us revisit the ‘workload’ feature as a function of time:
Linear regression lite using Plotly Express with OLS trendline :
Without doing much more than a simple scatter plot of case workload, we can visualize a regression line that captures the general upward trend inside of the undulation.
The regression line looks reasonable, but lets fit our own model and evaluate performance:
Linear Regression in detail:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train, test = train_test_split(df_dl, test_size=.25)
X_train = train.day_offset.values.reshape(-1,1)
y_train = train.workload.values.reshape(-1,1)
X_test = test.day_offset.values.reshape(-1,1)
y_test = test.workload.values.reshape(-1,1)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(model.coef_, model.intercept_) #same slope calculated by px above
print(model.score(y_pred, y_test), mean_absolute_error(y_pred,y_test)) #clearly regression leaves something to be desired.
Slope: 0.09857098 Intercept: -11.25573084 R-squared -2.044872940246718 MAE :43.194449250712374
Linear Regression Model Coeffs and Error:
The sklearn linear model yields a slope that is the same as that calculated by plotly express above: 0.098 Note that the intercept is slightly shifted- but we are using a random sample here vs the enter dataframe above. We used both the model’s .score() method and the mean_absolute_error package to estimate how well the linear model fits the data. The negative R2 value of -2.09 is telling us that our model does no better than a mean baseline.
Will a tree model perform better?
Based the R-squared error metric, linear regression isn’t really working well for this data. There is just too much variance that isn’t accounted for. Let us see if a tree model can better handle the non-monotonic changes in workload. Using the same train / test split from above, we have to shape the arrays differently before feeding them to the tree ensemble.
Got trees? Let’s fit some more models and try to see the forest..
It is evident that the tree-based regressor has no problem over fitting the wild variance of the daily workload. Using the slider to control the max_depth parameter, we can see the effect on fit and obtain the best MAE. But are we done? Can this model really predict future workload? Since we did random sampling of our data we would expect the fitted model to predict well on the test sets, if we have enough data points in each set.
Time-Based Split
To test our Random Forest model further, we can create a time based split. The datetime index is sorted, and the model trained on the first 80% of the observations along the time axis, with the remaining 20% for test. I used the training set% slider to observe the effect of changing the percent split on the predictions. Further, out of the original train set, we create a random 80% sample and using the remainder for validation. prepare to be amazed…
Wow this tree model is very smart??!
It seem that while we can easily fit to the wild variance off the training data, but when its time to make a prediction its best guess is a horizonal line. While it seems counterintuitve at first, recall the negative R2 we saw with the linear model above telling us that its actually not better than a mean baseline estimate.
Where do we go from here? Let’s roll up some windows.
Now that we can see the deficiencies of regression models, we need to introduce some means to better interpret the data. If there appears to be a periodic fluctuation, we can use a rolling average, or moving mean to smooth out short term variance and visualize larger trends.
Rolling Means for Daily Workload
To see through some of the daily variance, and try to look for seasonal effects we can calculate the rolling means with various window sizes. after doing so we can see that there is a slight upward linear trend in the daily workload but with the extreme smoothing using a 360 day window, the red line, there also appears to be a multi-year cycle, perhaps releated to staffing and or process changes, that reduce the workload for a time before the upward trend continues.
In contrast, the mean time to resolution (ttr) in hours shows extreme fluctuations, but perhaps without a clear upward trend. We’ll look at this more later. The takeaway is that after smoothing the function, we can see larger patterns.
Auto Correllation, Stationarity, and Autoregressive Process
In a nutshell these are measures of how much the value of a variable depends on its previous value.
p value is 0.41090577512001386
Since the p value is larger than the significance level of 0.05, we can reject the null hypothesis that the time series data is non-stationary. Thus, the time series data is stationary.
and we dont have time for that.. not yet.
Conclusions
As an intial attempt to predict time to resolve 311 cases we are mildly successful. On deeper analysis the targets we are interested in predicing depend on both time, as well as their previous values. This type of relationship characterizes ‘leakage’ in the isolation but is a necessary feature of the real world we are trying to model. We have established that this type of problem is most likely a fit for times series analysis. Next steps are to increase the sample rate of original data, and verify the applicability of methods for reducing seasonality. Some subjects and resources that I am looking forward to learning more and using in future iterations on this project:
-Geospatial Analysis (knn clustering) for prediction based on proximity
-TRAMO / ARIMA / SEATS in python