Sunday, January 7, 2018

Coursera Communicating Data Science Results Assignment 1

Primary Finding: Violent Crime Events in Seattle Are Strongly Associated With Distance From Downtown and Hour of the Day.  The Two Variables Show a Strong "Inverse Relationship" in Predicting Violent Crime Events.


As shown below, reported crimes in Seattle are heavily concentrated in the downtown area.  The suburban areas, particularly in the southwest metro area, are far less likely to experience crime generally


The Seattle crime data set includes a location grouping variable, "District" which sub-divides the metro area into 19 more-or-less distinct geographic units.  The downtown district, "K", is within the black rectangle.  The height of the points on the graph shows the relative size of reported crimes.



While there appears to be a strong geographic element to the count of reports, the time of day is also a key factor in criminal activity.  To explore   the relationship between location and time, I chose to split the data into 3 broad categories  : violent crime (homicide, assault, etc.), drug/alcohol/prostitution ("vice") , and all other reports (theft, traffic, etc.)

The plots below (time of the event, top; time of report to police, bottom) shows the probability densities of these categories for every minute of the day.  Vice crimes (black-line) have three clear peaks, once around noon, the next between 9 and 10 PM, and then a smaller peak in the early morning hours.  Violent crime (green) also has a peak in the early morning, another between 9 and 10 PM, but has no similar afternoon peak.  Other crime stays fairly stable between 8 am and 8 pm, with tapering in the early morning and late evening.


The similarities between vice and violent crime extends to the geographic dimension as well.  The density plot below shows nearly identical peaks for the two categories in areas approximately 2.5 miles from the downtown city center.




To explore the relationship between distance from downtown and hour of the day, I created a dataset aggregating the count of reports for the three categories for every hour of day between 6/1/2014 and 8/31/2014.  The left plot shows the significance of the following variables in predicting the count of violent crimes.  Average distance from down-town (bottom left) is by far the strongest predictor, followed by count of violent   crimes in the previous 2 hours, and hour of the day.  I added an interaction team (Hour of the Day * Distance from Downtown) to evaluate how the terms were related.  Interestingly, this variable was significant at a 99% level, and the coefficient was negative.  This implies that the greater the distance from down-town and the later the hour of the day, the lower the probability of a violent crime in Seattle.

AvgDistanceFromDownTownViolentEvents, CountDrugsAlcoholProstEvents, CountDrugsAlcoholProstEvents_Prior_2_Hours, CountDrugsAlcoholProstEvents_Prior_4_to_2_Hours, CountDrugsAlcoholProstEvents_Prior_6_to_4_Hours, CountViolenceEvents_Prior_2_Hours, CountViolenceEvents_Prior_4_to_2_Hours, CountViolenceEvents_Prior_6_to_4_Hours, HourDateEvent, TimeDistanceInteraction



To test how these predictors might change for the area closest to down-town, District K, I subsetted the data for only that district and re-ran the regression.  Again, the distance from down-town was by far the most significant variable, but the interaction term increased to ~17x as large as that above, was still negative, and significant at the 95% level.  This has the effect that for example, at 1 am, for two areas, one ~0.18 miles from downtown and the other ~0.09 miles away, the expect count of violent crimes reported for the more distant area would be ~0.11, while the prediction for the less distant area  would be ~0.07.  For the original model based on all districts, the corresponding predictions would be ~0.03  and ~0.02, demonstrating the effect of the larger coefficient.