Recently Google’s BigQuery came across my radar, so I decided get hands on it to see what all the buzz was about. The US traffic fatality records database looked interesting because it contains information on all traffic accidents in the US where at least 1 person died. No doubt a sad topic, but one that is also quite interesting in terms of how understanding the trends could potentially help prevent future accidents. Not only was I able to zero in on some pretty neat insights, but I was also able the utilize some Python to visualize some of my findings.

BigQuery is a Google Cloud product for storing and accessing large databases super quickly. If you’re interested in experimenting more with BigQuery then you should definitely head over to Kaggle, where you can explore with some of the data sets they have available.

Set Up:

Setting up for a BigQuery notebook is actually quite simple thanks to the fine folks at Kaggle. They created a Python package called (bq_helper) which has some helper functions for getting data out of BigQuery.

Data analysis with SQL and Python - BigQuery Helper

Once you’ve imported the helper package the next step is to create a BigQueryHelper object that points to the specific data set you want to use. Below is the BigQueryHelper object I created, which I appropriately named ‘accidents’. The active_project argument takes the BigQuery information, which is currently “bigquery-public-data” for all of the BigQuery data sets on Kaggle. The dataset_name argument take the name of the data set, in this case “nhtsa_traffic_fatalities”.

BigQuery Helper Object - traffic fatalities

With those two pieces of code you environment is ready to go! It’s time get analyzing!

 

Analysis

A good starting point to get a feel for a new data set is to look at the schema, which is simply a description of how the data is organized. This will give you an idea as to how the data is structured. Below is the list of tables in the accidents BigQueryHelper object that I created in the previous step.

List of tables in a BigQueryHelper object

Once you have a feel for the tables in the data set, try digging deeper into a specific table to get more information on the columns within that table. I chose the accident_2015 table.

Table column data in BigQuery

The SchemaField gives us important information about a specific column, such as:

  • The name of the column
  • The datatype in the column
  • Whether or not the column is NULLABLE or not
  • And a description of the data in that column

 

So for the state_name column we have the following schema field:

SchemaField(‘state_name’, ‘string’, ‘NULLABLE’, “This data element identifies the state in which the crash occurred.”,())

To take this a step further let’s look at the first few rows of the accidents_2015 table

Sample table rows in BigQuery

Ok once you’ve done some exploratory research and got a feel for the data set, it’s time to have some fun writing SQL queries.

 

The first thing I was interested in was trying to identify was which day of the week the most fatal accidents happen on.  So I wrote a query to get the count of unique id’s (consecutive_number in this data set), which I aliased as Count, as well as the day of week for each accident, which I aliased as Day. And finally I sorted the table in descending order based on count.

Writing a query and sorting a table in BigQuery

Just a little clarification. Sunday is the first day of the week (Day 1) in the data set and Saturday is the last day of the week (Day 7), so according to the information returned from this query, Saturday is the most dangerous day to be out on the road and Tuesday is the day where the least amount of fatal accidents occur.

 

Next I imported matplotlib.pyplot so I could show this information graphically.

Import plotting library into BigQuery

Next, I was interested in which hours of the day tend to be the most dangerous in terms of fatal accidents.

Below is the query, again I selected the count of the unique id of each accident, extracted the hour of day that the accident occurred on, grouped by hour and ordered by count in descending order.

I assigned ‘query3’ to a DataFrame called accidents_by_hour and then called the DataFrame to run my query.

Calling a DataFrame to run a query in BigQuery

This data is recorded from 00-23 hours in a day, so if I’m interpreting correctly the three most deadly hours to be on the road are 6:00pm, 8:00pm, and 5:00pm, which makes sense as those times are people are heading home from work, running errands after the workday, etc.

The most deadly hours to be on the road driving

Instead of another line graph, I imported seaborn and created a barplot to visualize the information and added a splash of color.

Importing seaborn to create a barplot for data visualization in BigQuery

Finally I wanted to see which states had the most hit and runs, so I can avoid those in the future.

Query to identify hit and run accidents by state

Here are the top 10 states with the most hit and runs. Your guess is as good as mine as to what the “Unknown” grouping is, but for the year 2015 California had the most hit and runs, followed by Florida and Texas.

List of the states where the most hit and run accidents occur

I hope you enjoyed following along with this short data analysis exercise and hopefully this sparked your interested in BigQuery. Once again, if you’re interested head over to Kaggle to check out some of the BigQuery data sets they have available. As always, if you have any questions or comments please shoot me an email at:

Kristi Smith,

ksmith@ecapitaladvisors.com