In this fun project I break down the behaviour of Americans on Thanksgiving to deep dive into pandas! I use this dataset by DataQuest which contains responses to an online survey about what Americans eat for Thanksgiving dinner. Each survey respondent was asked questions about what they typically eat for Thanksgiving, along with some demographic questions, like their gender, income, and location. This dataset will allow us to discover regional and income-based patterns in what Americans eat for Thanksgiving dinner. Img

  • The dataset is stored in the thanksgiving.csv file
  • The dataset has 65 columns, and 1058 rows
  • Most of the column names are questions.
  • Most of the column values are string responses to the questions.

KEY RESULTS

Img

  • Most people (>99%) ate Turkey!
  • Most people had multiple kinds of pies (~60%), followed by Pumpkin pie (~20%) and Apple pie (~8%).
  • Majority of High Income Group celebrates at home, Low Income Group travels

Please continue reading if you want to go over the step-by-step python tutorial to generate these results. I have also linked my Jupyter Notebook here. Let’s begin by importing relevant modules and loading the data file.

# Importing relevant modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import re
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 16})
# Reading data
data = pd.read_csv("thanksgiving.csv", encoding="Latin-1")

What do people eat for Thanksgiving?

  • We want to understand what people ate for Thanksgiving, we’ll remove any responses from people who don’t celebrate it. The column ‘Do you celebrate Thanksgiving?’ contains this information. We only want to keep data for people who answered ‘Yes’ to this questions.
  • Let’s explore what main dishes people tend to eat during Thanksgiving dinner. We can use the value_counts method to help us with this.
  • “Surprise!”, most people ate Turkey!
# indices of rows for people who celebrate Thanksgiving
yes_celebrating = data['Do you celebrate Thanksgiving?']=='Yes'

# Keep the rows for which [Do you celebrate Thanksgiving?]= Yes
data = data[yes_celebrating] 
# Count how many times each category occurs 
dish_type = pd.value_counts(data['What is typically the main dish at your Thanksgiving dinner?'].values, sort=True)

#Now make a pie chart
plt.figure(figsize=(12,4))
dish_type.plot(kind='bar', color=['blue', 'magenta'])
plt.ylabel('Frequency')
plt.xlabel('Dish')
plt.title('Number of appearances in dataset')
plt.show()

png

What’s for dessert?

  • Now that we’ve looked into the main dishes, let’s explore the dessert dishes.
  • Specifically, we’ll look at how many people eat Apple, Pecan, or Pumpkin pie during Thanksgiving dinner.
  • As expected Pumpkin Pie is more popular.
  • Another interesting observation is that most people ate more than one kind of pie
#apple
apple_isnull = pd.isnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'])
apple_notnull = pd.notnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'])
#pumpkin
pumpkin_isnull = pd.isnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'])
pumpkin_notnull = pd.notnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'])
#pecan
pecan_isnull = pd.isnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'])
pecan_notnull = pd.notnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'])

no_pies = apple_isnull & pumpkin_isnull & pecan_isnull
only_apple_pies = apple_notnull & pumpkin_isnull & pecan_isnull
only_pumpkin_pies = apple_isnull & pumpkin_notnull & pecan_isnull
only_pecan_pies = apple_isnull & pumpkin_isnull & pecan_notnull
# create a dictionary with pie counts
pie_types = {}
pie_types['Apple'] = pd.value_counts(only_apple_pies)[1]
pie_types['Pumpkin'] = pd.value_counts(only_pumpkin_pies)[1]
pie_types['Pecan'] = pd.value_counts(only_pecan_pies)[1]
pie_types['None'] = pd.value_counts(no_pies)[1]
pie_types['Multiple'] = pd.value_counts(no_pies)[0] - pie_types['Apple'] - pie_types['Pumpkin'] - pie_types['Pecan'] 
# plot pie data in pie chart 
plt.figure(figsize=(7,7))
plt.pie([int(v) for v in pie_types.values()],labels=pie_types.keys(), autopct='%1.1f%%')
plt.title("Pie chart of pies consumed")
plt.show()

png

Age & Income Groups

  • Let’s analyze the Age column in more depth. In order to analyze the Age column, we’ll first need to convert it to numeric values from the categories.
  • The ‘How much total combined money did all members of your HOUSEHOLD earn last year?’ column is very similar to the Age column.
  • It contains categories, but can be converted to numerical values.
  • Finally, we can then plot Histograms to see the distributions.
print(data["Age"].value_counts())
income_col = 'How much total combined money did all members of your HOUSEHOLD earn last year?'
print(data[income_col].value_counts()[:2])
45 - 59    269
60+        258
30 - 44    235
18 - 29    185
Name: Age, dtype: int64
$25,000 to $49,999    166
$50,000 to $74,999    127
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64

def get_int_age(in_str):
    if pd.isnull(in_str):
        return None
    split_str = in_str.split(" ")
    age_str = re.sub('\+$', '', split_str[0])
    try:
        age_int = int(age_str)
    except Exception: 
        age_int = None
    return age_int

def get_int_income(in_str):
    if pd.isnull(in_str):
        return None
    first_str = in_str.split(" ")[0]
    if first_str== 'Prefer':
        return None
    
    income_str = re.sub('[\$\,]', '', first_str)
    try:
        income_int = int(income_str)
    except Exception: 
        income_int = None
    return income_int/1000
# Clean data 
data["int_age"] = data["Age"].apply(get_int_age)
data["int_income"] = data[income_col].apply(get_int_income)
# Fill missing data with median
data["int_age"] = data["int_age"].fillna(data["int_age"].median())
data["int_income"] = data["int_income"].fillna(data["int_income"].median())
# plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,4))

data.hist(column = 'int_age', ax=ax1, color='orange')
ax1.set_title("Distribution of Age")
ax1.set_xlabel("Age (years)")
data.hist(column = 'int_income', ax=ax2, color='purple')
ax2.set_title("Distribution of Income")
ax2.set_xlabel("Income (in 1000$)")

plt.show()

png

Correlating Travel Distance And Income

  • We can now see how the distance someone travels for Thanksgiving dinner relates to their income level.

  • It’s safe to hypothesize that people earning less money could be younger, and would travel to their parent’s houses for Thanksgiving.

  • People earning more are more likely to have Thanksgiving at their house as a result.

  • We can test this by filtering data based on int_income, and seeing what the values in the How far will you travel for Thanksgiving?

#  low income results <150K
is_low_income = data['int_income'] < 150
dist_low_income = data['How far will you travel for Thanksgiving?'][is_low_income]
value_dist_low = dist_low_income.value_counts()

# high income results >150K
is_high_income = data['int_income'] > 150
dist_high_income = data['How far will you travel for Thanksgiving?'][is_high_income]
value_dist_high = dist_high_income.value_counts()
# pie plots
my_label = ["No travel", "Local","Few Hours","Out of Town"]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))

ax1.pie(value_dist_low, labels=my_label, autopct='%1.1f%%')
ax1.set_title("Low Income")

ax2.pie(value_dist_high, labels=my_label, autopct='%1.1f%%')
ax2.set_title("High Income")
fig.subplots_adjust(hspace=6)
plt.show()

png