Note: The dataset with information about its source and the variables included is available at https://www.openintro.org/data/index.php?data=county_2019). This assignment reinforces lessons and resources from the Defining Data, Critiquing Data, and Collecting Data subdomains of the DA4A Toolkit, and also could be used as the basis of exercises focused on Making Claims with Data, Visualizing Data, Mapping Data, or Telling Stories with Data.
Assignment Prompt: The American Community Survey provides an occasion to reflect upon how the project of counting the US population is inherently messy, and implicitly (and sometimes explicitly) caught up in questions of power. This is the case not only because census numbers are used by federal, state and local policy makers, but also because the methods and categories used to gather and organize data frequently make assumptions about what it means to be normal and about how people should be living their lives. At the same time, data can be a powerful tool for identifying patterns of injustice or systemic violence. As you work through this assignment, reflect both on how the ACS data embeds bias and on how the data might contribute to a responsible data advocacy project.
PART I:
Using a spreadsheet program or a software platform for statistical analysis (such as R), access the dataset and answer the following questions:
- How does the dataset represent the phenomena under scrutiny? What variables does it include? Which of the variables are categorical? Which are numerical? How do the variables selected for inclusion impact the kinds of inquiry you can perform with the data? What kinds of values are embedded in the way the dataset presents its information?
- Pick a numerical variable–for example, “population,” “age_over_18,” “hs_grad,” or any other numerical variable you wish to explore–and create a histogram of the data to visualize the data distribution. (Note: You can create a visualization for the entire United States–encompassing every county in the country–or you could filter first for a particular state.) How are the data distributed?
- Using the same variable, calculate the mean, median, and mode. Are these three measures of central tendency relatively close to each other? If so, what does their proximity suggest? If not, what does their relative difference tell you about the distribution of the observations that make up the dataset?
- Using the same variable, locate the maximum and minimum values. Calculate the interquartile range. Finally, calculate the standard deviation for your variable. Using the information about central tendency developed above, describe how your data are dispersed. Do the observations cluster around a central point? Are they relatively spread out?
- Choose another numerical variable and calculate the correlation between it and the initial variable you’ve studied. Interpret the r coefficient for these two variables. Does the r statistic indicate a strong or weak positive correlation, a strong or weak negative correlation, or no correlation? Why might this be? What might account for the relative correlation or non-correlation?
PART II:
Reviewing the calculations and reflections above, consider how these insights might inform a data advocacy project. For this part of the assignment, write a brief reflection focused on how your exploration of the ACS dataset might help support a data advocacy project. Your reflection should include two components:
- Brainstorm answers to the following questions, which build on your analysis from part one. What opportunities for further inquiry does your initial exploration help you identify? What kinds of power dynamics, structural inequities, or potential injustices might your analysis help identify? What kinds of information, including contextual and historical information, would be useful to help you answer these questions?
- Describe a data advocacy project that responsibly would build on the insights you’ve generated. What kinds of policy changes–including policies about data categories, data collection, and data use–might the insights you’ve generated help support? What kinds of challenges or injustices does your preliminary statistical analysis help identify? What kinds of help and input would you need to develop a data advocacy project?