How to Download and Prepare Census Data for GIS Analysis
To effectively map and analyze social, economic, or demographic patterns using Census data in a GIS interface, this tutorial will walk you through the process of downloading and preparing data manually.
This tutorial focuses on manual data preparation using spreadsheet software. While every step in this guide details manual processes, most of these tasks – including downloading, filtering, and calculating derived variables – could also be efficiently performed using programming languages like Python or R, especially for larger datasets or automated workflows.
For those interested in programmatic access:
- Python: Explore Python API tools for direct access to NHGIS data.
- R: Utilize R packages like ipumsr to integrate NHGIS data into your R-based workflows.
We will cover:
- Downloading high-quality, free data from NHGIS.org.
- Understanding and selecting appropriate geographic units.
- Cleaning and scaling the data using common spreadsheet software.
- Calculating derived variables (like percentages or rates).
- Mastering a methodology you can adapt for various Census variables.
1. Downloading High-Quality Data
Visit NHGIS.org. Create a free academic use account if you haven’t already, and log in.
From the NHGIS.org home screen, choose START HERE: Get Data in the middle of the page, or SELECT DATA in the navigation bar menu at the top.
Click TOPICS. You can browse broad categories (like "Poverty" or "Race, Ethnicity, and Origins") or use the search bar to find specific variables relevant to your project.
Once you've found the variables you need (e.g., specific population counts, income levels), select them using the check-box furthest to the left. Click SUBMIT.
Under YEARS, select the American Community Survey 5-year range or Decennial Census year most applicable to your project.
Under GEOGRAPHIC LEVELS, choose CENSUS TRACT.
Why are we using census tracts?
Censuses are collected at the individual level, but results are aggregated to specific units of geography to protect privacy. Census tracts are a popular unit for geospatial analysis because counties (the level above) can be too vague to show spatial specificity, whereas more granular geographies (such as block groups or blocks) can have a higher likelihood for sampling errors. You can learn more about the hierarchical Census geographies on census.gov. If you need help determining the best geographic level for your analysis, schedule a consultation with us.
After selecting your variables, years, and geographic levels, your DATA CART will reflect your selections. From your DATA CART, choose CONTINUE, CONTINUE, and SUBMIT to begin the download. If you are not logged in, you will be prompted to do so.
Wait a moment or so for your extract to finish processing, and then download it by selecting TABLES.
2. Preparing Your Data for Mapping
You will have downloaded rows for every census tract in the entire United States, which is often more data than you need. Reducing your data at the outset will save time and space, especially when importing to visualization software.
Editing the Scale of the Data (Filtering by State)
This data cleaning portion of this tutorial is written for three spreadsheet software options: Microsoft Excel, an open (free) desktop spreadsheet tool called LibreOffice, or Google Sheets.
Microsoft Excel instructions:
- Select the header row (row 1).
- Go to the
Datatab on the ribbon and clickFilter. - Click the down arrow next to the column header containing state names (e.g., 'STATEA').
- Uncheck
(Select All)and then check the box next to the desired state(s). ClickOK.
LibreOffice instructions:
- Select any cell within your data range.
- Go to
Datain the top menu and selectAutoFilter. - Click the down arrow next to the column header containing state names.
- Uncheck
Alland then select the desired state(s). ClickOK.
Google Sheets instructions:
- Select any cell within your data range.
- Go to
Datain the top menu and selectCreate a filter. - Click the
filter icon(upside-down triangle) in the column header for state names. - Click
Clearand then select the desired state(s). ClickOK.
Scaling Data More Granularly (Filtering by County/City)
Once you've reduced your records to a specific state (e.g., New York State), you can further scale down your data by filtering records by a particular county or a group of counties (e.g., the counties associated with New York City). You will use the same filtering steps as above, applying them to the county-level column in your dataset. If you need assistance with specific filtering strategies, schedule a consultation with us.
3. Calculating Derived Variables: A Step-by-Step Example (Poverty Rate)
Many valuable Census variables, like percentages or rates, need to be derived from raw population counts. This process is often called normalization in data and GIS analysis. When seeking spatial patterns in mapped and visualized data, understanding ratios (normalized values) is often more useful than looking at raw counts, as counts can be misleading where populations are dense. However, this approach does not apply to all variables; readers should think critically about whether their variable should be represented as a count or a ratio to truly reflect what they intend to show. For assistance in determining the best representation for your data, please consult our team at the Harvard Map Collection.
From your downloaded folder, open the codebook. The file should read something like nhgis00xx_ds262_20225_tract_codebook.txt. This codebook details the meaning of each variable.
For Poverty Rate, your goal is to divide the number of people below a certain poverty level by the total population for whom poverty status is determined.
AMZME002is the number of people with income under 0.50 of the poverty level.AMZME003is the number of people with income between 0.50 and 0.99 of the poverty level.AMZME001is the total population for whom poverty status is determined.
Poverty Rate Equation:
Poverty Rate = (AMZME002 + AMZME003) / AMZME001 * 100
Steps to create the column:
- Create a new column in your spreadsheet software called
POVERTY_RATE. - In the second cell of the new column (e.g.,
B2), enter the equation (example for Excel/Sheets):= IFERROR((AR2 + AS2)/AQ2 * 100, 0)IFERRORcatches null values in the source Census data. We ask the function to return 0 instead of an error message if the function can’t carry out the calculation.
- Fill down to complete the column by double-clicking the plus sign on the cell with the formula, or dragging the fill handle.
- Save your file, making sure to choose the .CSV format.
4. Adapting for Other Census Variables & Calculations
The methodology demonstrated above is flexible. You can apply the same principles to calculate other derived variables (like percentages of specific populations) from raw Census counts.
To create a derived percentage or rate for your specific variable:
- Consult the Codebook: Open the
codebook.txtfile from your NHGIS download. Carefully identify:- The variable representing your total population or denominator (e.g.,
AQNGE001for Total Population in the Race data). - The variables representing the specific population segment(s) you want to include in your numerator. (e.g., for "Percent Non-White," you would sum
AQNGE003throughAQNGE008and potentially other relevant categories). Pay close attention to any notes about duplicate values or subtotals, as sometimes certain categories are already included in others (e.g.,AQNGE009andAQNGE010in Race data are often breakdowns ofAQNGE008and should not be double-counted).
- The variable representing your total population or denominator (e.g.,
- Construct Your Equation: Use the sum of your chosen numerator variables, divide by your total population/denominator, and multiply by 100 to get a percentage.
- Example (Percent Non-White):
(AQNGE003 + AQNGE004 + AQNGE005 + AQNGE006 + AQNGE007 + AQNGE008) / AQNGE001 * 100
- Example (Percent Non-White):
- Apply in Spreadsheet Software: Create a new column and enter your custom formula, then fill down.
- Save as CSV: Always save your final prepared data in a .CSV format for easy import into GIS software.
5. Finalizing Your Data for GIS
You now have a dataset ready for GIS mapping. Use the GEOID column (which uniquely identifies each Census tract) to join this data with a Census tract shapefile in your GIS software. The newly computed derived field (e.g., POVERTY_RATE or PCT_NONWHITE) will then power your visualization.
For a step-by-step guide on how to join tabular data in QGIS and create a map, please refer to our tutorial on Performing a Table Join.