Data cleaning and creating files for import into ArcView
We created an Access database to store and process the data. Putting the data into a consistent format took place in several steps.
The ICPSR Study 13 data had text files and data dictionaries for election results through 1990. The data dictionary was for SPSS, so to import the data into Access we had to set up the database by hand for each election. Each election's file had a slightly different format, making this a challenging and time-consuming task.
The ICPSR Study 7757 contained the key to codes used in Study 13 to represent the party associated with each vote total. Again we had to examine the codes by hand and manually update the field labels.
The data from the 1992, 1996 and 2000 elections are from the US Election Atlas. This resource is created and maintained by a hobbyist who collects election returns from state elections offices, standardizes the format, and manually creates maps representing the results. His data is available on the web but only for one state and year at a time. We manually copied the available information for each election state by state into Excel.
Information from 1952 - 1990 was imported directly into Access. We created tables of the format DATAyyyy using the manual method mentioned above, one for each election.
Information from the Excel spreadsheets was imported state by state for each year, in the format 2000_AZ, 2000_CA, etc. These were then combined into one table using a Macro Express routine. The routine used a constant to hold the year ("2000", "1996", etc.) and read from a text file that had a list of the two-letter state abbreviations with a number representing the number of down arrows needed to select that spreadsheet from the list of available sheets in the import window. The result was again a single table for each election.
The next step was to get the tables ready to join with the HUSCO data. The shapefiles are identified by a code representing state and county. Unfortunately the ICPSR data had different State ID codes. Since county names are not unique, nor are county IDs, state ID codes were needed for accurate joining. To fix this, we created a cross-reference table (StateCodeCrossRef) containing the ICPSR state code, the HUSCO state code, and the State name.
For the election data, County ID codes were only present in the 1976 and 1988 data. Where no county ID existed, the data was joined on State ID (with its cross reference) and county name. These joins showed that valid data was getting dropped (for instance, the HUSCO table with 3080 counties might only join 2076 records from the election table.)
To fix this, each year's election county names were hand-checked against the appropriate HUSCO county names. The goal was to make sure that all HUSCO counties that could be matched were matched with election data.
Some problems that turned up:
Fixed by Hand
County names different (Franklin County vs. Franklin) [the data sets were really bad about having consistent county names]
County name with thin spaces, looked alike but not textually alike
Ignored, nothing to do to fix
Virginia cities were included in the election data, but not in the HUSCO tables
All HUSCO tables had a county 'Nansemonde' in Virginia that does not exist in any election table.
Yellowstone National Park was listed in HUSCO but not in the election data. Where necessary, a 0 vote record was created to force a match.
The District of Columbia was not present in all the election data.
Counties that existed in the election data but not in the HUSCO data, and were not Virginia cities: La Paz in Arizona (84), Hines in Mississippi (84), Cibola in New Mexico (84), and Carson City in Nevada (76).
Fixed by Access query In some data, the Total Votes field was missing, but the percentages and party votes were present (id rep 40%, dem 60%, rep 400 votes, dem 600 votes). To determine total votes, we added up the vote fields.
StateCodeCrossRef was then used in the creation of a set of standardized working tables, one for each election. A MakeTable query was used to take some fields from the .dbf file associated with the appropriate HUSCO shapefile and join them with the source election data using StateCodeCrossRef as a translator to get the records to match properly.
For the data from US Election Atlas there was no existing state or code ID code. The HUSCO ID code was joined on common fields STATE and COUNTY names to create a new table. The SQL statement used was:
SELECT [US1999].[STATE], [US1999].[StateID], [US1999].[COUNTY], [US1999].[CountyID], int([US1999].[FIP]) AS ID, [Data2000].[Dem], [Data2000].[Bush], [Data2000].[Gore], [Data2000].[Nader], [Data2000].[Other], [Data2000].[Rep], [Data2000].[Green], [Data2000].[Misc], [Data2000].[Total] INTO election2000
FROM Data2000 INNER JOIN US1999 ON ([Data2000].[County]=[US1999].[COUNTY]) AND ([Data2000].[STATE]=[US1999].[STATE]);
See Table 1 for the election year/shapefile year matchups.
|Election Year HUSCO County shape year|
The result was consistent working tables that could be imported in ArcView and joined with the county shapefiles.
Before that was done, however, one more piece of information was added to the working tables. Queries were used to calculate which party received the most votes in each county for each election. First a query determined whether or not an independent candidate won the county. These counties had a new field (votecode) filled with an I.
Another query then looked at the records without I's and populated the field with other codes (see Table 2 for the codes and when they were used).
|I||(Independent > Rep) AND (Independent > Dem)|
|R||Rep > Dem|
|D||Dem > Rep|
|T||Dem = Rep|
|N||Dem = (0 or Null) AND Rep = (0 or Null)|
At that point, the data was ready for import into ArcView to be joined with the county shapefiles.
Importing Into ArcView Test imports discovered some problems with the data. It turned out that the election results for 1972 were missing. In addition, votes for both parties were missing from Minnesota and Mississippi for 1952, 1956, and 1964. Democratic votes were missing from Mississippi for 1960. This made it impractical to use the data for any year prior to 1976, since we were looking in part at how counties switch party (or not) from election to election. We needed complete and consistent voting returns to do so. At that point the project shrank in scope to only analyze the seven elections from 1976-2000.
With all of the relevant election data joined to the proper county shapefile (see Table 1 for the matchups), the GIS analysis began.
back to County Voting Patterns in US Presidential Elections, 1976-2000