5HAW

An old data scientist learning new tricks in 5 hours a week

Week 2: Cleaning data extracted from a pdf

Posted at — Jul 9, 2019

For week 2, I continued the process of creating a clean data set of call data from the Maryland Poison Center (MPC) that I started in week 1. The idea was to create a data set of call data for all 24 counties in Maryland for the years 2006-2018 from data extracted from pdf documents like this one.

After week 1, I had successfully created a function to extract the data from the 312 documents and organize it into a data frame. In week 2 I needed to clean this data frame into something usable and provide a codebook to describe the data.

Time this week: 376 minutes (~6.25 hours)

Total time (across 3 weeks): 1,122 minutes (~ 18.75 hours, ~ 6.25 hours/week)

I spent 6.25 hours on this project this week, with the time breakdown shown below. (As always, this time doesn’t include any blogging time about the project.)

Mon 6.17 Tue 6.18 Wed 6.19 Thu 6.20 Fri 6.21 Sat 6.22 Sun 6.23 Mon 6.24 Tue 6.25
0 0 0 254 67 0 0 0 55

A stream of consciousness of the total analysis process for this project (weeks 1 and 2) can be found here, and will eventually be tidied.

A final R markdown document that produces the cleaned data can be found on Github here. You can also find the clean data set and codebook there as well.

Here are a few thoughts on the process.

Starting with the messy data

After week 1, I had a nice function, pdfMPC.page1(),for extracting data from a pdf document of the MPC type. When I run this function on a single pdf document, I get a very nice-looking data frame with 45 variables:

myData <- pdfMPC.page1("./data/Allegany County Statistical Report 2018.pdf")
dim(myData)
## [1]  1 45
myData
##   Total human exposures < 12 months 1 year 2 years 3 years 4 years 5 years 6-12 years 13-19 years 20-59 years >60 years Unknown age Unintentional General Environmental Occupational Therapeutic Error
## 1                   529          14     51      31      37      13       3         24          65         220        62           9           304     154             4            5                86
##   Misuse Bite or Sting Food Poisoning Intentional Suspected Suicide Misuse Abuse Unknown Other Contamination/Tampering Malicious Adverse Reaction/Drug Adverse Reaction/Other Other/Unknown
## 1     52             2              1         189               110     17    60       2    36                       7         3                    12                      4            10
##   Animal Exposures Information Calls On site/non Healthcare Facility Healthcare Facility Other Refused Referral No Effect Minor Effect Moderate Effect Major Effect Death Other/Unknown Year
## 1               24               159                             266                 238    17                8       104          305              58           24     3            35 2018
##                County
## 1 Allegany County, MD

There is still work to be done here, like making variable names that are easier to work with in R, but it looks very nice. However, when I extract the data from all 312 documents and combine it together, I get something that’s much messier!

d1 <- pdfMPC.page1(paste0("./data/",files[1]))
myData <- d1

options(warn=2)
for (i in 2:length(files)) {
  di <- pdfMPC.page1(paste0("./data/",files[i]))
  myData <- bind_rows(myData,di)
}

dim(myData)
## [1] 312  62
names(myData)
##  [1] "Total human exposures"           "< 12 months"                     "1 year"                          "2 years"                         "3 years"                        
##  [6] "4 years"                         "5 years"                         "6-12 years"                      "13-19 years"                     "Adult"                          
## [11] "Unknown age"                     "Unintentional"                   "General"                         "Environmental"                   "Occupational"                   
## [16] "Therapeutic Error"               "Misuse"                          "Bite or Sting"                   "Food Poisoning"                  "Intentional"                    
## [21] "Suspected Suicide"               "Abuse"                           "Unknown"                         "Other"                           "Contamination/Tampering"        
## [26] "Malicious"                       "Adverse Reaction/Drug"           "Adverse Reaction/Other"          "Animal Exposures"                "Information Calls"              
## [31] "On site/non Healthcare Facility" "Healthcare Facility"             "Refused Referral"                "No Effect"                       "Minor Effect"                   
## [36] "Moderate Effect"                 "Major Effect"                    "Death"                           "Other/Unknown"                   "Year"                           
## [41] "County"                          "Food poisoning"                  "Center"                          "Unknown Reason"                  "Adverse Reaction/Food"          
## [46] "Withdrawal"                      "Contamination/tampering"         "Unknown Age"                     "Bite or sting"                   "20-59 years"                    
## [51] ">60 years"                       "Unknown adult"                   "Unknown child"                   "<60 years"                       "Unkknown"                       
## [56] "Unknown/Other"                   "Unknown Adult"                   "Other/unknown"                   "Other?Unknown"                   "Other/Unknown Reason"           
## [61] "Outcome"                         "Medical"

It’s easier to see if we look at the variables names in alphabetical order:

myNames <- data.frame(names=names(myData))
myNames %>% arrange(names)
##                              names
## 1                      < 12 months
## 2                        <60 years
## 3                        >60 years
## 4                           1 year
## 5                      13-19 years
## 6                          2 years
## 7                      20-59 years
## 8                          3 years
## 9                          4 years
## 10                         5 years
## 11                      6-12 years
## 12                           Abuse
## 13                           Adult
## 14           Adverse Reaction/Drug
## 15           Adverse Reaction/Food
## 16          Adverse Reaction/Other
## 17                Animal Exposures
## 18                   Bite or sting
## 19                   Bite or Sting
## 20                          Center
## 21         Contamination/tampering
## 22         Contamination/Tampering
## 23                          County
## 24                           Death
## 25                   Environmental
## 26                  Food poisoning
## 27                  Food Poisoning
## 28                         General
## 29             Healthcare Facility
## 30               Information Calls
## 31                     Intentional
## 32                    Major Effect
## 33                       Malicious
## 34                         Medical
## 35                    Minor Effect
## 36                          Misuse
## 37                 Moderate Effect
## 38                       No Effect
## 39                    Occupational
## 40 On site/non Healthcare Facility
## 41                           Other
## 42                   Other?Unknown
## 43                   Other/unknown
## 44                   Other/Unknown
## 45            Other/Unknown Reason
## 46                         Outcome
## 47                Refused Referral
## 48               Suspected Suicide
## 49               Therapeutic Error
## 50           Total human exposures
## 51                   Unintentional
## 52                        Unkknown
## 53                         Unknown
## 54                   Unknown adult
## 55                   Unknown Adult
## 56                     Unknown age
## 57                     Unknown Age
## 58                   Unknown child
## 59                  Unknown Reason
## 60                   Unknown/Other
## 61                      Withdrawal
## 62                            Year

We can see that some issues are just differences in capitalization, like Bite or sting verses Bite or Sting. There are also differences in that sometimes the age groups are given as <60 years while other times the age groups are broken down into smaller age ranges. There are also lots of typos and different orderings for all of the Unknown and Other categories.

Adding higher-level categories to variable names

This is an additional problem, because it’s not immediately clear which higher category the Other or Unknown variable belongs to. For example, in this pdf document, you can see that there’s an Unknown categories for each reason for exposure (Unintentional, Intentional, and Other) as well as for Management Site and also an Other/Unknown category for Medical Outcome. I thought the best thing to do would be to try to capture the higher category when capturing the lower category, so that the Unknown for Unintentional would become Unintentional:Unknown and the second Unknown for Intentional would become Intentional:Unknown and so on. Basically, I needed to assign the higher categories of Age, Unintentional, Other, Management Site, and Medical Outcome whenever I am getting a subcategory count. Then, when data for the individual counties are merged together, the Unknown and Other categories will get properly matched.

I initially planned to do this by taking into account the fact that the subcategories are indented, but instead ended up just using the y-distance between the category heading to define the categories and assigning higher categories in a variableSub variable, and then pasting the heading and subcategory together!

## Designate the "Other Reason" variable to say "Other Reason"
countData$variable[countData$variable=="Other" & countData$xmin==min(countData$xmin)] <- "Other Reason"

y.age.min <- countData$y[countData$variable=="Total human exposures"]
y.age.max <- countData$y[countData$variable=="Unintentional"]
y.un.min <- countData$y[countData$variable=="Unintentional"]
y.un.max <- countData$y[countData$variable=="Intentional"]
y.int.min <- countData$y[countData$variable=="Intentional"]
y.int.max <- countData$y[countData$variable=="Other Reason"]
y.other.min <- countData$y[countData$variable=="Other Reason"]

y.ms.min <- countData$y[countData$variable=="Management"]
y.ms.max <- countData$y[countData$variable=="Medical"]
y.mo.min <- countData$y[countData$variable=="Medical"]

countData <- countData %>%
  mutate(variableSub=ifelse(column=="Left" & y > y.age.min & y < y.age.max, "Age:",
         ifelse(column=="Left" & y > y.un.min & y < y.un.max, "Unintentional:",
         ifelse(column=="Left" & y > y.int.min & y < y.int.max, "Intentional:",
         ifelse(column=="Left" & y > y.other.min, "Other:",
         ifelse(column=="Right" & y > y.ms.min & y < y.ms.max, "ManageSite:",
         ifelse(column=="Right" & y > y.mo.min, "MedOutcome:", "")))))))

# remove the rows that aren't variables/counts
countData <- countData %>% filter(count != "Calls", count!="exposure", count!="Site", count!="Outcome", variable!="Medical", variable!="Outcome")

countData <- countData %>%
  mutate(name=paste0(variableSub, variable))

Once I include these higher order category labels in my function (and made some additional changes to deal with spacing in specific documents), I can process the data again. I now end up with more variables (70 verses 62), which leads me to believe that some of the Unknown variables where being merged together when they shouldn’t have!

dim(myData)
## [1] 312  70
sort(names(myData))
##  [1] "Age:< 12 months"                            "Age:<60 years"                              "Age:>60 years"                              "Age:1 year"                                
##  [5] "Age:13-19 years"                            "Age:2 years"                                "Age:20-59 years"                            "Age:3 years"                               
##  [9] "Age:4 years"                                "Age:5 years"                                "Age:6-12 years"                             "Age:Adult"                                 
## [13] "Age:Unknown"                                "Age:Unknown adult"                          "Age:Unknown Adult"                          "Age:Unknown age"                           
## [17] "Age:Unknown Age"                            "Age:Unknown child"                          "Animal Exposures"                           "County"                                    
## [21] "fileNo"                                     "Information Calls"                          "Intentional"                                "Intentional:Abuse"                         
## [25] "Intentional:Misuse"                         "Intentional:Suspected Suicide"              "Intentional:Unknown"                        "ManageSite:Healthcare Facility"            
## [29] "ManageSite:On site/non Healthcare Facility" "ManageSite:Other"                           "ManageSite:Other/unknown"                   "ManageSite:Other/Unknown"                  
## [33] "ManageSite:Refused Referral"                "ManageSite:Unknown"                         "MedOutcome:Death"                           "MedOutcome:Major Effect"                   
## [37] "MedOutcome:Minor Effect"                    "MedOutcome:Moderate Effect"                 "MedOutcome:No Effect"                       "MedOutcome:Other/Unknown"                  
## [41] "Other Reason"                               "Other:Adverse Reaction/Drug"                "Other:Adverse Reaction/Food"                "Other:Adverse Reaction/Other"              
## [45] "Other:Contamination/tampering"              "Other:Contamination/Tampering"              "Other:Malicious"                            "Other:Other"                               
## [49] "Other:Other?Unknown"                        "Other:Other/Unknown"                        "Other:Other/Unknown Reason"                 "Other:Unknown"                             
## [53] "Other:Unknown Reason"                       "Other:Unknown/Other"                        "Other:Withdrawal"                           "Total human exposures"                     
## [57] "Unintentional"                              "Unintentional:Bite or sting"                "Unintentional:Bite or Sting"                "Unintentional:Environmental"               
## [61] "Unintentional:Food poisoning"               "Unintentional:Food Poisoning"               "Unintentional:General"                      "Unintentional:Misuse"                      
## [65] "Unintentional:Occupational"                 "Unintentional:Other/Unknown"                "Unintentional:Therapeutic Error"            "Unintentional:Unkknown"                    
## [69] "Unintentional:Unknown"                      "Year"

Cleaning typos in variables names and combining variables

Next I needed to clean up typos and combine variables that were measuring the same thing. I had to do this carefully to be sure the variables were really measuring the same thing by first checking for overlap between variables. For example, I wanted to combine all the different variables for “Unknown age” together, but had to first make sure they were really just different labelings and not truly different variables by checking for overlap:

# Age recoding
### Unknown
pairs(~`Age:Unknown` + `Age:Unknown age` + `Age:Unknown Age`, data=myData )

myData <- myData %>% 
  mutate(`Age:Unknown`=ifelse(!is.na(`Age:Unknown`), `Age:Unknown`, `Age:Unknown age`)) %>% select(-`Age:Unknown age`) %>%
  mutate(`Age:Unknown`=ifelse(!is.na(`Age:Unknown`), `Age:Unknown`, `Age:Unknown Age`)) %>% select(-`Age:Unknown Age`)

Reordering variables in a meaningful way

After combining all such necessary variables, I re-ordered the data in a meaningful way to put the county/year variables at the front:

myData <- myData %>%
  select(sort(names(myData)))

myData <- myData %>%
  select(fileNo, County, Year, `Total human exposures`, `Animal Exposures`, `Information Calls`, `Age:< 12 months`, `Age:1 year`, `Age:2 years`, `Age:3 years`, `Age:4 years`, `Age:5 years`, `Age:6-12 years`, `Age:13-19 years`, `Age:20-59 years`, `Age:>60 years`, everything())

Further data cleaning

Then I had to do a little more cleaning to address the following issues:

  • There was a clear typo in the pdf documents for Caroline County for 2010-2018; the variable labeled <60 years was really supposed to be >60 years to match all the other counties.
  • It was clear the Adult age category encompassed the 20-59 and >60 subcategories. From 2006-2009 MPC used the adult category. From 2010 on, MPC started splitting adult into two age groups. I created an Adult category for all years but also left the age subcategories for the later years. I also combined the Unknown Adult categories into the Adult category I generated for years 2010-2018.
  • I noticed that in 2014 there were 25 counties and in 2015 there were only 23 counties. Somerset County was showing up twice in 2014 and not at all in 2015. Looking at the pdf documents for Somerset for these two years, the pdf documents are in fact different (have different numbers) but both are labeled 2014 on the first page of the document. However,the one from the 2015 link is labeled 2015 on the second page, so I assigned the data from this document to Somerset 2015.

Modifying variable names

Finally, I made the names of the variables easier to work with in R by removing : and / and spaces!

names(myData)
namesNew <- names(myData)

namesNew <- gsub(pattern=":",replacement="_", namesNew)
namesNew <- gsub(pattern=" or s",replacement="/S", namesNew)
namesNew <- gsub(pattern="/",replacement=".", namesNew)
namesNew <- gsub(pattern="<",replacement="Under", namesNew)
namesNew <- gsub(pattern=">",replacement="Over", namesNew)
namesNew <- gsub(pattern="years",replacement="yr", namesNew)
namesNew <- gsub(pattern="year",replacement="yr", namesNew)
namesNew <- gsub(pattern="months",replacement="mo", namesNew)
namesNew <- gsub(pattern=" p",replacement=" P", namesNew)
namesNew <- gsub(pattern=" s",replacement=" S", namesNew)
namesNew <- gsub(pattern=" a",replacement=" A", namesNew)
namesNew <- gsub(pattern=" c",replacement=" C", namesNew)
namesNew <- gsub(pattern="Site.non",replacement="Site.Non", namesNew)
namesNew <- gsub(pattern=" ",replacement="", namesNew)

names(myData) <- namesNew

Final clean data set!

After all that work, I have a clean data set ready to be analyzed; you can find this data here and the accompanying codebook here.

This project made me appreciate again how much work goes into assembling and cleaning data before any actual analysis can be done!

comments powered by Disqus