For week 2, I continued the process of creating a clean data set of call data from the Maryland Poison Center (MPC) that I started in week 1. The idea was to create a data set of call data for all 24 counties in Maryland for the years 2006-2018 from data extracted from pdf documents like this one.
After week 1, I had successfully created a function to extract the data from the 312 documents and organize it into a data frame. In week 2 I needed to clean this data frame into something usable and provide a codebook to describe the data.
Time this week: 376 minutes (~6.25 hours)
Total time (across 3 weeks): 1,122 minutes (~ 18.75 hours, ~ 6.25 hours/week)
I spent 6.25 hours on this project this week, with the time breakdown shown below. (As always, this time doesn’t include any blogging time about the project.)
Mon 6.17 | Tue 6.18 | Wed 6.19 | Thu 6.20 | Fri 6.21 | Sat 6.22 | Sun 6.23 | Mon 6.24 | Tue 6.25 |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 254 | 67 | 0 | 0 | 0 | 55 |
A stream of consciousness of the total analysis process for this project (weeks 1 and 2) can be found here, and will eventually be tidied.
A final R markdown document that produces the cleaned data can be found on Github here. You can also find the clean data set and codebook there as well.
Here are a few thoughts on the process.
After week 1, I had a nice function, pdfMPC.page1()
,for extracting data from a pdf document of the MPC type. When I run this function on a single pdf document, I get a very nice-looking data frame with 45 variables:
myData <- pdfMPC.page1("./data/Allegany County Statistical Report 2018.pdf")
dim(myData)
## [1] 1 45
myData
## Total human exposures < 12 months 1 year 2 years 3 years 4 years 5 years 6-12 years 13-19 years 20-59 years >60 years Unknown age Unintentional General Environmental Occupational Therapeutic Error
## 1 529 14 51 31 37 13 3 24 65 220 62 9 304 154 4 5 86
## Misuse Bite or Sting Food Poisoning Intentional Suspected Suicide Misuse Abuse Unknown Other Contamination/Tampering Malicious Adverse Reaction/Drug Adverse Reaction/Other Other/Unknown
## 1 52 2 1 189 110 17 60 2 36 7 3 12 4 10
## Animal Exposures Information Calls On site/non Healthcare Facility Healthcare Facility Other Refused Referral No Effect Minor Effect Moderate Effect Major Effect Death Other/Unknown Year
## 1 24 159 266 238 17 8 104 305 58 24 3 35 2018
## County
## 1 Allegany County, MD
There is still work to be done here, like making variable names that are easier to work with in R
, but it looks very nice. However, when I extract the data from all 312 documents and combine it together, I get something that’s much messier!
d1 <- pdfMPC.page1(paste0("./data/",files[1]))
myData <- d1
options(warn=2)
for (i in 2:length(files)) {
di <- pdfMPC.page1(paste0("./data/",files[i]))
myData <- bind_rows(myData,di)
}
dim(myData)
## [1] 312 62
names(myData)
## [1] "Total human exposures" "< 12 months" "1 year" "2 years" "3 years"
## [6] "4 years" "5 years" "6-12 years" "13-19 years" "Adult"
## [11] "Unknown age" "Unintentional" "General" "Environmental" "Occupational"
## [16] "Therapeutic Error" "Misuse" "Bite or Sting" "Food Poisoning" "Intentional"
## [21] "Suspected Suicide" "Abuse" "Unknown" "Other" "Contamination/Tampering"
## [26] "Malicious" "Adverse Reaction/Drug" "Adverse Reaction/Other" "Animal Exposures" "Information Calls"
## [31] "On site/non Healthcare Facility" "Healthcare Facility" "Refused Referral" "No Effect" "Minor Effect"
## [36] "Moderate Effect" "Major Effect" "Death" "Other/Unknown" "Year"
## [41] "County" "Food poisoning" "Center" "Unknown Reason" "Adverse Reaction/Food"
## [46] "Withdrawal" "Contamination/tampering" "Unknown Age" "Bite or sting" "20-59 years"
## [51] ">60 years" "Unknown adult" "Unknown child" "<60 years" "Unkknown"
## [56] "Unknown/Other" "Unknown Adult" "Other/unknown" "Other?Unknown" "Other/Unknown Reason"
## [61] "Outcome" "Medical"
It’s easier to see if we look at the variables names in alphabetical order:
myNames <- data.frame(names=names(myData))
myNames %>% arrange(names)
## names
## 1 < 12 months
## 2 <60 years
## 3 >60 years
## 4 1 year
## 5 13-19 years
## 6 2 years
## 7 20-59 years
## 8 3 years
## 9 4 years
## 10 5 years
## 11 6-12 years
## 12 Abuse
## 13 Adult
## 14 Adverse Reaction/Drug
## 15 Adverse Reaction/Food
## 16 Adverse Reaction/Other
## 17 Animal Exposures
## 18 Bite or sting
## 19 Bite or Sting
## 20 Center
## 21 Contamination/tampering
## 22 Contamination/Tampering
## 23 County
## 24 Death
## 25 Environmental
## 26 Food poisoning
## 27 Food Poisoning
## 28 General
## 29 Healthcare Facility
## 30 Information Calls
## 31 Intentional
## 32 Major Effect
## 33 Malicious
## 34 Medical
## 35 Minor Effect
## 36 Misuse
## 37 Moderate Effect
## 38 No Effect
## 39 Occupational
## 40 On site/non Healthcare Facility
## 41 Other
## 42 Other?Unknown
## 43 Other/unknown
## 44 Other/Unknown
## 45 Other/Unknown Reason
## 46 Outcome
## 47 Refused Referral
## 48 Suspected Suicide
## 49 Therapeutic Error
## 50 Total human exposures
## 51 Unintentional
## 52 Unkknown
## 53 Unknown
## 54 Unknown adult
## 55 Unknown Adult
## 56 Unknown age
## 57 Unknown Age
## 58 Unknown child
## 59 Unknown Reason
## 60 Unknown/Other
## 61 Withdrawal
## 62 Year
We can see that some issues are just differences in capitalization, like Bite or sting
verses Bite or Sting
. There are also differences in that sometimes the age groups are given as <60 years
while other times the age groups are broken down into smaller age ranges. There are also lots of typos and different orderings for all of the Unknown
and Other
categories.
This is an additional problem, because it’s not immediately clear which higher category the Other
or Unknown
variable belongs to. For example, in this pdf document, you can see that there’s an Unknown
categories for each reason for exposure (Unintentional, Intentional, and Other) as well as for Management Site and also an Other/Unknown
category for Medical Outcome. I thought the best thing to do would be to try to capture the higher category when capturing the lower category, so that the Unknown
for Unintentional
would become Unintentional:Unknown
and the second Unknown
for Intentional
would become Intentional:Unknown
and so on. Basically, I needed to assign the higher categories of Age
, Unintentional
, Other
, Management Site
, and Medical Outcome
whenever I am getting a subcategory count. Then, when data for the individual counties are merged together, the Unknown
and Other
categories will get properly matched.
I initially planned to do this by taking into account the fact that the subcategories are indented, but instead ended up just using the y
-distance between the category heading to define the categories and assigning higher categories in a variableSub
variable, and then pasting the heading and subcategory together!
## Designate the "Other Reason" variable to say "Other Reason"
countData$variable[countData$variable=="Other" & countData$xmin==min(countData$xmin)] <- "Other Reason"
y.age.min <- countData$y[countData$variable=="Total human exposures"]
y.age.max <- countData$y[countData$variable=="Unintentional"]
y.un.min <- countData$y[countData$variable=="Unintentional"]
y.un.max <- countData$y[countData$variable=="Intentional"]
y.int.min <- countData$y[countData$variable=="Intentional"]
y.int.max <- countData$y[countData$variable=="Other Reason"]
y.other.min <- countData$y[countData$variable=="Other Reason"]
y.ms.min <- countData$y[countData$variable=="Management"]
y.ms.max <- countData$y[countData$variable=="Medical"]
y.mo.min <- countData$y[countData$variable=="Medical"]
countData <- countData %>%
mutate(variableSub=ifelse(column=="Left" & y > y.age.min & y < y.age.max, "Age:",
ifelse(column=="Left" & y > y.un.min & y < y.un.max, "Unintentional:",
ifelse(column=="Left" & y > y.int.min & y < y.int.max, "Intentional:",
ifelse(column=="Left" & y > y.other.min, "Other:",
ifelse(column=="Right" & y > y.ms.min & y < y.ms.max, "ManageSite:",
ifelse(column=="Right" & y > y.mo.min, "MedOutcome:", "")))))))
# remove the rows that aren't variables/counts
countData <- countData %>% filter(count != "Calls", count!="exposure", count!="Site", count!="Outcome", variable!="Medical", variable!="Outcome")
countData <- countData %>%
mutate(name=paste0(variableSub, variable))
Once I include these higher order category labels in my function (and made some additional changes to deal with spacing in specific documents), I can process the data again. I now end up with more variables (70 verses 62), which leads me to believe that some of the Unknown
variables where being merged together when they shouldn’t have!
dim(myData)
## [1] 312 70
sort(names(myData))
## [1] "Age:< 12 months" "Age:<60 years" "Age:>60 years" "Age:1 year"
## [5] "Age:13-19 years" "Age:2 years" "Age:20-59 years" "Age:3 years"
## [9] "Age:4 years" "Age:5 years" "Age:6-12 years" "Age:Adult"
## [13] "Age:Unknown" "Age:Unknown adult" "Age:Unknown Adult" "Age:Unknown age"
## [17] "Age:Unknown Age" "Age:Unknown child" "Animal Exposures" "County"
## [21] "fileNo" "Information Calls" "Intentional" "Intentional:Abuse"
## [25] "Intentional:Misuse" "Intentional:Suspected Suicide" "Intentional:Unknown" "ManageSite:Healthcare Facility"
## [29] "ManageSite:On site/non Healthcare Facility" "ManageSite:Other" "ManageSite:Other/unknown" "ManageSite:Other/Unknown"
## [33] "ManageSite:Refused Referral" "ManageSite:Unknown" "MedOutcome:Death" "MedOutcome:Major Effect"
## [37] "MedOutcome:Minor Effect" "MedOutcome:Moderate Effect" "MedOutcome:No Effect" "MedOutcome:Other/Unknown"
## [41] "Other Reason" "Other:Adverse Reaction/Drug" "Other:Adverse Reaction/Food" "Other:Adverse Reaction/Other"
## [45] "Other:Contamination/tampering" "Other:Contamination/Tampering" "Other:Malicious" "Other:Other"
## [49] "Other:Other?Unknown" "Other:Other/Unknown" "Other:Other/Unknown Reason" "Other:Unknown"
## [53] "Other:Unknown Reason" "Other:Unknown/Other" "Other:Withdrawal" "Total human exposures"
## [57] "Unintentional" "Unintentional:Bite or sting" "Unintentional:Bite or Sting" "Unintentional:Environmental"
## [61] "Unintentional:Food poisoning" "Unintentional:Food Poisoning" "Unintentional:General" "Unintentional:Misuse"
## [65] "Unintentional:Occupational" "Unintentional:Other/Unknown" "Unintentional:Therapeutic Error" "Unintentional:Unkknown"
## [69] "Unintentional:Unknown" "Year"
Next I needed to clean up typos and combine variables that were measuring the same thing. I had to do this carefully to be sure the variables were really measuring the same thing by first checking for overlap between variables. For example, I wanted to combine all the different variables for “Unknown age” together, but had to first make sure they were really just different labelings and not truly different variables by checking for overlap:
# Age recoding
### Unknown
pairs(~`Age:Unknown` + `Age:Unknown age` + `Age:Unknown Age`, data=myData )
myData <- myData %>%
mutate(`Age:Unknown`=ifelse(!is.na(`Age:Unknown`), `Age:Unknown`, `Age:Unknown age`)) %>% select(-`Age:Unknown age`) %>%
mutate(`Age:Unknown`=ifelse(!is.na(`Age:Unknown`), `Age:Unknown`, `Age:Unknown Age`)) %>% select(-`Age:Unknown Age`)
After combining all such necessary variables, I re-ordered the data in a meaningful way to put the county/year variables at the front:
myData <- myData %>%
select(sort(names(myData)))
myData <- myData %>%
select(fileNo, County, Year, `Total human exposures`, `Animal Exposures`, `Information Calls`, `Age:< 12 months`, `Age:1 year`, `Age:2 years`, `Age:3 years`, `Age:4 years`, `Age:5 years`, `Age:6-12 years`, `Age:13-19 years`, `Age:20-59 years`, `Age:>60 years`, everything())
Then I had to do a little more cleaning to address the following issues:
<60 years
was really supposed to be >60 years
to match all the other counties.Adult
age category encompassed the 20-59 and >60 subcategories. From 2006-2009 MPC used the adult category. From 2010 on, MPC started splitting adult into two age groups. I created an Adult
category for all years but also left the age subcategories for the later years. I also combined the Unknown Adult
categories into the Adult
category I generated for years 2010-2018.Finally, I made the names of the variables easier to work with in R
by removing :
and /
and spaces!
names(myData)
namesNew <- names(myData)
namesNew <- gsub(pattern=":",replacement="_", namesNew)
namesNew <- gsub(pattern=" or s",replacement="/S", namesNew)
namesNew <- gsub(pattern="/",replacement=".", namesNew)
namesNew <- gsub(pattern="<",replacement="Under", namesNew)
namesNew <- gsub(pattern=">",replacement="Over", namesNew)
namesNew <- gsub(pattern="years",replacement="yr", namesNew)
namesNew <- gsub(pattern="year",replacement="yr", namesNew)
namesNew <- gsub(pattern="months",replacement="mo", namesNew)
namesNew <- gsub(pattern=" p",replacement=" P", namesNew)
namesNew <- gsub(pattern=" s",replacement=" S", namesNew)
namesNew <- gsub(pattern=" a",replacement=" A", namesNew)
namesNew <- gsub(pattern=" c",replacement=" C", namesNew)
namesNew <- gsub(pattern="Site.non",replacement="Site.Non", namesNew)
namesNew <- gsub(pattern=" ",replacement="", namesNew)
names(myData) <- namesNew