Antibiotics and food in the American press: A text mining study.
Antoine Bridier-Nahmias\(\dagger\), Estera Badau\(\dagger\), Pi Nyvall Collen ,Antoine Andremont, Jocelyne Arquembourg

\(\dagger\): These authors contributed equally to this work

Corpus constitution

The articles have been searched upon the Factiva database, based on key words and expressions used in conjunction. Terms and expressions researched were the following:

antibiotic resistance, antimicrobial resistance, 
antibiotic free or antibiotic-free, antibiotics and food, 
antibiotics and farming, antibiotics and resistant, antibiotics and salmonella, 
salmonella and resistant, salmonella and outbreak, 
antibiotics and campylobacter and resistant, antibiotics and routine, 
antibiotics and routinely, antibiotics and One Health;
(antibio* near3 food) or (antibio* near3 farm*) or (antibio* near3 salmonell*) 
or (antibio* near3 campylobacter*) or (antibio* near3 animal*) or 
(antibio* near3 feed)

Data loading

The corpus is consituted by articles saved in independent pdf files.

Data Processing

Information scrapping and cleaning

We will now fuse the articles and their respective informations in a dataframe, and then we will remove the headers and footers. This operation is noisy because of the inconsistency in the footer formatting.

We are ready to unite everything in one data.frame, beforehand we’ll just add a unique id for each article.

The tokenzation can now take place. We can use multiple ngrams size, we will start with 1grams first i.e: words.

Counting and more

Let’s first extract some figures about the whole corpus

Publication chronology

Parsed with column specification:
cols(
  date = col_date(format = ""),
  full_event = col_character(),
  event_label = col_character()
)
longer object length is not a multiple of shorter object length

Counts and TF-IDF

We will compute the following:

  • corpus_length: the total number of documents in the corpus
  • article_length: the total number of words in each article
  • n_article: the count of each word in each article
  • word_in_n: for each word, how many articles contain it
  • n_tot: total count of each word
  • n_journal: total count of each word by journal
suppressMessages(
  my_stop_words <-
    read_table(
      file = "../data/my_stop_words.txt",
      col_names = "stop_word"
    )
)
corpus_tfidf_full <-
  corpus_1_grams_unfiltered %>%
  # filter(id %in% c(30:100,400:500) ) %>%
  # filter(id %in% c(517) ) %>%
  
  # clean dataset
  mutate(word = str_to_lower(string = word)) %>% 
  mutate(word = str_replace_all(string = word, pattern = "([[:alpha:]]*)\\.([[:alpha:]]*)", replacement = "\\1\\2")) %>%
  mutate(word = str_replace_all(string = word, pattern = "'s$", replacement = "")) %>%
  # mutate(word = str_replace_all(string = word, pattern = "[^a-z]", replacement = "")) %>%
  filter(!(str_detect(string = word, pattern = "washpostcom"))) %>% # Tokenization splits on @ !!!!!!!
  filter(word != "") %>% 
  filter(nchar(word) > 1) %>% 
  filter(!str_detect(string = word, pattern = "^[0-9]|[[:punct:]]+$")) %>%
  
  # lemmatization (better than stemming) and last filtering
  mutate(stem = stem_words(word)) %>%
  filter(!word %in% my_stop_words$stop_word) %>% 
  
  # total number of articles
  mutate(corpus_length = length(unique(id) )) %>%  
  
  # total word in article
  group_by(id) %>%
  mutate(article_length = n()) %>%
  
  # count of each word by article
  group_by(id, stem) %>% 
  mutate(n_article = n()) %>% 
  
  # word count
  group_by(stem) %>% 
  mutate(n_total = n()) %>% 
  group_by(stem, journal) %>% 
  mutate(n_journal = n()) %>% 
  ungroup() %>%
  # compute tf-idf
  
  mutate(tf = n_article / article_length) %>% # text frequency
  group_by(stem) %>% 
  mutate(word_in_n = length(unique(id))) %>% 
  mutate(idf = log(corpus_length / word_in_n) ) %>% # inverse document frequency
  mutate(tf_idf = tf * idf) %>% 
  ungroup() %>% 
  
  # Choose a representant for each stem, the most common term could be the best
  group_by(stem) %>%
  mutate(ori_word = word) %>%
  group_by(ori_word) %>% 
  mutate(n_ori = n()) %>%
  arrange(desc(n_ori)) %>%
  group_by(stem) %>% 
  mutate(word = word[1]) %>%
  select(-n_ori) %>% 
  ungroup()
# write_delim(x = corpus_tfidf_full, path = "../output/corpus_tfidf_full.tsv", delim = "\t", col_names = TRUE)
# corpus_tfidf_full <- 
#   read_delim(file = "../output/corpus_tfidf_full.tsv", delim = "\t", col_names = TRUE)
corpus_tfidf <-
  corpus_tfidf_full %>% 
  # reduce
  group_by(id, word) %>%
  slice(1) %>% 
  ungroup() %>%
  
  # filter out  words with tf-idf == 0
  filter(tf_idf > 0) %>% 
  identity()
corpus_tfidf %>% 
  filter(!(word %in% stop_words$word)) %>%
  filter(n_total >= 50) %>% 
  arrange(desc(n_total)) %>% 
  group_by(word, journal) %>% 
  slice(1) %>%
  select(word, journal, pub_date, n_total, n_journal) %>%
  arrange(desc(n_total)) %>% 
  datatable(caption = "Words appearing at least 50 times", filter = "top") %>% 
  identity()

Word count evolution through time

Analyzes

GLM

What are the term that could discriminate between an article from the WP and the NYT?

   user  system elapsed 
 13.482   0.293  13.802 

Contextual analysis

Syntagma curation

In the next section, we will concentrate on counting manually curated terms or expressions (syntagmas) They will be presented in a named list containing all the terms considered equivalent. We will then extract all the sentences in which they occur and analyse their context (in general and across time).

$antibiotic_resistance
[1] "antibiotic resistance"     "antibiotic-resistance"     "resistant to antibiotics"  "resistance to antibiotics"

$antibiotic_free
[1] "antibiotic free"     "antibiotic-free"     "antibioticsfree"     "free of antibiotics"

$routine_use
[1] "routine use"    "routinely used"

$judicious_use
[1] "judicious use"

$responsible_use
[1] "responsible use"

$prudent_use
[1] "prudent use"

$indiscriminate_use
[1] "indiscriminate use"

$food_borne
[1] "food borne" "food-borne"

The first step is to divide the corpus in sentences.
Remark unnest_tokens(token = "sentences") clearly fails whenever it encounters an abbreviation containing a dot.

corpus_sentences <-
  corpus_txt %>% 
  # Each article has to be re-concatenated
  group_by(id) %>% 
  # filter(id %in% 1:5) %>%
  mutate(article = paste(text, collapse = " ")) %>%
  select(-text) %>% 
  distinct() %>%
  ungroup() %>% 
# Could/should be done in one pass with a list of terms!
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "dr\\.", ignore_case = TRUE), replacement = "dr")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "prof\\.", ignore_case = TRUE), replacement = "prof")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "mr\\.", ignore_case = TRUE), replacement = "mr")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "ms\\.", ignore_case = TRUE), replacement = "ms")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "mrs\\.", ignore_case = TRUE), replacement = "mrs")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "st\\.", ignore_case = TRUE), replacement = "st")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "rep\\.", ignore_case = TRUE), replacement = "rep")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "u\\.s\\.", ignore_case = TRUE), replacement = "usa")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "f\\.d\\.a\\.", ignore_case = TRUE), replacement = "fda")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "gov\\.", ignore_case = TRUE), replacement = "gov")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "sen\\.", ignore_case = TRUE), replacement = "sen")) %>%
  mutate(article = str_replace_all(string = article, pattern = regex(pattern = "( .{1})\\.", ignore_case = TRUE), replacement = "\\1")) %>%
  unnest_tokens(output = "sentence", 
                input = article, 
                token = "sentences", 
                to_lower = TRUE) %>%
  mutate(length = nchar(sentence)) %>%
  select(length, everything()) %>% 
  ungroup() %>% 
  identity()
# Diagnose problems with abbreviations (Mr. Dr. etc)
end_sentences_corpus <-
  corpus_sentences %>%
  filter(str_detect(string = sentence, pattern = "^[[:alnum:]]{1,5}\\.$")) %>%
  group_by(sentence) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  mutate(n_char = nchar(sentence)) %>%
  arrange(desc(n))

We can now try to isolate sentences containing our terms of interest.


      antibiotic free antibiotic resistance            food borne    indiscriminate use         judicious use           prudent use 
                  125                   299                   119                    16                    14                     7 
      responsible use           routine use 
                    1                    53 
                       
                        The New York Times The Washington Post
  antibiotic free                       88                  37
  antibiotic resistance                165                 134
  food borne                            55                  64
  indiscriminate use                    13                   3
  judicious use                          9                   5
  prudent use                            6                   1
  responsible use                        1                   0
  routine use                           38                  15

On this new dataframe, we can count the occurence of each word in the sentence context of each syntagma, overall and divided by journal:

Grouping rowwise data frame strips rowwise natureGrouping rowwise data frame strips rowwise nature

Figures printing and saving

Packages loading

─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package        * version date       lib source        
 askpass          1.1     2019-01-13 [1] CRAN (R 3.6.0)
 assertthat       0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
 backports        1.1.4   2019-04-10 [1] CRAN (R 3.6.0)
 base64enc        0.1-3   2015-07-28 [1] CRAN (R 3.6.0)
 broom            0.5.2   2019-04-07 [1] CRAN (R 3.6.0)
 callr            3.2.0   2019-03-15 [1] CRAN (R 3.6.0)
 cli              1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
 codetools        0.2-16  2018-12-24 [2] CRAN (R 3.6.0)
 colorspace       1.4-1   2019-03-18 [1] CRAN (R 3.6.0)
 cowplot        * 0.9.4   2019-01-08 [1] CRAN (R 3.6.0)
 crayon           1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
 crosstalk        1.0.0   2016-12-21 [1] CRAN (R 3.6.0)
 data.table       1.12.2  2019-04-07 [1] CRAN (R 3.6.0)
 desc             1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
 devtools         2.0.2   2019-04-08 [1] CRAN (R 3.6.0)
 digest           0.6.18  2018-10-10 [1] CRAN (R 3.6.0)
 dplyr          * 0.8.0.1 2019-02-15 [1] CRAN (R 3.6.0)
 DT             * 0.5     2018-11-05 [1] CRAN (R 3.6.0)
 evaluate         0.13    2019-02-12 [1] CRAN (R 3.6.0)
 foreach        * 1.4.4   2017-12-12 [1] CRAN (R 3.6.0)
 fs               1.2.7   2019-03-19 [1] CRAN (R 3.6.0)
 generics         0.0.2   2018-11-29 [1] CRAN (R 3.6.0)
 ggplot2        * 3.1.1   2019-04-07 [1] CRAN (R 3.6.0)
 ggrepel          0.8.0   2018-05-09 [1] CRAN (R 3.6.0)
 glmnet         * 2.0-16  2018-04-02 [1] CRAN (R 3.6.0)
 glue             1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
 gtable           0.3.0   2019-03-25 [1] CRAN (R 3.6.0)
 hms              0.4.2   2018-03-10 [1] CRAN (R 3.6.0)
 htmltools        0.3.6   2017-04-28 [1] CRAN (R 3.6.0)
 htmlwidgets      1.3     2018-09-30 [1] CRAN (R 3.6.0)
 httpuv           1.5.1   2019-04-05 [1] CRAN (R 3.6.0)
 iterators        1.0.10  2018-07-13 [1] CRAN (R 3.6.0)
 janeaustenr      0.1.5   2017-06-10 [1] CRAN (R 3.6.0)
 jsonlite         1.6     2018-12-07 [1] CRAN (R 3.6.0)
 knitr            1.22    2019-03-08 [1] CRAN (R 3.6.0)
 koRpus         * 0.11-5  2018-10-28 [1] CRAN (R 3.6.0)
 koRpus.lang.en * 0.1-2   2018-03-21 [1] CRAN (R 3.6.0)
 labeling         0.3     2014-08-23 [1] CRAN (R 3.6.0)
 later            0.8.0   2019-02-11 [1] CRAN (R 3.6.0)
 lattice          0.20-38 2018-11-04 [2] CRAN (R 3.6.0)
 lazyeval         0.2.2   2019-03-15 [1] CRAN (R 3.6.0)
 lubridate      * 1.7.4   2018-04-11 [1] CRAN (R 3.6.0)
 magrittr         1.5     2014-11-22 [1] CRAN (R 3.6.0)
 Matrix         * 1.2-17  2019-03-22 [2] CRAN (R 3.6.0)
 memoise          1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
 mime             0.6     2018-10-05 [1] CRAN (R 3.6.0)
 munsell          0.5.0   2018-06-12 [1] CRAN (R 3.6.0)
 nlme             3.1-139 2019-04-09 [2] CRAN (R 3.6.0)
 pdftools       * 2.2     2019-03-10 [1] CRAN (R 3.6.0)
 pillar           1.3.1   2018-12-15 [1] CRAN (R 3.6.0)
 pkgbuild         1.0.3   2019-03-20 [1] CRAN (R 3.6.0)
 pkgconfig        2.0.2   2018-08-16 [1] CRAN (R 3.6.0)
 pkgload          1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
 plyr             1.8.4   2016-06-08 [1] CRAN (R 3.6.0)
 prettyunits      1.0.2   2015-07-13 [1] CRAN (R 3.6.0)
 processx         3.3.0   2019-03-10 [1] CRAN (R 3.6.0)
 promises         1.0.1   2018-04-13 [1] CRAN (R 3.6.0)
 ps               1.3.0   2018-12-21 [1] CRAN (R 3.6.0)
 purrr            0.3.2   2019-03-15 [1] CRAN (R 3.6.0)
 qpdf             1.1     2019-03-07 [1] CRAN (R 3.6.0)
 R6               2.4.0   2019-02-14 [1] CRAN (R 3.6.0)
 Rcpp             1.0.1   2019-03-17 [1] CRAN (R 3.6.0)
 readr          * 1.3.1   2018-12-21 [1] CRAN (R 3.6.0)
 remotes          2.0.4   2019-04-10 [1] CRAN (R 3.6.0)
 rlang            0.3.4   2019-04-07 [1] CRAN (R 3.6.0)
 rmarkdown        1.12    2019-03-14 [1] CRAN (R 3.6.0)
 rprojroot        1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
 rstudioapi       0.10    2019-03-19 [1] CRAN (R 3.6.0)
 scales           1.0.0   2018-08-09 [1] CRAN (R 3.6.0)
 sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
 shiny            1.3.2   2019-04-22 [1] CRAN (R 3.6.0)
 SnowballC        0.6.0   2019-01-15 [1] CRAN (R 3.6.0)
 stringi          1.4.3   2019-03-12 [1] CRAN (R 3.6.0)
 stringr        * 1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
 sylly          * 0.1-5   2018-07-29 [1] CRAN (R 3.6.0)
 sylly.en         0.1-3   2018-03-19 [1] CRAN (R 3.6.0)
 textstem       * 0.1.4   2018-04-09 [1] CRAN (R 3.6.0)
 tibble           2.1.1   2019-03-16 [1] CRAN (R 3.6.0)
 tidyr          * 0.8.3   2019-03-01 [1] CRAN (R 3.6.0)
 tidyselect       0.2.5   2018-10-11 [1] CRAN (R 3.6.0)
 tidytext       * 0.2.0   2018-10-17 [1] CRAN (R 3.6.0)
 tokenizers       0.2.1   2018-03-29 [1] CRAN (R 3.6.0)
 usethis          1.5.0   2019-04-07 [1] CRAN (R 3.6.0)
 withr            2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
 xfun             0.6     2019-04-02 [1] CRAN (R 3.6.0)
 xtable           1.8-4   2019-04-21 [1] CRAN (R 3.6.0)
 yaml             2.2.0   2018-07-25 [1] CRAN (R 3.6.0)

[1] /home/abn/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/lib/R/library
