Paper Comber Tutorial

Author

Ben J. Rivera

Published

October 18, 2022

Introduction

  • To access this, go to my website:

    • https://tinyurl.com/meta-R
  • Download the R markdown File here: https://drive.google.com/file/d/15oA8sYpUgekt_Ccq6jijLqrBr-9y9V8d/view?usp=sharing

  • @NotBenRivera on Twitter

  • Live tutorial at 2:20 Pacific on 10.18.2022

    • https://ucdavis.zoom.us/s/97766151914
    • Passcode: 763246

Example Word Cloud

Goals

  • Provide a tool to help get started with a Meta-Analysis
  • Explain how the tool works well enough for y’all to use and modify
  • At the end, we all have a bunch of papers set up to start reading and extracting information from!

Overview of how it works

  • Select and run a search query through the Web of Science Core Collection
  • Use the results of the search to download as many PDFs as possible
  • Based on titles and keywords, create word clouds! (mostly for fun)
  • Upload all the results to your Google Drive for collaboration and information extraction!

I strongly recommend downloading the script from the link above and running it line by line rather than trying to run it all at once!

Packages

Below are the Packages you will use here today! If you don’t have these all downloaded, uncomment the lines with just one “#” and download those packages. Most are on CRAN, but two require the packages remotes and devtools in order to download from github. so you may have to download those too.

#####Install from CRAN#######
#install.packages("tidyverse", "metagear", "googlesheets4", "googledrive", "here", "wordcloud","wordcloud2", "tm")

#####Install from GitHub####
#if (!require("remotes") ){install.packages("remotes")}
#remotes::install_github("netique/scihubr")

#devtools::install_github("juba/rwos") #may need to install 'devtools' from CRAN

#####Load Packages########
library(tidyverse)
library(rwos) #
library(metagear)
library(scihubr) #
library(googlesheets4)
library(googledrive)
library(here)
library(wordcloud)
library(wordcloud2)
library(tm)

Part 2: Stealing papers from Sci-hub

In this section, we are running all the DOIs through a for loop to download all the PDFS and put them in a folder called “SearchResults”. If you have this in an R Project (recommended), this folder will be saved in the same place as this file. If you are running this as a regular script somehow, it will be in your working directory.

VPN needs to now be turned OFF

This part takes a while and has ~70-80% success rate in that it is able to download about 7 out of 10 pdfs for each found result

Code Downloads

try_catch <- function(exprs) {!inherits(try(eval(exprs)), "try-error")} #this allows the for-loop to proceed even if there is an error

dir.create(here("SearchResultsTest")) #Creates folder to put the pdfs
Warning in dir.create(here("SearchResultsTest")): 'C:
\Users\benny\OneDrive\Documents\Website_Rivera_Official\SearchResultsTest'
already exists
for (i in 1:nrow(pubs)){ #iterate through list of search results
  fileDOI<-pubs[i,]$doi #extracts DOI
  filename<- paste(pubs[i,]$title, ".pdf", sep = "") #extract file name to be used to create file
  filename2<- pubs[i,]$title #just gets file name
  try(ifelse(try_catch(download_paper(query = fileDOI, path = paste(here::here("SearchResultsTest"),"/", filename, ".pdf", sep = ""),  open = FALSE)),  #checks if Sci-hub will work
         download_paper(query = fileDOI, path = paste(here::here("SearchResultsTest"),"/", filename, ".pdf", sep = ""),  open = FALSE), #if it can, it downloads straight from sci-hub 
         PDF_download( DOI = fileDOI,  theFileName = filename2, directory = here::here("SearchResultsTest"), WindowsProxy = TRUE )   )) # If sci-hub won't work, it tries this instead. May need to change 'WindowsProxy' to false if not on Windows.
  
  }
Error in parse_url(.) : length(url) == 1 is not TRUE
Collecting PDF from DOI: NA
            Extraction 1 of 2: HTML script.... cannot open: HTTP status was '404 Not Found'
            Extraction 2 of 2: PDF download... skipped
Warning in file(con, "wb"): cannot open file 'C:/Users/benny/OneDrive/
Documents/Website_Rivera_Official/SearchResultsTest/Does atmospheric nitrogen
deposition lead to greater nitrogen and carbon accumulation in coastal sand
dunes?.pdf.pdf': Invalid argument
Error in file(con, "wb") : cannot open the connection
Collecting PDF from DOI: 10.1016/j.biocon.2016.12.007
            Extraction 1 of 2: HTML script.... successful
            Extraction 2 of 2: PDF download... failed, connections too slow or files not PDF format
i Cite as:  Provoost, S., Jones, M. L. M., & Edmondson, S. E. (2009).
  Changes in landscape and vegetation of coastal dunes in northwest Europe: a review. Journal of Coastal Conservation, 15(1), 207–226.
  doi:10.1007/s11852-009-0068-5i Cite as:  Provoost, S., Jones, M. L. M., & Edmondson, S. E. (2009).
  Changes in landscape and vegetation of coastal dunes in northwest Europe: a review. Journal of Coastal Conservation, 15(1), 207–226.
  doi:10.1007/s11852-009-0068-5
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Rodriguez-Echeverria, S., Crisostomo, J. A., & Freitas, H. (2007).
  Genetic Diversity of Rhizobia Associated with Acacia longifolia in Two Stages of Invasion of Coastal Sand Dunes. Applied and Environmental Microbiology, 73(15), 5066–5070.
  doi:10.1128/aem.00613-07
i Cite as:
  Rodriguez-Echeverria, S., Crisostomo, J. A., & Freitas, H. (2007).
  Genetic Diversity of Rhizobia Associated with Acacia longifolia in Two Stages of Invasion of Coastal Sand Dunes. Applied and Environmental Microbiology, 73(15), 5066–5070.
  doi:10.1128/aem.00613-07
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Jones, M. L. M., Sowerby, A., Williams, D. L., & Jones, R. E. (2008).
  Factors controlling soil development in sand dunes: evidence from a coastal dune soil chronosequence. Plant and Soil, 307(1-2), 219–234.
  doi:10.1007/s11104-008-9601-9
i Cite as:
  Jones, M. L. M., Sowerby, A., Williams, D. L., & Jones, R. E. (2008).
  Factors controlling soil development in sand dunes: evidence from a coastal dune soil chronosequence. Plant and Soil, 307(1-2), 219–234.
  doi:10.1007/s11104-008-9601-9
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
Error in parse_url(.) : length(url) == 1 is not TRUE
Collecting PDF from DOI: NA
            Extraction 1 of 2: HTML script.... cannot open: HTTP status was '404 Not Found'
            Extraction 2 of 2: PDF download... skipped
i Cite as:  Kooijman, A. M., Lubbers, I., & van Til, M. (2009).
  Iron-rich dune grasslands: Relations between soil organic matter and sorption of Fe and P. Environmental Pollution, 157(11), 3158–3165.
  doi:10.1016/j.envpol.2009.05.022i Cite as:  Kooijman, A. M., Lubbers, I., & van Til, M. (2009).
  Iron-rich dune grasslands: Relations between soil organic matter and sorption of Fe and P. Environmental Pollution, 157(11), 3158–3165.
  doi:10.1016/j.envpol.2009.05.022
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Arun, A. B., & Sridhar, K. R. (2004).
  Symbiotic performance of fast-growing rhizobia isolated from the coastal sand dune legumes of west coast of India. Biology and Fertility of Soils, 40(6), 435–439.
  doi:10.1007/s00374-004-0800-0
i Cite as:
  Arun, A. B., & Sridhar, K. R. (2004).
  Symbiotic performance of fast-growing rhizobia isolated from the coastal sand dune legumes of west coast of India. Biology and Fertility of Soils, 40(6), 435–439.
  doi:10.1007/s00374-004-0800-0
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Hanslin, H. M., & Kollmann, J. (2016).
  Positive responses of coastal dune plants to soil conditioning by the invasive Lupinus nootkatensis. Acta Oecologica, 77, 1–9.
  doi:10.1016/j.actao.2016.08.007
i Cite as:
  Hanslin, H. M., & Kollmann, J. (2016).
  Positive responses of coastal dune plants to soil conditioning by the invasive Lupinus nootkatensis. Acta Oecologica, 77, 1–9.
  doi:10.1016/j.actao.2016.08.007
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
Error in parse_url(.) : length(url) == 1 is not TRUE
Collecting PDF from DOI: 10.1007/s12237-022-01052-2
            Extraction 1 of 2: HTML script.... successful
            Extraction 2 of 2: PDF download... successful
i Cite as:  Kooijman, A. M., van Til, M., Noordijk, E., Remke, E., & Kalbitz, K. (2017). Nitrogen deposition and grass encroachment in calcareous and acidic Grey dunes (H2130) in NW-Europe. Biological Conservation, 212, 406–415. doi:10.1016/j.biocon.2016.08.009i Cite as:  Kooijman, A. M., van Til, M., Noordijk, E., Remke, E., & Kalbitz, K. (2017). Nitrogen deposition and grass encroachment in calcareous and acidic Grey dunes (H2130) in NW-Europe. Biological Conservation, 212, 406–415. doi:10.1016/j.biocon.2016.08.009
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Kim, D., & Yu, K. B. (2008).
  A conceptual model of coastal dune ecology synthesizing spatial gradients of vegetation, soil, and geomorphology. Plant Ecology, 202(1), 135–148.
  doi:10.1007/s11258-008-9456-4
i Cite as:
  Kim, D., & Yu, K. B. (2008).
  A conceptual model of coastal dune ecology synthesizing spatial gradients of vegetation, soil, and geomorphology. Plant Ecology, 202(1), 135–148.
  doi:10.1007/s11258-008-9456-4
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Sridhar, K. R., Arun, A. B., Narula, N., Deubel, A., & Merbach, W. (2005).
  Patterns of Sole-Carbon-Source Utilization by Fast-Growing Coastal Sand Dune Rhizobia of the Southwest Coast of India. Engineering in Life Sciences, 5(5), 425–430.
  doi:10.1002/elsc.200520091
i Cite as:
  Sridhar, K. R., Arun, A. B., Narula, N., Deubel, A., & Merbach, W. (2005).
  Patterns of Sole-Carbon-Source Utilization by Fast-Growing Coastal Sand Dune Rhizobia of the Southwest Coast of India. Engineering in Life Sciences, 5(5), 425–430.
  doi:10.1002/elsc.200520091
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Rodríguez-Echeverría, S. (2010).
  Rhizobial hitchhikers from Down Under: invasional meltdown in a plant-bacteria mutualism? Journal of Biogeography.
  doi:10.1111/j.1365-2699.2010.02284.x
i Cite as:
  Rodríguez-Echeverría, S. (2010).
  Rhizobial hitchhikers from Down Under: invasional meltdown in a plant-bacteria mutualism? Journal of Biogeography.
  doi:10.1111/j.1365-2699.2010.02284.x
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Hellmann, C., Sutter, R., Rascher, K. G., Máguas, C., Correia, O., & Werner, C. (2011).
  Impact of an exotic N2-fixing Acacia on composition and N status of a native Mediterranean community. Acta Oecologica, 37(1), 43–50.
  doi:10.1016/j.actao.2010.11.005
i Cite as:
  Hellmann, C., Sutter, R., Rascher, K. G., Máguas, C., Correia, O., & Werner, C. (2011).
  Impact of an exotic N2-fixing Acacia on composition and N status of a native Mediterranean community. Acta Oecologica, 37(1), 43–50.
  doi:10.1016/j.actao.2010.11.005
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
Error in parse_url(.) : length(url) == 1 is not TRUE
Collecting PDF from DOI: NA
            Extraction 1 of 2: HTML script.... cannot open: HTTP status was '404 Not Found'
            Extraction 2 of 2: PDF download... skipped
Error in open.connection(con, "rb") : Could not resolve host: downloads
Collecting PDF from DOI: 10.1002/ldr.4078
            Extraction 1 of 2: HTML script.... successful
            Extraction 2 of 2: PDF download... failed, url connections too slow or unavailable
Error in parse_url(.) : length(url) == 1 is not TRUE
Collecting PDF from DOI: NA
            Extraction 1 of 2: HTML script.... cannot open: HTTP status was '404 Not Found'
            Extraction 2 of 2: PDF download... skipped
i Cite as:  Selami, N., Auriac, M.-C., Catrice, O., Capela, D., Kaid-Harche, M., & Timmers, T. (2014).
  Morphology and anatomy of root nodules of Retama monosperma (L.)Boiss. Plant and Soil, 379(1-2), 109–119.
  doi:10.1007/s11104-014-2045-5i Cite as:  Selami, N., Auriac, M.-C., Catrice, O., Capela, D., Kaid-Harche, M., & Timmers, T. (2014).
  Morphology and anatomy of root nodules of Retama monosperma (L.)Boiss. Plant and Soil, 379(1-2), 109–119.
  doi:10.1007/s11104-014-2045-5
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Bolhuis, H., Fillinger, L., & Stal, L. J. (2013).
  Coastal Microbial Mat Diversity along a Natural Salinity Gradient. PLoS ONE, 8(5), e63166.
  doi:10.1371/journal.pone.0063166
i Cite as:
  Bolhuis, H., Fillinger, L., & Stal, L. J. (2013).
  Coastal Microbial Mat Diversity along a Natural Salinity Gradient. PLoS ONE, 8(5), e63166.
  doi:10.1371/journal.pone.0063166
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
Error in open.connection(con, "rb") : Could not resolve host: downloads
Collecting PDF from DOI: 10.1007/s13199-021-00765-5
            Extraction 1 of 2: HTML script.... successful
            Extraction 2 of 2: PDF download... successful
Error in open.connection(con, "rb") : Could not resolve host: downloads
Collecting PDF from DOI: 10.1038/s41598-019-45490-8
            Extraction 1 of 2: HTML script.... successful
            Extraction 2 of 2: PDF download... successful
i Cite as:  Birnbaum, C., Bissett, A., Teste, F. P., & Laliberté, E. (2018).
  Symbiotic N2-Fixer Community Composition, but Not Diversity, Shifts in Nodules of a Single Host Legume Across a 2-Million-Year Dune Chronosequence. Microbial Ecology.
  doi:10.1007/s00248-018-1185-1i Cite as:  Birnbaum, C., Bissett, A., Teste, F. P., & Laliberté, E. (2018).
  Symbiotic N2-Fixer Community Composition, but Not Diversity, Shifts in Nodules of a Single Host Legume Across a 2-Million-Year Dune Chronosequence. Microbial Ecology.
  doi:10.1007/s00248-018-1185-1
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Emery, S. M., & Rudgers, J. A. (2011).
  Beach Restoration Efforts Influenced by Plant Variety, Soil Inoculum, and Site Effects. Journal of Coastal Research, 274, 636–644.
  doi:10.2112/jcoastres-d-10-00120.1
i Cite as:
  Emery, S. M., & Rudgers, J. A. (2011).
  Beach Restoration Efforts Influenced by Plant Variety, Soil Inoculum, and Site Effects. Journal of Coastal Research, 274, 636–644.
  doi:10.2112/jcoastres-d-10-00120.1
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero
i Cite as:
  Werner, C., Zumkier, U., Beyschlag, W., & Máguas, C. (2009).
  High competitiveness of a resource demanding invasive acacia under low resource supply. Plant Ecology, 206(1), 83–96.
  doi:10.1007/s11258-009-9625-0
i Cite as:
  Werner, C., Zumkier, U., Beyschlag, W., & Máguas, C. (2009).
  High competitiveness of a resource demanding invasive acacia under low resource supply. Plant Ecology, 206(1), 83–96.
  doi:10.1007/s11258-009-9625-0
Warning in rep(yes, length.out = len): 'x' is NULL so the result will be NULL
Error in ans[ypos] <- rep(yes, length.out = len)[ypos] : 
  replacement has length zero

Resulting Folder

  • All the PDFs it was about to find should be downloaded into the folder in your working directory

Output of Pdf Downloading

Part 3: Word Cloud time!

Okay, this is just cool! I just wanted to make some neat wordclouds. Feel free to skip this part. It takes all the keywords found in the search and formats them for use. I do the same with the titles from the results. One creates a static image and the other creates a dynamic tool where hovering over word tells you how many times it occured.

Code shamelssly stolen from here and it is magic! https://cran.r-project.org/web/packages/wordcloud2/vignettes/wordcloud.html

Word Clouds made from key words

#### Word cloud from key words
kw<-pubs$keywords


# Create a corpus  
docs <- Corpus(VectorSource(kw))

docs <- docs  %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
documents
Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
documents
docs <- tm_map(docs, content_transformer(tolower))
Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
transformation drops documents
docs <- tm_map(docs, removeWords, stopwords("english"))
Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords("english")):
transformation drops documents
dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

set.seed(27)
wordcloud(words = df$word, freq = df$freq, min.freq = 1,max.words=200, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))
Warning in wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words =
200, : mycorrhizae could not be fit on page. It will not be plotted.
Warning in wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words =
200, : competition could not be fit on page. It will not be plotted.
Warning in wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words =
200, : mediterranean could not be fit on page. It will not be plotted.
Warning in wordcloud(words = df$word, freq = df$freq, min.freq = 1, max.words =
200, : progressive could not be fit on page. It will not be plotted.

wordcloud2(data=df, fontFamily = 'Times', color = "random-dark")

Word cloud from titles

kw2<-pubs$title
docs2 <- Corpus(VectorSource(kw2))

docs2 <- docs2  %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents
Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
documents
Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
documents
docs2 <- tm_map(docs2, content_transformer(tolower))
Warning in tm_map.SimpleCorpus(docs2, content_transformer(tolower)):
transformation drops documents
docs2 <- tm_map(docs2, removeWords, stopwords("english"))
Warning in tm_map.SimpleCorpus(docs2, removeWords, stopwords("english")):
transformation drops documents
dtm2 <- TermDocumentMatrix(docs2) 
matrix2 <- as.matrix(dtm2) 
words2 <- sort(rowSums(matrix2),decreasing=TRUE) 
df2 <- data.frame(word = names(words2),freq=words2)

set.seed(27)
wordcloud(words = df2$word, freq = df2$freq, min.freq = 1,max.words=200, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

wordcloud2(data=df2, fontFamily = 'Times', color = "random-dark")

Part 4: Uploading results to google sheets

This is an important final step. This will upload the metadata and all the pdfs (as a zipped folder) to your google drive automatically. It will prompt you to sign in. I reccomend using your UC Davis google drive, but it does not totally matter.

#### RUN THIS IF THE FIRST TIME ####

#gs4_auth() #signs you in so it can upload to google sheets

#### This gives you the ability to alter sheets on your google drive ###

#gs4_create("SearchResultsDataTest", sheets = list(data = pubs)) #uploads list of publications and all the info to a google sheets

#zip(zipfile= "resultsTest.zip", files = here("SearchResultsTest")) #zips your pdfs

###this will ask you to sign in again, which is annoying, but deal with it.
#drive_upload(media = "resultsTest.zip") #uploads a zipped file to your google drive. 

Example Output

Then you can get started on reading and extracting! My personal set up for my Meta Analysis

That’s it!

I hope this worked for you! Let me know if you run into any troubles or have any ways for me to make it better. Thank you so much for giving it a look. Let me know if you actually end up using this and send me your word clouds!

@NotBenRivera on Twitter or email me at benrivera@ucdavis.edu