Search term selection with litsearchr v1.0.0 for an example systematic review of the effects of fire on black-backed woodpeckers

About the package

The litsearchr package for R is designed to partially automate search term selection and writing search strategies for systematic reviews. This vignette demonstrates its utility through a mock, example review examining the effects of fire on black-backed woodpeckers by demonstrating how the package: (1) Identifies potential keywords through the naive search input, (2) Builds a keyword co-occurrence network to assist with building a more precise search strategy, (3) Uses a cutoff function to identify important changes in keyword importance, (4) Assists with grouping terms into concepts, and (5) Writes a Boolean search as a result of completion of the four previous steps.

Write and conduct naive search

In our empirical example, we begin with a naive search intended to capture a set of relevant articles. Naive search terms: ((“black-backed woodpecker” OR “picoides arcticus” OR “picoides tridactylus” AND (burn* OR fire*)). We ran the search in Scopus and Zoological Record (Web of Science), exporting results in .ris and .txt, respectively. These exported search results are then imported to litsearchr using the import_results function and next deduplicated using the remove_duplicates function. In some cases, it is best to run the remove_duplicates function two or more times, for example starting with exact matches and moving on to fuzzy matching.

# Note: system.file() is only used to identify where the example datasets are stored
# If litsearchr and its dependencies were successfully installed, this directory exists on your computer

# If you are using your own bibliographic files, you should not use system.file
# You should instead give it the full path (or relative path from your current working directory) to the directory where your files are stored
search_directory <- system.file("extdata", package="litsearchr")

naiveimport <-
  litsearchr::import_results(directory = search_directory, verbose = TRUE)
#> Reading file /tmp/RtmpGp7mOt/temp_libpath605c6719a3ad/litsearchr/extdata/scopus.ris ... done
#> Reading file /tmp/RtmpGp7mOt/temp_libpath605c6719a3ad/litsearchr/extdata/zoorec.txt ... done

naiveresults <-
  litsearchr::remove_duplicates(naiveimport, field = "title", method = "string_osa")

1. Identify potential keywords

Using the deduplicated records captured from the naive search, the extract_terms function will systematically extract all potential keywords from the article titles, abstracts, or other fields that are passed to the function as text. There are a couple different methods for extracting terms: the Rapid Automatic Keyword Extraction algorithm (RAKE), a function that approximates RAKE (fakerake), or simple ngram detection. Alternatively, extract_terms can take the keywords field from the deduplicated records (if it exists) and clean up author- and database-tagged keywords.

rakedkeywords <-
  litsearchr::extract_terms(
    text = paste(naiveresults$title, naiveresults$abstract),
    method = "fakerake",
    min_freq = 2,
    ngrams = TRUE,
    min_n = 2,
    language = "English"
  )
#> Loading required namespace: stopwords

taggedkeywords <-
  litsearchr::extract_terms(
    keywords = naiveresults$keywords,
    method = "tagged",
    min_freq = 2,
    ngrams = TRUE,
    min_n = 2,
    language = "English"
  )

2. Build the keyword co-occurrence network

Using the results from Step 1, Identify potential keywords, a series of functions are next run to create a co-occurrence network.

all_keywords <- unique(append(taggedkeywords, rakedkeywords))

naivedfm <-
  litsearchr::create_dfm(
    elements = paste(naiveresults$title, naiveresults$abstract),
    features = all_keywords
  )

naivegraph <-
  litsearchr::create_network(
    search_dfm = naivedfm,
    min_studies = 2,
    min_occ = 2
  )

3. Identify change points in keyword importance

The keyword co-occurrence network can next be quantitatively assessed to detect a cutoff point for changes in the level of importance of a particular keyword to the concept. This will help in making an efficient but comprehensive search by removing terms that are not central to a field of study.

cutoff <-
  litsearchr::find_cutoff(
    naivegraph,
    method = "cumulative",
    percent = .80,
    imp_method = "strength"
  )

reducedgraph <-
  litsearchr::reduce_graph(naivegraph, cutoff_strength = cutoff[1])

searchterms <- litsearchr::get_keywords(reducedgraph)

head(searchterms, 20)
#>  [1] "black hill"              "black hills"            
#>  [3] "black-backed woodpecker" "boreal forest"          
#>  [5] "breeding season"         "burned forest"          
#>  [7] "cavity nesters"          "cavity-nesting birds"   
#>  [9] "certhia americana"       "colaptes auratus"       
#> [11] "conifer forests"         "coniferous forest"      
#> [13] "coniferous forests"      "dendroctonus ponderosae"
#> [15] "dryocopus pileatus"      "fire severity"          
#> [17] "food availability"       "forest management"      
#> [19] "habitat selection"       "habitat suitability"

4. Group terms into concepts

Now that the important keywords for the search have been identified, they can be grouped into blocks to build the search strategy. For Boolean searches, terms are grouped into similar concept groups where they can be combined with “OR” statements and the separate blocks combined with “AND” statements.

In our example, all keywords that relate to woodpeckers would be in a similar concept group (e.g., “three-toed woodpeckers”, “cavity-nesting birds” etc.) while terms relating to fire (e.g. “post-fire”, “burned forest”, etc.) would be in their own concept group.

Terms that fit multiple concept groups can be added to both without changing the logic of the Boolean connections. For example, a term like “post-fire woodpecker ecology” would be added to both the woodpecker and fire concept groups by labeling its group “woodpecker, fire”. We recommend saving the search terms to a .csv file, adding a new column called “group”, and entering the group names in it, then reading in the .csv file. Although this can be done in R, adding tags to 300+ suggested search terms is generally quicker in a .csv file. Example code for this is commented out below as it cannot be run without the .csv file.


# write.csv(searchterms, "./search_terms.csv")
# manually group terms in the csv
# grouped_terms <- read.csv("./search_terms_grouped.csv")
# extract the woodpecker terms from the csv
# woodpecker_terms <- grouped_terms$term[grep("woodpecker", grouped_terms$group)]
# join together a list of manually generated woodpecker terms with the ones from the csv
# woodpeckers <- unique(append(c("woodpecker")), woodpecker_terms)
# repeat this for all concept groups
# then merge them into a list, using the code below as an example
# mysearchterms <- list(woodpeckers, fire)

5. Write Boolean searches

Once keywords are grouped into concept groups in a list, the write_search function can be used to write Boolean searches in multiple languages, ready for export and use in chosen databases. The example below demonstrates writing a search in English using the search terms.

# Note: these search terms are a shortened example of a full search for illustration purposes only
mysearchterms <-
  list(
    c(
      "picoides arcticus",
      "black-backed woodpecker",
      "cavity-nesting birds",
      "picoides tridactylus",
      "three-toed woodpecker"),
    c(
      "wildfire",
      "burned forest",
      "post-fire",
      "postfire salvage logging",
      "fire severity",
      "recently burned"
    )
  )

my_search <-
  litsearchr::write_search(
    groupdata = mysearchterms,
    languages = "English",
    stemming = TRUE,
    closure = "none",
    exactphrase = TRUE,
    writesearch = FALSE,
    verbose = TRUE
  )
#> [1] "English is written"

# when writing to a plain text file, the extra \ are required to render the * and " properly
# if copying straight from the console, simply find and replace them in a text editor
my_search
#> [1] "((\"picoid* arcticus*\" OR \"black-back* woodpeck*\" OR \"cavity-nest* bird*\" OR \"picoid* tridactylus*\" OR \"three-to* woodpeck*\") AND (wildfir* OR \"burn* forest*\" OR post-fir* OR \"postfir* salvag* logging\" OR \"fire* sever*\" OR \"recent* burn*\"))"

6. Check search strategy precision and recall

Save the titles of the articles you want to retrieve as a character vector called gold_standard. You may want to do this by reading in a .csv file or you can paste/type them out. In this example, we are only using three articles to test our search strategy, though normally you will want to use a longer list of articles identified through expert opinion and/or consulting the list of studies included in previous reviews related to the topic. We will write a simple search of these titles to run in the bibliographic databases we plan to use to check if they are indeed indexed and should be retrieved by our search terms. Any that are not indexed can be ignored for the following step.

gold_standard <-
  c(
    "Black-backed woodpecker occupancy in burned and beetle killed forests: disturbance agent matters",
    "Nest site selection and nest survival of Black-backed Woodpeckers after wildfire",
    "Cross scale occupancy dynamics of a postfire specialist in response to variation across a fire regime"
  )

title_search <- litsearchr::write_title_search(titles=gold_standard)

We then read in our full search results and compare them to our gold standard to determine which gold standard articles we retrieved. Note: in this case I am using the naive search results from earlier because this is just a demonstration and this is not a real systematic review, so I did not run the full searches. You will want to do this with your actual full search results.

results_directory <- system.file("extdata", package="litsearchr")

retrieved_articles <-
  litsearchr::import_results(directory = results_directory, verbose = TRUE)
#> Reading file /tmp/RtmpGp7mOt/temp_libpath605c6719a3ad/litsearchr/extdata/scopus.ris ... done
#> Reading file /tmp/RtmpGp7mOt/temp_libpath605c6719a3ad/litsearchr/extdata/zoorec.txt ... done

retrieved_articles <- litsearchr::remove_duplicates(retrieved_articles, field="title", method="string_osa")

articles_found <- litsearchr::check_recall(true_hits = gold_standard,
                                           retrieved = retrieved_articles$title)

articles_found
#>      Title                                                                                                  
#> [1,] "Black-backed woodpecker occupancy in burned and beetle killed forests: disturbance agent matters"     
#> [2,] "Nest site selection and nest survival of Black-backed Woodpeckers after wildfire"                     
#> [3,] "Cross scale occupancy dynamics of a postfire specialist in response to variation across a fire regime"
#>      Best_Match                                                                                        
#> [1,] "Black-backed woodpecker occupancy in burned and beetle-killed forests: Disturbance agent matters"
#> [2,] "Nest site selection and nest survival of Black-backed Woodpeckers after wildfire"                
#> [3,] "The ecological importance of severe wildfires: Some like it hot"                                 
#>      Similarity         
#> [1,] "0.588235294117647"
#> [2,] "1"                
#> [3,] "0.134969325153374"

The check indicates that all three of our gold standard articles were included in our search results, so we would go ahead with our final search and use it for our systematic review.