Extracting Two Words Before and After "Further" with Regex in R

Understanding the Problem

The problem presented involves parsing sentences where a specific word, in this case, “further,” is used. We need to extract two words before and after “further” from each sentence.

Background Information

We will first look at the required operations using regular expressions (regex). These patterns can be applied to strings to find occurrences of certain sequences of characters.

Understanding Regex Basics

Regex involves creating a pattern that describes what we are looking for in a string. Patterns are written using symbols and syntax specific to regex languages.

# Example of a simple pattern
hello world
^hello.*world$

In this example, ^ marks the start of the string, * indicates any repetition (in this case, zero or more), . matches any single character except newline, and $ denotes the end of the string.

String Manipulation

String manipulation functions in R like str_extract, str_remove, and str_squish are used to process strings based on regex patterns.

str_extract: Extracts a specified substring from a given string. If multiple matches exist, it returns all of them.
str_remove: Removes a specified substring from a given string. Again, if multiple occurrences exist, it will return all of them.
str_squish: Takes multiple strings and joins them together with spaces in between.

Solution Overview

To extract the two words before and after “further,” we can use these functions as shown below:

library(stringr)
library(dplyr)

x <- c("Then one morning Mills refused to mount refused to advance further",
       "further one morning Mills refused to mount refused to advance",
       "Then further morning Mills refused to mount refused to advance",
       "Then one morning Mills refused to mount refused further advance", 
       "Then one morning Mills further refused to mount refused to advance")

x %>% str_extract(regex('(?:[^ ]+ ){0,2}further(?: [^ ]+){0,2}', ignore_case = TRUE)) %>% 
    str_remove(regex("further", ignore_case = TRUE)) %>% 
    str_squish()

Breakdown of the Code

Let’s break down what each part does:

x <- c(...): Creates a vector x containing our input sentences.
library(stringr) and library(dplyr): Load necessary libraries for string manipulation (stringr) and data manipulation (dplyr).
str_extract(regex('(?:[^ ]+ ){0,2}further(?: [^ ]+){0,2}', ignore_case = TRUE)): Extracts all occurrences of the pattern “two words before ‘further’” (with optional spaces in between), ensuring case-insensitivity.
- (?:[^ ]+ ){0,2}: Matches zero or two of one or more characters that are not spaces.
- further and (?: [^ ]+){0,2} match the word “further” followed by optional two groups of one or more non-space characters.
str_remove(regex("further", ignore_case = TRUE)): Removes all occurrences of the pattern just matched (i.e., the word “further”), again ensuring case-insensitivity.
- "further" directly matches the exact string “further”.
str_squish(): Joins all substrings obtained after removal into a single string with spaces in between.

Example Output

Running this code on our input vector x should return:

[1] "to advance"               "one morning"              "Then morning Mills"      
[4] "mount refused advance"    "morning Mills refused to"

This matches the required output for each sentence, where two words are extracted before and after “further.”

Handling Different Cases

The code is designed to work with case-insensitivity by setting ignore_case = TRUE. This means that regardless of whether ‘Further’, ‘FURTHER’, or ‘further’ appear in a string, the extraction will treat them as identical.

However, if you prefer case sensitivity (i.e., want “Further” and “further” to be treated differently), simply remove ignore_case = TRUE from your regex patterns.

Last modified on 2023-06-25