Analysis in R: Introduction to String Manipulation and Searching “grep” and “gsub” Commands and Regular Expressions

RAnalytics
スポンサーリンク

String manipulation commands and the “grep” command for searching strings for patterns (keywords) and the “gsub” command for replacing strings are introduced. Regular expression extraction is also briefly introduced.

The grep command can be used to repeat the process by data element name, etc.
Rとウェブ解析:データフレームの項目名で処理を繰り返す

Here are some articles that may be useful for text replacement.
・Analysis in R: “stringr” package for easy text manipulation
https://www.karada-good.net/analyticsr/r-640

The command is confirmed with R version 4.1.3.

スポンサーリンク

Commands such as string replacement

These commands relate to counting the number of characters with spaces, removing spaces, replacing strings, and truncating strings.

#String example, 16 characters including spaces
#KARADANI EMONO 
Test <- " KARADANI EMONO "

#Check the number of characters
nchar(Test)
[1] 16

#Exclude all single-byte spaces
gsub(" ", "", Test)
[1] "KARADANIEMONO"

#Excluding leading whitespace
gsub("^ ", "", " KARADANI EMONO ")
[1] "KARADANI EMONO "

#Excluding trailing whitespace
gsub(" $", "", " KARADANI EMONO ")
[1] " KARADANI EMONO"

#Retrieve a portion from a string
#Retrieve 4 to 11 characters
substring(Test, 4, 11)
[1] "RADANI E"

#Replace the 5th and 6th characters in the string with XX
substring(Test, 5, 6) <- "XX"
#Before replacement
[1] " KARADANI EMONO "
#After replacement
Test
[1] " KARXXANI EMONO "

#To cut a string that contains whitespace at a given length
strtrim(Test, 9)
[1] " KARADANI"

#Divide by the specified content of letter A
#Split by letter A
#The specified character will disappear
#The result is a LIST, so it will be made into a vector by unlisting
unlist(strsplit(Test, "A"))
[1] " K"        "R"         "D"         "NI EMONO "

Commands such as string concatenation

Commands for concatenating strings, creating string vectors, digitizing strings, etc.

###String concatenation command:paste#####
#Create an example string
Text1 <- c("KARADA1", "KARADA2")
Text2 <- " NI "
Text3 <- "EMONO"

#Each string is concatenated with whitespace at the end
paste(Text1, Text2, Text3)
[1] "KARADA1  NI  EMONO" "KARADA2  NI  EMONO"

#Combine without spaces with sep option ""
paste(Text1, Text2, Text3, sep = "")
[1] "KARADA1 NI EMONO" "KARADA2 NI EMONO"

#The collapse option concatenates strings with 
#the specified string to create a single string
paste(Text1, Text2, Text3, sep = "", collapse = " ? ")
[1] "KARADA1 NI EMONO ? KARADA2 NI EMONO"

#Commands with the option "sep = "" only:paste0
#Merge without inserting spaces
paste0(Text1, Text2, Text3)
[1] "KARADA NI EMONO"
########

#Creating String Vectors
TsetChar <- character(10)
TsetChar
[1] "" "" "" "" "" "" "" "" "" ""
class(TsetChar)
[1] "character"

#Conversion to String
class(10)
[1] "numeric"
class(as.character(10))
[1] "character"

###Convert numeric strings to numbers:type.convertコマンド#####
Text1 <- "1"
Text2 <- "2"

#Error.
Text1 + Text2
Error in Text1 + Text2 : non-numeric argument to binary operator

#Using type.convert
type.convert(Text1, as.is = TRUE) + type.convert(Text2, as.is = TRUE)
[1] 3
########

#Function to output alphabets
#capital letter
LETTERS[1:7]
[1] "A" "B" "C" "D" "E" "F" "G"
#lower case letters
letters[10:16]
[1] "j" "k" "l" "m" "n" "o" "p"

“grep” command to search and ”gsub”command to replace

Please check the comments for content.

#Title of KARADA NI IIMONO
Test <- c("Analysis in R: Introduction to "pathological" packages, including copying and creating folders, and how to be package independent",
          "A guide to R: researchers, working people, and ladies. In the meantime, why don't we all use it?" ,
          "Analysis in R: How to get the color code of an image! Introduction to the "EBImage" package",
          "Playing with R: Could it be used for color schemes in presentation materials? Character hair color?")

###grep outputs the position of strings containing keywords #####
###ranking of titles containing introductions
grep("introduction", Test)
[1] 1 3

How to extract strings using ##grep
#Extract titles containing guide in string
Test[grep("guide", Test)]
[1] "Guide to R: Researchers, working people, and ladies. Why don't we all use it anyway?"

#Multiple keywords can also be set
#Example is or"|"
Test[grep("onee-san|hair", Test)]
[1] "Guide to R: Researchers, working people, and ladies. In the meantime, why don't we all use it?"
[2] "Playing around with R: Could it be used for color schemes for presentation materials? Character hair color?" 

#Regular expressions can also be used
#Titles with the letter "R" at the beginning (^)
Test[grep("^R's", Test)]]
[1] "A guide to R: Researchers, working people, and ladies. In the meantime, why don't we all use it?"
[########

###gsub replaces the string corresponding to the target keyword with the specified content #####
##gsub(target keyword, specified content, string)
Replace ##onee-san with onee-san
gsub("onee-san", "oneesan", Test)
[2] "A guide to R: Researchers, working people, and oneesan. Why don't we all use it anyway?"

Replace all "of" with "is" in #.
gsub("of", "is", Test)
[1] "Analyze in R: How folders can be copied, created, etc. "pathological" package does not depend on the introduction and package."
[2] "R is the guide: researchers, working people, and ladies. In the meantime, why don't we all use it?"            
[3] "Analyze in R: How to get images color coded! Introduced by the "EBImage" package."                         
[4] "Fun with R: Could presentation materials be used for color schemes? Character's hair is the color?"

#Replace "of" with "is" in the first occurrence
sub("of", "is", Test)
[1] "Analysis in R: Introduction to "pathological" packages such as folders copy and create, and how to be package independent."
[2] "R is the guide: researchers, working people, and ladies. In the meantime, why don't we all use it?"
[3] "Analyze with R: How to get images color coded! Introducing the "EBImage" package."
[4] "Playing around in R: Could your presentation material be used for color schemes? Character hair color?"

Extract strings with regular expressions

Here is an example of string extraction using regular expressions.

Test <- c("KKKKRRADAAA GOODDDDDD",
          "kkkrraaadaaaaGoood",
          "kkkrraaadaaaaGoodDDDDD", "kkkrraaadaaaaGood",
          "good for your health")

Extract repeats of the character immediately preceding #+ at least once
Test[grep("kara+", Test)]]
[1] "Kakarada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada  

Extract strings that do not contain # characters and [^] strings
Test[grep("da[^ ni]i", Test)]
[1] "Kakara rara da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da"

#Extract strings containing certain characters
Test[grep("A| ni", Test)]
[1] "KKKRRADAAA GOODDDDDD" "good for you"  

#Extract specific pattern
Test[grep("et al. to", Test)]
[1] "good for you"

Combine #+ to extract specific pattern
Test[grep("A.+D", Test)]]
[1] "KKKRRADAAA GOODDDDDD"

###### From here, the character data to be used: ########
Test <- c("karada good stuff", "karada good stuff",
          karada_good_thing", "karada_good_thing", "karada_good_thing", "karada_good_thing",
          1 good thing for your body", "1223 good things for your body",
          "12234535 good for your body", "13234535 good for your body",
          "13234335 good for your body", "0123-45-6789",
          "012-3456-7895", "0123456789")
########

#Match strings containing whitespace
Test[grep("\\s", Test)]
[1] "karada good stuff" "karada good stuff"

#Match 3 consecutive numbers before number 3
Test[grep("\\d{3}?3", Test)]
[1] "1223 good things for your body" "12234535 good things for your body" "13234535 good things for your body" "13234335 good things for your body"
[5] "0123-45-6789" "0123456789" 

#Match 3 or more occurrences of the number 3
#Test[grep("(3.*){3}", Test)]
Test[grep("(3.*){3,}", Test)]
[1] "13234535 good for the body" "13234335 good for the body"

#Match two consecutive occurrences of the number 3
Test[grep("3{2}", Test)]
[1] "13234335 good for the body"

#Match hyphenated numbers
Test[grep("0\\d{1,4}-\d{1,4}-\d{4}", Test)]
[1] "0123-45-6789" "012-3456-7895"

#Strings containing "good" are matched
Test[grep("\\wいい", Test)]
[1] "karada い もの" "karada_いいもの" "からだにいいもの" "からだに1いいもの"       
[5] "12234535 good for body" "13234535 good for body" "13234335 good for body"

#Matches strings that do not contain "good
Test[grep("(*good)", Test, invert = TRUE)]
[1] "1223 good things for your body" "0123-45-6789" "012-3456-7895" "0123456789"

#Escape special characters (. etc.) are escaped:\\fg
grep("i\. i", c("ii", "i. i", "i. a", "iii", "i9i", "i.9i"))
[1] 2

#Match any single character:.
grep("i. i", c("ii", "i. i", "i. a", "iii", "i9i", "i9i"))
[1] 2 4

#Match characters starting with a certain character:^,. ,*,$ combination
grep("^Ii. *$", c("ii", "い. i", "i. a", "iii", "i9i", "i9i"))
[1] 4

#Match characters ending with a certain character:^,. ,*,$ combination
grep("^. *i$", c("ii", "i. i", "i. a", "iii", "i9i", "i9i"))
[1] 2 4

#Match alphanumeric characters:[A-Za-z0-9].
grep("[A-Za-z0-9]", c("ii", "い. i", "i. a", "iii", "i9i", "i9i"))
[1] 1 3 5 6

#Match numbers:[0-9].
grep("[0-9]", c("ii", "い. i", "i. a", "iii", "i9i", "i9i"))
[1] 6

#Match for whitespace:[[:space:]]]
grep("[[:space:]]", c("ii", "i. i", "i. a", "iii", "i9i", "i9i"))
[1] 3

#Match a string containing a certain extension
#Match .doxc in word in the example
grep("^. *doxc*$", c("ii", "i. i", "i.doxc", "iii", "i9i", "i9i"))
[1] 3

#Match strings containing a minority point
grep("[+-]? \\d*\d. \\d", c("45454", "-8.0", "8.15452", "7.5", ".23", "i9i"))
[1] 2 3 4 5

#Matches strings containing "https/ftp"
grep("(https?|ftp)://([^:/]+)",
     c("45454", "-8.0",
       "https://www.karada-good.net/analyticsr/r-648/",
       "7.5", ".23", "https://www.karada-good.net/"))
[1] 3 6

Useful Articles


I hope this makes your analysis a little easier !!

Copied title and URL