String manipulation commands and the “grep” command for searching strings for patterns (keywords) and the “gsub” command for replacing strings are introduced. Regular expression extraction is also briefly introduced.
The grep command can be used to repeat the process by data element name, etc.
Rとウェブ解析:データフレームの項目名で処理を繰り返す
Here are some articles that may be useful for text replacement.
・Analysis in R: “stringr” package for easy text manipulation
https://www.karada-good.net/analyticsr/r-640
The command is confirmed with R version 4.1.3.
Commands such as string replacement
These commands relate to counting the number of characters with spaces, removing spaces, replacing strings, and truncating strings.
#String example, 16 characters including spaces #KARADANI EMONO Test <- " KARADANI EMONO " #Check the number of characters nchar(Test) [1] 16 #Exclude all single-byte spaces gsub(" ", "", Test) [1] "KARADANIEMONO" #Excluding leading whitespace gsub("^ ", "", " KARADANI EMONO ") [1] "KARADANI EMONO " #Excluding trailing whitespace gsub(" $", "", " KARADANI EMONO ") [1] " KARADANI EMONO" #Retrieve a portion from a string #Retrieve 4 to 11 characters substring(Test, 4, 11) [1] "RADANI E" #Replace the 5th and 6th characters in the string with XX substring(Test, 5, 6) <- "XX" #Before replacement [1] " KARADANI EMONO " #After replacement Test [1] " KARXXANI EMONO " #To cut a string that contains whitespace at a given length strtrim(Test, 9) [1] " KARADANI" #Divide by the specified content of letter A #Split by letter A #The specified character will disappear #The result is a LIST, so it will be made into a vector by unlisting unlist(strsplit(Test, "A")) [1] " K" "R" "D" "NI EMONO "
Commands such as string concatenation
Commands for concatenating strings, creating string vectors, digitizing strings, etc.
###String concatenation command:paste##### #Create an example string Text1 <- c("KARADA1", "KARADA2") Text2 <- " NI " Text3 <- "EMONO" #Each string is concatenated with whitespace at the end paste(Text1, Text2, Text3) [1] "KARADA1 NI EMONO" "KARADA2 NI EMONO" #Combine without spaces with sep option "" paste(Text1, Text2, Text3, sep = "") [1] "KARADA1 NI EMONO" "KARADA2 NI EMONO" #The collapse option concatenates strings with #the specified string to create a single string paste(Text1, Text2, Text3, sep = "", collapse = " ? ") [1] "KARADA1 NI EMONO ? KARADA2 NI EMONO" #Commands with the option "sep = "" only:paste0 #Merge without inserting spaces paste0(Text1, Text2, Text3) [1] "KARADA NI EMONO" ######## #Creating String Vectors TsetChar <- character(10) TsetChar [1] "" "" "" "" "" "" "" "" "" "" class(TsetChar) [1] "character" #Conversion to String class(10) [1] "numeric" class(as.character(10)) [1] "character" ###Convert numeric strings to numbers:type.convertコマンド##### Text1 <- "1" Text2 <- "2" #Error. Text1 + Text2 Error in Text1 + Text2 : non-numeric argument to binary operator #Using type.convert type.convert(Text1, as.is = TRUE) + type.convert(Text2, as.is = TRUE) [1] 3 ######## #Function to output alphabets #capital letter LETTERS[1:7] [1] "A" "B" "C" "D" "E" "F" "G" #lower case letters letters[10:16] [1] "j" "k" "l" "m" "n" "o" "p"
“grep” command to search and ”gsub”command to replace
Please check the comments for content.
#Title of KARADA NI IIMONO Test <- c("Analysis in R: Introduction to "pathological" packages, including copying and creating folders, and how to be package independent", "A guide to R: researchers, working people, and ladies. In the meantime, why don't we all use it?" , "Analysis in R: How to get the color code of an image! Introduction to the "EBImage" package", "Playing with R: Could it be used for color schemes in presentation materials? Character hair color?") ###grep outputs the position of strings containing keywords ##### ###ranking of titles containing introductions grep("introduction", Test) [1] 1 3 How to extract strings using ##grep #Extract titles containing guide in string Test[grep("guide", Test)] [1] "Guide to R: Researchers, working people, and ladies. Why don't we all use it anyway?" #Multiple keywords can also be set #Example is or"|" Test[grep("onee-san|hair", Test)] [1] "Guide to R: Researchers, working people, and ladies. In the meantime, why don't we all use it?" [2] "Playing around with R: Could it be used for color schemes for presentation materials? Character hair color?" #Regular expressions can also be used #Titles with the letter "R" at the beginning (^) Test[grep("^R's", Test)]] [1] "A guide to R: Researchers, working people, and ladies. In the meantime, why don't we all use it?" [######## ###gsub replaces the string corresponding to the target keyword with the specified content ##### ##gsub(target keyword, specified content, string) Replace ##onee-san with onee-san gsub("onee-san", "oneesan", Test) [2] "A guide to R: Researchers, working people, and oneesan. Why don't we all use it anyway?" Replace all "of" with "is" in #. gsub("of", "is", Test) [1] "Analyze in R: How folders can be copied, created, etc. "pathological" package does not depend on the introduction and package." [2] "R is the guide: researchers, working people, and ladies. In the meantime, why don't we all use it?" [3] "Analyze in R: How to get images color coded! Introduced by the "EBImage" package." [4] "Fun with R: Could presentation materials be used for color schemes? Character's hair is the color?" #Replace "of" with "is" in the first occurrence sub("of", "is", Test) [1] "Analysis in R: Introduction to "pathological" packages such as folders copy and create, and how to be package independent." [2] "R is the guide: researchers, working people, and ladies. In the meantime, why don't we all use it?" [3] "Analyze with R: How to get images color coded! Introducing the "EBImage" package." [4] "Playing around in R: Could your presentation material be used for color schemes? Character hair color?"
Extract strings with regular expressions
Here is an example of string extraction using regular expressions.
Test <- c("KKKKRRADAAA GOODDDDDD", "kkkrraaadaaaaGoood", "kkkrraaadaaaaGoodDDDDD", "kkkrraaadaaaaGood", "good for your health") Extract repeats of the character immediately preceding #+ at least once Test[grep("kara+", Test)]] [1] "Kakarada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada dada Extract strings that do not contain # characters and [^] strings Test[grep("da[^ ni]i", Test)] [1] "Kakara rara da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da da" #Extract strings containing certain characters Test[grep("A| ni", Test)] [1] "KKKRRADAAA GOODDDDDD" "good for you" #Extract specific pattern Test[grep("et al. to", Test)] [1] "good for you" Combine #+ to extract specific pattern Test[grep("A.+D", Test)]] [1] "KKKRRADAAA GOODDDDDD" ###### From here, the character data to be used: ######## Test <- c("karada good stuff", "karada good stuff", karada_good_thing", "karada_good_thing", "karada_good_thing", "karada_good_thing", 1 good thing for your body", "1223 good things for your body", "12234535 good for your body", "13234535 good for your body", "13234335 good for your body", "0123-45-6789", "012-3456-7895", "0123456789") ######## #Match strings containing whitespace Test[grep("\\s", Test)] [1] "karada good stuff" "karada good stuff" #Match 3 consecutive numbers before number 3 Test[grep("\\d{3}?3", Test)] [1] "1223 good things for your body" "12234535 good things for your body" "13234535 good things for your body" "13234335 good things for your body" [5] "0123-45-6789" "0123456789" #Match 3 or more occurrences of the number 3 #Test[grep("(3.*){3}", Test)] Test[grep("(3.*){3,}", Test)] [1] "13234535 good for the body" "13234335 good for the body" #Match two consecutive occurrences of the number 3 Test[grep("3{2}", Test)] [1] "13234335 good for the body" #Match hyphenated numbers Test[grep("0\\d{1,4}-\d{1,4}-\d{4}", Test)] [1] "0123-45-6789" "012-3456-7895" #Strings containing "good" are matched Test[grep("\\wいい", Test)] [1] "karada い もの" "karada_いいもの" "からだにいいもの" "からだに1いいもの" [5] "12234535 good for body" "13234535 good for body" "13234335 good for body" #Matches strings that do not contain "good Test[grep("(*good)", Test, invert = TRUE)] [1] "1223 good things for your body" "0123-45-6789" "012-3456-7895" "0123456789" #Escape special characters (. etc.) are escaped:\\fg grep("i\. i", c("ii", "i. i", "i. a", "iii", "i9i", "i.9i")) [1] 2 #Match any single character:. grep("i. i", c("ii", "i. i", "i. a", "iii", "i9i", "i9i")) [1] 2 4 #Match characters starting with a certain character:^,. ,*,$ combination grep("^Ii. *$", c("ii", "い. i", "i. a", "iii", "i9i", "i9i")) [1] 4 #Match characters ending with a certain character:^,. ,*,$ combination grep("^. *i$", c("ii", "i. i", "i. a", "iii", "i9i", "i9i")) [1] 2 4 #Match alphanumeric characters:[A-Za-z0-9]. grep("[A-Za-z0-9]", c("ii", "い. i", "i. a", "iii", "i9i", "i9i")) [1] 1 3 5 6 #Match numbers:[0-9]. grep("[0-9]", c("ii", "い. i", "i. a", "iii", "i9i", "i9i")) [1] 6 #Match for whitespace:[[:space:]]] grep("[[:space:]]", c("ii", "i. i", "i. a", "iii", "i9i", "i9i")) [1] 3 #Match a string containing a certain extension #Match .doxc in word in the example grep("^. *doxc*$", c("ii", "i. i", "i.doxc", "iii", "i9i", "i9i")) [1] 3 #Match strings containing a minority point grep("[+-]? \\d*\d. \\d", c("45454", "-8.0", "8.15452", "7.5", ".23", "i9i")) [1] 2 3 4 5 #Matches strings containing "https/ftp" grep("(https?|ftp)://([^:/]+)", c("45454", "-8.0", "https://www.karada-good.net/analyticsr/r-648/", "7.5", ".23", "https://www.karada-good.net/")) [1] 3 6
Useful Articles
I hope this makes your analysis a little easier !!