问R使用正则表达式替换字符串中空格的第n次出现。
EN

Stack Overflow用户

提问于 2022-10-20 12:07:45

回答 1查看 42关注 0票数 1

我是一个正则表达式的初学者，并且正在处理来自pdf和R的数据。不幸的是，R没有捕获数据中的小数点，所以我需要用句点替换特定的空格。我确实找到了一个解决这个问题的方法，但是我怀疑有一种更有效的方法可以用正则表达式来完成这个任务。下面是一些示例数据，我正在使用的解决方案，以及我想要得到的结果数据集。

ex<-c("16 7978 38 78 651 42 651 42","25 1967 8 94 225 26 225 26",
      "16 5000 8 00 132 00 132 00", "16 6125 2 00 33 23 33 23")

df<-data.frame(row=1:4,string=ex)
df

row                   string
 1  16 7978 38 78 651 42 651 42
 2  25 1967 8 94 225 26 225 26
 3  16 5000 8 00 132 00 132 00
 4  16 6125 2 00 33 23 33 23

df$tst<-stri_replace_first_fixed(df$string," ",".")
df$rate<-sapply(strsplit(df$tst," ",fixed=T),"[",1)
df$string2<-stri_replace_first_fixed(df$tst,df$rate,"")%>% trimws()
df$tst<-stri_replace_first_fixed(df$string2," ",".")
df$hours<-sapply(strsplit(df$tst," ",fixed=T),"[",1)
df$string3<-stri_replace_first_fixed(df$tst,df$hours,"")%>% trimws()
df$tst<-stri_replace_first_fixed(df$string3," ",".")
df$period_amt<-sapply(strsplit(df$tst," ",fixed=T),"[",1)
df$string4<-stri_replace_first_fixed(df$tst,df$period_amt,"")%>% trimws()
df$tst<-stri_replace_first_fixed(df$string3," ",".")
df$ytd_amt<-sapply(strsplit(df$tst," ",fixed=T),"[",1)


df<-df %>% dplyr::select(-string2,-string3,-tst,-string4)
df
 
  row                     string   rate  hours  period_amt ytd_amt
   1 16 7978 38 78 651 42 651 42 16.7978 38.78     651.42  651.42
   2  25 1967 8 94 225 26 225 26 25.1967  8.94     225.26  225.26
   3  16 5000 8 00 132 00 132 00 16.5000  8.00     132.00  132.00
   4    16 6125 2 00 33 23 33 23 16.6125  2.00      33.23   33.23

在上面的解决方案中，我将第一次出现的空格替换为句点，提取校正的数字，然后从字符串中删除更正的数字。然后迭代地重复这个处理，直到提取出所有的值。正如我所说的，这个解决方案很有效，但在我看来，它看起来相当草率，如果我需要从文本中纠正和提取大量的值，就会很乏味。任何关于在R中实现这一目标的更好方法的建议都将不胜感激。

提前感谢！

regex

text

replace

data-processing

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-10-20 12:27:01

这里有个办法。正如@Peter所言，这可能更好地解决小数点缺失的原因，也许这与pdf源有关。

此外，它将取决于源数字格式，例如，如果它有前导零，负号等。

ex<-c("16 7978 38 78 651 42 651 42","25 1967 8 94 225 26 225 26",
      "16 5000 8 00 132 00 132 00", "16 6125 2 00 33 23 33 23")

df<-data.frame(row=1:4,string=ex)

n <- do.call(rbind, strsplit(gsub("(\\d+) (\\d+)","\\1.\\2",df$string ), " " ))
n <- apply(n, 2, as.numeric)
colnames(n) <- c("rate",  "hours",  "period_amt", "ytd_amt")
df<-cbind(df, n)
df
#>   row                      string    rate hours period_amt ytd_amt
#> 1   1 16 7978 38 78 651 42 651 42 16.7978 38.78     651.42  651.42
#> 2   2  25 1967 8 94 225 26 225 26 25.1967  8.94     225.26  225.26
#> 3   3  16 5000 8 00 132 00 132 00 16.5000  8.00     132.00  132.00
#> 4   4    16 6125 2 00 33 23 33 23 16.6125  2.00      33.23   33.23