nc
Maintainer: Toby Dylan Hocking (toby.hocking@r-project.org)
User-friendly functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions, and for melting columns that match a regular expression. Patterns are defined using a readable syntax that makes it easy to build complex patterns in terms of simpler, re-usable sub-patterns. Named R arguments are translated to column names in the output, thereby providing a standard interface to three regular expression ‘C’ libraries (‘PCRE’, ‘RE2’, ‘ICU’). Output can also include numeric columns via user-specified type conversion functions.
Relationship with data.table
Whereas data.table
provides several functions such as patterns()
and measure()
which support some regex engines (PCRE, TRE), nc
interfaces with two other engines (RE2, ICU). nc
imports data.table
, and always returns regex match results as a data.table
.
Overview
nc
is useful for extracting numeric data from text, for example consider the following strings, which indicate genomic positions, in bases on a chromosome:
chr.pos.vec <- c( "chr10:213,054,000-213,055,000", "chrM:111,000", # no end. "chr1:110-111 chr2:220-222") # two ranges.
The data above consist of a chromosome name (chr10), followed by a start position, and then optionally a dash and an end position. Using nc
, we can extract these different pieces of information into a data table using the code below, which inputs the data to parse (first argument), along with a regular expression (subsequent arguments).
nc::capture_first_vec( chr.pos.vec, chrom="chr.*?", ":", start="[0-9,]+")
chrom start <char> <char>1: chr10 213,054,0002: chrM 111,0003: chr1 110
The code above uses chrom
and start
as argument names, which are therefore used for column names in the output data table (one row per input subject string, one column per named argument / capture group). However the code above only parses the start position (and not the optional end position). Below, we create a more complex regex to parse both the start and end, by first defining a common pattern to parse an integer,
keep.digits <- function(x) as.integer(gsub("[^0-9]", "", x))int.pattern <- list("[0-9,]+", keep.digits)
In the code above, we use a list to group the regex "[0-9],]+"
with the function keep.digits
which will be used for parsing the text that is extracted by that regex. We use that pattern twice in the code below,
range.pattern <- list( chrom="chr.*?", ":", start=int.pattern, list( # un-named list becomes non-capturing group. "-", end=int.pattern ), "?") # chromEnd is optional.nc::capture_first_vec(chr.pos.vec, range.pattern)
chrom start end <char> <int> <int>1: chr10 213054000 2130550002: chrM 111000 NA3: chr1 110 111
The result above is a data table containing the first match in each subject (three rows total). Note the second row has end=NA
because that optional group did not match.
But the last subject has two potential matches (only the first is reported above). What if we wanted to get all matches in each subject? We can use another function, as in the code below.
nc::capture_all_str(chr.pos.vec, range.pattern)
chrom start end <char> <int> <int>1: chr10 213054000 2130550002: chrM 111000 NA3: chr1 110 1114: chr2 220 222
The output above includes all matches in each subject (four rows total), but does not include any information about which subject each row came from, because it treats the subject as a single string to parse. To get that info, we can use capture_all_str()
for each row, using by=.I
as in the code below.
library(data.table)data.table(chr.pos.vec)[, nc::capture_all_str( chr.pos.vec, range.pattern), by=.I]
I chrom start end <int> <char> <int> <int>1: 1 chr10 213054000 2130550002: 2 chrM 111000 NA3: 3 chr1 110 1114: 3 chr2 220 222
The output above includes the additional I
column which is the index of the subject that each match came from (two rows with I=3
because there are two matches in the third subject).
Finally, data.table::melt()
is used to power the long-to-wide data reshaping functionality in nc
. In data.table
we could use measure()
to specify a set of variables to reshape, as in the code below.
(iris.wide <- data.table(iris)[1])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species <num> <num> <num> <num> <fctr>1: 5.1 3.5 1.4 0.2 setosa
melt(iris.wide, measure.vars=measure(value.name, dim, pattern="(.*)[.](.*)"))
Species dim Sepal Petal <fctr> <char> <num> <num>1: setosa Length 5.1 1.42: setosa Width 3.5 0.2
The result above has reshaped the four numeric input columns into two numeric output columns (value.name
is the sentinel/keyword indicating that we want to make a new column for each unique value captured in that group). The equivalent nc
code would be as below, with the regex defined using a named argument for each capture group (instead of one long pattern
string with parentheses for each capture group).
nc::capture_melt_multiple( iris.wide, column=".*", "[.]", dim=".*")
Species dim Petal Sepal <fctr> <char> <num> <num>1: setosa Length 1.4 5.12: setosa Width 0.2 3.5
The nc
code above produces the same result, and in fact uses data.table::melt()
internally.
For more info about the nc
package, please read the vignettes on its CRAN page.
Seal of Approval: dtplyr
Author(s): Hadley Wickham, Maximilian Girlich, Mark Fairbanks, Ryan Dickerson, Posit Software PBC
Seal of Approval: tidyfast
Author(s): Tyson S. Barrett, Mark Fairbanks, Ivan Leung, Indrajeet Patil
Announcement: The ‘Seal of Approval’
The Community Team, alongside a group of regular data.table contributors, is very pleased to announce a new Seal of Approval program!
Two Roads Diverged
Two roads diverged in a wood and I, I took the one less traveled by, and that has made all the difference.
Testing infrastructure for data.table
One major element of the NSF POSE grant for data.table is to create more documentation and testing infrastructure, in order to help expand the data.table ecosystem. This…
Community interviews about data.table
One stipulation of NSF POSE funded projects like this one was to conduct several interviews under NSF’s I-CORPS program (Winter 2024 Cohort), to gather information as to how …
Results of the 2023 survey
Thanks to everyone who helped create, shared, or filled out the first data.table survey! The survey was officially open between October 17 and December 1 and it received 391 …
Column assignment and reference semantics in data.table
The goal of this blog post is to explain some similarities and differences between the base R data.frame object type, and the data.table object type. We will focus on…
The Benefits of data.table Syntax
Among the many reasons to use data.table in your code (which includes the more common answers of speed, memory efficiency, etc.) is the syntax. The syntax is
New governance, release with new features
I am proud to report that today, the first major new data.table features in several years have been released to CRAN!
Piping data.tables
Like a devoted plumber, modern R loves pipes. The magrittr pipe has a long history and it’s fair share of detractors, but with the implementation of the native pipe operator …
Announcement: Jan Gorecki, data.table Ambassador
Jan is a natural choice for an Ambassador, due to his many years of fantastic contribution to the data.table package. You can find his great work in open-source development…
Summary of LatinR conference
Last month, I (Toby) went to the LatinR conference in Montevideo, Uruguay. I had two goals: to teach about data.table in a tutorial, and to find people to work on…
Announcement: The data.table Ambassadors Travel Grant
We on the community team are very excited to announce another major funding opportunity!
Announcement: data.table translation projects
In 2023-2025, National Science Foundation (NSF) has provided funds to support the project “Expanding the data.table ecosystem for efficient big data manipulation in R.” One…
Welcome to the data.table ecosystem project!
An NSF-POSE funded venture.
Hi! My name is Toby Dylan Hocking, and I have been using R since 2003, which means 20 years, can you believe it?
No matching items