Seal of Approval: nc – Blog (2024)

`nc`

Maintainer: Toby Dylan Hocking (toby.hocking@r-project.org)

User-friendly functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions, and for melting columns that match a regular expression. Patterns are defined using a readable syntax that makes it easy to build complex patterns in terms of simpler, re-usable sub-patterns. Named R arguments are translated to column names in the output, thereby providing a standard interface to three regular expression ‘C’ libraries (‘PCRE’, ‘RE2’, ‘ICU’). Output can also include numeric columns via user-specified type conversion functions.

Relationship with `data.table`

Whereas data.table provides several functions such as patterns() and measure() which support some regex engines (PCRE, TRE), nc interfaces with two other engines (RE2, ICU). nc imports data.table, and always returns regex match results as a data.table.

Overview

nc is useful for extracting numeric data from text, for example consider the following strings, which indicate genomic positions, in bases on a chromosome:

chr.pos.vec <- c( "chr10:213,054,000-213,055,000", "chrM:111,000", # no end. "chr1:110-111 chr2:220-222") # two ranges.

The data above consist of a chromosome name (chr10), followed by a start position, and then optionally a dash and an end position. Using nc, we can extract these different pieces of information into a data table using the code below, which inputs the data to parse (first argument), along with a regular expression (subsequent arguments).

nc::capture_first_vec( chr.pos.vec, chrom="chr.*?", ":", start="[0-9,]+")

 chrom start <char> <char>1: chr10 213,054,0002: chrM 111,0003: chr1 110

The code above uses chrom and start as argument names, which are therefore used for column names in the output data table (one row per input subject string, one column per named argument / capture group). However the code above only parses the start position (and not the optional end position). Below, we create a more complex regex to parse both the start and end, by first defining a common pattern to parse an integer,

keep.digits <- function(x) as.integer(gsub("[^0-9]", "", x))int.pattern <- list("[0-9,]+", keep.digits)

In the code above, we use a list to group the regex "[0-9],]+" with the function keep.digits which will be used for parsing the text that is extracted by that regex. We use that pattern twice in the code below,

range.pattern <- list( chrom="chr.*?", ":", start=int.pattern, list( # un-named list becomes non-capturing group. "-", end=int.pattern ), "?") # chromEnd is optional.nc::capture_first_vec(chr.pos.vec, range.pattern)

 chrom start end <char> <int> <int>1: chr10 213054000 2130550002: chrM 111000 NA3: chr1 110 111

The result above is a data table containing the first match in each subject (three rows total). Note the second row has end=NA because that optional group did not match.

But the last subject has two potential matches (only the first is reported above). What if we wanted to get all matches in each subject? We can use another function, as in the code below.

nc::capture_all_str(chr.pos.vec, range.pattern)

 chrom start end <char> <int> <int>1: chr10 213054000 2130550002: chrM 111000 NA3: chr1 110 1114: chr2 220 222

The output above includes all matches in each subject (four rows total), but does not include any information about which subject each row came from, because it treats the subject as a single string to parse. To get that info, we can use capture_all_str() for each row, using by=.I as in the code below.

library(data.table)data.table(chr.pos.vec)[, nc::capture_all_str( chr.pos.vec, range.pattern), by=.I]

 I chrom start end <int> <char> <int> <int>1: 1 chr10 213054000 2130550002: 2 chrM 111000 NA3: 3 chr1 110 1114: 3 chr2 220 222

The output above includes the additional I column which is the index of the subject that each match came from (two rows with I=3 because there are two matches in the third subject).

Finally, data.table::melt() is used to power the long-to-wide data reshaping functionality in nc. In data.table we could use measure() to specify a set of variables to reshape, as in the code below.

(iris.wide <- data.table(iris)[1])

 Sepal.Length Sepal.Width Petal.Length Petal.Width Species <num> <num> <num> <num> <fctr>1: 5.1 3.5 1.4 0.2 setosa

melt(iris.wide, measure.vars=measure(value.name, dim, pattern="(.*)[.](.*)"))

 Species dim Sepal Petal <fctr> <char> <num> <num>1: setosa Length 5.1 1.42: setosa Width 3.5 0.2

The result above has reshaped the four numeric input columns into two numeric output columns (value.name is the sentinel/keyword indicating that we want to make a new column for each unique value captured in that group). The equivalent nc code would be as below, with the regex defined using a named argument for each capture group (instead of one long pattern string with parentheses for each capture group).

nc::capture_melt_multiple( iris.wide, column=".*", "[.]", dim=".*")

 Species dim Petal Sepal <fctr> <char> <num> <num>1: setosa Length 1.4 5.12: setosa Width 0.2 3.5

The nc code above produces the same result, and in fact uses data.table::melt() internally.

For more info about the nc package, please read the vignettes on its CRAN page.

Seal of Approval: dtplyr

Author(s): Hadley Wickham, Maximilian Girlich, Mark Fairbanks, Ryan Dickerson, Posit Software PBC

Aug 1, 2024Kelly Bodwin

Seal of Approval: tidyfast

Author(s): Tyson S. Barrett, Mark Fairbanks, Ivan Leung, Indrajeet Patil

Aug 1, 2024Tyson S. Barrett

Announcement: The ‘Seal of Approval’

The Community Team, alongside a group of regular data.table contributors, is very pleased to announce a new Seal of Approval program!

Jul 31, 2024Kelly Bodwin

Two Roads Diverged

Two roads diverged in a wood and I, I took the one less traveled by, and that has made all the difference.

Jun 4, 2024Kelly Bodwin

Testing infrastructure for data.table

One major element of the NSF POSE grant for data.table is to create more documentation and testing infrastructure, in order to help expand the data.table ecosystem. This…

Mar 10, 2024Toby Hocking

Community interviews about data.table

One stipulation of NSF POSE funded projects like this one was to conduct several interviews under NSF’s I-CORPS program (Winter 2024 Cohort), to gather information as to how …

Mar 6, 2024Anirban Chetia

Results of the 2023 survey

Thanks to everyone who helped create, shared, or filled out the first data.table survey! The survey was officially open between October 17 and December 1 and it received 391 …

Feb 25, 2024Aljaž Sluga

Column assignment and reference semantics in data.table

The goal of this blog post is to explain some similarities and differences between the base R data.frame object type, and the data.table object type. We will focus on…

Feb 18, 2024Toby Hocking

The Benefits of data.table Syntax

Among the many reasons to use data.table in your code (which includes the more common answers of speed, memory efficiency, etc.) is the syntax. The syntax is

Feb 5, 2024Tyson Barrett

New governance, release with new features

I am proud to report that today, the first major new data.table features in several years have been released to CRAN!

Jan 30, 2024Toby Dylan Hocking

Piping data.tables

Like a devoted plumber, modern R loves pipes. The magrittr pipe has a long history and it’s fair share of detractors, but with the implementation of the native pipe operator …

Jan 28, 2024Elio Campitelli

Announcement: Jan Gorecki, data.table Ambassador

Jan is a natural choice for an Ambassador, due to his many years of fantastic contribution to the data.table package. You can find his great work in open-source development…

Jan 14, 2024Kelly Bodwin

Summary of LatinR conference

Last month, I (Toby) went to the LatinR conference in Montevideo, Uruguay. I had two goals: to teach about data.table in a tutorial, and to find people to work on…

Nov 19, 2023Toby Dylan Hocking

Announcement: The data.table Ambassadors Travel Grant

We on the community team are very excited to announce another major funding opportunity!

Nov 1, 2023Kelly Bodwin

Announcement: data.table translation projects

In 2023-2025, National Science Foundation (NSF) has provided funds to support the project “Expanding the data.table ecosystem for efficient big data manipulation in R.” One…

Oct 17, 2023

Welcome to the data.table ecosystem project!

An NSF-POSE funded venture.

Hi! My name is Toby Dylan Hocking, and I have been using R since 2003, which means 20 years, can you believe it?

Oct 15, 2023Toby Hocking

No matching items