The fields of linguistics and computer science (CS) are somewhat joined at the hip. Noam Chomsky, dubbed thefather of modern linguistics, coined the term and concept of Context-Free Grammar (CFG). CFGs have been central to many computer languages (parsers, compilers, etc.). They enable regular expressions, or regexes , which areconcise programs that parse text strings to extract information, format, and transform them.
Regex processors include notably sed and awk . ’s/\B[0-9]\{3\}\>/,&/g’ is a sed programto format numbers in a file. Number 123345600 for example, is formatted as 123,345,600. We are indebted to Chomsky for the concept of universal grammar, embedded in all human brains, enabling them to create an infinite number of sentences using a finite lexicon. This is done through a process of recursion, a key concept in computing. Chomsky is a definite linguistics genius. The genius lasts but one centuryBut as Voltaire famously said “the genius has but one century, after that it must degenerate.” Other linguists have taken Chomsky to task for the concept of recursion in universal language. It is more akin to iteration and merging, they say. What more, CFG-based regexes have been shown to cause exponential searches proportional to the size of the input string they’re searching. I have shown this in a corollary to the recurrence relation theorem. (f{n}s/mx → searches O(2^n) when x ≠ s) . python’s regex implementation takes 2 hours and 10 minutes to search the 35 fields (n=35) of a 100-character input string!
Enter PEGsIn 2004, Bryan Ford formalized the Parsing Expression Grammar (PEG) concept, seen as a considerable improvement over Chomsky’s CFGs. When paired with Ford’s Packrat parser, PEGs have shown to have a linear matching time. The addition of the Packratparser is critical for linear time processing. However, the memory consumption of PEGs is significantly higher than that of CFGs. PEGs have to remember a lot of intermediate data to support their linear time algorithms. This is called memoization .
Rosie Pattern Language is PEG-basedJamie Jennings’ Rosie Pattern Language ís PEG-based. It has direct application in the realms of data science, analytics, and machine learning. If data is the new oil, then you need new efficient tools to exploit it.
CFG-based regexes have done miracles in the past on small data sets but they fall short in the sphere of big data. They don’t scale, they have the Achilles heel of exponential matching time on increasingly more sizable input strings, and it is difficult to maintain and understand them. Moreover, they cannot match the PEGs capabilities to express recursive structures found in JSON and XML. Only 0.5% of the data generated in today’s world can be efficiently parsed by existing tools.
Rosie, or the convergence of linguistics research and computingIn Chomsky’s words, we’re dealing with a finite lexicon with which we’re generating an infinite amount of sentences. Data is infinite in this day and age. Taming the beast requires recursion. It is expected that more PEG-based solutions will see the light of day. Rosie has gotten a head start.
Ford stood on Chomsky’s shoulders as did Jennings on Ford’s, as CS held hands with linguistics. You can expect Rosie to be used to facilitate linguistics research. Computational linguistics, for example, could benefit from the precise parsing and mapping out of semi-structured text into queryable data formats.