A smaller, more compatible regular expression engine for XS
Introduction
The JavaScript language includes support for regular expressions, a compact and powerful tool to perform search and replace operations on strings. Regular expressions can be considered a separate language within JavaScript, with its own syntax and behaviors.
Just as implementing a JavaScript engine is a major undertaking, properly implementing a new regular expression engine is a significant challenge. From the start, Moddable's XS JavaScript engine has used a modified version of the venerable PCRE engine originally created for use in the Perl language.
Overview
The new regular expression engine in XS achieves several important goals:
- Support for all regular expression features in the current JavaScript standard
- Support for three regular expression proposals under consideration for inclusion in a future edition of the standard
- Significantly smaller ROM footprint, about 88% smaller.
- Easier integration with XS
The team behind XS has a longstanding commitment to supporting the full standard JavaScript language, even on extremely resource constrained devices. The new regular expression engine paves the way for ongoing support of regular expression enhancements, while reducing the ROM footprint so significantly that regular expressions are now practical to use in many more projects.
Motivation
The PCRE engine has worked well for XS for nearly 15 years. It has been necessary to modify PCRE to integrate cleanly with XS, for example to use the XS memory heap instead of system memory. More recently, we have enhanced the functionality of PCRE to achieve compatibility with the ECMAScript standard.
There are several proposals currently in the late stages of consideration by TC39, the ECMAScript language committee. We found it impractical to enhance PCRE further to support these features, in particular look behind assertions. We explored using PCRE2, but it is an even bigger engine and achieving a clean integration proved too difficult.
This challenge motivated an exploration of the feasibility of building a new regular expression engine specifically for XS. Years of experience integrating regular expression support in XS and using regular expressions in JavaScript code provided a knowledge base for building a new engine.
Implementation
The design and implementation of the new regular expression engine began as a combination of C and JavaScript to facilitate rapid experimentation. The final version is entirely written in C. The author of the engine, Patrick Soquet, explains the process together with some details of the implementation, including the fact that part of the engine was inspired by Piu, Moddable's user interface engine.
I am no specialist of regular expressions. I studied the theory a bit, before implementing a parser in C, following the specified grammar and using the hand coded parsing technique that is everywhere in XS.
The character ranges were kind of fun because the parser has to union or inverse character sets. I saw character sets like one dimension graphical regions and used the same algorithms as Piu! Graphical regions are small enough and fast enough for hit testing, which is the most frequent operation when executing regular expressions.
Initially the parser created trees of JavaScript objects that described the regular expressions. I used JavaScript to implement the specified semantics quite literally, with a lot of functions and closures. I got something that passed some tests, and, mostly, I eventually understood how regular expressions were supposed to work.
Then I progressively transformed the JavaScript code: instead of interpreting the tree while executing the regular expressions, the functions generated a state machine. With a state stack for backtracking, it was the state machine that executed the regular expression. Once that passed some tests, I recoded the generation and the execution of the state machine in C. Et voilĂ :-)
Compatibility
The official test suite for JavaScript is test262. It is maintained as part of the ECMAScript standardization effort, ensuring that up-to-date tests are available for all features in the standard and those under consideration. The test suite has been invaluable in our development of XS, as it provides examples of all the requirements described in the language specification. Without test262 understanding the language specification would be more difficult, and validating enhancements changes to an engine would be far less reliable.
The new regular expression engine achieves excellent compatibility as verified by test262 tests.
- 100% (178 of 178 tests pass) of test262
language/literals/regexp
tests
- 99% (1684 of 1698 tests pass) of test262
built-ins/RegExp
tests.
- 100% of the dotAll Stage 4 proposal
- 100% of the look behind assertions Stage 3 proposal
- 100% of the named capture groups Stage 3 proposal
The new engine does not implement the unicode property escapes Stage 3 proposal. There is no technical barrier to implementing this proposal. However, the required Unicode tables are larger than practical on the devices XS targets.
Code size
The PCRE engine, as used by XS, compiles to 219,888 bytes. The embedded devices XS is designed to target typically have between 512KB and 2MB of ROM. PCRE requires such a large fraction of the total ROM that it is usually stripped from embedded projects.
The new regular expression engine compiles to 25,644 bytes, or just 11.6% of the size of PCRE. This reduction in code size, nearly an order of magnitude, makes regular expressions practical for use by many more projects.
Note: Many factors influence the size of compiled code. The numbers provided here are based on a GCC build for ESP8266 using a release build of the Moddable SDK.
The question naturally arises as to how the new regular expression engine can be so much smaller. There is no single answer, but a variety of reasons:
- PCRE is built for Perl, not JavaScript. As such, it contains features that are unused by a JavaScript engine.
- For portability, PCRE duplicates capabilities already present in XS, for example for memory management and managing character encoding.
- The new regular expression engine takes a different approach to operations on character sets.
Performance and RAM
We have not yet performed a detailed analysis of the performance and RAM use of the new regular expression engine. Some details are worth noting:
- The engine uses XS memory heaps exclusively, reducing the free space needed in system memory, speeding allocations, and reducing fragmentation.
- The engine runs with a smaller native stack size, eliminating the need to increase the native stack to use regular expressions reliably in all circumstances.
- Performance appears to be similar to PCRE. As performance has not been a focus of development of new regular expression engine, there are likely optimization opportunities.