rust


Is Rust's lexical grammar regular, context-free or context-sensitive?


The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?
The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive, and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated, although most standard scanner generators don't provide as much assistance as one might like, so the context-sensitivity is not a huge problem. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.

Related Links

How to write a trait that has a method returning a reference and implement it correctly?
Mismatched types when displaying a matrix with a for loop
How to have a struct field with the same mutability as the parent struct?
`if` condition remains borrowed in body [duplicate]
“parameter `'a` is never used” error when 'a is used in type parameter bound
Linking Rust application with a dynamic library not in the runtime linker search path
Can't access environment variable in Rust
How to wrap a call to a FFI function that uses VarArgs in Rust?
Change terminal cursor position in Rust
Is it better to destructure a value to access attributes or use their names?
How is every dynamically checked aspect of Rust implemented?
Referring to matched value in Rust
Reference to unwrapped property fails: use of partially moved value: `self`
Read one level of directory structure
How does one operate over a subset of a vector?
Forcing the order in which struct fields are dropped

Categories

HOME
openshift
apache-nifi
mc
freeradius
shinyapps
bots
postgresql-9.4
selenium-builder
add-on
mousemove
code-formatting
plpgsql
openlayers-3
apk
celery
consul
dependencies
swift2
phpseclib
telegraf
lapack
watch-os-3
circleci
feature-extraction
dma
supervisord
pygobject
hidden
maven-plugin
checkout
spyder
pst
stackexchange.redis
quartz.net
nurbs
router
multi-dimensional-scaling
uri
crud
clipboard
annotation-processing
pyspark-sql
vcenter
publishing
snap-framework
nsmutablearray
xmlunit
counting
payeezy
redgate
identify
fastlane
aurelia-cli
microsoft-ui-automation
tidal-scheduler
utf
grails3.2.0
vungle-ads
julius-speech
sendinput
flume-twitter
qvtkwidget
autorelease
sonarqube5.2
aws-kinesis-firehose
sony-future-lab-n
iptv
punctuation
google-news
ternary-operator
portal
nclam
data-management
seccomp
android-syncadapter
wepay
pythonxy
componentart
flow-js
dbmigrate
nssortdescriptor
famo.us
wicket-1.5
producer
iplimage
berkeley-db-xml
ecos
redpitaya
thruway
microformats
axacropdf
ckeditor.net
user-forums
tridion-2011
idispatch
stretch
getproperty
uipangesturerecognizer
vows
servlet-container
solandra
mongrel
ccnet-config
reliability
ppc
subtext
accumulator
code-camp

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App