In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used for computer programming, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.
A string is a finite sequence of characters.The empty string is denoted by
\varepsilon
s
t
s ⋅ t
st
s ⋅ \varepsilon=s=\varepsilon ⋅ s
s ⋅ (t ⋅ u)=(s ⋅ t) ⋅ u
For example,
(\langleb\rangle ⋅ \langlel\rangle) ⋅ (\varepsilon ⋅ \langleah\rangle)=\langlebl\rangle ⋅ \langleah\rangle=\langleblah\rangle
A language is a finite or infinite set of strings.Besides the usual set operations like union, intersection etc., concatenation can be applied to languages:if both
S
T
S ⋅ T
S
T
S ⋅ T=\{s ⋅ t\mids\inS\landt\inT\}
⋅
The language
\{\varepsilon\}
\{\}
S ⋅ \{\varepsilon\}=S=\{\varepsilon\} ⋅ S
S ⋅ \{\}=\{\}=\{\} ⋅ S
S ⋅ (T ⋅ U)=(S ⋅ T) ⋅ U
For example, abbreviating
D=\{\langle0\rangle,\langle1\rangle,\langle2\rangle,\langle3\rangle,\langle4\rangle,\langle5\rangle,\langle6\rangle,\langle7\rangle,\langle8\rangle,\langle9\rangle\}
D ⋅ D ⋅ D
The alphabet of a string is the set of all of the characters that occur in a particular string. If s is a string, its alphabet is denoted by
\operatorname{Alph}(s)
The alphabet of a language
S
S
\operatorname{Alph}(S)=cups\operatorname{Alph}(s)
For example, the set
\{\langlea\rangle,\langlec\rangle,\langleo\rangle\}
\langlecacao\rangle
D
D ⋅ D ⋅ D
Let L be a language, and let Σ be its alphabet. A string substitution or simply a substitution is a mapping f that maps characters in Σ to languages (possibly in a different alphabet). Thus, for example, given a character a ∈ Σ, one has f(a)=La where La ⊆ Δ is some language whose alphabet is Δ. This mapping may be extended to strings as
f(ε)=ε
for the empty string ε, and
f(sa)=f(s)f(a)
for string s ∈ L and character a ∈ Σ. String substitutions may be extended to entire languages as [1]
f(L)=cups\inf(s)
Regular languages are closed under string substitution. That is, if each character in the alphabet of a regular language is substituted by another regular language, the result is still a regular language.[2] Similarly, context-free languages are closed under string substitution.[3] [4]
A simple example is the conversion fuc(.) to uppercase, which may be defined e.g. as follows:
character | mapped to language | remark | |
---|---|---|---|
x | fuc(x) | ||
‹a› | map lowercase char to corresponding uppercase char | ||
‹A› | map uppercase char to itself | ||
‹ß› | no uppercase char available, map to two-char string | ||
‹0› | map digit to empty string | ||
‹!› | forbid punctuation, map to empty language | ||
... | similar for other chars |
For the extension of fuc to strings, we have e.g.
For the extension of fuc to languages, we have e.g.
A string homomorphism (often referred to simply as a homomorphism in formal language theory) is a string substitution such that each character is replaced by a single string. That is,
f(a)=s
s
a
String homomorphisms are monoid morphisms on the free monoid, preserving the empty string and the binary operation of string concatenation. Given a language
L
f(L)
L
s
f-1(s)=\{w\midf(w)=s\}
while the inverse homomorphic image of a language
L
f-1(L)=\{s\midf(s)\inL\}
In general,
f(f-1(L)) ≠ L
f(f-1(L))\subseteqL
and
L\subseteqf-1(f(L))
for any language
L
The class of regular languages is closed under homomorphisms and inverse homomorphisms.[7] Similarly, the context-free languages are closed under homomorphisms[8] and inverse homomorphisms.[9]
A string homomorphism is said to be ε-free (or e-free) if
f(a) ≠ \varepsilon
\Sigma
An example string homomorphism guc can also be obtained by defining similar to the above substitution: guc(‹a›) = ‹A›, ..., guc(‹0›) = ε, but letting guc be undefined on punctuation chars. Examples for inverse homomorphic images are
For the latter language, guc(guc−1) = guc = ≠ .The homomorphism guc is not ε-free, since it maps e.g. ‹0› to ε.
A very simple string homomorphism example that maps each character to just a character is the conversion of an EBCDIC-encoded string to ASCII.
If s is a string, and
\Sigma
\Sigma
\pi\Sigma(s)
\pi\Sigma(s)=\begin{cases}\varepsilon&ifs=\varepsilontheemptystring\\ \pi\Sigma(t)&ifs=taanda\notin\Sigma\ \pi\Sigma(t)a&ifs=taanda\in\Sigma\end{cases}
Here
\varepsilon
String projection may be promoted to the projection of a language. Given a formal language L, its projection is given by
\pi\Sigma(L)=\{\pi\Sigma(s) \vert s\inL\}
The right quotient of a character a from a string s is the truncation of the character a in the string s, from the right hand side. It is denoted as
s/a
(sa)/b=\begin{cases}s&ifa=b\\ \varepsilon&ifa\neb \end{cases}
The quotient of the empty string may be taken:
\varepsilon/a=\varepsilon
Similarly, given a subset
S\subsetM
M
S/a=\{s\inM \vert sa\inS\}
Left quotients may be defined similarly, with operations taking place on the left of a string.
Hopcroft and Ullman (1979) define the quotient L1/L2 of the languages L1 and L2 over the same alphabet as .[10] This is not a generalization of the above definition, since, for a string s and distinct characters a, b, Hopcroft's and Ullman's definition implies yielding, rather than .
The left quotient (when defined similar to Hopcroft and Ullman 1979) of a singleton language L1 and an arbitrary language L2 is known as Brzozowski derivative; if L2 is represented by a regular expression, so can be the left quotient.[11]
The right quotient of a subset
S\subsetM
M
\simS =\{(s,t)\inM x M \vert S/s=S/t\}
The relation is clearly of finite index (has a finite number of equivalence classes) if and only if the family right quotients is finite; that is, if
\{S/m \vert m\inM\}
is finite. In the case that M is the monoid of words over some alphabet, S is then a regular language, that is, a language that can be recognized by a finite state automaton. This is discussed in greater detail in the article on syntactic monoids.
The right cancellation of a character a from a string s is the removal of the first occurrence of the character a in the string s, starting from the right hand side. It is denoted as
s ÷ a
(sa) ÷ b=\begin{cases}s&ifa=b\\ (s ÷ b)a&ifa\neb \end{cases}
The empty string is always cancellable:
\varepsilon ÷ a=\varepsilon
Clearly, right cancellation and projection commute:
\pi\Sigma(s) ÷ a=\pi\Sigma(s ÷ a)
The prefixes of a string is the set of all prefixes to a string, with respect to a given language:
\operatorname{Pref}L(s)=\{t \vert s=tufort,u\in\operatorname{Alph}(L)*\}
where
s\inL
The prefix closure of a language is
\operatorname{Pref}(L)=cups\in\operatorname{Pref}L(s)=\left\{t \vert s=tu;s\inL;t,u\in\operatorname{Alph}(L)*\right\}
Example:
L=\left\{abc\right\}then\operatorname{Pref}(L)=\left\{\varepsilon,a,ab,abc\right\}
A language is called prefix closed if
\operatorname{Pref}(L)=L
The prefix closure operator is idempotent:
\operatorname{Pref}(\operatorname{Pref}(L))=\operatorname{Pref}(L)
\sqsubseteq
s\sqsubseteqt
s\in\operatorname{Pref}L(t)
f(a)=\{s\}