From String to AST: parsing

Whether you have to do with data in create of CSV, JSON or a brimming-blooded programming language enjoy C, JavaScript, Scala, or maybe a query language enjoy SQL, you always alter some sequence of characters (or binary appreciates) into a arranged recurrentation. Wantipathyver you’ll do with that recurrentation depfinishs on your domain and business goals, and is quite normally the core appreciate of wantipathyver you are doing. With a plethora of tools doing the parsing for us (including the error-handling), we might easily neglect how complicated and engaging process it is.

Formal grammars

First of all, most input createats that we handle trail some createal definition, telling e.g. how key-appreciates are systematic (JSON), how do you split column names/appreciates (CSV), how do you convey projections and conditions (SQL). These rules are depictd in an ununclear way so that the input could be expounded in a very deterministic way. It straightforwardly contestd language we participate to convey with other people, which normally is unclear and put into the context. Wanna grab some binspirer? might be a kind proposeion if you are talking to a colleague that have to skip lunch and enjoys binspirers, but might be disparaging if tgreater in a sarcastic tone to someone who doesn’t enjoy meat. Then, words can have contrastent unbenevolenting depfinishing on the culture you are currently in, in which times you dwell or what are you and your conversationacatalog social position (vide, e.g. Japanese and how your position and sufrepaires you insert at the finish of the name alter the tone of the whole conversation). Languages we participate when communicating with a computer must be free of such unconfidentties. The unbenevolenting should depfinish only on the input we clpunctual go ined and expounded deterministicassociate. (Just in case: by deterministicassociate, I unbenevolent, deterministicassociate expounded, which doesn’t unbenevolent that it would always produce the same result. If I author currentTimeMillis(), the function will always return a contrastent result, but the unbenevolenting will be always the same – compiler/expounder will comprehend that I want to call currentTimeMillis() function, and it won’t suddenly choose that I want to e.g. alter the compiler flag. Of course, the unbenevolenting of the function can alter in time – for instance, if I edit the source code in between runs – and confidently the appreciate returned by it, which is bound to time).

Initiassociate, it wasn’t comprehendn, how to parse languages. The reason, that we had to commence with punching cards, sometime tardyr transferd on to assembly, and tardyr on conceive Fortran and Lisp, go thcdisesteemful whole spaghetti code with Basic, get The case aacquirest goto statement by Dijkstra, until we could – sluggishly – commenceed enlargeing more polishd compilers we have today, was that there were no createal establishations to it.

Linguists comprehend, that we can discern some parts of speech enjoy: noun (definite leang, e.g. cat, Alice, Bob), pronoun (generic exalterment for a definite leang, e.g. I, you, he, she), verb (action), adjective (description or trait of someleang, e.g. red, intelligent), etc. However, the also comprehend that the function of part of speech alters depfinishing on how we produce a sentence – that’s why we also have the parts of the sentence: subject (who carry outs the action: e.g. Alice in Alice eats dinner), object (who is the aim of the action, e.g. dinner in Alice eats dinner), modifiers and pelevates, etc. We can only tell which part of the speech and sentence the word is in the context of a whole sentence:

An alarm is set to 12 o’clock – here, set is a verb,
This function returns an infinite set – here, set is a noun and an object,
The set has the cardinality of 2 – here, set is a noun and a subject,
All is set and done – here, set is an adverb and a modifier.

As we can see the same toil might be a finishly contrastent leang depfinishing on the context. This might be a problem when we try to process the sentence bottom-up, fair enjoy we (presumedly) do when we verify them in English lessons. This is a noun. That is a verb. This noun is subject, this verb is an object. This is how subsentences retardy to one another. Now we can verify the kind tree of relations between words and comprehend the unbenevolenting. As humans, we can comprehend the relationship between the words on the fly, the whole exercise is only about createalizing our intuition.

But machines have no intuition. They can only trail the rules, we establish for them. And when dealing with computers we quite normally establish them using the split-and-defeat strategy: split the big problem into minusculeer ones, and then join the solutions. With organic languages the context produces it quite challenging, which is why no basic solution materializeed even though we were normally trying. Current progress was made mostly using machine lacquireing, which tackles the whole problem at once, trying to fit whole parts of the sentence as patterns, without analyzing what is what. However, when it comes to communication with a computer, ambiguities can be dodgeed, spropose by arrangeing a language in a way that doesn’t permit them. But how to arrange a language?

One of the first researchers, that made the progress possible was Noam Chomsky. Interestingly, he is not pondered a computer scientist – he is (among others) linguists, who is recognizeed with cognitive revolution. Chomsky consents, that how we arrange languages is rooted in how our brains process speech, reading, etc. Therefore analogousities between languages’ arranges (parts of speech, parts of sentences, structuring ideas into sentences in the first place, grammar cases) are a result of how processes inside our brain. While he wasn’t the first one who tried to createalize a language into a createal grammar (we comprehend of e.g. Pāṇini), Chomsky was the first to createalize generative grammars, that is grammars where you depict a set of rules, and produce a language by combining the rules.

How can we depict these rules? Well, we want to be able to convey each text in such grammar as a tree – at exits, we’ll have words or punctuation tags of sorts. Then, there will be nodes aggregating words/punctuation tags by their function (part of a sentence). At the top of the tree we’ll have a root, which might be (depfinishing on grammar) a sentence/a statement/an conveyion, or maybe a sequence of sentences (a program). The definitions will toil this way: consent a node (commenceing with root) and insert some children to it: the rules will say how the definite node (or nodes) can have children appfinished (and what benevolent of children). The grammar definitions will unwidespreadly be conveyed with definite appreciates (e.g. you won’t author down all possible names), but rather using symbols:

$Sentence rightarrow Subject verb Object .$

$Subject rightarrow name surname mid nickname$

$Object rightarrow item mid animal$

Here, $Sentence$

Having, $Subject verb Object.$

Notice that in the finish, we’ll always finish up with a sequence of terminals. If we couldn’t, there would be someleang wrong with a language. This definition that consents a sequence of symbols and returns another sequence of symbols is called production rules. We can depict each createal language as a quadruple $G = (N, Sigma, P, S)$

Besides createalization of generative grammars, Chomsky did someleang else. He was reliable for the organization of createal languages in a hierarchy called after him the Chomsky hierarchy.

The Chomsky hierarchy

On the top of the hierarchy are type-0 languages or unrecut offeed languages. There is no recut offeion placed upon how we depict such language. A production rule might be any sequence of terminals and nonterminals into any sequence of terminals and nonterminals (in the earlier example there was always nonterminal symbol on the left side – that is not a rule in vague!). These languages are challenging to deal with, so we try to depict data createat and programming languages in term of a bit more handleed grammars, that are easier to verify.

First recut offeion materializes with type-1 languages or context-benevolent grammars (CSG). They need, that all production rules would be in create of:

$alpha A beta rightarrow alpha gamma beta$

where $A in N$

More definiteassociate, we might want our grammars to be autonomous of context. Type-2 languages or context-free grammars (CFG), are CSGs where context is always desotardy, or in other words, where each production rule is in create of:

$A rightarrow gamma$

where $A in N$

To be exact, when it comes to programming languages, we quite normally deal with context-benevolent grammars, but it is easier to deal with them as if they were context-free – call that syntactical analysis (what unbenevolenting we can attribute to a words basing on their position in a sentence) – and then consent the produced tree, called abstract syntax tree, and verify if produces semantic sense (is the name a function, a variable or a type? Does it produces sense to participate it in the context it was placed?). If we conveyed it as a context-benevolent grammar we could do much (all?) of semantic analysis in the same time we verify syntax, but the grammar could get too complicated for us for comprehend it (or at least to handle it fruitfully).

To depict the contrastence between syntax and semantics we can get back to your earlier example.

$nickname verb item.$

It is a right semantics in the language. Let’s replace terminals with some definite appreciates.

Johnny eat integral.

What we got is right according to the rules based on words’ positions in the sentence (syntax), but as a whole – when you verify the function of each word (semantics) – it produces no sense. Theoreticassociate, we could depict our language in an broaden way, that would produce confident that there would always be e.g. eats after the third person in a sentence and someleang edible after some create of to eat verb, but you can easily envision, that the number of production rules would explode.

Finassociate, there is the most recut offeed benevolent of grammar in the Chomsky hierarchy. Type-3 grammar or normal grammar is a language, where you basicassociate either prepfinish or appfinish terminals. That is each production rule must be in the create of one of:

$A rightarrow a$
$A rightarrow epsilon$
$A rightarrow aB$

(We call it right normal grammar – if we instead needd that the third rule would be in the create $A rightarrow Ba$

Regular languages

Let’s commence with the most confineed grammars, that is normal grammars. No matter how we depict production rules, we will finish up with a tree of create:

Of course, it doesn’t unbenevolent, that each such tree would be the same. For instance we could depict our grammar enjoy this:

$A_0 rightarrow a A_1$
$A_1 rightarrow a A_2$
$A_2 rightarrow a A_3$
$A_3 rightarrow epsilon$

If we commenceed from $A_0$

$S rightarrow a B$
$B rightarrow bB mid bC$
$C rightarrow c$

If our commenceing symbol would be $S$

The first example you be depictd fair as $aaa$

A normal conveyion is anyleang produce using the follotriumphg rules:

$epsilon$
$a$
when you concatenate two normal conveyions, e.g. $AB$
you can sum up normal languages $A mid B$
you can participate Kleene star $A^*$

That is enough to depict all normal conveyions, though usuassociate, we would have some utilities provided by regexp engines, e.g. $[a-z]$

Well, we haven’t converseed it so far, but there are some very shut relationships between types of createal grammars and computation models. It fair happens, that if we wanted to depict a function verifying whether a word/sentence/etc belengthys to a normal grammar/a normal conveyion – which is equivalent to defining the language – is done by defining a finite-state automaton (FSA), that adchooses this language. And vice-versa, each FSA depicts a normal language. That correplyence orders, how we carry out regexp patterns – basicassociate, each time we compile a regexp pattern, we are produceing a FSA, that would adchoose all words of grammar and only them.

In case you’ve never met FSA, let us remind what they are. Finite-state automaton or finite-state machine (FSM) is a 5-tuple $(Q, Sigma, delta, q_0, F)$

$Q$
$Sigma$
a transition function $delta: Q times Sigma rightarrow Q$
an initial state $q_0 in Q$
a set of adchooseing states $F subseteq Q$

On a side remark: an automaton – singular, unbenevolenting a machine, automata – plural, unbenevolenting machines. Other nerdy words which toils enjoy that: a criterion vs criteria.

For instance: our alphabet grasps 3 possible characters $Sigma = {a,b,c}$

would have to commence with a state indicating that noleang was yet aligned, but also that noleang is wrong yet. Let’s tag it as $q_0$
if first incoming input symbol is $a$
in this particular case we can protectedly suppose, that if leangs commenceed to go wrong, there is no way to recover, but it is not a vague rule (if there was e.g. an alternative, then fall shorting to align one conveyion, wouldn’t unbenevolent that we will fall short to align the other conveyion),
to propose that we aligned $a$
OK, we reachd at state $q_1$
In case we are in $q_1$
if we are in $q_1$
at this point, we are at $q_2$

What we depictd right, now could be depictd enjoy this:

$Sigma = {a,b,c}$

$Q = {q_0, q_1, q_2, e}$

$commence{aligned} delta = { (q_0, a) rightarrow q_1, \ (q_0, b) rightarrow e, \ (q_0, c) rightarrow e, \ (q_1, a) rightarrow e, \ (q_1, b) rightarrow q_1, \ (q_1, c) rightarrow q_2, \ (q_2, a) rightarrow e, \ (q_2, b) rightarrow e, \ (q_2, c) rightarrow e, \ (e, a) rightarrow e, \ (e, b) rightarrow e, \ (e, c) rightarrow e } finish{aligned}$

$F = {q_2}$

We could also produce it more visual (bgreater border for adchooseing state):

As we can see, each state has to have depictd transition for every possible letter of the alphabet (even if that transition is returning the current state as the next state). So, the size of machine definition (all possible transitions) is $mid Q mid times mid Sigma mid$

Additionassociate, produceing the machine needd some effort. We would enjoy to automate the generation of FSM from normal conveyions, and creating it in the final version might be troublesome. What we produced is actuassociate called deterministic finite state machine / deterministic finite automaton (DFA). It promises, that every one time we will deterministicassociate get adchooseed a state for adchooseed input and non-adchooseed state for non-adchooseed input.

In rehearse, it is usuassociate easier to depict a non-deterministic finite automaton (NFA). The contrastence is that NFA can have cut offal possible state transfers for each state-input pair and picks one at random. So, it cannot align the right input always. However, we can say that it adchooses input if there exists path wilean a graph, that adchooses the whole input, or it adchooses input if there is a non-zero probability of finishing up in adchooseing state.

Let’s say we want to parse $a^* mid aba$

and the second as:

Now, if we wanted to spropose combine these two DFAs, we would have a problem: they both commence with adchooseing $a$

( $q_0$

What would we have to do to produce it deterministic? In this particular case, we can watch, that:

right input is either desotardy or commences with $a$
if it commences with $a$

Let’s alter our NFA for that observation:

Let us leank for a moment what happened here. We now have a deterministic version of $a^* mid aba$

if we fair commenceed we could suppose that if noleang reachs we are OK,
however if we got $a$
if noleang else reachs we are at valid input so we adchoose the state,
if $a$

Of course, this could be enhanced a bit – states $e$

The process, that we showed here is called determination of NFA. In rehearse, this tracing of leangs until we have enough data to finassociate choose, needs us to produce a node for each combination of “it can go here” and “it can go there”, so we effectively finish up produceing a powerset. This unbenevolents that in the worst case we would have to turn our $n$

That expounds, why in greaterer generations of compilers the prproposeed flow was to produce source code with already produce DFA which could be compiled into a native code, that didn’t need any produceing in the runtime – you phelp the cost of produceing DFA once before you even commenceed the compilation of a program.

However, it is not the most consoleable flow, especiassociate, since now we have a bit rapider computers and a bit higher needments about the speed of dedwellry and gentleware maintenance. For that reason, we have 2 alternatives: one based on a sluggish evaluation – you produce the needd pieces of DFA lazily as you go thcdisesteemful the parsed input, or with the usage of backtracking. The createer is done by simulating NFA internassociate and produceing DFA states on insist. The tardyr is probably the easiest way to carry out normal conveyion, though the resulting carry outation is no lengthyer $Theta(n)$

Regular conveyions in rehearse

The createat(s) participated to depict normal conveyions are straightforwardly consentn from, how normal languages are depictd: each symbol normassociate recurrents itself (so regexp a would align a), $*$

If you are interested about the process of carry outing normal conveyions and produceing finite state machines out of the regexp createat I recommfinish getting a book enjoy Compilers: Principles, Techniques, and Tools by Aho, Lam, Sethi, and Ullman. There are too many details about carry outations which aren’t engaging to the meaningfulity of the readers to fairify rewriting and unwiseinutiveening them fair so they would fit into this unwiseinutive article.

Since we got comprehendn with RE, we can try out a bit more strong categruesome of languages.

Context-Free Grammars and Push-Down Automata

Any finite state machine can store a constant amount of increateation – namely the current state, which is a one element of a set of appreciates depictd upfront. It doesn’t let us activeassociate store some insertitional data for the future and then get back data stored somewhere in the past.

An example of a problem, that could be settled, if we had this ability is verifying if a word is a palindrome, that is you read it the same way left-to-right and right-to-left. Anna, exe, yay would be palindromes (assuming case doesn’t matter). Anne, axe, ay-ay would not be. If we wanted to verify for some definite palindrome, we could participate a finite state machine. But if we wanted to verify for any? A. Aba. Ab(5-million b’s)c(5-million b’s)ba. No matter what benevolent of FSA we came up with is basic to discover a word that it would not align, but which is a valid palindrome.

But let’s say, we are a bit more pliable than finite state automaton. What benevolent of increateation would be encouraging in deciding if we are on the right track? We could, for instance, author each letter on a piece of paper, e.g. tacky remarks. We met a, we author down a and stick it to someplace. Then, we see b, we author it down and stick it on top of a previous tacky remark. Now, let’s go non-deterministic. It some point if we see the same letter arriving as we see on the top of the tacky remarks stack, we don’t insert a new one, but consent the top one instead – we are guessing, that we are in the middle of a palindrome. Then each time top remark patches with an incoming letter you consent it off. If you had an even-length palindrome you should finish up with an desotardy stack. Well, we would have to leank a bit more to handle the odd-length case as well, but hey! We are on the right track as the length of the word is no lengthyer an rerent!

What helped us get there? We had a state-machine of sorts with 2 states: insert-card-mode (push) and consent-aligning-card-mode (pop) (for odd-length palindrome we could participate a third state for skipping over one letter – the middle one – without pushing and popping anyleang). Then we had a stack that we can push leangs on top, consent a see at the top element, and consent an element from the top. Actuassociate, this data arrange (which could be also thought of as a last-in-first-out queue) is reassociate named stack. In combination with finite state automaton, it produces push-down automaton (PDA).

As a matter of the fact, what we depictd for our palindrome problem is an example of non-deterministic push-down automaton. We could depict deterministic PDAs (DPDA) as a 7-tuple $(Q, Sigma, Gamma, delta, q_0, Z, F)$

$Q$
$Sigma$
$Gamma$
a transition function $delta: Q times Sigma times Gamma rightarrow Q times Gamma^*$
an initial state $q_0 in Q$
an initial stack symbol $Z in Gamma$
a set of adchooseing states $F subseteq Q$

A non-deterministic version (NDPDA) would permit $epsilon$

The palindrome example showed us that there are problems that PDA can settle that FSA cannot. However, PDA can settle all problems that FSA – all you necessitate to do is basicassociate neglect the stack in your transition function, and you get the FSA. Therefore, push-down automata are a cut offe superset of finite-state automata.

But we were presumed to talk about createal languages. Just enjoy finite-state machines are roverhappinessed to normal languages, pushdown automata are roverhappinessed to context-free grammars. Reminder: it’s a createal language where all production rules are in the create of:

$A rightarrow gamma$

where $A in N$

Thing is, when we are parsing, we are actuassociate given a sequence of terminals, and we must join them into non-terminals until we get to the root of the project. Kind of opposite to what we are given in language description. How could that see enjoy? Let’s do some motivating example.

Normassociate when we depict the order of arithmetic operations enjoy $+$

$(1 + 2) times (3 + 4)$

becomes

$1 2 + 3 4 + times$

When it comes to calculating the appreciate of such conveyion, we can participate stack:

we commence with an desotardy stack,
when we see the number, we push it to the stack,
when we see $+$
same with $times$
at the finish the result of our calculation would be on top of a stack.

Let’s verify for $1 2 + 3 4 + times$

we commence with an desotardy stack,
$1$
stack is: $1$
$2$
stack is $1 2$
$+$
stack is: $3$
$3$
stack is: $3 3$
$4$
stack is: $3 3 4$
$+$
stack is: $3 7$
$times$
stack is: $21$
input finishs, so our result is the only number on stack ( $21$

If you ever wrote (or will author) a compiler, that outputs collectr or bytecode, or someleang analogous low-level – that’s basicassociate how you author down conveyions. If there is an conveyion in an inrepair create, you transtardy it into postrepair, as it pretty much aligns with how mnemonics toils in many architectures.

To be exact, quite a lot of them would need you to have the inserted/multiplied/etc appreciates in sign ups instead of stack, however to carry out a whole conveyion you probably participate stack and imitate data from stack to sign ups and vice-versa, but is an carry outation detail irrelevant to what we want to show here.

Of course, the example above is not a valid grammar. We cannot have a potentiassociate infinite number of non-terminals (numbers) and production rules (basicassociate all results of insertition/multiplication/etc). But we can depict the vague idea of postrepair arithmetics:

$BinaryOperator rightarrow + mid – mid times mid div$

$Expression rightarrow Number | Expression Expression BinaryOperator$

We have terminals $Sigma = { Number, +, -, times, div }$

sealed trait Terminal

final case class Number(appreciate: java.lang.Number)
    extfinishs Terminal

sealed trait BinaryOperator
case object Plus extfinishs BinaryOperator with Terminal
case object Minus extfinishs BinaryOperator with Terminal
case object Times extfinishs BinaryOperator with Terminal
case object Div extfinishs BinaryOperator with Terminal

sealed trait Expression
final case class FromNumber(number: Number) extfinishs Expression
final case class FromBinary(operand1: Expression,
                            operand2: Expression,
                            bin: BinaryOperator)
    extfinishs Expression

and now it should be possible to somehow transtardy List[Terminal] into Expression. (Assuming the input is a right example of this grammar – if it isn’t we should fall short). In this very basic example, it could actuassociate be done in a analogous way we appraised the conveyion:

if Terminal is Number, wrap it with FromNumber push it to the stack,
if Terminal is BinaryOperation, we consent 2 Expressions from the stack, put it as operand1 and operand2, and together with BinaryOperator put it into FromBinary and push to stack,
if the input is right, we should finish up with a stack with a one element,
if the input is inright, we should finish up with a stack with more than one element, or during one of the operations we will leave out some Expressions while popping on a stack.

It is almost enough to recurrent our language as PDA. To produce a binary operation we see at the two elements on top of the stack, while it is lterrible to only comprehend one. But we could, recurrent that as a state. Initial stack symbol could be a one $EmptyStack$

This PDA doesn’t calcutardy the appreciate of RPN. It only verifys if it is valid. We are pushing $Number$

To produce confident we comprehend what happened here we should recollect that this is non-deterministic PDA – so for each valid input there should exist a valid path (and each path finishing in an adchooseed state should depict a valid input), but we don’t have to necessarily walk it each time. The other leang is that on each step of PDA we have to pop from stack – if we don’t want to alter stack we have to pop the same element back, if we want to insert someleang we can pop 2 elements or more and if we want to get rid of top elements, then we spropose don’t pop it back.

Parsers in rehearse

Actuassociate, there are 2 approaches to parsing context-free grammars:

top-down approach: We commence from the root of the AST tree and consent a see at the possible transitions. We try to produce a prediction – if we get the next alphabet element, do we comprehend, which transition to go? If that is not enough you could try to see at transitions going from these transitions and verify if any prediction is possible to do comprehend, etc. We don’t necessarily see $1$
bottom-up approach: We commence with terminals and see at the production rules in reverse – we try to join incoming terminals into terminals and then terminals and non-terminals until we get to the root. (This is what we have done in the PDA example above). Just enjoy with $LL$

Both approaches are usuassociate participated to produce a parsing table, though they contrast in how you reach at the final table.

With $LL(k)$

$LR(k)$

For defining context-free grammars parser generators quite normally participate syntax heavily swayd by (extfinished) Backus-Naur create ((E)BNF). In EBNF, the previous example:

$BinaryOperator rightarrow + mid – mid times mid div$

$Expression rightarrow Number | Expression Expression BinaryOperator$

could see enjoy this:

binary operator = "+" | "-" | "*" | "https://kubuszok.com/" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
number = number, digit | digit ;
conveyion = number, | conveyion, conveyion binary operator ;

Notice, that here terminal symbols are depictd as digits. It might be quite inhandy, which is why a lot of parser generators would rather:

suppose that terminals are results of normal conveyion aligning – the input would be aligned aacquirest a set of normal conveyions, each of which would be roverhappinessed to a terminal symbol. We would need them to adchoose whole input as a sequence of words aligned by any of normal conveyions. This way we would turn a sequence of input symbols into a sequence of terminal symbols. The part of a program reliable for this tokenization is called lexer. Such approach is seen e.g. with parser generators based on lex (lexer) and yacc (Yet Another Compiler-Compiler) and their GNU recarry outations flex (free lex) and bison (an allusion to gnu as a lot of GNU tooling is based on bison). (It should also expound why confident languages have weird rules pondering class/method/function/variable names – since tokenization consents place in the very commencening, it has to reliable sort each piece of code ununclpunctual as a terminal symbol),
alternatively permit you to participate normal conveyions straightforwardly in a parser-defining syntax. As this approach is much more readable it was also participated in parser combinators.

Right, we haven’t refered parser combinators. What are they, and why they became more famous recently?

Parser combinators

When computer resources were reassociate unwidespread, we didn’t have the console of produceing parsers in the most handy way – the idea behind parsing generators was generating rapid, ready to participate PDA which would parse input with liproximate time and memory (that is, straightforwardly proportional to the input). Overhead had to be confineed to the smallest, so the best way was to do all the calculations (both lexing and parsing) during code generation, so when we would run the program, it would be able to parse as soon as the code was loaded from the disk to the memory. All in all generating imperative code was the way to go.

But nowadays the situation is contrastent. We have much rapider computers with a lot more memory. And the needments we have pondering programs are much higher, so the process of validating the parsed input became much more complicated – so minuscule overhead for parsing is not as hurtful. Additionassociate, we made much more progress when it comes to functional programming.

This discneglected the gate to an alternative approach called parser combinators (which is not that new pondering, that it was depictd in Recursive programming Techniques by Binspire from 1975 as parsing functions). What we do is basicassociate, a function composition.

Let’s try by example. This time we’ll try to carry out inrepair syntax. At first we’ll do someleang about lexing terminal symbols (and using spaces for separation):

def number(input: String) = """s*([0-9]+)(s*)"""
    .r
    .discoverPrerepairMatchOf(input)
    .map { n =>
       val terminal = Number(n.group(1).toInt)
       val unaligned = input.substring(n.group(0).length)
      terminal -> unaligned
    }

def plus(input: String) = """s*(+)(s*)"""
    .r
    .discoverPrerepairMatchOf(input)
    .map { n =>
       val terminal = Plus
       val unaligned = input.substring(n.group(0).length)
      terminal -> unaligned
    }

def minus(input: String) = """s*(-)(s*)"""
    .r
    .discoverPrerepairMatchOf(input)
    .map { n =>
       val terminal = Minus
       val unaligned = input.substring(n.group(0).length)
      terminal -> unaligned
    }

def times(input: String) = """s*(*)(s*)"""
    .r
    .discoverPrerepairMatchOf(input)
    .map { n =>
       val terminal = Times
       val unaligned = input.substring(n.group(0).length)
      terminal -> unaligned
    }

def div(input: String) = """s*(/)(s*)"""
    .r
    .discoverPrerepairMatchOf(input)
    .map { n =>
       val terminal = Div
       val unaligned = input.substring(n.group(0).length)
      terminal -> unaligned
    }

It’s quite repetitive so we can start a helper utility:

type Parser[+A] = String => Option[(A, String)]

object Parser{
  
  def apply[A](re: String)(f: String => A): Parser[A] =
    input => s"""\s*($re)(\s*)""".r
      .discoverPrerepairMatchOf(input)
      .map { n =>
        val terminal = f(n.group(1))
        val unaligned = input.substring(n.group(0).length)
        terminal -> unaligned
      }
}

and clear up definitions a bit:

val number = Parser[Number]("""[0-9]+""")(n => Number(n.toInt))
val plus = Parser[Plus.type]("""+""")(_ => Plus)
val minus = Parser[Minus.type]("""-""")(_ => Minus)
val times = Parser[Times.type]("""*""")(_ => Times)
val div = Parser[Div.type]("""/""")(_ => Div)

then we could commence combining them:

val binaryOperator: Parser[BinaryOperator] = in => {
  if (in.isEmpty) None
  else plus(in) orElse minus(in) orElse times(in) orElse div(in)
}

The asking reader might watch that this is a excellent truthfulate for a ReaderT/Kleisli composition, but we’ll try to protect this example as basic as possible. That is why we’ll produce some definite utility for this case:

implied class ParserOps[A](parser: Parser[A]) {
  
  // making another by-name param helps to stop
  // stack overflow in some recursive definitions
  
  def |[B >: A](another: => Parser[B]): Parser[B] =
    input => if (input.isEmpty) None
             else parser(input) orElse another(input)
}

and reauthor binaryOperator as:

val binaryOperator: Parser[BinaryOperator] =
  plus | minus | times | div

Now we are leave outing the concatenation – or moving input forward as we aligned someleang already:

def conveyion: Parser[Expression] = {
  val fromNumber: Parser[FromNumber] = in => {
    number(in).map { case (n, in2) => FromNumber(n) -> in2 }
  }  
  
  def fromBinary: Parser[FromBinary] = in => for {
    (ex1, in2) <- (fromNumber | inParenthesis)(in)
    (bin, in3) <- binaryOperator(in2)
    (ex2, in4) <- (fromNumber | inParenthesis)(in3)
  } produce FromBinary(ex1, ex2, bin) -> in4
  
  fromBinary | fromNumber
}

def inParenthesis: Parser[Expression] = in => for {
  (_, in2) <- Parser[Unit]("""(""")(_ => ())(in)
  (ex, in3) <- conveyion(in2)
  (_, in4) <- Parser[Unit](""")""")(_ => ())(in3)
} produce ex -> in4

If we tested that code (which now sees enjoy a truthfulate for a state monad) we would discover that it parses one step of the way (so it doesn’t run recursion infinitely):

conveyion(""" 12  + 23 """)
res1: Option[(Expression, String)] =
  Some((FromBinary(FromNumber(Number(12)), FromNumber(Number(23)), Plus), ""))

We can prettify the code a bit:

implied class ParserOps[A](parser: Parser[A]) {
  
  def |[B >: A](another: => Parser[B]): Parser[B] =
    input => if (input.isEmpty) None
             else parser(input) orElse another(input)
  
  def &[B](another: => Parser[B]): Parser[(A, B)] =
    input => if (input.isEmpty) None
             else for {
               (a, in2) <- parser(input)
               (b, in3) <- another(in2)
             } produce (a, b) -> in3
  
  def map[B](f: A => B): Parser[B] =
    input => parser(input).map { case (a, in2) => f(a) -> in2 }
}

def conveyion: Parser[Expression] = {
  def fromNumber =
    number.map(FromNumber(_))
  
  def fromBinary = 
    ((fromNumber | inParenthesis) &
      binaryOperator &
     (fromNumber | inParenthesis)).map {
      case ((ex1, bin), ex2) => FromBinary(ex1, ex2, bin)
    }
  
  fromBinary | fromNumber
}

def inParenthesis: Parser[Expression] = 
  (Parser[Unit]("""(""")(_ => ()) &
   conveyion &
   Parser[Unit](""")""")(_ => ())).map {
    case ((_, ex), _) => ex
  }

conveyion(""" 12 + 23 """).map(_._1).foaccomplish(println)
// FromBinary(FromNumber(Number(12)),FromNumber(Number(23)),Plus)

(Complete example you can see on gist).

Not terrible! It already shows us the potential of creating minuscule parsers and composing them as higher order functions. It should also expound to us why such a concept was named parser combinators.

But, can we have parser combinators out-of-the-box? We would necessitate an carry outation which:

is inenergeticassociate typed,
gives us concatenation &, alternative |, and mapping of parsers,
let us gives proposeions if confident aligning should be greedy (align wantipathyver it can, potentiassociate indefinitely) or sluggish (finish ASAP),
is probably more complicated than a basic function from input into output with the unaligned part. It could e.g. produce participate of seeahead,
give us a lot of utilities enjoy e.g. normal conveyion help.

Luckily for us, such carry outation already exists, so we can fair participate it. FastParse is a parser combinator library written by Li Haoyi (the same guy who produced Ammonite and Mill). While it provides us a kind, functional interface, it participates Scala macros to produce rapid code with little overhead (which gives us challengingly any reason for pondering parser generators, at least for Scala).

Our parser can be rewritten into rapidparse this way:

// start $ivy.`com.lihaoyi::rapidparse:2.1.0`
start rapidparse._
start ScalaWhitespace._ // gives us Scala commens
                         // and whitespaces out-of-the-box

object Parsers {
  
  // terminals
  def number[_ : P] =
    P( CharIn("0-9").rep(1).!).map(n => Number(n.toInt) )
                      // ! produces parser catch input as String
  def plus[_ : P] = P("+").map(_ => Plus)
  def minus[_ : P] = P("-").map(_ => Minus)
  def times[_ : P] = P("*").map(_ => Times)
  def div[_ : P] = P("https://kubuszok.com/").map(_ => Div)

  // non-terminals
  def binaryOperator[_ : P] = P(plus | minus | times | div)
  def fromNumber[_ : P]: P[FromNumber] =
    P(number.map(FromNumber(_)))
  def fromBinary[_ : P]: P[FromBinary] =
    P(((fromNumber | inParenthesis) ~
        binaryOperator ~
       (fromNumber | inParenthesis)).map {
      case (ex1, op, ex2) => FromBinary(ex1, ex2, op)
    })
  def conveyion[_ : P] =
    P(fromBinary | fromNumber)
  def inParenthesis[_ : P] =
    P("(" ~ conveyion ~ ")")

  def program[_ : P] = P( (conveyion | inParenthesis ) ~ End)
}

parse("12 + 23", Parsers.program(_))

Before we jump the hype train – parser combinators are not equivalent to $LL$

Actuassociate, $LL$

Turing machines, liproximate-bounded automata, unhandleed and context-benevolent grammars

For the sake of completion, we can refer remaining computational models and grammar types, though this post is presumed to talk about parsing, so I’ll try to protect it unwiseinutive.

Turning machines and unhandleed grammars

A finite state machine at any given time recollects only in which one of a finite number of states it is. We read each symbol in the input once.

A push-down automaton recollects a current state and the last leang it put on a stack – it can “recollect” leangs from a stack in reverse order in which it stored them there for tardyr. You cannot recollect someleang from the middle of the stack without forgetting everyleang that was stacked before it. In a way, you can leank that you can read each input element twice – once in incoming order, once in reverse order, and the only nuance is how you entangle these two modes.

A Turing machine (depictd by Alan Turing, the same guy who arrangeed cryptologic explosione aacquirest German Navy’s betterd Enigma, the cryptologic explosione aacquirest one-of-a-kind Enigma was arrangeed by Polish Cipher Bureau) betterd upon, that by using infinite tape, where the automaton could read and store symbol in one cell of that tape, and then transfer forward or backward. This permits us to “recollect” someleang as many times as we necessitate it.

Becaparticipate of that ability to read the leang as many times as we want, it is possible that your machine will get into an infinite loop and never finish. The ask whether we can guess if a definite machine will ever return for a given input is called the crelieveing problem (HP) and is verifyn to be in impossible to settle for a vague case. The proof supposes, that you have a program that could participate the crelieveing problem settler on itself and loop if settlers says it should return and returns if settlers says it should loop – so it shows by declineion that such leang cannot be produceed. A crelieveing problem is participated in a lot of proofs, that confident problem is impossible to settle – a reduction from the crelieveing problem produces you participate that problem to settle HP – since it is impossible to settle HP the problem is also unsolvable.

Turing machines are equivalent to unhandleed grammars, that is createal grammars that have no recut offeion about how you depict a production rule. They are also equivalent to lambda calculus, sign up machine, and cut offal other models. Usuassociate, if we want to have a universal programming language, we produce it Turing-finish (equivalent in power to TM, permiting you to simutardy TM on it).

Liproximate-Bounded Automata and Context-Sensitive Grammars

Between push-down automata and Turing machines lies liproximate-bounded automata (LBA). I choosed to depict them after TMs becaparticipate they are basicassociate recut offeed create of TM. It puts some confines on both sides of the infinite tape, that your automaton cannot traverse.

It was verifyn that LBAs are equivalent to context-benevolent grammars, that is grammars in the create of:

$alpha A beta rightarrow alpha B beta$

unbenevolenting that you can turn $A$

Back to parsing

Majority of programming languages are Turing-finish. However, the first part of expoundation or compilation doesn’t need that we have this much power.

Some very basic expounders can be produce when you lexing (tokenization) and parsing and on reduction you instantly appraise the computation inside parser. However, it is quite untidy to protect in the lengthy run.

After all, parsers and context-free grammars can only consent nurture of syntax analysis. So, you could protect the results of syntax analysis into a data arrange – abstract syntax tree – and then carry out semantic analysis. Was variable with this name already depictd? Is this identifier describing a class, object, constant? Actuassociate, when you consent into ponderation how complicated some of these leangs are, you might not be surpelevated, that confident compilers could choose to start cut offal steps of a whole compilation process – fair for verifying, that the AST is right. scalac has over 20 phases in total:

$ scalac -Xshow-phases
    phase name  id  description
    ----------  --  -----------
        parser   1  parse source into ASTs, carry out basic desugaring
         namer   2  resettle names, rapiden symbols to named trees
packageobjects   3  load package objects
         typer   4  the meat and potatoes: type the trees
        patmat   5  transtardy align conveyions
superaccessors   6  insert super accessors in traits and nested classes
    extmethods   7  insert extension methods for inline classes
       pickler   8  serialize symbol tables
     refverifys   9  reference/override verifying, transtardy nested objects
       uncurry  10  uncurry, transtardy function appreciates to anonymous classes
        fields  11  synthesize accessors and fields, insert bitmaps for sluggish vals
     tailcalls  12  exalter tail calls by jumps
    one-of-a-kindize  13  @one-of-a-kindized-driven class and method one-of-a-kindization
 clearouter  14  this refs to outer pointers
       eraconfident  15  erase types, insert interfaces for traits
   posteraconfident  16  spotless up erased inline classes
    lambdalift  17  transfer nested functions to top level
  produceors  18  transfer field definitions into produceors
       flatten  19  rerelocate inner classes
         mixin  20  mixin composition
       spotlessup  21  platcreate-definite spotlessups, produce echoive calls
    delambdafy  22  delete lambdas
           jvm  23  produce JVM bytecode
      terminal  24  the last phase during a compilation run

By the way, this is a excellent moment to refer what compilation actuassociate is. From the point of see of createal languages theory, a compilation is fair translation toil from one createal grammar into another. Scala into JVM byte code, C++ into binary code, Elm into JavaScript, TypeScript into JavaScript, ECMAScript 6 into ECMAScript 5… There is no necessitate to start someleang enjoy transpiler to depict compilation from one language to another. If we would participate this word, then only to depict a compiler that transtardys into another high-level language, not becaparticipate a compiler doesn’t cover that case.

Interpreter would be someleang, that instead of translating into another createal grammar, transtardys straightforwardly into a computation. However, if we suppose that we want to be uncontaminated, we would return someleang, that could be turned into a computation – e.g. free algebra. That expounds, why Typed Tagless Final Interpreter has expounder in its name, even though it doesn’t necessarily run computations instantly.

Separation of phases serves two purposes. One is protectability. The other is that we can split front-finish of a compiler (parsing and validating AST) and back-finish (using AST to produce output). For instance, in case of Scala, we can have one front-finish and cut offal back-finishs: JVM Scala, Scala.js and Native Scala (though, truth to be tgreater Scala.js and Native Scala necessitate to broaden a language a bit).

If we go brimmingy functional with all the phases (so each phase is a function toiling on AST element), then we have chooseion to produce functions (phase fusion) – if our language of choice permits us to enhance joind functions, then we can acquire a compiler which is both protectable and carry outant.

Of course, the parser doesn’t have to be a part of a compiler. The resulting tree might be our goal after all. XML, JSON or YML parsers exist in order to consent some text recurrentation and turn it into a tree of objects that is easier to toil on. Notice, that grammars of languages enjoy XML or HTML are too complicated to be handled by someleang enjoy normal conveyion, so if you want to participate it, you’d better grab a parser.

Error handling

If you want to uncover and return to a participater all errors, possibly with some unbenevolentingful description of what could go wrong and how they could repair it, it is problematic.

As you watchd our unmistrusting parser combinator spropose returned unaligned part of the input – challengingly encouraging. rapidparse is sairyly better – it can also tell you around which character leangs broke and what terminal/non-terminal it was (especiassociate if you participate cuts, to tell the parser where it should not carry out a backtracking).

With parsers produced by generators, the situation is analogous – out of the box you usuassociate only get the increateation that parsing fall shorted. So, how come all these compilers and record parsers can give you unbenevolentingful messages? More unbenevolentingful than leangs broke at n-th character?

When it comes to parser generators enjoy Bison, you have a one-of-a-kind terminal symbol – error. When you align it, you can rerelocate the position in text and produce an element of AST which could mock the element, that should be there but put the error element instead. Whether you do it by having each element of AST being a coproduct of valid and invalid version, or if you will produce each step of the way someleang enjoy $(A,B,C) rightarrow Either[Error, Element]$

With parser combinators enjoy rapidparse, leangs are analogous – you can tell the parser to devour some input after your current align, so you could e.g. try to align all right cases and – if they fall short – devour the part of the input, that would fall short and turn it into invalid AST element version.

Now, you should comprehend why this is not someleang, that you get for every language and why only some of them have participater-cordial error handling. It incrrelieves the effort roverhappinessed to parser maintenance tremfinishously.

Summary

In this post, we had a rather high-level handle of parsing (text) into AST. We increately talked about the Chomsky hierarchy and relations between normal languages and contrastent models of computation. We talked a little more about normal languages and context-free grammars, though without an in-depth description of algorithms participated to produce those.

How normal languages, computational models and compilers toil in wonderfuler detail you can lacquire from books enjoy Compilers: Principles, Techniques, and Tools by Aho, Lam, Sethi, and Ullman or Structure and Interpretation of Computer Programs by Abelson, Sussman, and Sussman.

I hope, that it will help you appreciate how many thoughts and effort went into letting us produce REPLs, a plethora of languages – and all of that with much better syntax, than those of the first programming languages, which very unreceive. This unblocked us from leanking about how to arrange the language to be cordial and readable. And, while we are still trying to figure out better ways of arrangeing our tools, we should recollect that all of that was possible thanks to innovates in the createal languages theory.

Source connect

From String to AST: parsing

Formal grammars

The Chomsky hierarchy

Regular languages

Regular conveyions in rehearse

Context-Free Grammars and Push-Down Automata

Parsers in rehearse

Parser combinators

Turing machines, liproximate-bounded automata, unhandleed and context-benevolent grammars

Turning machines and unhandleed grammars

Liproximate-Bounded Automata and Context-Sensitive Grammars

Back to parsing

Error handling

Summary

Read More

Former NFL star Shaun Alexander helps Trump’s schedule to ban trans athletes from girls and women’s sports

The hugegest baccomplish of US handlement data is under way

Daughter Of South Africa’s Ex-Pdwellnt Jacob Zuma Charged Over 2021 Riots

Robert De Niro to Lead Netflix Crime Thriller The Whisper Man

Towards a Trump Tower in Gaza? | Israel-Palestine struggle

Leave a Reply
Cancel reply

From String to AST: parsing

Formal grammars

The Chomsky hierarchy

Regular languages

Regular conveyions in rehearse

Context-Free Grammars and Push-Down Automata

Parsers in rehearse

Parser combinators

Turing machines, liproximate-bounded automata, unhandleed and context-benevolent grammars

Turning machines and unhandleed grammars

Liproximate-Bounded Automata and Context-Sensitive Grammars

Back to parsing

Error handling

Summary

Read More

Former NFL star Shaun Alexander helps Trump’s schedule to ban trans athletes from girls and women’s sports

The hugegest baccomplish of US handlement data is under way

Daughter Of South Africa’s Ex-Pdwellnt Jacob Zuma Charged Over 2021 Riots

Robert De Niro to Lead Netflix Crime Thriller The Whisper Man

Towards a Trump Tower in Gaza? | Israel-Palestine struggle

Leave a Reply Cancel reply

Thank You For The Order

Select Your Plan

Leave a Reply
Cancel reply