Application of Automata theory in Compiler Design

Atharva Patilpate
5 min readJan 9, 2022

--

Computers are nothing but a mix of software and hardware. Hardware is just a piece of mechanical device and software is what controls its functions. Hardware takes in instructions in the form of electronic pulses, which can also be represented in the form of binary language while programming software. Binary language only consists of 0 and 1. To send instructions to the hardware, codes written must be in binary format, which is simply a series of 1s and 0s. It won’t just be difficult, it will be near too impossible for programmers to write such codes as this way of writing code will be very hard to keep track of while coding, which is why we need a ‘middle-man’ to write such codes.

Language Processing System

From what we know the hardware half of a machine understands a type of language, which humans cannot understand. So, we have come up with a solution for this, what we can do is that we can write programs in a high-level language, which is easier for us to understand and remember. This set of codes can be sent to a piece of software that is programmed to take in this said high-level code and convert it into the low-level code counterpart that a machine can understand. This is what is known as Language Processing System in Theory of Computation.

Use of Finite Automata in the working of Compilers:

Now that we are familiar with what NLP is and what it has to do with Compilers, we can discuss the role of Finite Automata in Compilers. The concept of Finite Automata is used in two of the three phases of the process of compilation.

The first one, being Lexical Analysis, uses Regular Expressions to tokenize the input. Generally, Finite Automata are required to implement regular expressions. Being the first phase of the compiler, its job is to read one character at a time from the source program and to generate Lexemes. The process of forming the words based on pattern rules and converting them into tokens is called Lexeme. These tokens are divided into keywords, identifiers, operators, delimiters, and punctuation symbols. Each token is represented with a pair of values <identifier, number>. It recognizes the various tokens with the help of regular expressions and pattern rules and classifies them. So, Tokens are recognized by regular grammar and are implemented by finite automata. In any programming language, the reorganization of tokens is the first and most important step. In compiler design, regular expressions are used to formalize the specification of tokens and for specifying regular languages.

What’s more interesting is the part of the second phase, which is Parsing. Our goal here is to build what’s known as an Abstract Syntax Tree (or AST or just Syntax Tree). An AST is formed when you take a piece of code written in a formal language and represent its abstract syntactic structure of it. Constructs occurring in the text are denoted by nodes.

There are two types of Parsers, the first one being Top-down while the other one being Bottom-up. Top-down parsing is the simpler form of parsing. Usually, they use Recursive Descent, which doesn’t use any Automata, there. The parse tree is constructed from the top in the top-down parsing technique while the input is read left to right. This technique makes a parse tree that recursively parses the input, this process may or may not require back-tracking.

The drawback of this, that is the Context-Free Grammar (CFG) of the language must be left-factored which means that it has to be from a left-recursive form to an equivalent non-left-recursive form. GCC uses this form of parser

On the contrary, Bottom-up parsing goes the other way, you provide it with a string of terminals and it reverts back with a single, parsed, non-terminal symbol. For each CFG, the automaton that is constructed has a bunch of states, either to represent the production of a non-terminal state or to represent each terminal state.

Another thing that is worth mentioning is that, at each state, we add a node to AST. If a node in the automaton receives some input that is invalid with respect to that grammar, it goes to this different, special accepting state called the “Error State”.

Summary

TLDR; The Compiler design process starts with lexical analysis which tokenizes the input and is followed by Syntax analysis as the second phase of the same where the logical flaws are detected in the code that is related to the syntax of the text provided, this is to taken into consideration that it strictly means the logic with respect to the syntax and not the logical errors that are faced at the run time or the logic of the program itself as this is what the term logical errors when talking about semantic analysis is confused with. The syntactical analyzer is what helps you apply rules to the code. Then the act of parsing checks whether the input string is well-formed or not, if it isn’t well-formed it gets rejected. Also, parsing techniques are divided into two groups, them being Top-Down Parsing and Bottom-Up Parsing. These are the two of the total three phases of a compiler’s working in which the concept of Finite Automata is required for the compiler to function.

Authors:

Atharva Patilpate, Ratan Patil, Prakhar Rai, Pushkaraj Bhor.

--

--

Atharva Patilpate
Atharva Patilpate

Written by Atharva Patilpate

Hello, my name is Atharva and I am a Computer Engineering student!

Responses (8)