Implementing Autocomplete: how we find what could the user insert next?

ftomassetti · March 4, 2020, 7:10am

I was wondering what approaches you use for auto-completion.

I see the problems as split in two: syntactic auto-completion (i.e., which types of tokens go here) and semantic auto-completion (i.e., ok, I can put an ID here, but which IDs it make sense to show here?).

For syntactic auto-completion, when we use ANTLR we tend to use https://github.com/mike-lischke/antlr4-c3 that we had ported to kotlin some time ago (https://github.com/ftomassetti/antlr4-c3-kotlin).

We have also found about this project: https://github.com/oranoran/antlr4-autosuggest

So it seems that there are solution for that. For semantic auto-completion things are more complicate, also because we need to do scope analysis on a tree that is frequently broken, as the user is typing. So the code that traverse the tree to find visible symbol must be robust and able to handle a lot of uncertainty/broken stuff and this prove typically not easy for us.

Any suggestion? Any tool you are using?

tarilabs · March 4, 2020, 9:05am

This topic is very relevant to my interest

Personally, I’m currently investigating only the scenario where syntactic autocompletion is suggesting an ID, and I have on the side a Scope+Symbols table for the actualised IDs which are only valid in the scope.

I’ll be interested to follow this thread to see others suggestions!

cristian.vasile · March 4, 2020, 7:24pm

I am aware of tab9 / https://tabnine.com/
Which in turn is based on research on AI field / Better Language Models
and Their Implications

ftomassetti · March 4, 2020, 8:28pm

Quite inspiring! It is a pity that it is not open-source

cristian.vasile · March 5, 2020, 7:57pm

Well, life is hard.
I found this.
A code-completion engine for Vim http://ycm-core.github.io/YouCompleteMe/

teowebcontact · January 3, 2021, 10:39pm

Hello

I am kinda late to the party but here goes my two cents.

I am thinking about a semantic auto completion engine to be used on a strongly typed language like Java, C# etc
I will need something like this for a project that i have been thinking about for quite some time now.

What i was about to say is something like the code completion engine that cristian.vasile presented.

For an efficient code completion engine ( fast code completion, rename refactor, goto definition, show references etc )
basically the features that you use in a good IDE these days, i don’t think there is a way to bypass the analysys stages of a compiler,
or at least some of them.

The compiler scanner becomes the keys input token factory that generates single tokens.
You will push these single tokens down the processing pipe to the parsers.

A mindset change is needed here in terms of parsers design.

If for a classical compiler you provide a stream of tokens that the compiler can process
at it’s own pace, in a code completion engine you ‘jump’ between stages and so the parser needs extra flexibility.

This problem can be addressed by using stacks of parser instances one upon eachother
and so you push and pop parsers while you enter certain code sections while you are typing methods, functions, loops, classes etc.
In terms of state machines nothing changes, the parsers work the same as they do in a classical compiler,
only now they are brought in and out faster and more frequently.

And so you will have an engine that ‘works’ with you, waits for you to type, generates the AST, traverses the tree finds declarations etc.
This is the stage where types and errors are highlighted, and the completion list is presented on a keypress.

You are not writing your code from start to finish, in one go, you will move your cursor on the editor, you will select another line,
so a sort of AST cache for each file in the editor that you will be able to update easy is needed.

And don’t forget speed, you want you code completion list to pop up fast,
so this engine should be multithreaded.

And here i am thinking about language server protocols, Visual Studio Code to be precise.
For a fast multithreaded code completion, each key press on your keyboard, will have to load a thread ( from some thread pool obviously ).
Keep those keypresses in sync.
With a language server protocol i don’t know if this is possible, not to mention the back and forth json serialization ( takes time )

This approach is not always possible.
I am talking here about writing the code completion engine while you are writing the interpreter / compiler / transpiler.
If this can not be done, than, at least use a section of the compiler.

A programming language is a living thing, it changes, it evolves, … it has bugs.
If a new feature is implemented in the compiler, it will have to be implemented in you code completion engine as well.
Same thing happens for bugs.

But hey, remember the nice feeling you have when you interpreter runs code, or when you compiler generates the output you want ?
Imagine the feeling you will have when your ide will show you the completion list or when your errors will be highlighted on the text editor !!!

ftomassetti · January 6, 2021, 1:43pm

I would heartfully agree to this

It is not clear to me why you need this: can’t you get the parse-tree, transform it into an AST, do symbol resolution and all of your logic at that stage? Why do you need to change the parser?

Yes, that would be useful indeed

Yes, I love it!

teowebcontact · January 26, 2021, 7:04pm

Hello

I will try to keep this short.

We try to apply clean code principles all the time.
Sometimes we succeed sometimes not so much.

However, when working on a big and complex project such as an IDE Code completion engine,
divide et impera, single responsibility, open close, dry etc are needed more than ever.

That being said, let us consider a small snippet of code:

1 void Test()
2 {
3 int index = 0;
4
5 if (condition1)
6 {
7 int count = 0;
8
9 where (condition2)
10 {
11 bool testing = false;
12 }
13 }
14 }

Now, here we have three code sections.
A method, two child sections if and where.

This code will translate to an AST that might look like this:

Test						( method node )

	------------>	 = 		( operator node )
				  /		\
		( lhs ) index    0	( rhs )
				- [int]	
		
		
	-----> if						( if section node )
			------> condition1

			--------->  	=		( operator node )
						 /	   \
			( lhs ) count		0	( rhs )
					- [int]
				
			
			-> where						( where section node )
				-----> condition2
				
				----------->	 =			( operator node )
							   /  	\
				( lhs )	testing		false	( rhs )
						- [bool]
			
===============================
lhs = left hand side operand
rhs = right hand side operand
===============================

Using recursive descent we will be able to traverse this ast tree and execute / compile the code.
We are interested in generating the AST so i will not insist on traversing the AST here.

We need three types of parsers here
- a method parser
- an if parser
- a where parser

Ok. Why three parsers ?
	Single responsibility. 
	The project is complex, later on, on debugging things could get realy ugly.

When compiling, we extract a stream of tokens for the whole code section

void
Test
(
)
{
int
index

0
;
if
(
condition1
)
{
int
count

0
;
where
(
condition2
)
{
bool
testing

false
;
}
}
}

Now let’s take a look at the state machine inside a parser.
We start by declaring a set of states ( let’s say, inside an enum )

enum MethodParserState
{
VoidKeyword,
OpenParamsParanth,
ParamType,
ParamName,
InsideMethodBody
AccumulateIfTokens
AccumulateWhereTokens
…
}

This is just an example to show the process not a really working piece of code!

============== MethodParser ====================

int openParanths = 0;
int openBraces = 0;

ArrayOfTokens[] = [ the tokens extracted from code ];

void Parse()
{
for ArrayOfTokens → foreach token in array call CurrentMethod with token as argument
the token will be sent to the method that is set at the moment as the processing state method
==
the first token in the array will go into the VoidKeywordState method
}

void CurrentMethod = VoidKeyword; // default state

For each enum label we declare a method to handle the state

void VoidKeywordState(Token token)
{
if not ‘void’ token then throw an exception
else CurrentMethod = OpenParamsParanth
}

void OpenParamsParanthState(Token token)
{
if not ‘(’ token then throw an exception
else CurrentMethod = ParamType
}

…

void InsideMethodBody(Token token)
{
// again, this is just an example

switch (token)
{
	case if:
		== in a top to bottom compilation process, here we start accumulating tokens
			until we reach line nr 13.
			
		AccumulatedTokens.Add(token)
			
		CurrentMethod = AccumulateIfTokens
	
	case where:
	
		== same thing for a 'where' section
		
		AccumulatedTokens.Add(token)
		
		CurrentMethod = AccumulateWhereTokens
	
	...
}

}

void AccumulateIfTokensState(Token token)
{
switch(token)
{
case ‘(’:
OpenParanths ++;
AccumulatedTokens.Ad(token)
case ‘)’:
OpenParanths –
AccumulatedTokens.Add(token)
case ‘{’:
OpenBraces ++
AccumulatedTokens.Add(token)
case ‘}’:
OpenBraces –

		!!!
		if OpanBraces == 0 // end of section reached
			== we would pass the tokens from line 5 to 13 to a new if paser instance
					
		IfParser = new IfParser()
		IfParser.Parse(AccumulatedTokens)
					
		... handle the response from the if parser or the resulted AST node
					
		continue parsing the method
			===============================================
			once the if parser has finished processing the tokens
			this parser ( the method parser ) goes further with the process
			
		... we set a new state and keep processing
		CurrentMethod = InsideMethodBody
								
	default:
		AccumulatedTokens.Add(token)
}

}

void AccumulateWhereTokensState(Token token)
{
… same as in the if section
}

=======================================

Again, this is just an example.

For this kind of processing, there are multiple sections where the parsers stay in the same state and accumulate tokens.

If we process files it is not a problem.
However if we a typing code things change.

With a code completion engine, there is no longer a stream of tokens.
As soon as it is created, a token is sent to a parser.

So each token 'moves' the parser to the next state.

So if we type code and we reach line 5 ( the beginning of the if section )
we create a new if parser there and we start sending the tokens to the if parser.
When the if parser reaches the last state it is no longer used and we get back to the method parser.

Imagine this is a stack of parser instances

	ParsersStack

– we start typing and we produce the token ‘void’
the beginning of a method has been detected.
A new method parser is created and pushed onto the stack

		MethodParser	<---- the tokens from the keyboard go into this parser now
	ParsersStack

– we keep typing and reach line 5 ‘if’ keyword
the beginning of an if section has been detected.
A new if parser is created and pushed onto the stack.

			IfParser	<---- the tokens from the keyboard go into this parser now
		MethodParser	<---- the method parser has preserved it's state !!
	ParsersStack

– we keep typing and reach line 9 ‘where’ keyword
the beginning of a where section has been detected.
A new where parser is created and pushed onto the stack.

				WhereParser		<---- the tokens from the keyboard go into this parser now
			IfParser			<---- the if parser has preserved it's state
		MethodParser
	ParsersStack

– we reach line 12 - the end of the where section
We pop the where parser from the stack

			IfParser			<---- we are back at the state the parser was when the where parser was created
		MethodParser
	ParsersStack

– we reach line 13 - the end of the if section
We pop the if parser from the stack

		MethodParser			<----- we are back at the state the parser was when the if parser was created
	ParsersStack

The conclusion is, you must restore states from one parser to another and this is done with the help of the stack.
And all this is needed because with code completion - there is no longer a stream of tokens instead a single token at a time has to be processed -

This would be the top layer of the engine.
While we type, each parser can resolve types, conflicts, apply syntax color or highlight errors.
On a dot ‘.’ keypress , or any other for that matter, a parser can use it’s current state, the last token and decide what completion list to show.
While we type, the AST is created on the background and ‘go to definition’ ‘rename’ ‘refactor’ ‘show references’ begin to take shape.

Again , everything starts from the fact that with code completion we no longer have a stream of tokens.

ftomassetti · January 27, 2021, 6:37am

Have you considered using ANTLR instead? It should make way easier to write a parser and we routinely build autocomplete features using it and a very nice library. @alessio.stalla wrote a few articles about that

teowebcontact · January 27, 2021, 3:24pm

I will look into it
Thank You

Implementing Autocomplete: how we find what could the user insert next?

void Test ( ) { int index

0 ; if ( condition1 ) { int count

0 ; where ( condition2 ) { bool testing

For each enum label we declare a method to handle the state

void
Test
(
)
{
int
index

0
;
if
(
condition1
)
{
int
count

0
;
where
(
condition2
)
{
bool
testing