Polyglot Modeling/Metamodeling formats and frameworks

ftomassetti · January 20, 2021, 4:22pm

Hi,
At Strumenta among other things, we build parsers. We typically start with ANTLR, get a parse-tree and then translate it to an AST, implemented using some library we created, like Kolasu, in the case the project is written in Kotlin.
The AST is a model for us, and we need features like the ability to specify properties, containment relations, and references.

For example, we could define a series of Kotlin dataclasses, like this one:

data class EqualityExpr(
    override val left: Expression,
    override val right: Expression,
    override val specifiedPosition: Position? = null
) : BinaryComparisonExpr(left, right, specifiedPosition)

This would be our lightweight way to define the metamodel.

We would then have instances of such classes, organized in a tree, and that would be our model. We typically process it in memory, but given we sometimes need to write the parser in one language and then the successive stages in another we could serialize it in JSON or XML. For example, we could need to write the parser in Kotlin, get the model serialized, and then load it from Python to do some code generation.

We typically need to traverse, transform, and generate stuff out of these models.

We also need tools to process the metamodel (in our cases defined by a series of class definitions) to generate things like the documentation.

Over time we have started building this functionalities but…

At some point, we realized we are basically building some much very basic clone of EMF.

Now we wonder… is that some alternative we should just adopt?

EMF itself seems too heavy: the usage of OSGi and the fact it has a ton of dependencies make it a bit hard to use. We would also need to support the framework outside the JVM (specifically from JavaScript and Python, possibly also C#).

I remembered having used RGen many years ago as a lightweight replacement of EMF.

Maybe we could just use the XMI format.

It seems surprising that there is not a solution to this specific problem.

So before investing the next 5 years building it, I wanted to check with you if you know of alternatives to use or if instead this would be actually useful.

ftomassetti · January 20, 2021, 5:12pm

This is also been discussed on Twitter:
https://twitter.com/ftomasse/status/1351928350919548930

meinte.boersma · January 20, 2021, 9:26pm

Also already mentioned in the Twitter thread: EMF is not OSGi-native, and you can perfectly use it without OSGi, as-is. I haven’t used EMF recently, but I recall that running it outside a pure Eclipse context wasn’t too problematic.

The XMI format is an OMG standard, right? You can’t really go wrong with that, unless you truly hate XML as a serialization format, for some reason

I think EMF has a couple of valuable aspects: its minimality, the Ecore (meta) meta meta model, as a runtime for managing in-memory models (where it doesn’t really matter whether you do that dynamically based on an Ecore model, or with generated classes) including (de-)serialization (from/to XMI, but generalizes to JSON just fine), and as a generator Ecore → Java classes. All of these aspects can obviously be generalized to other languages than Java, and that would be valuable and interesting.

It’s not obvious to me which aspect one should make polyglot first, in order to “force” the other aspects to “come with”. You have to make sure that existing interop works, so e.g. providing an implementation Ecore in other meta modeling frameworks only makes sense if that can produce XMI to plug into existing Java-EMF. Porting over the runtime is a lot of work in one go.

EMF does seem to have a few corners which might be…more generic than necessary, and which are hardly used at all. (I’m thinking of some of the attributes of an EReference here.) Maybe it’s an idea to define a “minimal-EMF” of sorts, and see how that ports to various other languages?

vincent.aranega · January 21, 2021, 7:45am

On a non-objective point of view, regarding Python, I would propose you to use PyEcore, the EMF implementation in Python that I work on. It tends to be compatible with EMF more or less on the API (some points are simplified) and through the XMI format (and JSON following EMF-jackson format) to ensure that you can exchange models/metamodels between Java-EMF and Python-EMF. I tried also to unify the way you can handle dynamic/static metamodel/models, so switching from a static to a dynamic metamodel shouldn’t impact your app code. It’s currently used in some projects as Geppetto/Openworm, where both, a java and a python backend are used. PyEcore also comes with it’s own Python generator (pyecoregen) if you want to generate a static metamodel from your .ecore.

For Javascript, there is the ecore.js library, but from what I recall, it doesn’t support opposite references and the project seems abandonned.

Regarding a full polyglott EMF addressing Kotlin and Python, perhaps trying to run EMF in GraalVM could work?

ftomassetti · January 21, 2021, 12:53pm

This is good to hear

Well, I do not care that much. However, given that the browser will be involved JSON would feel more natural, but I suppose one could handle also XML

Yes, I think we eventually will need all of these things

In picking that I would be driven but what I immediately need
I guess that in the first case we need to handle having a way to generate “metamodel classes” in Python would be the first step. So perhaps our first step could be:

Have a component to read our Kotlin dataclasses and generate an Ecore metamodel
Maybe use PyEcore to generate the Python data classes from the Ecore metamodel

I am not even sure we will start using PyEcore for the second step. We could just use the Ecore format (should I say XMI?) instead of a random exchange format we needed to come up with.

By the way, this could also be relevant for your “MPS exporter”, maybe it could be used to produce XMI files?

Yes, indeed. If I recall correctly it had generics and the possibility of adding primitive datatypes (I could be wrong). Probably we would not need that

ftomassetti · January 21, 2021, 12:58pm

Is this a format you created or is it “standard”?

Nice! Do you take advantage of Python dataclasses in the generator?

It could be still worth taking a look. The documentation here gives the impression that the core features are there: ecore.js

Well, I would not like to depend on the availability of GraalVM. While it is interesting and promising I could not always have the possibility to have Clients adopt the GraalVM

vincent.aranega · January 21, 2021, 2:13pm

Is this a format you created or is it “standard”?

The format is the same used by EMF, I tried to be as much compatible as I could with EMF on this either for XMI or for JSON. However, it may occur that for very specific feature, I didn’t yet implemented them (for example, some metatada behavior from the .ecore file).
You can try it here, on this small script, a .ecore file is loaded from a remote URI, register the metamodel and opens a XMI model. The files comes from the project ALE, written in EMF Java.
The JSON format supported by PyEcore is the same one as described here (which is the same used by ecore.js).

Nice! Do you take advantage of Python dataclasses in the generator?

Not yet, dataclasses were released after I released the first version of PyEcore and I had some users. I stayed with classical classes for this (you can see here an old example of how the metamodel code looks like). I should make some tests about it and modify the generator. Fortunately, it’s really easy to modify the pyecoregen generator, it relies on jinja2 templates and it’s easy to extend/change the existing generator to use your own templates.

It could be still worth taking a look. The documentation here gives the impression that the core features are there

That could be nice to ensure exactly if everything is here. Last time I tried (few years back), it still had some issues.

meinte.boersma · January 21, 2021, 3:42pm

It seems to me that ecore.js is very interesting, either to continue with it for the runtime part, but also for the JSON format for XMI.

I think it would be good to come up with a specification of what EMF (for Java) actually is and does. Some of that is just a reference to the XMI standard, but another part is what Ecore is and does, and to what extent it aligns and differs with EMOF (ch. 12). A set of compliance tests would also be necessary in the longer than short run.

That {c|w}ould make EMF really a standard based on a de facto implementation (the Java one), with interoperability through models serializable using XMI, and using meta models in Ecore.

Some other points (in random order):

Didn’t mbeddr (MPS) have an Ecore im-/export facility? My mps-exporter stuff needs quite a bit of work before it’s complete enough that it starts making sense to have it produce XMI instead of some JSON format.
Reading XML in a browser is somewhat cumbersome. I’d rather use a sensible JSON format, as long as it’s nicely standardized.
I always thought that the built-in primitive datatypes were one of EMF’s weaker points. They obviously map to Java types, but it’s already different for e.g. JavaScript. I wouldn’t mind adding the possibility of adding primitive datatypes in to try and deal with that.

Niko · January 21, 2021, 6:35pm

This mentions some “EMF-JSON Jackson mapper”: EMF.cloud
And some talk from 2018: EMF, JSON and I | EclipseCon Europe 2018

meinte.boersma · January 21, 2021, 6:45pm

I just created GitHub - dslmeinte/polyglot-emf: Proposal(s) to make EMF polyglot. to try and have a central place to gather thoughts and coordinate efforts. Please have a look at it. I invited Federico as collaborators, and I can invite others as such, or you could send PRs.

vincent.aranega · January 21, 2021, 8:02pm

Thanks @meinte.boersma for the initiative. I’m glad to see that the 4 main points of EMF (and more) are tackled by PyEcore. I would perhaps add to the list you proposed the “Command” system of EMF, it is largely used for cleanly handle in-memory models/metamodels and providing the interesting undo/redo system to create modeling editors. I implemented it and it was a great use to load/modify models from EMFCloud and GenMyModel models repositories using either XMI or JSON.

meinte.boersma · January 21, 2021, 8:42pm

I switched on GitHub’s discussions feature on this repo, but not with the intention of moving the discussion over there right away.

@vincent.aranega I guess one could see the Command system as belonging to the runtime aspect? I’ll add some text to that point for that.

ftomassetti · January 22, 2021, 10:38am

I think there is a lot of potential regarding this. In the long term, I could imagine this could open the possibility to recreate an ecosystem richer than the EMF ecosystem, with a variety of interoperable tools so that, based on the common model and metamodel formats we could:

Have systems for storage and collaboration (Modelix)
Have systems for parsing, based on ANTLR (a subset of the features of Xtext, maybe something similar to textX)
Perhaps a way to interact with textual editors involving the Language Server Protocol
A way to plug-in web editors like WebEditKit and ProjectIt
Interaction with MPS in various forms
Work with multiple code generators supporting this format
Have the possibility of building different stages of these systems in different languages such as Kotlin, Java, Python, TypeScript, Javascript, C#

While there is a lot of potential and a ton of different things we could do, could we imagine some actual use cases of immediate applicability we see?

I will try sharing a couple.

Short term: Parsing and processing

In the very short term, I see the need we have to combine a parser written in Kotlin with a processing stage written in Python.

The parser is written using ANTLR and then we translate the parse-tree to an AST implemented using Kotlin data classes and the Kolasu framework.
To use the parser from a Python program we are just thinking of invoking the parser, make it output JSON and load such JSON from Python.

Now, we can derive the metamodel of our AST by examining the Kotlin data classes through reflection or parsing Kotlin code. Once we get this model we could serialize it in XMI or in a transposition of XMI to JSON. We could then load such metamodel in Python and generate classes. We could potentially do that using PyEcore, if I understood correctly. Ideally, we could also evolve PyEcore to use Python data classes, but this is not strictly necessary. To enable this scenario we would just need to have a mechanism that from our Kotlin data classes generate the metamodel on XMI or JSON-XMI.

Then we would need to translate also the actual AST instances (the model). At the moment we are serializing JSON and unserializing the JSON on the Python side. This JSON could be based on JSON-XMI instead of our own format.

Medium term: Accessing Modelix from different languages

We have APIs to work with Modelix from Kotlin (and Java). However, it makes sense to work with models stored in Modelix from all sort of other languages, in particular from TypeScript.
At the moment we can work with Modelix only using dynamic API. For example, if we have a concept Car we do not have a class Car, we just use the class Node and set properties specifying the name (e.g., “plate” or “year” or “color”). We do not have a class with methods such as “getPlate” or “setColor”.

It could be useful to generate those classes. If we were exposing the metamodel in some common format, like XMI, we may be able to reuse existing code generators, and then combine them with a runtime “modelix-aware”.

I may just be rambling as I have not a clear idea on this yet.

vincent.aranega · January 22, 2021, 11:11am

Once we get this model we could serialize it in XMI or in a transposition of XMI to JSON. We could then load such metamodel in Python and generate classes. We could potentially do that using PyEcore, if I understood correctly.

Yes, if your metamodel is EMF/Ecore, it works out-of-the-box.

Then we would need to translate also the actual AST instances (the model). At the moment we are serializing JSON and unserializing the JSON on the Python side. This JSON could be based on JSON-XMI instead of our own format.

If you derive an Ecore metamodel from your AST, you should have the JSON-XMI and gain the unserialization for free in Python/Javascript.

Out of curiosity, how big are the models you are handling? (by number of element for one resource for example)

meinte.boersma · January 22, 2021, 9:47pm

Formally defining the transposition of XMI to JSON would have a lot of value. That would open up the goodness of XMI (and EMOF) to languages that like JSON more than XML.

I could also think that “something” (e.g. a set of interfaces) to serialize any AST to XMI, and deserialize back, could be useful. Currently, you need an EMF model to do this (at least in the Java space), but maybe that’s a too strict requirement.

meinte.boersma · January 28, 2021, 8:17pm

I updated the repo with @ftomassetti 's suggestions.

meinte.boersma · February 1, 2021, 9:22am

I updated GitHub - dslmeinte/polyglot-emf: Proposal(s) to make EMF polyglot. with some links about emfjson-jackson, and a concrete proposal: come up with a JSON Schema for the JSON format version of XMI as implemented by ecore.js, emfjson-jackson, and PyEcore. These might differ but maybe we can uniformize them?

ftomassetti · February 1, 2021, 11:15am

In the last few weeks I have experimented with using EMF from Kotlin, and I found that to be easier than I anticipated! No OSGi, just adding 3 dependencies into a gradle script and I was ready to go. I had then to interrupt my experiments as this week I have many committments.

I think that using Ecore “as XMI” or as JSON as an interchange format is a first valuable goal that is relatively easy to achieve. I agree with @meinte.boersma that defining the JSON version of XMI would be valuable.

We could build also exporters from MPS to XMI or JSON/XMI, or importers from it.

Regarding runtime, obviously we have a great Java runtime that we could immediately use. I personally would like to use in combination with generated Kotlin data classes. Generating such classes from an Ecore metamodel seems relatively easy.

I would like to explore the other runtimes (Javascript and Python).

For defining metamodels I am currently playing with a simple Kotlin internal DSLs and it seems to work relatively well. I wondered if there are external DSLs to generate Ecore metamodels. I guess someone attempted that

Yes, I agree that would be useful!

ftomassetti · February 1, 2021, 11:19am

I think this is a good first work-package. I agree that uniformize them would be very valuable. Who would be interesting in working on this?

ftomassetti · February 3, 2021, 7:26am

Also relevant for this discussion: JSOI: An EMF model interchange format for efficient model management (posted here: JSOI and Type based access · Discussion #2 · dslmeinte/polyglot-emf · GitHub)