DotExpressions with labels: Do I miss something?

Niko · October 28, 2021, 5:12pm

Assume we have some system with entities that might relate to other entities, and we map that to SQL. We’d like to provide users with a simplified dot-expression query language.

The idea below seems obvious, but I haven’t seen it before – so probably I’m missing something, and ask for your thoughts.

Basically, we write paths through our object graph, and optionally can label any entity we come across. These labels can be used in constraints, and in more specific queries.

Account<acc>
  .owner.address<addr>
  .transactions<trx>
where addr.country = 'NL'
where trx.amount > 15

richStreets = addr.street
  where acc.amount > 1000

shoppingAddress = Address['FR', 'Champs-Élysées'] // id: 23

mediumTransactionAtShoppingAccounts = acc
  where trx.amount < 50
  where addr = shoppingAddress

(Please disregard parsing, the syntax is not set in stone)

I hope for the following advantages:

Easier to understand for users
Users don’t have to think about our internal mapping to SQL
Complex queries can easily be reused, as we can reach “in the middle” and just get the entity we care for
Optimization potential, e.g:
- we might skip some joins if the result doesn’t need access to
- add constraints at the most efficient place in the query
- create temporary views for the base query

My main concern is whether the query language can always be translated to SQL unambiguously.

Above’s example would translate to SQL like:

-- richStreets =
select addr.street
  from
    Account acc,
    Person p,
    Address addr,
    Transaction trx
  where
    acc.amount > 1000
    and acc.owner = p.id
    and p.address = addr.id
    and trx.account = acc.id
    and addr.country = 'NL'
    and trx.amount > 15

-- mediumTransactionAtShoppingAccounts =
select
    acc.id,
    acc.balance,
    acc.owner
  from
    Account acc,
    Person p,
    Address addr,
    Transaction trx
  where
    acc.owner = 23
    and trx.amount < 50
    and acc.owner = p.id
    and p.address = addr.id
    and trx.account = acc.id
    and addr.country = 'NL'
    and trx.amount > 15

Example entities:

entity Account {
  balance: decimal
  owner: 1 Person
  transactions: * Transaction
}

entity Person {
  name: string
  address: 1 Address
}

entity Address {
  country: string
  street: string
}

entity Transaction {
  amount: decimal
}

Example tables (skipping constraints):

create table Account (
  id int,
  balance decimal,
  owner int references Person.id
);

create table Person (
  id int,
  name varchar,
  address int references Address.id
);

create table Address (
  id int,
  country: char(2),
  street: varchar
);

create table Transaction (
  id int,
  amount decimal,
  account int references Account.id
);

pjmolina · October 29, 2021, 6:40am

Looks good!

My main concern is whether the query language can always be translated to SQL unambiguously.

It seems you are building something very similar to an ORM (Object Relational Mapper). Not trivial to deal with in general, (can have many edge cases) but doable (ORMs works decently well).

I am not aware of unambiguous problems on ORMs.

Niko · October 29, 2021, 7:22am

Thank you (-:

I’m aware of ORMs. The aspect I’ve not seen before are the labels in the middle of the expression; so I’m wary whether we can map them (and their use) to SQL correctly.

pjmolina · October 29, 2021, 7:59am

The so called labels are usually called “aliases” in query languages.
A well-known concept in similar query languages like:

SQL, e.g. select * from customer as c where c.country == 'CR'
ORMs e.g. Hibernate
and Cypher (the query lang over Neo4J, a graph db). e.g. (cust:Customer)-[:ISSUED]->(o:Order)-[:CONTAINS]->(prod:Product) see https://neo4j.com/developer/cypher/guide-sql-to-cypher/ for more examples.

I see no problem with aliases, as long as they are unique in your query.

steffen.zschaler · October 29, 2021, 7:59am

Looks nice. I wasn’t totally sure about the labels though: as you moved further down in the expressions, I struggled to build a mental model of what was going on.

For example, in

mediumTransactionAtShoppingAccounts = acc
  where trx.amount < 50
  where addr = shoppingAddress

acc refers to the whole query rather than just any accounts, which messed with the way my mind thinks about dot expressions as progressively refining/ filtering if that makes sense.

Might just be me, of course…

Niko · October 29, 2021, 9:09am

Thanks, now I know the right term to look for.

Niko · October 29, 2021, 9:20am

Me neither … but I wonder whether this is programmer’s thinking.

The original context currently has a DSL that deals with data from a database. Lots of scripts start with the exact same loops:

foreach (account in SomeComplicatedQuery()) {
  foreach (person in SomeOtherComplicatedQuery(account)) {
    foreach (transaction in SomeConvolutedQuery(account)) {
      ... do something with account, person, and transaction
    }
  }
}

As this is present all over the place, and it’s not a DSL (but GPL), it cries for a better abstraction. These queries are the currently best approach – whether it works is to be seen …
The idea behind this was:

Imagine we’re standing in front of a huge wall with all entities written to it. You get a red marker and color all the attributes and references you care for.
The system will make sure they will be available, and handle all required combination (aka loops).

steffen.zschaler · October 29, 2021, 9:34am

Makes sense. I think this may be related to your uneasiness about “ambiguity”. This could be me mentally trying to translate into SQL to make sense of what I’m seeing, and it may not be a problem at all for your domain users. I just wonder whether for more complex queries this could become tricky.

Then again, perhaps I’m misunderstanding your proposed language. I think this would potentially work better if there was a clear conceptual separation between the initial query – the only place where labels / aliases are allowed – and the subsequent sub-queries, which are to be understood as strict refinements of the initial query?

Niko · October 29, 2021, 10:14am

You could actually reuse labels arbitrarily:

Account<acc>
  owner<own>.address<addr>
  transactions<trx>
  where owner = myself

acc<dutchAcc>
  where addr.country = 'NL'

richestDutch = dutchAcc.owner
  top 1 by dutchAcc.balance desc

suspciousDutchTrx = dutchAcc.transactions<t>
  where t.amount > 10000

// pretty much the same as
dutchAcc.transactions<suspciousDutchTrx>
  where suspciousDutchTrx.amount > 10000

Semantics would be “all aliases within the same query refer to entities that satisfy the whole query”.
The queries can be combined, so we could always fold subsequent (direct or indirect) alias references into one big query (or not, if we wanted to do some optimization with temporary views).

The real crux is whether we can simplify something like

foreach (account in SomeComplicatedQuery()) {
  foreach (person in SomeOtherComplicatedQuery(account)) {
    foreach (transaction in SomeConvolutedQuery(account)) {
      if (transaction.currency != account.currency) {
        if (person.canMixCurrencies) {
          process(transaction)
        } else {
          error "currency mix"
        }
      }
   }
}

to

if (trx.currency != acc.currency) {
  if (pers.canMixCurrencies) {
    process(transaction)
  } else {
    error "currency mix"
  }
}

and infer all the required loops.

Even more interesting if we mixed several queries:

if (acc.transactions contains any suspiciousDutchTrx) {
  checkForMoneyLaundering()
}

alessio.stalla · November 4, 2021, 8:42am

This doesn’t seem to be consistent to me. Some of these look like aliases while others look like you’re actually defining named queries. “acc” is not an alias for the Accounts table, it’s a name for the query returning Dutch transactions with an amount > 15 (in the first example). However, “addr” sometimes is merely an alias for an element in that query, while other times it looks like it’s another named query (as in richStreets), and in the latter case, it’s not clear whether it inherits the where condition from “acc” or not. I think you should disambiguate the two cases.

Also joins are nontrivial. In the example, you’re refining a base query by adding new conditions (presumably in AND with the original WHERE condition). But what about adding a new condition that needs a join with another table? E.g. acc where transactionHistory.last < yesterday, where transactionHistory is a join to another table.

Niko · November 5, 2021, 2:16pm

Thanks for the detailed remarks!

I think (read: hope), these ambiguous cases can be sorted out (technically) by “folding” all referenced queries into one. I agree structurally the cases are different, but from a user perspective, I want to get some feedback from prototype users to see if this is a problem for them.

This is my main area of concern. My gut feeling says I could get away with it as long as I don’t have to write complex joins, but can use the where variant of joins. But I definitely have to try it out before we really decide to go for this language.
Do you have specific nasty cases in mind?