TOPIC: An Interesting Database Query
#632
An Interesting Database Query 3 Years, 6 Months ago
In the "book" for our main application, I wanted to find all the names that were used more than once (e.g. "History Report"), that had different code (text).

I started with:

Code:

eta
     summarize name, count
     where count > 1
     remove count



This gave me all the names used more than once. ("eta" is the name of the book)

Unfortunately, summarize "loses" all the other fields so to get them back I joined with the original book. Then I could summarize again, but this time by both name and text and only keep ones with a count of 1 (the ones with different text).

Here is the final query:

Code:

eta
     summarize name, count
     where count > 1
     remove count
     join eta
     summarize name, text, count
     where count = 1
     remove text, count
     sort name



This query probably is not the most efficient, but for my purpose that did not matter. I think it is a good example of the power of a relational algebra query language like Suneido's that allows easy composition of complex queries.
 
 
andrew
 
#637
Re:An Interesting Database Query 3 Years, 6 Months ago
Hi Andrew. I think you can optimize your query in this way:

eta project name, text
summarize name, count
where count > 1


The trick in the above query is that the project removes duplicated rows. ;)

Post edited by: Mauro, at: 2007/02/04 12:23
 
 
Mauro
 
#638
Re:An Interesting Database Query 3 Years, 6 Months ago
Much better! Thank you!

There is one problem because project makes an index from the fields (if it can not use an existing one). But some of the text fields are too long to index. (Index nodes are 4 k and normally should hold multiple keys.)

This is easily fixed by truncating the text field. (This is not "perfect" because two text fields could become equal after truncation, but for my purposes it was ok.) Unfortunately, queries do not allow calling methods so you can not just extend with a Substr. Instead, you have to use either a rule or a global function. For example:

Code:

eta
    extend text2 = Trunc(text)
    project name, text2
    summarize name, count
    where count > 1



where Trunc is defined as:

Code:

function (s)
    {
    return s.Substr(0, 100)
    }



PS. I really should get around to making queries handle method calls!
 
 
andrew
 
#639
Re:An Interesting Database Query 3 Years, 6 Months ago
Maybe you could use this too:

Code:


eta
    extend texthash = Md5(text)
    project name, texthash
    summarize name, count
    where count > 1



In this way it will be slower but more accurate.

andrew wrote:
PS. I really should get around to making queries handle method calls!Yes, it would be very nice. :)

Post edited by: Mauro, at: 2007/02/04 21:40
 
 
Mauro
 
#641
Re:An Interesting Database Query 3 Years, 6 Months ago
Another good idea.

Or the implementation of project could be changed to use hashing (with comparison of actual values in the case of duplicate hash values). Not sure if this would be faster or not - it might depend how many duplicates there were (since looking up the original value would slow it down). But this is a bigger job!
 
 
andrew