Simplifying Complicated SQL queries in LINQ

I have been working pretty heavily in LINQ the past few days and ran into a few problems when dealing with some more complicated queries. While LINQ is fairly simple and intuitive when working with objects in memory, the LINQ-SQL bridge (or SQLite in my case) is a different machine. The reason it is different is that your queries must be translated into the appropriate SQL, and this SQL can sometimes turn into brutally inefficient queries.

In my case i was working on a query that inner joined three tables together, then grouped by one of the columns in the joined tables. Now LINQ can do this no problem with Join/Join/GroupBy. The difficulty lies in how you project your grouped columns. By default LINQ will just include the whole collection of grouped columns per group row. While this works great in memory, the SQL bridge butchers your call. The reason is that SQL (SQLite at least) can’t do what you asked so the translation is to group by, then do a select all for each group row. As you can imagine, the computational power required grows exponentially with the data size. This is quite clearly very bad.

When working in pure SQL, we can only include columns that are in the group key and columns with an aggregation function applied to them. The aggregation functions are not super obvious on how to use them, so this is what i will be simplifying in this post. Usually what you want is the Sum, Count, First Row, Last Row, etc… from your grouped rows. The main problem is that you can’t just call the LINQ aggregate functions normally, because it doesn’t make sense in the context of your group by projection. I’ll start off with some sample code and move on from there explaining as i go.

First we join the tables together, and project the join into an anonymous type with both values
var q1 = Table1.Join(Table2, x => x.t2key, x => x.key, (t1, t2) => new { t1, t2 })

Now we can group by (say by t1) so that we end up with t1, and an enumerable of t2’s.
var q2 = q1.GroupBy(x => x.t1.key, x => x.t2, (k, x) => new { k.Name, Timestamp = x.Max(y => y.Timestamp), Value = x.Max(y => y.Value) })
so the first lambda tells us what to group on (our parent key), the second lambda describes how to project the grouped data (we will just project the whole t2 table), and the third lambda describes how to project the result set. The third lambda is where things get interesting, we can include members of the key as is (k.Name), but for members of the grouped data, we must apply the aggregate functions or risk a horribly inefficient query. In the above example, i am using the Max aggregate function, which (in SQLite land at least, but others as well im sure) will select the last item in the group as determined by the order by clause (which i didn’t include). Because i didn’t include an order by then its just the last item in the grouped data however the grouped data was built. The order by must appear before the group by in order for it to affect the aggregation functions, so above i would do something like q1 = q1.OrderBy(x => x.t2.Timestamp) in order for the Max function to return the most recent value in the grouped data. If you specify an order by after the group by, it will order the projected group results, so you could do something like q2 = q2.OrderBy(x => x.Value) in order to show the values in ascending order, without affecting the group by aggregation. This is executed by a wrapper query, that uses the projected grouped data as input, so you execute a second query, but that’s it (and this one is generally quite fast)

So the important bits to take a way from this post is that any aggregated columns you want to project must be individually included in your group by result projection, and if you include any enumerations in your projection then they will be converted into subqueries. While my example may not be all that complicated, it is certainly more complicated than simple a select/join/where/order type of LINQ query, which are far more common. By understanding the group by projection core concepts, it should be relatively easy to construct much more complicated (AND efficient) LINQ queries. And remember, when in doubt play around in LINQPad, that program is the best for testing out LINQ queries.

Some alternative Zip extension methods

I ran into a small problem today, i had ‘n’ arrays of correlated data. I wanted to effectively “transpose” the rows into columns so that i would end up with the same number of arrays but containing the items of the same index of all arrays. to simplify that terrible explanation, here are some visuals:

start : [a0, a1, a2], [b0, b1, b2], [c0, c1, c2] –> transpose –> [a0, b0, c0], [a1, b1, c1], [a2, b2, c2]

Now the Zip function in .net4 can do this very easily, but only for TWO enumerables, not any number. Its basic function is to interleave two enumerables and project the result. I wanted to do something similar but without the silly two enumerable limitation. It’s very easy to extend Zip to operate on > 2 enumerables statically, you just need to add more and more params to the extension method, but i wanted to have true dynamic interleaving functionality. So i wrote two extension methods (really just one, and then a simplification for the standard projection) that would do this. One consequence to note is that this will be slower than the built-in Zip, quite obviously because we’re dealing with more enumerables, but also because we can’t run the same type of optimizations that Zip does (this is due to the dynamic nature of this function). So here is some code to look at:

public static IEnumerable<TResult> ZipMany<TItem, TResult>(
    this IEnumerable<IEnumerable<TItem>> source,
    Func<IEnumerable<TItem>, TResult> func)
{
    var iters = source.Select(x => x.GetEnumerator()).ToArray();
    while (iters.All(x => x.MoveNext()))
    {
        yield return func(iters.Select(x => x.Current));
    }
}

public static IEnumerable<List<TItem>> ZipMany<TItem>(
    this IEnumerable<IEnumerable<TItem>> source)
{
    return ZipMany(source, x => x.ToList());
}

So the idea is we build an array of enumerators for each of the input source lists. We have to call ToArray to invoke the execution of the projection, otherwise some strange and undesirable stuff happens. Then we just iterate each enumerator at the same pace and project the index aligned projection of the results. It’s actually pretty simple code for what it is really doing, but it works quite well and i thought i would share it. The second method is just a helper that simplifies the first by removing the need to project the results, this way you can project the results into any custom type if you want, but the default is to simply project them into a list (which is how i can imaging myself and most others wanting the results returned).

I am curious if anyone can point out any improvements or optimizations on this code. Mostly just so i can learn from them myself. Other than that, i am quite happy with how this turned out and it works wonderfully with the application that i wanted to use it with.

This same post is available at the msdn forums if you want to follow the comments there as well: http://social.msdn.microsoft.com/Forums/eu/linqprojectgeneral/thread/40cc2410-adad-4da6-80cd-1736140e885e

Get Bitcoin Email Alerts on Mt. Gox Price Changes

I got fed up with watching mtgoxlive.com all day long so I wrote a simple web service to monitor the Mt. Gox prices for me. Then I wrote a simple engine to send out email alerts when the price (or volume) goes above or below a specified threshold. The result is http://btcalerts.errorok.com/ which now just tells me when I should pay attention to the exchange, or when my open orders have likely been filled. The site is designed to eventually track multiple exchanges, but since I do all my trading on Mt. Gox I have only coded its prices in so far. The site also lacks any flare, but im not one for flare, but rather sustenance. Anyways, check it out and let me know if you like it.

Next Page »