I've been banging my head against the wall on this for a bit, and I'm not sure what the solution is.
I have a table of "jobs." These jobs have a one-to-many relationship with "tasks." I need to be able to query for jobs having a certain number of tasks.
This would be a trivial query that anybody knowing SQL could write if the tasks were in one table. However, I have 3 tables, for three types of tasks. These tables can be joined with the "jobs" table via a "job_id" column. However, that's really the only thing I have - the jobs table has a "task-type" column that tells me which tasks to look in if I know a job-id ahead of time, but it isn't a join value on the task tables.
My problem is this - I don't know how to query:
"Give me all jobs with count(task) between min and max."
If this were one table, I could simply join jobs and tasks on job_id = jobs.id and count, group by, having.
However, these 3 tables pose a challenge - I need to basically query the counts for the "right" task type without having a way of knowing which table to join.
Previous attempts have involved counting each type, unioning the pairs of job_id, count and selecting jobs with the id in that derived table. This doesn't quite work because I get 3 pairs of (job_id, task_type) for each job (because I'm counting from 3 tables). As a result, any one of those could incorrectly satisfy the criteria.
I would *like* to have a solution that gets the count from the correct table, but am definitely open to solutions where I aggregate the "correct" count instead. The tasks types are exclusive, a job cannot have tasks from multiple tasks at this time.
2 Replies - 642 Views - Last Post: 18 December 2012 - 06:44 PM
#1
General approach to aggregating counts from multiple related tables
Posted 18 December 2012 - 04:47 PM
Replies To: General approach to aggregating counts from multiple related tables
#2
Re: General approach to aggregating counts from multiple related tables
Posted 18 December 2012 - 04:56 PM
Examples of queries I've tried:
This result set includes everything that matches what I'm looking for, AND extra ids. For example, if I want items that have a count of 0, I'll get all the items that have a count of 0. I'll also get every other job in the set, because if it has any number of tasks in taska, it has 0 in taskb and that is part of the result set. I need the left join because I need to have job.id, 0 in there for jobs that truly have 0 tasks.
Edit: One thing I can do is select id, MAX(task_count) from the aggregated table.
SELECT id FROM job WHERE id IN
(SELECT id FROM
((SELECT job.id, COUNT(taska.id) AS task_count FROM job LEFT JOIN taska ON job.id = job_id GROUP BY job.id HAVING task_count BETWEEN 0 AND 1)
UNION
(SELECT job.id, COUNT(taskb.id) AS task_count FROM job LEFT JOIN taskb ON job.id = job_id GROUP BY job.id HAVING task_count BETWEEN 0 AND 1))
AS qualified);
This result set includes everything that matches what I'm looking for, AND extra ids. For example, if I want items that have a count of 0, I'll get all the items that have a count of 0. I'll also get every other job in the set, because if it has any number of tasks in taska, it has 0 in taskb and that is part of the result set. I need the left join because I need to have job.id, 0 in there for jobs that truly have 0 tasks.
Edit: One thing I can do is select id, MAX(task_count) from the aggregated table.
This post has been edited by xclite: 18 December 2012 - 05:04 PM
#3
Re: General approach to aggregating counts from multiple related tables
Posted 18 December 2012 - 06:44 PM
I believe what you're describing is an intentional denormalising of the data. While normalisation is something that should be strived for, there are several scenarios where denormalising is the best technique. It's a technique that usually favours code, but there can be a number of reasons for doing this.
To achieve the result you're after, there's a few tricks you have to put in place. Firstly, you need to flatten the 3 task tables into one table using a UNION query. This is the way I would do it, and though there a number a ways, this will probably be easiest on the engine:
You can see that I've aggregated the number of tasks against each job_id and also put in an identifier for task_type. This identifier should match the identifier you use in the column jobs.task_type.
Now that we have a "table" that unites the three task tables, we can use that in a subquery, like so:
A couple of points to make here:
See how that query goes, without knowing your exact table setup, it's hard for me to pinpoint it, but it should be correct, at least in theory.
If you need to get the query running faster, an index on jobs(id,task_type) and further indices on taska(job_id), taskb(job_id) and taskc(job_id) should help.
To achieve the result you're after, there's a few tricks you have to put in place. Firstly, you need to flatten the 3 task tables into one table using a UNION query. This is the way I would do it, and though there a number a ways, this will probably be easiest on the engine:
SELECT job_id, COUNT(*) AS task_count, 'a' AS task_type FROM taska UNION SELECT job_id, COUNT(*) AS task_count, 'b' AS task_type FROM taskb UNION SELECT job_id, COUNT(*) AS task_count, 'c' AS task_type FROM taskc
You can see that I've aggregated the number of tasks against each job_id and also put in an identifier for task_type. This identifier should match the identifier you use in the column jobs.task_type.
Now that we have a "table" that unites the three task tables, we can use that in a subquery, like so:
SELECT jobs.id, COALESCE(SUM(tasks.task_count), 0) AS task_count FROM jobs LEFT OUTER JOIN ( SELECT job_id, COUNT(*) AS task_count, 1 AS task_type FROM taska UNION SELECT job_id, COUNT(*) AS task_count, 2 AS task_type FROM taskb UNION SELECT job_id, COUNT(*) AS task_count, 3 AS task_type FROM taskc ) AS tasks ON tasks.job_id = jobs.id AND jobs.task_type = tasks.task_type GROUP BY jobs.id
A couple of points to make here:
- I have started with the jobs table and then LEFT OUTER JOINed the tasks subquery. This means that we will still get a result row even when there are no tasks for a job.
- There is a GROUP BY on the jobs.id column, as we are aggregating by job.
- The JOIN clause on the tasks subquery means that we will get as many as 3 rows per job_id, since each subsubquery in the subquery could produce a row. i.e. - there could be taska, taskb and taskc rows for a particular job.
- In the SELECT part of the query, I have SUMmed the individual task_counts, but if there are no instances of task_count for a job_id, then this SUM will be NULL. To prevent this, I have used the COALESCE function, which returns the first non-NULL argument that is passed to it. In other words, if SUM(tasks.task_count) is NULL, then it looks at the next argument, which is 0, which is not NULL, so it returns 0 instead.
See how that query goes, without knowing your exact table setup, it's hard for me to pinpoint it, but it should be correct, at least in theory.
If you need to get the query running faster, an index on jobs(id,task_type) and further indices on taska(job_id), taskb(job_id) and taskc(job_id) should help.
Page 1 of 1
|
|

New Topic/Question
Reply





MultiQuote



|