2 Replies - 1033 Views - Last Post: 18 December 2012 - 06:44 PM Rate Topic: -----

#1 xclite  Icon User is offline

  • LIKE A BOSS
  • member icon


Reputation: 916
  • View blog
  • Posts: 3,205
  • Joined: 12-May 09

General approach to aggregating counts from multiple related tables

Posted 18 December 2012 - 04:47 PM

I've been banging my head against the wall on this for a bit, and I'm not sure what the solution is.

I have a table of "jobs." These jobs have a one-to-many relationship with "tasks." I need to be able to query for jobs having a certain number of tasks.

This would be a trivial query that anybody knowing SQL could write if the tasks were in one table. However, I have 3 tables, for three types of tasks. These tables can be joined with the "jobs" table via a "job_id" column. However, that's really the only thing I have - the jobs table has a "task-type" column that tells me which tasks to look in if I know a job-id ahead of time, but it isn't a join value on the task tables.

My problem is this - I don't know how to query:
"Give me all jobs with count(task) between min and max."

If this were one table, I could simply join jobs and tasks on job_id = jobs.id and count, group by, having.

However, these 3 tables pose a challenge - I need to basically query the counts for the "right" task type without having a way of knowing which table to join.

Previous attempts have involved counting each type, unioning the pairs of job_id, count and selecting jobs with the id in that derived table. This doesn't quite work because I get 3 pairs of (job_id, task_type) for each job (because I'm counting from 3 tables). As a result, any one of those could incorrectly satisfy the criteria.

I would *like* to have a solution that gets the count from the correct table, but am definitely open to solutions where I aggregate the "correct" count instead. The tasks types are exclusive, a job cannot have tasks from multiple tasks at this time.

Is This A Good Question/Topic? 0
  • +

Replies To: General approach to aggregating counts from multiple related tables

#2 xclite  Icon User is offline

  • LIKE A BOSS
  • member icon


Reputation: 916
  • View blog
  • Posts: 3,205
  • Joined: 12-May 09

Re: General approach to aggregating counts from multiple related tables

Posted 18 December 2012 - 04:56 PM

Examples of queries I've tried:
SELECT id FROM job WHERE id IN 
  (SELECT id FROM 
    ((SELECT job.id, COUNT(taska.id) AS task_count FROM job LEFT JOIN taska ON job.id = job_id GROUP BY job.id HAVING task_count BETWEEN 0 AND 1) 
    UNION
    (SELECT job.id, COUNT(taskb.id) AS task_count FROM job LEFT JOIN taskb ON job.id = job_id GROUP BY job.id HAVING task_count BETWEEN 0 AND 1)) 
  AS qualified);


This result set includes everything that matches what I'm looking for, AND extra ids. For example, if I want items that have a count of 0, I'll get all the items that have a count of 0. I'll also get every other job in the set, because if it has any number of tasks in taska, it has 0 in taskb and that is part of the result set. I need the left join because I need to have job.id, 0 in there for jobs that truly have 0 tasks.

Edit: One thing I can do is select id, MAX(task_count) from the aggregated table.

This post has been edited by xclite: 18 December 2012 - 05:04 PM

Was This Post Helpful? 0
  • +
  • -

#3 e_i_pi  Icon User is online

  • = -1
  • member icon

Reputation: 801
  • View blog
  • Posts: 1,700
  • Joined: 30-January 09

Re: General approach to aggregating counts from multiple related tables

Posted 18 December 2012 - 06:44 PM

I believe what you're describing is an intentional denormalising of the data. While normalisation is something that should be strived for, there are several scenarios where denormalising is the best technique. It's a technique that usually favours code, but there can be a number of reasons for doing this.

To achieve the result you're after, there's a few tricks you have to put in place. Firstly, you need to flatten the 3 task tables into one table using a UNION query. This is the way I would do it, and though there a number a ways, this will probably be easiest on the engine:
SELECT
	job_id,
	COUNT(*) AS task_count,
	'a' AS task_type
FROM taska

UNION

SELECT
	job_id,
	COUNT(*) AS task_count,
	'b' AS task_type
FROM taskb

UNION

SELECT
	job_id,
	COUNT(*) AS task_count,
	'c' AS task_type
FROM taskc


You can see that I've aggregated the number of tasks against each job_id and also put in an identifier for task_type. This identifier should match the identifier you use in the column jobs.task_type.

Now that we have a "table" that unites the three task tables, we can use that in a subquery, like so:
SELECT
	jobs.id,
	COALESCE(SUM(tasks.task_count), 0) AS task_count
FROM jobs
LEFT OUTER JOIN (
	SELECT
		job_id,
		COUNT(*) AS task_count,
		1 AS task_type
	FROM taska

	UNION

	SELECT
		job_id,
		COUNT(*) AS task_count,
		2 AS task_type
	FROM taskb

	UNION

	SELECT
		job_id,
		COUNT(*) AS task_count,
		3 AS task_type
	FROM taskc
) AS tasks
	ON tasks.job_id = jobs.id
	AND jobs.task_type = tasks.task_type
GROUP BY jobs.id


A couple of points to make here:
  • I have started with the jobs table and then LEFT OUTER JOINed the tasks subquery. This means that we will still get a result row even when there are no tasks for a job.
  • There is a GROUP BY on the jobs.id column, as we are aggregating by job.
  • The JOIN clause on the tasks subquery means that we will get as many as 3 rows per job_id, since each subsubquery in the subquery could produce a row. i.e. - there could be taska, taskb and taskc rows for a particular job.
  • In the SELECT part of the query, I have SUMmed the individual task_counts, but if there are no instances of task_count for a job_id, then this SUM will be NULL. To prevent this, I have used the COALESCE function, which returns the first non-NULL argument that is passed to it. In other words, if SUM(tasks.task_count) is NULL, then it looks at the next argument, which is 0, which is not NULL, so it returns 0 instead.

See how that query goes, without knowing your exact table setup, it's hard for me to pinpoint it, but it should be correct, at least in theory.

If you need to get the query running faster, an index on jobs(id,task_type) and further indices on taska(job_id), taskb(job_id) and taskc(job_id) should help.
Was This Post Helpful? 1
  • +
  • -

Page 1 of 1