Page 1 of 1

Developing robust hierarchical data in MSSQL

#1 e_i_pi  Icon User is offline

  • = -1
  • member icon

Reputation: 801
  • View blog
  • Posts: 1,700
  • Joined: 30-January 09

Posted 06 March 2012 - 02:14 AM

*
POPULAR

Preamble
This tutorial is intended for MSSQL, but the code contents could be modified to work with Oracle, PostgreSQL, and any other database that handles CTEs (Common Table Expressions). Unfortunately, MySQL does not, so handling hierarchical data in MySQL must be achieved using different methodology. To find out if your DB handles CTEs, Wikipedia has an excellent resource available.


What you will need
  • An intermediate knowledge of SQL, including use of the WITH clause, indexes, and variables.
  • A database to work with where you have CREATE permissions for both tables and views.
  • A coffee and a cushy chair.



The problem
When we want to create relational (normalised) data in a database, we create multiple tables to accomplish this. Take, for example, the scenario where we want to store the countries of the world, divided into continents. We could use a table schema such as this:
    Continents
    ID
    Name

    Countries
    ID
    ContinentID
    Name

This is all well and good if we know that we only ever want to store continents and countries, but what happens if we want to store states/provinces also? Well, we could create another table for these, but then what happens if we need districts, cities, suburbs, streets, street addresses, buildings, rooms, etc? You can see how the data schema might become extremely cumbersome very quickly.

A table-by-table data schema does not hit the mark when it comes to storing dynamically nested hierarchical data. Also, when it comes to querying this data, it can become extremely problematic for the SQL coder, and even the smallest schema change may have drastic trickle-down effects on applications that utilise the data schema.


The solution
We can avoid the problems associated with table-by-table relations by mapping out a single table schema that is self-referential. By this, I mean that instead of having a "Countries" table that references a "Continents" table (and so on), we simply have a "Locations" table that references itself.

For this, we need to structure our data in a specific manner. Each data node will require these attributes as a minimum:
  • An ID
  • Its parent's ID
  • A name
  • A sortkey (optional - the name could be used as the sort key)

Once we have alloted these values to our nodes, we will be able to create an extensive self-referential tree structure that can have dynamic depth. Note that this data is kept in one simple table, so to work with the data we can use Views, CTEs, and even table variables to quickly and easily query the hierarchy structure.

Let's begin by creating the Locations table in the database, along with a foreign key constraint against itself:
CREATE TABLE [dbo].[Locations](
	[ID] [int] NOT NULL,
	[ParentID] [int] NULL,
	[Name] [nvarchar](1024) NOT NULL
PRIMARY KEY CLUSTERED 
(
	[ID] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[Locations]  WITH CHECK ADD CONSTRAINT [FK_Locations_Locations] FOREIGN KEY([ParentID])
REFERENCES [dbo].[Locations] ([ID])
GO

ALTER TABLE [dbo].[Locations] CHECK CONSTRAINT [FK_Locations_Locations]
GO



Next, we'll insert some values into the table, so we have a nice structure to work with:
INSERT INTO LocationEntities VALUES (1, NULL, 'North America');
INSERT INTO LocationEntities VALUES (2, NULL, 'Europe');
INSERT INTO LocationEntities VALUES (3, 1, 'USA');
INSERT INTO LocationEntities VALUES (4, 1, 'Canada');
INSERT INTO LocationEntities VALUES (5, 3, 'California');
INSERT INTO LocationEntities VALUES (6, 2, 'United Kingdom');
INSERT INTO LocationEntities VALUES (7, 6, 'England');
INSERT INTO LocationEntities VALUES (8, 7, 'East Midlands');
INSERT INTO LocationEntities VALUES (9, 8, 'Derbyshire');
INSERT INTO LocationEntities VALUES (10, 9, 'Derby');


Running that query will produce the simple data structure that this tutorial will continue to work with. Note that I have set the ParentID of North America and Europe as NULL. This is because we are defining these two Locations as root nodes. Here's a visualisation of that data:
+---North America
| |
| +---USA
| | |
| | +---California
| |
| +---Canada
|
+---Europe
  |
  +---United Kingdom
    |
	+---England
	  |
	  +---East Midlands
	    |
		+---Derbyshire
		  |
		  +---Derby




Creating Views for the hierarchy
So now that we have the data that we are going to work with, how do we work with it?

Well, there are two main tools you will need to work with this data effectively - a Tree View, and an Entity Parents View. Both of these Views will utilise CTEs, but in different manners. One will use top-down recursion, and the other will use bottom-up recursion.

Let's start by creating the Tree View:
-- "Top down" recursion
CREATE VIEW [dbo].[LocationTreeView]
AS
WITH LocationTree(
	ParentID,
	ID,
	Depth,
	Name,
	SortKey,
	Indent
) AS (
	-- Start with all the root node Locations
	SELECT
		ParentID,
		ID,
		0 AS Depth,
		Name,
		CONVERT(nvarchar(1024), Name) AS SortKey,
		CONVERT(nvarchar(1024), '') AS Indent
	FROM Locations
	WHERE ParentID IS NULL
	UNION ALL
	-- then recursively add the children
	SELECT
		Children.ParentID,
		Children.ID,
		Parents.Depth + 1,
		Children.Name,
		CONVERT(nvarchar(1024), RTRIM(Parents.SortKey) + '|' + Children.Name) AS SortKey, 
		CONVERT(nvarchar(1024), Parents.Indent + '  ') AS Indent
	FROM Locations AS Children
	INNER JOIN LocationTree AS Parents ON Children.ParentID = Parents.ID
)

SELECT TOP (100) PERCENT
	ParentID,
	ID,
	Depth,
	Name,
	Indent + Name AS IndentedName,
	SortKey
FROM LocationTree
ORDER BY SortKey
GO


And there we have it, a View that contains the Locations data as a hierarchical tree. SELECTing this view will show you the entire hierarchy, with one row per node sorted by the sort key, which in this case is alphabetical by name over node.

The next View we will create is the Entity Parents View. This View, when combined with the TreeView, enables a vast array of functionality, including SELECTing a part of the hierarchy, determining the depth (distance) between two Locations, and determining traversal paths between Locations:
-- "Bottom up" recursion
CREATE VIEW [dbo].[LocationsEntityParentsView]
AS
WITH EntityParents(
	ID,
	ParentID,
	Depth,
	IsLocalRootNode
) AS (
	-- Add an entry for each entity, with itself as it's 0-depth parent
	SELECT
		ID,
		ID AS ParentID,
		0 AS Depth,
		1 AS IsLocalRootNode
	FROM Locations
	UNION ALL
	-- Add an entry for each entity, with it's ParentID as it's 1-depth parent
	SELECT
		ID,
		ParentID,
		1 AS Depth,
		0 AS IsLocalRootNode
	FROM Locations AS Locations_1
	UNION ALL
	-- Finally, recurse up EntityParents to get the furthest grandparent, which should be the nodes with ParentID IS NULL
	-- Note that because the root nodes have ParentID IS NULL, this is where we stop the recursion
	SELECT
		Children.ID,
		Parent.ParentID,
		Parent.Depth + 1 AS Depth,
		0 AS IsLocalRootNode
	FROM Locations AS Children
	INNER JOIN EntityParents AS Parent ON Parent.ID = Children.ParentID
	WHERE Parent.IsLocalRootNode = 0
)

SELECT TOP (100) PERCENT
	ID,
	ParentID,
	Depth
FROM EntityParents
GO




Querying against the Views
Now that we have our Views written, we can start looking into some common queries. The most common queries tend to be selecting the tree under a particular node, the parenting above a particular node, and the parents and children of a particular node. Note that I have not included a query to select the entire tree, since we can just SELECT * FROM LocationTree
-- Select a node and it's children
SELECT
	Tree.IndentedLabel
FROM LocationTreeView Tree
INNER JOIN ReportingLocationsEntityParentsView EP
	ON EP.ParentID = @Node
	AND EP.EntityID = Tree.ID
ORDER BY SortKey


-- Select a node and it's parents
SELECT
	Tree.IndentedLabel
FROM LocationTreeView Tree
INNER JOIN ReportingLocationsEntityParentsView EP
	ON EP.EntityID = @Node
	AND EP.ParentID = Tree.ID
ORDER BY SortKey


-- Select a node and it's children and parents (slow method)
SELECT
	Tree.IndentedLabel
FROM LocationTreeView Tree
-- Note that this method is slow due to the use of the OR operand in the INNER JOIN.
-- Disjunctions (ORs) are generally slower than conjunctions (ANDs) and unions.
INNER JOIN ReportingLocationsEntityParentsView EP
	ON (EP.ParentID = @Node AND EP.EntityID = Tree.ID)
	OR (EP.EntityID = @Node AND EP.ParentID = Tree.ID)
ORDER BY SortKey


-- Select a node and it's children and parents (fast method)
-- We need to SELECT the sortkey here, as it's being used to ORDER a UNION set.
-- The slowdown caused by SELECTing two columns is far outweighed by the speed
-- increase from utilising a UNION instead of a disjunction.
SELECT
	Tree.IndentedLabel,
	Tree.SortKey
FROM LocationTreeView Tree
INNER JOIN ReportingLocationsEntityParentsView EP
	ON (EP.ParentID = @Node AND EP.EntityID = Tree.ID)
UNION
SELECT
	Tree.IndentedLabel,
	Tree.SortKey
FROM LocationTreeView Tree
INNER JOIN ReportingLocationsEntityParentsView EP
	ON (EP.EntityID = @Node AND EP.ParentID = Tree.ID)
ORDER BY SortKey




Optimising data retrieval
Once your tree gets up to several thousand nodes, these CTE queries can start running slower. The first method of optimising your queries should be creating indexes on the appropriate columns. I have found that no matter what additional attributes you have on your Locations table, there is one index that you will almost certainly benefit from - a unique non-clustered index on ID and ParentID:
CREATE UNIQUE NONCLUSTERED INDEX IX_Locations_IDParentID
ON Locations (ID, ParentID)



Indexes are a great way of easily speeding up your queries, but sometimes an index isn't enough. In cases like this, the main problem is that, even though we have indexed the table, we really need an index on the CTEs.

We have a problem here, because CTEs cannot be indexed, even if we create the index using the WITH SCHEMABINDING clause. This is very unfortunate, as CTEs are prime candidates for indexing, given that a single query may JOIN against a CTE multiple times.

There is a workaround though! It involves the use of something called a table variable. Just like you can have integer or varchar variables, you can also have table variables. Table variables are good because they can be indexed, though the index must be clustered. Table variables also automatically get flushed at the end of a script execution, so there's no need to do your own garbage collection at the conclusion of a query.

By using iterators wisely, we can circumvent the usage of CTEs by building our own table in a table variable, with full indexing, and all the information that a CTE ordinarily provides:
-- Set @StartNode to the ID of the absolute parent you want to capture
-- Set @StartNode to NULL to capture the tree from the absolute root nodes
DECLARE @StartNode AS int = NULL;

-- Set @MaxRecursions to the relative depth you want to traverse to from the @StartNode
-- Set @MaxRecusrions to NULL to capture the entire depth of the tree
DECLARE @MaxRecursions AS int = NULL;

-- @Counter is used during the recursive iterator to map out the depth of nodes
DECLARE @Counter tinyint = 0;

-- @Indent is the string you wish to indent with.  I have chosen a double space as the indent
DECLARE @Indent AS nvarchar(8) = '  ';

-- @SortKeySeparator is the character used as a separator between node descriptions
-- The pipe symbol is a good choice, as it resides after most common ASCII characters, and is rarely used in descriptions
DECLARE @SortKeySeparator AS nvarchar(1) = '|';

DECLARE @ParentIDPair table
(
    ID					int					NOT NULL,
	ParentID			int					NULL,
	Depth				tinyint				NOT NULL,
	PRIMARY KEY CLUSTERED (ID, Depth)
)

DECLARE @LocationTree table (
	ID					int					NOT NULL,
	ParentID			int					NULL,
	Depth				int					NOT NULL,
	Name				nvarchar(1024)		NULL,
	SortKey				nvarchar(1024)		NULL,
	Indent				nvarchar(64)		NOT NULL,
	PRIMARY KEY CLUSTERED (ID, Depth)
)

-- Set the root nodes
INSERT INTO @LocationTree (
	ID,
	ParentID,
	Depth,
	Name,
	IndentedLabel,
	SortKey,
	Indent
)
SELECT
	ID,
	ParentID,
	0,
	CONVERT(nvarchar(1024), Name),
	CONVERT(nvarchar(1024), Name),
	CONVERT(nvarchar(1024), Name),
	@Indent
FROM Locations
WHERE (
	@StartNode IS NULL AND ParentID IS NULL
) OR (
	@StartNode IS NOT NULL AND ParentID = @StartNode
)

-- Set the root node parenting relationships
INSERT INTO @ParentIDPair(ID, ParentID, Depth)
SELECT
	Locations.ID,
	Locations.ParentID,
	@Counter
FROM @LocationTree AS Parent
INNER JOIN Locations ON Locations.ParentID = Parent.ID
GROUP BY Locations.ID, Locations.ParentID

-- Recursively populate the tree
WHILE @@ROWCOUNT > 0 AND (@Counter < @MaxRecursions - 1 OR @MaxRecursions IS NULL)
	BEGIN
    SET @Counter += 1

    INSERT INTO @LocationTree (
		ID,
		ParentID,
		Depth,
		Name,
		IndentedLabel,
		SortKey,
		Indent
	)
    SELECT
		Locations.ID,
		Locations.ParentID,
		@Counter,
		CONVERT(nvarchar(1024), Locations.Name),
		CONVERT(nvarchar(1024), Parent.Indent) + CONVERT(nvarchar(1024), Locations.Name),
		CONVERT(nvarchar(1024), Parent.SortKey) + @SortKeySeparator + CONVERT(nvarchar(1024), Locations.Name),
		Parent.Indent + @Indent
    FROM Locations
    INNER JOIN @ParentIDPair AS Pairing ON (Locations.ID = Pairing.ID)
    LEFT OUTER JOIN @LocationTree AS Parent ON (Parent.ID = Locations.ParentID)
    WHERE Parent.Depth = @Counter - 1

    INSERT INTO @ParentIDPair (ID, ParentID, Depth)
	SELECT Locations.ID, Locations.ParentID, @Counter
	FROM @LocationTree AS Parent
	INNER JOIN Locations ON Locations.ParentID = Parent.ID
	WHERE Parent.Depth = @Counter
	GROUP BY Locations.ID, Locations.ParentID
END

SELECT *
FROM @LocationTree
ORDER BY SortKey



I have found that, in practice, this table variable usage tends to speed up tree-building by around 75%. I normally work with trees that have thousands to hundreds of thousands of rows. On my development box, this query can build a tree with 5000 nodes, 15 attributes per node (5 calculated) in around 500ms. Compare this with the 2500-3000ms build time of a CTE, it is well worth it the trade-off of having to embed the table variable generation before a large query that utilises the tree.

One thing that has to be kept in mind with the use of table variables, is that they have to be explicitly cast during JOINs by using the AS keyword. If we don't rename the table variable, MSSQL has a bit of a whinge, so we need to do something like this:
SELECT
	-- Whatever rows you require
FROM Persons
INNER JOIN @LocationTree AS LocationTree
	ON LocationTree.ID = Persons.LocationID



Hope you've enjoyed the tutorial, and happy hierarchical querying ^^

Is This A Good Question/Topic? 5
  • +

Replies To: Developing robust hierarchical data in MSSQL

#2 Jstall  Icon User is offline

  • Lurker
  • member icon

Reputation: 434
  • View blog
  • Posts: 1,042
  • Joined: 08-March 09

Posted 04 July 2012 - 04:39 AM

Excellent tutorial, very informative and well written. Well done!
Was This Post Helpful? 0
  • +
  • -

#3 Goodfix86  Icon User is offline

  • New D.I.C Head

Reputation: 4
  • View blog
  • Posts: 12
  • Joined: 01-July 09

Posted 13 July 2012 - 04:57 AM

Nice work!
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1