This tutorial is intended for MSSQL, but the code contents could be modified to work with Oracle, PostgreSQL, and any other database that handles CTEs (Common Table Expressions). Unfortunately, MySQL does not, so handling hierarchical data in MySQL must be achieved using different methodology. To find out if your DB handles CTEs, Wikipedia has an excellent resource available.
What you will need
- An intermediate knowledge of SQL, including use of the WITH clause, indexes, and variables.
- A database to work with where you have CREATE permissions for both tables and views.
- A coffee and a cushy chair.
When we want to create relational (normalised) data in a database, we create multiple tables to accomplish this. Take, for example, the scenario where we want to store the countries of the world, divided into continents. We could use a table schema such as this:
This is all well and good if we know that we only ever want to store continents and countries, but what happens if we want to store states/provinces also? Well, we could create another table for these, but then what happens if we need districts, cities, suburbs, streets, street addresses, buildings, rooms, etc? You can see how the data schema might become extremely cumbersome very quickly.
A table-by-table data schema does not hit the mark when it comes to storing dynamically nested hierarchical data. Also, when it comes to querying this data, it can become extremely problematic for the SQL coder, and even the smallest schema change may have drastic trickle-down effects on applications that utilise the data schema.
We can avoid the problems associated with table-by-table relations by mapping out a single table schema that is self-referential. By this, I mean that instead of having a "Countries" table that references a "Continents" table (and so on), we simply have a "Locations" table that references itself.
For this, we need to structure our data in a specific manner. Each data node will require these attributes as a minimum:
- An ID
- Its parent's ID
- A name
- A sortkey (optional - the name could be used as the sort key)
Once we have alloted these values to our nodes, we will be able to create an extensive self-referential tree structure that can have dynamic depth. Note that this data is kept in one simple table, so to work with the data we can use Views, CTEs, and even table variables to quickly and easily query the hierarchy structure.
Let's begin by creating the Locations table in the database, along with a foreign key constraint against itself:
CREATE TABLE [dbo].[Locations]( [ID] [int] NOT NULL, [ParentID] [int] NULL, [Name] [nvarchar](1024) NOT NULL PRIMARY KEY CLUSTERED ( [ID] ASC ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY] GO ALTER TABLE [dbo].[Locations] WITH CHECK ADD CONSTRAINT [FK_Locations_Locations] FOREIGN KEY([ParentID]) REFERENCES [dbo].[Locations] ([ID]) GO ALTER TABLE [dbo].[Locations] CHECK CONSTRAINT [FK_Locations_Locations] GO
Next, we'll insert some values into the table, so we have a nice structure to work with:
INSERT INTO LocationEntities VALUES (1, NULL, 'North America'); INSERT INTO LocationEntities VALUES (2, NULL, 'Europe'); INSERT INTO LocationEntities VALUES (3, 1, 'USA'); INSERT INTO LocationEntities VALUES (4, 1, 'Canada'); INSERT INTO LocationEntities VALUES (5, 3, 'California'); INSERT INTO LocationEntities VALUES (6, 2, 'United Kingdom'); INSERT INTO LocationEntities VALUES (7, 6, 'England'); INSERT INTO LocationEntities VALUES (8, 7, 'East Midlands'); INSERT INTO LocationEntities VALUES (9, 8, 'Derbyshire'); INSERT INTO LocationEntities VALUES (10, 9, 'Derby');
Running that query will produce the simple data structure that this tutorial will continue to work with. Note that I have set the ParentID of North America and Europe as NULL. This is because we are defining these two Locations as root nodes. Here's a visualisation of that data:
+---North America | | | +---USA | | | | | +---California | | | +---Canada | +---Europe | +---United Kingdom | +---England | +---East Midlands | +---Derbyshire | +---Derby
Creating Views for the hierarchy
So now that we have the data that we are going to work with, how do we work with it?
Well, there are two main tools you will need to work with this data effectively - a Tree View, and an Entity Parents View. Both of these Views will utilise CTEs, but in different manners. One will use top-down recursion, and the other will use bottom-up recursion.
Let's start by creating the Tree View:
-- "Top down" recursion CREATE VIEW [dbo].[LocationTreeView] AS WITH LocationTree( ParentID, ID, Depth, Name, SortKey, Indent ) AS ( -- Start with all the root node Locations SELECT ParentID, ID, 0 AS Depth, Name, CONVERT(nvarchar(1024), Name) AS SortKey, CONVERT(nvarchar(1024), '') AS Indent FROM Locations WHERE ParentID IS NULL UNION ALL -- then recursively add the children SELECT Children.ParentID, Children.ID, Parents.Depth + 1, Children.Name, CONVERT(nvarchar(1024), RTRIM(Parents.SortKey) + '|' + Children.Name) AS SortKey, CONVERT(nvarchar(1024), Parents.Indent + ' ') AS Indent FROM Locations AS Children INNER JOIN LocationTree AS Parents ON Children.ParentID = Parents.ID ) SELECT TOP (100) PERCENT ParentID, ID, Depth, Name, Indent + Name AS IndentedName, SortKey FROM LocationTree ORDER BY SortKey GO
And there we have it, a View that contains the Locations data as a hierarchical tree. SELECTing this view will show you the entire hierarchy, with one row per node sorted by the sort key, which in this case is alphabetical by name over node.
The next View we will create is the Entity Parents View. This View, when combined with the TreeView, enables a vast array of functionality, including SELECTing a part of the hierarchy, determining the depth (distance) between two Locations, and determining traversal paths between Locations:
-- "Bottom up" recursion CREATE VIEW [dbo].[LocationsEntityParentsView] AS WITH EntityParents( ID, ParentID, Depth, IsLocalRootNode ) AS ( -- Add an entry for each entity, with itself as it's 0-depth parent SELECT ID, ID AS ParentID, 0 AS Depth, 1 AS IsLocalRootNode FROM Locations UNION ALL -- Add an entry for each entity, with it's ParentID as it's 1-depth parent SELECT ID, ParentID, 1 AS Depth, 0 AS IsLocalRootNode FROM Locations AS Locations_1 UNION ALL -- Finally, recurse up EntityParents to get the furthest grandparent, which should be the nodes with ParentID IS NULL -- Note that because the root nodes have ParentID IS NULL, this is where we stop the recursion SELECT Children.ID, Parent.ParentID, Parent.Depth + 1 AS Depth, 0 AS IsLocalRootNode FROM Locations AS Children INNER JOIN EntityParents AS Parent ON Parent.ID = Children.ParentID WHERE Parent.IsLocalRootNode = 0 ) SELECT TOP (100) PERCENT ID, ParentID, Depth FROM EntityParents GO
Querying against the Views
Now that we have our Views written, we can start looking into some common queries. The most common queries tend to be selecting the tree under a particular node, the parenting above a particular node, and the parents and children of a particular node. Note that I have not included a query to select the entire tree, since we can just SELECT * FROM LocationTree
-- Select a node and it's children SELECT Tree.IndentedLabel FROM LocationTreeView Tree INNER JOIN ReportingLocationsEntityParentsView EP ON EP.ParentID = @Node AND EP.EntityID = Tree.ID ORDER BY SortKey -- Select a node and it's parents SELECT Tree.IndentedLabel FROM LocationTreeView Tree INNER JOIN ReportingLocationsEntityParentsView EP ON EP.EntityID = @Node AND EP.ParentID = Tree.ID ORDER BY SortKey -- Select a node and it's children and parents (slow method) SELECT Tree.IndentedLabel FROM LocationTreeView Tree -- Note that this method is slow due to the use of the OR operand in the INNER JOIN. -- Disjunctions (ORs) are generally slower than conjunctions (ANDs) and unions. INNER JOIN ReportingLocationsEntityParentsView EP ON (EP.ParentID = @Node AND EP.EntityID = Tree.ID) OR (EP.EntityID = @Node AND EP.ParentID = Tree.ID) ORDER BY SortKey -- Select a node and it's children and parents (fast method) -- We need to SELECT the sortkey here, as it's being used to ORDER a UNION set. -- The slowdown caused by SELECTing two columns is far outweighed by the speed -- increase from utilising a UNION instead of a disjunction. SELECT Tree.IndentedLabel, Tree.SortKey FROM LocationTreeView Tree INNER JOIN ReportingLocationsEntityParentsView EP ON (EP.ParentID = @Node AND EP.EntityID = Tree.ID) UNION SELECT Tree.IndentedLabel, Tree.SortKey FROM LocationTreeView Tree INNER JOIN ReportingLocationsEntityParentsView EP ON (EP.EntityID = @Node AND EP.ParentID = Tree.ID) ORDER BY SortKey
Optimising data retrieval
Once your tree gets up to several thousand nodes, these CTE queries can start running slower. The first method of optimising your queries should be creating indexes on the appropriate columns. I have found that no matter what additional attributes you have on your Locations table, there is one index that you will almost certainly benefit from - a unique non-clustered index on ID and ParentID:
CREATE UNIQUE NONCLUSTERED INDEX IX_Locations_IDParentID ON Locations (ID, ParentID)
Indexes are a great way of easily speeding up your queries, but sometimes an index isn't enough. In cases like this, the main problem is that, even though we have indexed the table, we really need an index on the CTEs.
We have a problem here, because CTEs cannot be indexed, even if we create the index using the WITH SCHEMABINDING clause. This is very unfortunate, as CTEs are prime candidates for indexing, given that a single query may JOIN against a CTE multiple times.
There is a workaround though! It involves the use of something called a table variable. Just like you can have integer or varchar variables, you can also have table variables. Table variables are good because they can be indexed, though the index must be clustered. Table variables also automatically get flushed at the end of a script execution, so there's no need to do your own garbage collection at the conclusion of a query.
By using iterators wisely, we can circumvent the usage of CTEs by building our own table in a table variable, with full indexing, and all the information that a CTE ordinarily provides:
-- Set @StartNode to the ID of the absolute parent you want to capture -- Set @StartNode to NULL to capture the tree from the absolute root nodes DECLARE @StartNode AS int = NULL; -- Set @MaxRecursions to the relative depth you want to traverse to from the @StartNode -- Set @MaxRecusrions to NULL to capture the entire depth of the tree DECLARE @MaxRecursions AS int = NULL; -- @Counter is used during the recursive iterator to map out the depth of nodes DECLARE @Counter tinyint = 0; -- @Indent is the string you wish to indent with. I have chosen a double space as the indent DECLARE @Indent AS nvarchar(8) = ' '; -- @SortKeySeparator is the character used as a separator between node descriptions -- The pipe symbol is a good choice, as it resides after most common ASCII characters, and is rarely used in descriptions DECLARE @SortKeySeparator AS nvarchar(1) = '|'; DECLARE @ParentIDPair table ( ID int NOT NULL, ParentID int NULL, Depth tinyint NOT NULL, PRIMARY KEY CLUSTERED (ID, Depth) ) DECLARE @LocationTree table ( ID int NOT NULL, ParentID int NULL, Depth int NOT NULL, Name nvarchar(1024) NULL, SortKey nvarchar(1024) NULL, Indent nvarchar(64) NOT NULL, PRIMARY KEY CLUSTERED (ID, Depth) ) -- Set the root nodes INSERT INTO @LocationTree ( ID, ParentID, Depth, Name, IndentedLabel, SortKey, Indent ) SELECT ID, ParentID, 0, CONVERT(nvarchar(1024), Name), CONVERT(nvarchar(1024), Name), CONVERT(nvarchar(1024), Name), @Indent FROM Locations WHERE ( @StartNode IS NULL AND ParentID IS NULL ) OR ( @StartNode IS NOT NULL AND ParentID = @StartNode ) -- Set the root node parenting relationships INSERT INTO @ParentIDPair(ID, ParentID, Depth) SELECT Locations.ID, Locations.ParentID, @Counter FROM @LocationTree AS Parent INNER JOIN Locations ON Locations.ParentID = Parent.ID GROUP BY Locations.ID, Locations.ParentID -- Recursively populate the tree WHILE @@ROWCOUNT > 0 AND (@Counter < @MaxRecursions - 1 OR @MaxRecursions IS NULL) BEGIN SET @Counter += 1 INSERT INTO @LocationTree ( ID, ParentID, Depth, Name, IndentedLabel, SortKey, Indent ) SELECT Locations.ID, Locations.ParentID, @Counter, CONVERT(nvarchar(1024), Locations.Name), CONVERT(nvarchar(1024), Parent.Indent) + CONVERT(nvarchar(1024), Locations.Name), CONVERT(nvarchar(1024), Parent.SortKey) + @SortKeySeparator + CONVERT(nvarchar(1024), Locations.Name), Parent.Indent + @Indent FROM Locations INNER JOIN @ParentIDPair AS Pairing ON (Locations.ID = Pairing.ID) LEFT OUTER JOIN @LocationTree AS Parent ON (Parent.ID = Locations.ParentID) WHERE Parent.Depth = @Counter - 1 INSERT INTO @ParentIDPair (ID, ParentID, Depth) SELECT Locations.ID, Locations.ParentID, @Counter FROM @LocationTree AS Parent INNER JOIN Locations ON Locations.ParentID = Parent.ID WHERE Parent.Depth = @Counter GROUP BY Locations.ID, Locations.ParentID END SELECT * FROM @LocationTree ORDER BY SortKey
I have found that, in practice, this table variable usage tends to speed up tree-building by around 75%. I normally work with trees that have thousands to hundreds of thousands of rows. On my development box, this query can build a tree with 5000 nodes, 15 attributes per node (5 calculated) in around 500ms. Compare this with the 2500-3000ms build time of a CTE, it is well worth it the trade-off of having to embed the table variable generation before a large query that utilises the tree.
One thing that has to be kept in mind with the use of table variables, is that they have to be explicitly cast during JOINs by using the AS keyword. If we don't rename the table variable, MSSQL has a bit of a whinge, so we need to do something like this:
SELECT -- Whatever rows you require FROM Persons INNER JOIN @LocationTree AS LocationTree ON LocationTree.ID = Persons.LocationID
Hope you've enjoyed the tutorial, and happy hierarchical querying