15
votes

I have a table like

create table site
(
site_Id int(5),
parent_Id int(5),
site_desc varchar2(100)
);

Significance of the fields:

  • site_Id : Id of the sites
  • parent_Id : Parent id of the site
  • site_desc : though not relevant to the question but it has the description of the site

The requirement is that if I have a site_id as an input, and I need all the ids tagged below the site. For Example:

                    A
                   / \
                  B   C
                / | \ /\
               D  E F G H
              /\
             I  J

All the nodes are the site_Id.

The table contains data like this:

Site_id  | Parent_ID  |  site_desc
_________|____________|___________
 A       |   -1       |   
 B       |    A       |
 C       |    A       |
 D       |    B       |
 E       |    B       |
 F       |    B       |
 I       |    D       |
 J       |    D       |

......

A is the parent of B and C and so on.

If B is the input given then the query need to fetch D, E, I, F, J

It is currently achieved through multiple queries in a loop, but I was thinking to achieve this in a minimum number of queries.

What I am currently doing is::

down vote

The algorithm goes like this :

Initially create a data set object which you will populate, by fetching data from the data base. 
Create a method which takes the parent id as parameter and returns its child nodes if present, and returns -1, if it doesnt have a child. 
Step1: Fetch all the rows, which doesn't have a parent(root) node. 
Step2: Iterate through this result. For example if prod1 and prod2 are the initial returned nodes, in the resultset. 
Iterating this RS we get prod1, and we insert a row in our DataSET obj. 
Then we send the id of prod1 to getCHILD method, to get its child, and then again we iterate the returned resultset, and again call the getCHILD method, till we dont get the lowest node.

I need the best optimized technique within my data model constraint. Feel free to answer if you have any suggestion.
Please suggest anything. Thanks in advance.

10
can't understand the question about "ids tagged below the site"hjpotter92
@T-ShirtDude OP means all of the child (or descendant) sites for a given site.siride
In general, tree models are "difficult" in SQL, but not impossible. Depending on how big your tree is, if you are doing these queries at runtime it could be very slow. In the past I've pre-tagged things on insert or overnight-batch.Sean
As MySQL does not support recursive queries you will have a hard time working with this design. You should consider changing your datamodel using nested sets or a closure table (search for those terms and you'll find loads of information on them)a_horse_with_no_name
@a_horse_with_no_name: I can understand that, but changing the data model is not possible with me, as this is referred in many solution team in my project, and there are as much 500 instances where this table is refereed, I am only asking for the best technique to get the resultsetSashi Kant

10 Answers

10
votes

Unfortunately, if you can't change the data model, and you're using MySQL, you're stuck in a situation where you need recursive queries and you're using a DBMS that doesn't support recursive queries.

Quassnoi wrote an interesting series of blog articles, showing techniques for querying hierarchical data. His solutions are quite clever, but very complex. http://explainextended.com/2009/03/17/hierarchical-queries-in-mysql/

PostgreSQL is another open-source RDBMS, which does support recursive queries, so you could fetch a whole tree stored in the way you show. But if you can't change the data model, I'd assume you can't switch to a different RDBMS.

There are several alternative data models that make it much easier to fetch arbitrarily-deep trees:

  • Closure Table
  • Nested Sets aka Modified Preorder Tree Traversal
  • Path Enumeration aka Materialized Path

I cover these in my presentation Models for Hierarchical Data with SQL and PHP, and in my book SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

Finally, there's another solution that I've seen used in the code for Slashdot, for their comments hierarchies: They store "parent_id" like in Adjacency List, but they also store a "root_id" column. Every member of a given tree has the same value for root_id, which is the highest ancestor node in its tree. Then it's easy to fetch a whole tree in one query:

SELECT * FROM site WHERE root_id = 123;

Then your application fetches all the nodes back from the database into an array, and you have to write the code to loop over this array, inserting the nodes into a tree data structure in memory. This is a good solution if you have many separate trees, and each tree has relatively few entries. It's good for Slashdot's case.

8
votes

Yesterday, I had answered this question which is exactly related to your problem you've described: Out of a given adjacency list, you want to get all child nodes of a particular parent - and perhaps in a one-dimensional array that you can easily iterate over.

You can do this using only one call to the database, but there's sort of a catch: you have to return all rows from the table. MySQL doesn't support recursive queries, so instead, you essentially have to do the SELECTing in the application code.

I'm simply going to reiterate my answer that I linked to above, but basically if you return a result-set (perhaps from PDOStatement->fetchAll(PDO::FETCH_ASSOC) or other methods) in the format of something like:

Array
(
    [0] => Array
    (
        [site_id] => A
        [parent_id] => -1
        [site_desc] => testtext
    )
    [1] => Array
    (
        [site_id] => B
        [parent_id] => A
        [site_desc] => testtext
    )
    [2] => Array
    (
        [site_id] => C
        [parent_id] => A
        [site_desc] => testtext
    )
    [3] => Array
    (
        [site_id] => D
        [parent_id] => B
        [site_desc] => testtext
    )
    [4] => Array
    (
        [site_id] => E
        [parent_id] => B
        [site_desc] => testtext
    )
    [5] => Array
    (
        [site_id] => F
        [parent_id] => B
        [site_desc] => testtext
    )
    [6] => Array
    (
        [site_id] => I
        [parent_id] => D
        [site_desc] => testtext
    )
    [7] => Array
    (
        [site_id] => J
        [parent_id] => D
        [site_desc] => testtext
    )
)

You can retrieve all children/grandchildren/greatgrandchildren/so-on of any site_id (provided you know the id) using this recursive function:

function fetch_recursive($src_arr, $id, $parentfound = false, $cats = array())
{
    foreach($src_arr as $row)
    {
        if((!$parentfound && $row['site_id'] == $id) || $row['parent_id'] == $id)
        {
            $rowdata = array();
            foreach($row as $k => $v)
                $rowdata[$k] = $v;
            $cats[] = $rowdata;
            if($row['parent_id'] == $id)
                $cats = array_merge($cats, fetch_recursive($src_arr, $row['site_id'], true));
        }
    }
    return $cats;
}

For example, say you wanted to retrieve all children of site_id D, you would use the function like so:

$nodelist = fetch_recursive($pdostmt->fetchAll(PDO::FETCH_ASSOC), 'D');
print_r($nodelist);

Would output:

[0] => Array
(
    [site_id] => D
    [parent_id] => B
    [site_desc] => testtext
)
[1] => Array
(
    [site_id] => I
    [parent_id] => D
    [site_desc] => testtext
)
[2] => Array
(
    [site_id] => J
    [parent_id] => D
    [site_desc] => testtext
)

Notice we retain the information of the parent, along with its children, grandchildren, and so on (however deep the nesting goes).

5
votes

Check out the nested set model, if you want to be able to do this in single queries: http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/

Another alternative is to include all relationships in a linking table. So every site would have a link to its parent, its grandparent, etc. Every relationship is explicit. Then you just query that linkage table to get all descendants.

3
votes

To begin with, i would recommend a bit different method for storing a tree: Closure Table. If you want to know more about it, you could find SQL Antipatterns book quite interesting.

That said. The easiest way, in my opinion, to generate such structure is this: http://jsbin.com/omexix/3/edit#javascript

I hope you have no problems reading JavaScript code. I used it because creation of unclassified objects in JavaScript doesn't look so hack-ish. It is possible to implement same without relaying on object (or references) by using multidimensional array, but it kinda looks confusing.

Here is what algorithm does:

  • we loop through the list of nodes, once
  • if node's parent does not exits, placeholder is created in the array
  • if node has no parent, it is place in list of root nodes
  • if node has no placeholder in array, the placeholder is created
  • values from node are assigned to placeholder
  • node is registered to the parent, if it has a parent

This is about it. Basically you generate two lists: with all the nodes, and with only root node.

3
votes

You might want to have a look at the closure table pattern. I found this site informative. As far as I have seen, ther are also several StackOverflow questions about this concept, e.g., here.

2
votes

If you do not update your site table often you can use the following strategy:

create table site
(
site_Id int(5),
parent_Id int(5),
site_desc varchar2(100),
parents_path varchar(X)
);

parents_path equals path to selected node from root. For example, for leaf J it should be |A|B|D|.

Pros: - you will need single query to get the result;

Cons: - more query during updates (but you can do updates wisely);

Hope it helps

2
votes

Others have already proposed how to do this with slight modifications to the table structure.

If you do not want to modify the structure (even if this would be best), then you can do like this:

  • SELECT * FROM site ORDER BY Parent_ID, Site_id;

It can usually be safely assumed that once assigned, IDs do not change; if the IDs do not get shuffled around, i.e., node C is not moved under node B, then it will be true that child nodes have always higher IDs than their parents, and the sorting above will guarantee that all parents gets fetched before their children.

So these are the hypotheses:

- we prefer not to change the table layout
- we never change the IDs once assigned
- we never reorder the tree, moving IDs around

Therefore, it becomes possible to create the tree in memory (and even reduce the query itself adding a WHERE Site_ID >= B).

The first node to come through will be B's and will be put into the tree.

All subsequent nodes may be stored in their Parent_ID-th node, which has surely been loaded before.

This would go quite well in Python (you directly modify the parent node).

The request "Get all descendants of B" may be answered in PHP like this:

$nodes  = array( $parent_id );

$cursor = SQLQuery("SELECT * FROM site WHERE Site_ID > ? "
        .  "ORDER BY Parent_ID, Site_Id ;", $parent_id);

while ($tuple = SQLFetchTuple($cursor))
    if (in_array($tuple['Parent_ID'], $nodes))
        $nodes[] = $tuple['Site_Id'];
SQLFree($cursor);

// The first node is the global parent, and may be array_shift'ed away
    // if desired.

Another way
quite brute force

Another possibility is to store the "descendant_of" relation recursively in another table:

    TRUNCATE descendants;
    INSERT INTO descendants ( node, of ) VALUES ( -1, NULL );

    INSERT INTO descendants SELECT SiteId, ParentId FROM site JOIN
           descendants ON ( site.ParentId = descendants.of );

And repeat the INSERTs until number of inserted rows is equal to zero (or total number of rows in descendants stops increasing; querying table size is very fast in most DBs).

At this point you will have stored all one-level relations. Now:

INSERT IGNORE INTO descendants SELECT s1.node, s2.of FROM
           descendants AS s1 JOIN descendants AS s2 ON (s1.of = s2.node);

...again until descendants stops increasing (it will require a number of inserts equal to the maximum number of levels). Total number of JOINs will be twice the number of levels.

Now if you want to get all descendants of node 16, you simply query

SELECT node FROM descendants WHERE of = 16;
2
votes

You can create a stored procedure for this.

Here is my implementation in mysql

DROP PROCEDURE IF EXISTS SearchTree;
DELIMITER go

CREATE PROCEDURE SearchTree( IN root CHAR(1) )
BEGIN
  DECLARE rows SMALLINT DEFAULT 0;
  DROP TABLE IF EXISTS reached;
  CREATE TABLE reached (
    site_Id CHAR(1) PRIMARY KEY
  ) ENGINE=HEAP;
  INSERT INTO reached VALUES (root);
  SET rows = ROW_COUNT();
  WHILE rows > 0 DO
    INSERT IGNORE INTO reached 
      SELECT DISTINCT s.site_Id 
      FROM site AS s 
      INNER JOIN reached AS r ON s.parent_Id = r.site_Id;
    SET rows = ROW_COUNT();
    DELETE FROM reached 
      WHERE site_Id = root;
  END WHILE;
  SELECT * FROM reached;
  DROP TABLE reached;
END;
go
DELIMITER ;
CALL SearchTree('B');

It returns the expected result.

2
votes

Based on your comments here I am assuming you are unwilling to change the existing data model because hundreds of applications are using that (and will break if you replace it with something else).

The root of the problem is that for any site, we only know it's direct parent, so we need to lookup the parent of that parent recursively until we found the root site.

If you can get away with a limit to the depth / level to which sites can be nested, you can write one great query that does all the work for you and probably isn't even that slow to boot. Most overhead from firing queries comes from setting up the connection, network bandwidth etc. MySQL can be very quick.

Firing multiple queries multiplies all the overhead, so we don't want that. Doing a SELECT * and then computing in the application logic means you will fetch all data every time, maximizing network overhead, so we don't want that.

If a limit to the depth of the tree is acceptable, you can combine multiple queries into one huge query that does all the work and returns the exact resultset you need. As an example I used your data, but with A, B, C etc replaced with 1, 2, 3 (as your columns are int).

To get all direct children of the root node (with site_id = 1) do this:

select site_id from site where parent_id = 1

To get the grandchildren of the root node, do this:

select grandchild.site_id 
from site grandchild, site child 
where grandchild.parent_id = child.site_id 
and child.parent_id = 1

To get the great-grandchildren of the root node, do this:

select greatgrandchild.site_id 
from site greatgrandchild, site grandchild, site child 
where greatgrandchild.parent_id = grandchild.site_id 
and grandchild.parent_id = child.site_id 
and child.parent_id = 1

To get all descendants of the root node, just combine the above queries into one huge query, like this:

select site_id
from site
where site_id in (
    select site_id 
    from site 
    where parent_id = 1
)
or site_id in (
    select grandchild.site_id 
    from site grandchild, site child 
    where grandchild.parent_id = child.site_id 
    and child.parent_id = 1
)
or site_id in (
    select greatgrandchild.site_id 
    from site greatgrandchild, site grandchild, site child 
    where greatgrandchild.parent_id = grandchild.site_id 
    and grandchild.parent_id = child.site_id 
    and child.parent_id = 1
)

I think you see how this is working. For each extra level, create a query that finds the nodes which are that many levels away from the site you are searching descendants for and add that query to the super-query with an extra 'or site_id in ()'...

Now as you can see, just for three levels, this already becomes a big query. If you need to support, say, 10 levels, this query will become huge and all the OR's and IN's in it will slow it down... However, it still will probably be faster then just getting everything or using multiple queries. If you need to support an arbitrary amount of possible levels than this query cannot help you. It would have to become infinitely large. In that case all that remains is to use a better way...

That said, and before you copy paste this and start coding, there is a way that avoids such huge queries, supporting arbitrary depths and without breaking backward compatibility. It does require a change to the data model but it's a small one that won't hurt the other programs using this data model. There is in short...

A BETTER WAY

Add an extra column parent_paths, using something like ravnur mentioned in his answer to encode the full path from each node all the way up to the root

Fill that column dynamically using triggers on insert, update and delete. You are now maintaining redundant data. It won't hurt other programs but can give a significant performance benefit for yours. Make sure your triggers are bullet-proof though (that's the hardest part probably) as the data in the extra column should always be in synch with the regular data in the table

Use a short and sweet query like the one ravnur showed that looks for the occurance of the site_id at any place in the parent_paths column to directly get all descendants of the site with that site_id without any recursion.

1
votes

i also asked myself how to query relationships recursively, and my brain generated this solution (:

SELECT * FROM
(
    SELECT t2.* FROM table t1, table t2 where t2.parent = t1.id OR t2.parent 0 GROUP BY t2.id, t2.parent
) as all_relations
WHERE all_relations.parent >= '_the_id_'

# if you dont want a subtree use only the inner select

I am not 100% sure, but i think as long as the id is auto incremented and a child never has a smaller id as his parent (that should be the normal case), then this could be a solution?