1
votes

I have done a test on a query set to see the performance. in which i had found the query without clustered index is fast,WHY...?

The query is below

select A.col1 ,B.col2,B.col3 from table1 A inner join table2 on A.col1 =B.col1

The performance is

Hash Match( both have No Index or either have no index)

(913271 row(s) affected) Table 'Table B'. Scan count 5, logical reads 18681, physical reads 193, read-ahead reads 18681, lob

logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Table B'. Scan count 5, logical reads 57798, physical reads 4, read-ahead reads 57798, lob logical reads 0, lob

physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical

reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times: CPU time = 3665 ms, elapsed time = 9391 ms.

Total Time :09 Sec

enter image description here

Merge Join:(Both have unique non clustered index)

(913271 row(s) affected) Table 'Table B'. Scan count 1, logical reads 18723, physical reads 6, read-ahead reads 18727, lob

logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Table B'. Scan count 1, logical reads 56811, physical reads 21, read-ahead reads 56921, lob logical reads 0, lob

physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times: CPU time = 1466 ms, elapsed time = 14881 ms. SQL Server parse and compile time:

Total Time :14 Sec

enter image description here

2
Update statistics, rebuild indices, try again. And note that in both cases you're using an index scan - you're not really using the index as an index. Also, the first query is parallelized, while the second isn't.Luaan
please post actual execution plan xml using pastebin or some linkTheGameiswar

2 Answers

1
votes

The relative performance of the two index-types depends rather extremely upon the distribution of values in the various tables. Both index types favor situations that allow them to avoid reading new blocks, and/or to be able to re-use blocks that have already been cached, to profitably exploit "read-ahead" strategies, and so on. But, their pragmatic ability to do this depends on the data, as well as the particular operation(s) that are being performed.

Part of your application-design should be pragmatic examination to see if one approach (if any ...) "clearly appears superior to the other" in y-o-u-r situation. But, there is really no "de facto winner." (If there were, "the other" index-type would have been abandoned long ago.)

"A single, isolated, resource-consumption test" is not sufficient: you must consider all angles, including time required to perform various operations (not just one), the influence of data-quantity, and so on and on and on.

1
votes

I feel there are two questions here. To read and understand what a Merge vs Hash join is simply consult MSDN documentation. However the second question I see is you're misusing SQL Server and trying to understand the difference between query plans.

I'll answer the second one (Google "Hash Join" to answer the first). Query performance depends upon the data types used for the joining columns and the number of items in the pages.

BUT !! The big deal here is that you are Dumping the Whole Table (I see the parallel join icon above which tips me off to a bad query). So SQL Server is trying to find the fastest way to swim the whole thing and pump out the data. My question to you --- does your app really mean to dump all data? Or will there be more to the JOIN or WHERE clause? You are attempting to optimize for a non-realistic query.

What you are seeing is a different query plan being generated based upon the statistics (histograms) of data distribution. For some reason SQL Server "thinks" the plan is best. There is apparently a skew to the data and SQL Server believes (on average) it is faster to Scan the whole table to get the job done (fewer I/O cost). If the data set is small - then SQL Server believes it is faster to Dump the data than deal with following indices. OR no index available that looks like it will help (in this case --- you're dumping all data -- so SQL Server will most likely prefer the Clustered index if present and in some cases the narrowest index because the I/O will be smallest).

When the indices are missing (a Heap table) - SQL Server has nothing to go on and dumps the tables and does the work blind. Try creating a Clustered index on your Primary Key. Although - in this specific case it probably won't help because you're dumping all data.

Other things to consider: are the values in "col1" Unique (1:1, or 1:*).... or n:n ? You need to declare this when creating the table (create unique index, or Primary Key). This information is a way for SQL Server to "learn" something about the future data. Everything you do is a method to Communicate your intentions to SQL Server so that it can do the right thing.

For now I'd keep defining your tables and index (only as needed - start with Primary Key) -- Write some Real queries (maybe add an index) -- and then look at your results. Otherwise you are prematurely optimizing.

The (general) rule to indices are: As few as possible with as few columns in them as possible - thus maximizing usage. Indices are structures that must be updated when data is added or modified to the table. So more indices and lots of columns in them will ultimately slow you down. You want only as many as you need - and no more. See also - story of the 3 little bears.

And another rule for SQL Server - have a clustered index. Heap tables are considered "bad." Yes there are arguments for having a Heap - but IMHO that is a 400 level discussion. To get started - declare a PK and Cluster.

Good luck.