The Bix Blog

Monday, October 4, 2010

Left Outer Join

I am not going in detail about the definition of 'Left Outer join' , hopefully every one is aware of that but wanted to mention in this post about a mistake(bug) in a query I had seen somewhere on the net recently.Explained below is a similar situation .

The user wanted to know the total number of employees in each department and used the below query (Q1) for this. If we observe the query-Q1 , Its clear that GROUP BY will be done AFTER the Left Outer JOIN is done in this case and also observe that the Count(*) was used in this query. Actually uisng Count(*) in this context was a bug and becuase of this -we would definitely get AT LEAST one row though were no employees at all in that department in the emp table.The solution is to rewrite Query1 as Query2 - just replace '*' by e.dept_no in the count() .this is a sort of a logical mistake we all often do .

Q1 (Wrong):
SQL> select d.dept_no, count(*) from dept d, emp e
2 where d.dept_no=e.dept_no(+)
3 group by d.dept_no;

   DEPT_NO   COUNT(*)
---------- ----------
         1          1
         6          1
         2        547
         4        456
         5          1
         3       452

6 rows selected.
Elapsed: 00:00:00.04
Execution Plan
----------------------------------------------------------
Plan hash value: 2197210491
---------------------------------------------------------------------------------------
| Id | Operation              | Name         | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |                            |         3 |    18 |    43   (7)| 00:00:01 |
|   1 | HASH GROUP BY                |                             |        3 |    18 |    43   (7)| 00:00:01 |
|* 2 |   HASH JOIN OUTER          |                            | 65535 |   383K|    41   (3)| 00:00:01 |
|   3 |    INDEX FULL SCAN          | SYS_C0024561 |         6 |    18 |     1   (0)| 00:00:01 |
|   4 |    INDEX FAST FULL SCAN| DEPT_NO_IDX | 65535 |   191K|    39   (0)| 00:00:01 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("D"."DEPT_NO"="E"."DEPT_NO"(+))

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
         25 consistent gets
          0 physical reads
          0 redo size
        167 bytes sent via SQL*Net to client
        237 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          6 rows processed

Q2(Correct):
SQL> select d.dept_no, count(e.dept_no) from dept d, emp e
2 where d.dept_no=e.dept_no(+)
3 group by d.dept_no;

   DEPT_NO COUNT(E.DEPT_NO)
---------- ----------------
         1                0
         6                0
         2             547
         4             456
         5                0
         3             452
6 rows selected.
Elapsed: 00:00:00.06
Execution Plan
----------------------------------------------------------
Plan hash value: 2197210491
---------------------------------------------------------------------------------------
| Id | Operation              | Name         | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |                           |     3 |    18 |    43   (7)| 00:00:01 |
|   1 | HASH GROUP BY              |                            |     3 |    18 |    43   (7)| 00:00:01 |
|* 2 |   HASH JOIN OUTER         |                              | 65535 |   383K|    41   (3)| 00:00:01 |
|   3 |    INDEX FULL SCAN          | SYS_C0024561   |     6 |    18 |     1   (0)| 00:00:01 |
|   4 |    INDEX FAST FULL SCAN| DEPT_NO_IDX | 65535 |   191K|    39   (0)| 00:00:01 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("D"."DEPT_NO"="E"."DEPT_NO"(+))

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
        25 consistent gets
          0 physical reads
          0 redo size
        167 bytes sent via SQL*Net to client
        237 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          6 rows processed

By the way - there is another way of writing this query Q2 using 'Subquery' as shown in Q3:

Q3:
SQL> select dept_no,(select count(*) from emp where emp.dept_no=dept.dept_no) count
2 from dept ;

   DEPT_NO      COUNT
---------- ----------
         3             452
         1                0
         5                0
         6                0
         2             547
         4             456

6 rows selected.
Elapsed: 00:00:00.06
Execution Plan
----------------------------------------------------------
Plan hash value: 2535931830
--------------------------------------------------------------------------------------
| Id | Operation             | Name         | Rows | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |                             |     6 |    18 |     1   (0)| 00:00:01 |
|   1 | SORT AGGREGATE             |                           |     1 |     3 |            |          |
|* 2 |   INDEX FAST FULL SCAN| DEPT_NO_IDX | 21845 | 65535 |    39   (0)| 00:00:01 |
|   3 | INDEX FULL SCAN             | SYS_C0024561 |     6 |    18 |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("EMP"."DEPT_NO"=:B1)

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
       112 consistent gets
          0 physical reads
          0 redo size
       287 bytes sent via SQL*Net to client
       189 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          6 rows processed

See the logical i/os - so which query is efficient ???
By looking at the 'nature' of the two queries Q2 and Q3 - its clear that
Q3: would be 'nice and efficient' if dept is large and we want the 'first records' as soon as possible .
(It would take one department_no from dept table and go to employee table , get the count of employees for the department_no and give to us , then again take one more department_no from dept table and go to employee table ,get the count and give to us and so on ..........)
Q2 :would be efficient - if we want ALL the rows at a time as soon as possible regardless the table size.
(It will get the two tables first ,JOIN the two tables (left outer join) and then count the count the no of employees and output ALL the rows at a time)

Initial response (initial rows) is quick with Q3 but whole response (all rows) is fast with Q2.

In the above example - dept is very small and we were looking for ALL rows really -
so Q2 (left outer join ) is efficient than Q3( subquery).So there is no 'one for all rule' in Oracle - it all 'depends' on many factors which one is good approach and which one is a bad one .

ORACLE and ANSI Syntax for Left Outer JOIN :

ORACLE Syntax :

Q1:

select * from emp e, dept d

where e.dept_no=d.dept_no and d.dept_no=7

Q2:

select * from emp e , dept d

where e.dept_no=d.dept_no (+) and d.dept_no=7

Q3:

select * from emp e , dept d

where e.dept_no=d.dept_no (+) and d.dept_no(+)=7

ANSI Syntax:

Q4:

select * from emp e left outer join dept d

on (e.dept_no = d.dept_no and d.dept_no = 7)

Q5:

select * from emp e left outer join dept d

on (e.dept_no = d.dept_no) where d.dept_no = 7

Observations after looking at the test results that :

1.Q1=Q2 ->Pretty clear that you will get the records from EMP and DEPT where the
                 dept_no=7 only

2. Q3-> Will filter the DEPT table first(dept_no=7) then EMP table will be OUTER
          JOINed with DEPT (every record from EMP table will appear in the out put
        but only the recods with dep_no=7 appear from DEPT table and the rest will be NULLs from DEPT)

3.Q3=Q4

4. Q1=Q2=Q5

I always recommend -Use ANSI Syntax instead of Oracle's own Syntax as ANSI's syntax is Standardised and is supposed to be used irrespective of the Database Vendor , portable and universally accepted and recognised .

Thursday, September 23, 2010

About: Count(*)

I just thought to mention 4 imp points about COUNT(*) function that we use very regularly on a table/data set. I have tested this on 11gR2.

1. How Count(*) works :

This is a function works on a set of rows- to count the no of ALL rows in a data set. When we ask Oracle to get count of the no of rows from a table using Count(*),Oracle will use the index on the primary key (or) index on non-null column of the table to get the count .If there are indexes only (unique /non-unique) on nullable columns (columns with no constraint ' NOT NULL' ) (OR) if there are no indexes at all on the table-Oracle will FULL SCAN the table to get the count of rows.
Basically it does not matter whether the index is unique or non-unique but the 'NOT NULL' constraint matters here when using Count(*).Oracle will use the index created on a column with non-null constraint when to find the count using Count(*) but it is different when we want to find the count of a coloumn values-such as Count(emp_name) [assume emp_name is not primary key] it will simply use the index on emp_name (whether the coloumn is defined as null or not null) if it is there else will do the full scan though there is primary key index.I believe all this is bit of 'common sence' related stuff.

SQL> select count(*) from emp;
COUNT(*)
----------
65535
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 3050108756
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT       | | 1 | 41 (0) | 00:00:01 |
| 1 | SORT AGGREGATE            | | 1 | | |
| 2 | INDEX FAST FULL SCAN | SYS_C0024562 | 65535 | 41 (0)| 00:00:01 |
------------------------------------------------------------------------------

---Disable the constraint.
SQL> ALTER TABLE emp
2 disable CONSTRAINT SYS_C0024562;
SQL> select count(*) from emp;
COUNT(*)
----------
65535
Elapsed: 00:00:00.03
Execution Plan
----------------------------------------------------------
Plan hash value: 2083865914
-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 102 (0)| 00:00:02 |
| 1 | SORT AGGREGATE    | | 1 | | |
| 2 | TABLE ACCESS FULL | EMP | 65535 | 102 (0)| 00:00:02 |
-------------------------------------------------------------------
Note that : The Optimizer does not take the help of indexes created on nullable coloumns to get the count of ALL rows in a table .

2. SQL Rewrite :

The optimizer is more intelligent sometimes than us . See below I have written a query :
select count(*) from emp a , dept b where a.dept_no=b.dept_no
(dept_no is fk in the table emp and pk in the table dept)
The Optimizer then said to me that -'Hey , you are kidding , you really want the no of rows in just the emp table only - no need of join with parent'. The optimizer has simply re-written the query as
select count(dept_no) from emp (Note - it is not count(*) but count(dept_no)-join key)

See the below two plans:

SQL> select count(*) from emp a , dept b where a.dept_no=b.dept_no;
COUNT(*)
----------
65535
Elapsed: 00:00:00.03
Execution Plan
----------------------------------------------------------
Plan hash value: 2083865914
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT   | | 1 | 3 | 103 (1)| 00:00:02 |
| 1 | SORT AGGREGATE        | | 1 | 3 | | |
|* 2 | TABLE ACCESS FULL | EMP | 65535 | 191K| 103 (1)| 00:00:02 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("A"."DEPT_NO" IS NOT NULL)

SQL> select count(dept_no) from emp;
COUNT(DEPT_NO)
--------------
65535
Elapsed: 00:00:00.03
Execution Plan
----------------------------------------------------------
Plan hash value: 2083865914
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 102 (0)| 00:00:02 |
| 1 | SORT AGGREGATE     | | 1 | 3 | | |
| 2 | TABLE ACCESS FULL | EMP | 65535 | 191K| 102 (0)| 00:00:02 |
---------------------------------------------------------------------------
If there is an index (unique/non-unique) on the join key (dept_no)- It will be used by the Optimizer to get the count

One more point is that -
select count(*) from (select * from emp a , dept b where a.dept_no=b.dept_no)
is same as
select count(*) from emp a, dept b where a.dept_no=b.dept_no
The Optimizer serves them in the same way.

3. Count(any_column) :
(This is column level )

select count(dept_no) from emp;
this will get the count of rows with dept_no column not NULL -the Optimizer will use the index to get the count if exists on that column else do the full scan.

SQL> create index dept_no_idx on emp(dept_no);
Index created.
SQL> select count(dept_no) from emp;
COUNT(DEPT_NO)
--------------
65535
Elapsed: 00:00:00.03
Execution Plan
----------------------------------------------------------
Plan hash value: 2843651152
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT       | | 1 | 3 | 36 (0)| 00:00:01 |
| 1 | SORT AGGREGATE           | | 1 | 3 | | |
| 2 | INDEX FAST FULL SCAN | DEPT_NO_IDX | 65535 | 191K| 36 (0)| 00:00:01 |
-------------------------------------------------------------------------------------

4 . Count(*),Count(1),Count(rowid),Count(pk) --> Which is faster ??

The Fact is -All are SAME internally , The Optimizer is smart enough to treat them 'same'.
See below one by one - the COST is same (logical i/o) in all the four cases .
Just use which suites you 'semantically' correct in your context.
I prefer count(*) - when I need the no of all the rows in a set.

Count(*)
----------

SQL> select count(*) from emp;
COUNT(*)
----------
65535
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 3050108756
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT       | | 1 | 41 (0)| 00:00:01 |
| 1 | SORT AGGREGATE          | | 1 | | |
| 2 | INDEX FAST FULL SCAN | SYS_C0024562 | 65535 | 41 (0)| 00:00:01 |
------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
153 consistent gets
0 physical reads
0 redo size
337 bytes sent via SQL*Net to client
337 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

Count(rowid)
----------------

SQL> select count(rowid) from emp;
COUNT(ROWID)
------------
65535
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 3050108756
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT     | | 1 | 12 | 41 (0)| 00:00:01 |
| 1 | SORT AGGREGATE          | | 1 | 12 | | |
| 2 | INDEX FAST FULL SCAN| SYS_C0024562 | 65535 | 767K| 41 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
153 consistent gets
0 physical reads
0 redo size
341 bytes sent via SQL*Net to client
337 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

Count(1)
------------

SQL> select count(1) from emp;
COUNT(1)
----------
65535
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 3050108756
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT     | | 1 | 41 (0)| 00:00:01 |
| 1 | SORT AGGREGATE          | | 1 | | |
| 2 | INDEX FAST FULL SCAN| SYS_C0024562 | 65535 | 41 (0)| 00:00:01 |
------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
153 consistent gets
0 physical reads
0 redo size
337 bytes sent via SQL*Net to client
337 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

Count(pk)
-------------

SQL> select count(emp_no) from emp;
COUNT(EMP_NO)
-------------
65535
Elapsed: 00:00:00.01
Execution Plan
----------------------------------------------------------
Plan hash value: 3050108756
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT       | | 1 | 41 (0)| 00:00:01 |
| 1 | SORT AGGREGATE           | | 1 | | |
| 2 | INDEX FAST FULL SCAN| SYS_C0024562 | 65535 | 41 (0)| 00:00:01 |
------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
153 consistent gets
0 physical reads
0 redo size
342 bytes sent via SQL*Net to client
337 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

Tuesday, September 21, 2010

THINK IN 'SETS'

Before posting technical stuff in coming days, I want to mention couple of basic points here .
See below -Tomkyte's Simple Mantra .

You should do it in a single SQL statement if at all possible. And believe it or not, it is almost always possible.
If you can’t do it in a single SQL Statement, do it in PL/SQL—as little PL/SQL as possible! Follow the saying that goes "more code = more bugs, less code = less bugs."
If you can’t do it in PL/SQL, try a Java stored procedure. The times this is necessary are extremely rare nowadays with Oracle9i and above.
If you can’t do it in Java, do it in a C external procedure. This is most frequently the approach when raw speed or using a third-party API written in C is needed.
If you can’t do it in a C external routine, you might want to seriously think about why it is you need to do it.

I feel this is something important specially for a developer .When talking about sql and pl/sql - the point here is to avoid doing something *procedurally* in pl/sql if that can be done in a single SQL statement .
I see many times similar type of the below pseudocode:

For each x in Cursor_C
loop
Select count(*) into v_count from t2 where colm1=x
If v_count >0 then do something .
End loop;

Look at the above code -Firstly - LOOP - this is something to be avoided if it can be ……….because this causes the 'context switches' between SQL engine and Pl/SQL engine . The switch may consume negligible time (approx -0.0001sec) but just assume if the loop repeats 1 million times …definitely the time taken for switches matters. In most of the cases -it is possible to avoid the loops- if we think the results in 'sets' .Secondly -the developer wanted to check the *existence* for a row where colm1=x , he should not have used here count(*).Count(*) is to count the no of rows ,might not be used to check for the existence of a row and moreover the optimizer in this case may need to full scan the table /index to get the count .But in this example -if we use 'exists' operator - it will nicely check for the existence of the row by using a smart technique called- 'semi join' with less effort .This type of statements could be re-written using - EXISTS /IN , MERGE etc depends on the requirement in a single SQL statement in most of the cases .

We can really do magic with SQL if we think the results in 'SETS'...
SO -
THINK IN 'SETS' …………………………..
THINK IN 'SETS '
..
THINK IN 'SETS '
and Avoid procedural logic if it can be done in SQL …………..

If you finally have come to conclusion that something can not be done in SQL - yes , Undoubtedly this is the time to go for PL/SQL but again - here also think in smart way .For example : when I want to retrieve a single row - why should I use explicit cursor ? I use sweet implicit cursor because I feel Oracle introduced implicit cursor just to avoid explicit cursor which may need more code (and which may probably introduce bugs sometimes) and to give some relief to developers when the developer wants to retrieve a single row.
Can I say something here - actually there is no concept of 'implicit' or 'explicit' in oracle database engine point of view, every SQL is an explicit for Oracle database engine. This implicit cursor concept is a pl/sql programming feature -this is a smart technique given to the developers by Pl/sql just to avoid more coding (explicit cursors) when the developer wants to retrieve a single row - once we submit our cursor to pl/sql (implicit or explicit) - every thing will be converted to explicit and sent to Oracle database engine.

Efficient coding depends on how we understand the programming constructs/features and tools and where and when we have to use them ..There is no one for all rule and most of the times it 'depends'... …………………..
I will be discussing some real technical stuff in coming days .