terça-feira, 8 de julho de 2008

Retrieving data faster in Oracle: How useful indexes may be?

It is not so straight as it seems to be: indexes are NOT a magic solution to speed up queries. Of course, they are quite useful and, of course, they may (and probably will) boost up your query. But...
The most common type of index is the B* Index (no, "B tree" does not stand for "Binary Tree". Please refer to http://en.wikipedia.org/wiki/B_tree and http://en.wikipedia.org/wiki/Binary_tree). As a matter of fact, in the Oracle world, when someone talks about "index" without any further specification, he or she is probably referencing a B* index. This way, for now on, I'll be using "index" as a synonym for "B*-Tree index".
An index is a data structure that presents the database information in a particular order despite of how it is actually ordered. As relational tables are physically stored in a heap area, the index will force a scattered read of the data. An ordered view of data that is stored unordered will force several redundant reads, because a single data block will probably be retrieved several times. That makes sense if you keep in mind that a single data block may (and probably will) store many rows. Actually, on average, Oracle will read so many blocks as the rows it has to retrieve, even if many of these rows are stored in the same block. In Oracle's jargon, a single block will be touched many times. This is not quite intuitive, but it will help you to understand the way the optimizer make decisions.
As a consequence of the above, the efficiency of an index will be necessarily linked to three factors: the size of the table in rows and in data blocks, the size of the block in bytes and the size of the row in bytes. Too many issues for those that, just like me, strongly believed that indexes simply made things happen faster! I do not know about you, but if I can model something through an equation I will do it, because it may simplify the analysis. This way, I'll present a mathematical perspective for the subject.
Let nrT the amount of the rows in a table, nbT the amount of blocks, nrR the amount of rows to be retrieved, srT the size of the row (assumed to be constant) and sbT the data block's size. From these definitions, Equation 1 arises naturally.

(1) nbT=(nrT)/(sbT/srT) = (nrT . srT)/sbT

As stated before, once that the heap area requires a scattered read to retrieve the rows in the correct order and that the rows are stored in a non-particular order, we may safely assume that the database will touch as many blocks as rows it has to retrieve, despite of the fact a single block could have several rows that we are interested in.
The ratio sbT/srT is the amount of rows per data block in the table. Let's say a given query must return a subset nrA (Number of Rows Affected) of the all rows available (e.g., if there is some constraint specified in the WHERE clause). From above, the number of blocks affected (nbA) will probably be the same than nrA (or, at least, very close to it).

(2) nbA = nrA

Now, let's take a look in Equation 1. It is clear, if you really understood such an algebraic model, that the smaller the ratio (sbT/srT), the greater is the amount of blocks (nbT) that the table will have to have. For the same block size (sbT) , if you increase the size of the row (srT), you will need more blocks (nbT) to store the same amount of rows (nrT).
These simple mathematical facts lead to an amazing conclusion, as shown below.
From an example given by the Oracle's guru Tom Kyte, let's say you have a block size of 8k (sbT=8k), a row size of 800 bytes and 100,000 rows in a given table. Suppose you have to retrieve 20% of the rows (nrA=20,000 rows). You will have a ratio sbT/srT = 8K/800= 10 rows per block, resulting in 100,000/10=10,000 blocks to store all rows in the table. Remember: that is a very simplified scenario, because every block has an overhead (http://download-uk.oracle.com/docs/cd/B19306_01/server.102/b14220/logical.htm). As shown in Equation 2, in order to retrieve 20,000 rows the database will have to read ("touch") 20,000 data blocks (nbA = 20,000). As the table has only 10,000 blocks, it is like every block was touched 2 times.
In the same situation, if the row's size was decreased to, say, 80 bytes the scenario becomes singularly different: the ratio sbT/srT would grow to 8K/80=100 rows, and we would have 1,000 blocks in the entire table. Again, by Equation 2, the database will touch 20,000 data blocks. However, this time there are 1,000 data blocks in the table; hence, it is like every block was touched 20 times.
The conclusion is that there is not a "magic number" of rows that lead to the use of an index (something like "if you are gonna retrieve less than X% of the rows, you have to have a indexed field"), because you have to take in account the size of the row also. Is it amazing or not? For those who used to think about indexes as they were a kind of throttle lever, as I did for a long time, yes it is amazing. But remember: Equation 2 is taken as a premise to all this reasoning, BUT IT NOT MAY BE THE CASE if you the keys are stored altogether, as it is likely to happen with auto-incremental primary keys. In that case, the same block may store contiguous index's key values and, since data blocks are cached in the buffer cache, Equation 2 becomes a fake hypothesis. Hence, there is one conclusion more: do not believe heuristic rules so much, but test them in your environment instead.