Thomas 的个人资料Thomas Kejsers Blog日志列表网络 工具 帮助

日志


2006/7/10

Doing FIRST and LAST aggregates in SQL Server 2005

Users of Microsoft Access may be familiar with the aggregation functions FIRST and LAST. Basically, what you want from these aggregates is to scan the tables in a sorted order. The first or last value encountered in each group is send to the output - much like the existing SQL Server MIN/MAX function.

 

Let me illustrate with some pseudo code:

 

SELECT Col

  , FIRST(Value) AS FirstV

  , LAST(Value) AS LastV

  , SUM(Value) AS SumV

FROM Table

GROUP BY Col

 

On a table with these rows:

Col Value
A 3
A 2
A 6
B 5
B 1

...You want this output:

Col FirstV LastV SumV
A 3 6 11
B 5 1 12

 

Notice how there in an implicit assumption of a row ordering. You probably want some sort of identity column to order your rows. Implicitly - this is a very ISAM way of looking at the world. Now, how do we do this inside SQL Server?

 

Lets make some test data based on AdventureWorksDW:

 

SELECT s.*, IDENTITY(int, 1,1) AS ID

INTO #Agg

FROM dbo.FactInternetSales s

CROSS JOIN (SELECT TOP 30 * FROM sys.objects) scaleUpFactor

 

CREATE

UNIQUE INDEX IX_FirstLast ON #Agg (ProductKey, ID, OrderQuantity)

 

Well... My colleage Jesper Rasmussion came up with a brilliant answer (try this on the testdata from above):

 

SELECT

I.ProductKey

  , F.OrderQuantity AS FirstQuantity

  , L.OrderQuantity AS LastQuantity

  , I.SumQuantity AS SumQuantity

FROM

  (SELECT ProductKey

   , min(ID) AS FirstID

   , max(ID) AS LastID

   , sum(OrderQuantity) AS SumQuantity

   FROM #Agg

   GROUP BY ProductKey) I

INNER

JOIN #Agg F ON F.ID = I.FirstID AND F.ProductKey = I.ProductKey

INNER

JOIN #Agg L ON L.ID = I.LastID AND L.ProductKey = I.ProductKey

 

This query has some very interesting properties. If you check the executions statistics you will see that SQL Server only does very few I/O requests in the index created (not more than 10-20% more than it takes for a full scan - probably around log(size(#agg) ). Since all rows HAVE to be visited to answer the requiest - this means that the query is very efficient.

 

Now, if only we could make a first and last aggregate that does just one table scan... :-)

 

How fast can SQL Server 2005 possibly be

I am currently making a relatively deep (read: nerdy) study of how to optimally load a huge data warehouse.

Based on my experience with ETL tools, SQL tuning and index optimization I know that there are several approaches one may take when loading a billion row data warehouse with million row dimension tables. However, very few of them are optimal or even viable...

On question that must be answered to find the optimal stratagy is the following:



Let us assume we have an optimally configured SQL Server system (all sp_configure settings and trace flags optimized for our hardware).

Futhermore, assume we have the following:

  • A 1 GB Full-Duplex network card with the best drivers available
  • The best disk system money can buy (Real spindles - not Solid state)
  • The fastest CPU of each class: x86, x64 or IA-64 architecture (there is no parallel execution happening here - yet)
  • Plenty of RAM allocated to sqlserv.exe

Now, let us run these three simple statements against our warehouse database:

  • SELECT SurrogateKey FROM DimensionTable WHERE EntityKey = @ek
  • UPDATE DimensionTable SET EntityKey = @ek WHERE SurrogateKey = @sk
  • INSERT INTO DimensionTable (SurrogateKey, EntityKey) VALUES (@sk, @ek)

These are (simplified) version of the queries used to respectively make key lookups, type 1 and type 2 dimension changes

Assume the following about the execution of the above statements

  • We are using the OLEDB driver subsystem to communicate with SQL Server
  • Optimal indexes are in place to support all queries
  • No page splitting occurs
  • The Fillfactor is 100% in the indexes
  • The query plan for all statements is in the plan cache
  • The SELECT can be serviced from the buffer pool by looking at only 3 pages in the index B-tree (optimistic guess)
  • The UPDATE and the INSERT each require only one I/O operation (optimistic guess)

For our warehouse architecture we can consider two viable scenarios:

  • Statements executed from another machine on the network to the server (crossing the 1GB Full-Duplex link)
  • Statements executed from same machine (doing in memory movement where possible)

Now my question is: How fast can these statements be if we have the best software, tuning and hardware available?...

Or in Data Warehouse terms:

"What is the absolute lowest time used pr. row when loading a dimension table using a naïve (read: straightforward) approach to dimension loading"


BTW: (An example of such a naïve approach is the one employed by the SSIS 2005 SCD transformation)

In a later post I will explore a non-naïve approach which i consider optimal for dimension loading for dimension of any size.