Thomas 的个人资料Thomas Kejsers Blog日志列表网络 工具 帮助

日志


2007/12/3

Moving to SQL CAT Team

I have been very silent on this blog for some time now.
 
The reason is that I have been hired to work for the the SQL Server CAT team. In the future I will be working with (even more of) the most challenging SQL Server installations in the world. My focus on BI will probably continue but I will also apply my OLTP tuning skills again.
 
My technical blog will now be found on: http://sqlcat.com
 
I would like to thank you for your comments and feedback and I look forward to hearing from you in the SQL CAT blog community
 
Sincerely
Thomas
2007/4/18

SQL Server 2005 Standard cannot lock pages in memory

Today, I by accident discovered that SQL Server 2005 cannot lock pages in memory on the Standard Edition. The documentation for this is well hidden away. Look at the knowledge base article:
 
 
Look Under Applies to: you will only see Enterprise Edition. Today I validated this to be true, you cannot prevent SQL Server from being pages out on a Standard Edition.
 
Locking pages i memory is a recommended configuration of SQL Server. You do it by granting the SQL Server Service account the privilege "Lock Pages in Memory" (using gpedit.msc). The effect is that SQL Server will not be swapped out by the operating system - an operation that, if it occurs, can mean a very serious performance degradation.
 
You can validate that your SQL Server is running with locked pages by looking for the message: Using locked pages for buffer pool in the SQL Server log.
2006/12/18

IO Complexity of processing aggregation phase

In my last post I missed the IO complexities of the aggregation phase. My memory of external sorting has become a bit rusty and I once again needed to bring Knuth from my bookshelf.

First of all - if you have not yet tried running perfmon while processing this would be a good time. It is worth noticing that the aggregations phase does not consume much IO. Even on a less than well tuned IO subsystem the aggregation phase is typically CPU bound. For those of you only interested optimizing the speed of the aggregation phase - take another look at your attribute relationships and properties (you can use some of previous posts as inspiration).

From my mail exchange with Eric Jacobsen I interpret that the external sort used by analysis services is a variant of balanced merge. The basic idea of the algorithm is

  1. Read rows from input (in this case the nd rows from the read phase) until memory full
  2. Sort records in memory (for example using quicksort)
  3. Write sorted records to a temp file on disk.
  4. Are there still rows left in the input? If so - goto 1
  5. Read all temp files and merge these to a new - sorted file

Steps 1-3 require nd reads and nd writes. Steps 4-5 require the same amount of reads and writes - furthermore, there is a small lg(q) CPU overhead for merging, with q being the number of temp files.

From the algorithm above we can also deduce that we can save a significant amount of IO if we have plenty of memory for the sort in phase 2. Not surpring - but it is worth noticing that the records size in analysis services tend to be very big. 64-bit memory spaces clearly have a big advantage in the aggregation phase.

Summarizing, we have:

IOREAD(Aggregation phase) = O(2 nd)

IOWRITE(Aggregation phase) = O(2 nd)

2006/12/15

Time complexity of processing in Analysis Services

When I build enterprise scale solution it is crucial for me to know what sort of scalability can be expected from a server product. Is it linear? It is exponential? Should I scale-out or scale-up? Are there hard limits to what the architecture can do?

To understand these limitations and provide valid statements about scalability - you have to know the time complexity of the operations performed by your implementation.

A major challenge for large scale Analysis Services installations is to reduce the processing time for a cube. If we are understand to what constitutes a cube processing we must first look at how processing is done. Processing proceeds in three phases

  1. Read phase - Rows are read from the data source and into Analysis Services.
  2. Aggregation phase - Based on the rows read - aggregations are build.
  3. Index build phase - The aggregations are indexed to provide fast query access

I need to get a deeper understanding of the index phase before embarrassing myself completely (as opposed to partially) in my blog.

For now - lets look at the first two phases in more detail:

Read Phase

The purpose of the read phase is - as the name implies - to read the rows from the relational data source. The rows read form what I call the "base aggregations" - I.e. the leaf level (or key level if you like) data of the cube. It is these base aggregations that are used to answer queries at leaf level. The base aggregations also form the "base" of higher level aggregates in the next phase.

Each row from the relational source is read once. Assuming there are n rows in the source. We find that our number of IO reads must be proportional with:

IOREAD(Read Phase) = O(n)

Each row takes about the same amount of CPU time to read. Hence, our CPU usage is:

CPU(Read Phase) = O(n)

How about the writes, once could assume O(n) - but this would not be completely true:

As you may be aware - a relational fact table can contain "duplicate" data at the leaf level. Let me provide you with a few examples:

Example 1: I go into a supermarket today and buy a bottle of coke. Being a caffeine junkie - I later visit the same store - buying yet another bottle of coke. This will generate two rows in the Point Of Sales system - one of each bottle. Depending on your warehouse design - you may see both of these rows in your relational source - hence observing a "duplicate" row. The interesting data from the cube perspective (assuming no drillthrough) is one row with 2 bottles of coke - not two rows with 1 each

Example 2: You have a relational fact table with 5 dimension columns. However, for you analysis services cube you only want to use 4 of these dimensions. When you read the fact table you will almost certainly read duplicate leaf level data. Why? Compare the two statements:

(1) SELECT DISTINCT D1, D2, D3, D4 , D5 FROM <FactTable>

(2) SELECT DISTINCT D1, D2, D3, D4 FROM <FactTable>

The row count of (2) is most likely smaller than (1) - why else would you even bother with D5 in the first place?

Because the two examples happens a lot in the OLAP world, Analysis Services implements a "merge buffer" . While reading rows from the relational source- this buffer is used to merge such duplicate rows into single rows - thus reducing the storage space needed for the leaf level aggregates.

The merge buffer will not eliminate all duplicates. It has a limited size and will thus only on a subset of the fact table. In the ideal world - where the fact rows enter Analysis Services in sorted order and the merge buffer is large enough, we will eliminate all duplicates.

This brings us back to the number of IO operations needed to write base aggregation. Let nd be the distinct count of the dimension columns we use in the cube. Assuming an optimal reading of the fact table we have:

IOWRITE(Read Phase) = O(nd)

The reduction of the writes compared with the read can be quite significant for later phases of processing. Essentially we are seeing a kind of "pre aggregation" here. I like to think of nd as the size of the "non empty cube space".

Also, notice that there is an upper limit to nd that is unaffected by you fact table size. Assuming your cube dimension, D1, ... , Dn have the following leaf level sizes: |D1|, ... , |Dn|. The upper limit for nd is:

nd <= |D1| *, ... ,* |Dn|

For cubes with a small number of dimensions this upper limit can work to your advantage - as we shall see later.

Aggregation Phase

Once analysis services has stored the data read it must build higher level aggregates.

If you have ever tried managing you own aggregate tables in a relational database you will know that there are two ways to build an aggregate:

  1. Directly from the leaf level (in a relational database this would be the fact table)
  2. Derive the aggregate from another aggregate.

 This is where things get a little complicated - but I will give it a shot. Let's start with the first case - leaf level aggregation building:

Every attribute in you cube is a potential candidate for an aggregation. You control which ones you want to aggregate by a combination of:

  • Setting the attribute properties (see my previous post)
  • Running the storage design wizard. This will create some "statistically correct choices"
  • Using a custom aggregation tool to manually specify which aggregates you want

For the moment - lets keep attribute relationships out of the equation. Let's just assume you have a cube with aggregates designed, one way or the other, for a subset, AAgg, of your attributes.

During the aggregation phase analysis services has to read your leaf level rows (remember, there were nd of these). Reading these rows takes:

IOREAD(Read Leaf level data) = O(nd)

From my conversations with Eric Jacobsen I gather that the aggregation algorithm in AS 2005 is an external disk sort (as compared to the other common alternative - a hash/bucket sort). What does this tell us? Well, for those you that don't already own it - get a copy of Donald E. Knuths masterpiece: "The Art of Computer Programming". From this work we can lookup the complexity of different sort algoritms and how to implement them. I fair assumption is that the external sort uses quicksort on its memory buffer the time complexiy: O(n lg(n)). Each attribute that is aggregated must be sorted - so we have:

CPU(Aggregation Phase) = O( AAgg * nd * lg(nd))

Aggregation phase - summary so far

Time complexity: Not surprisingly - CPU usage in the aggregation phase is proportional to the number of attributes you are aggregating. Notice however, that it is not proportional to the amount of rows in your relational database. Its proportional to the DISTINCT count of dimension values (assuming your merging in the read phase is optimal).

Possible optimization (untested): This is pure speculation - but there may be some benefit of sorting your fact table using a clustered index to optimize the "pre aggregation" in the read phase.

Cubes with a small number of dimension: If you cube has a very small number of dimension with relatively few manbers you may be able to leverage the fact that nd has an upper limit. This is pure theory - but it seems to be that some cubes could always be fully aggregated in reasonable time - irrespective of the number of source rows in the underlying relational table

 

It's time to post this first part. Next part: Attribute relationships and how they are used to build higher level aggregates. 

Creating custom aggregations in Analysis Services 2005

Most of you are probably using the aggregation wizard to create aggregations for your cubes. If you done your homework and set attribute relationships, counts and properties right (please refer to my previous post) - the aggregation wizard will generally do a good job creating the right aggregations.

The exact workings of the storage design wizard are probably best left to a separate blog entry. But there are a few things you should be aware of: To determine which aggregations to build, the storage design wizard will apply the "30% rule". Roughly speaking this means is that an aggregation is only considered for building if it represents the sum of at least 30% of level below it in a natural hierarchy.

As an example, take this hierarchy:

Ordinary Hierarchy

The top level is considered for aggregation: The 3 rows each aggregate an average of 20 / 3 - around: 6.6 rows - this is more than 30% of the 20 rows just below the top level.

The second level is not considered for aggregation: It is almost as fast to scan the 30 rows as it is to scan the almost equal sized 20 rows above them.

There are rare cases when the aggregations designed by the wizard using the 30% rule are not good enough. A real life customer example had a hierarchy with these characteristics:

Skewed hierarchy

The middle level, with 1 million members, is not considered for aggregation by the storage design wizard:  it does not meet the criteria of the 30% rule. However, in this particular case - a large fraction of the customers queries will access the middle level. Furthermore, the distribution between the middle and the bottom level is skewed: there might only be an average of 1.2 members the lowest level for every members at the middle level - but some members have thousands of children - whereas some only have one.

For this particular customer - adding a custom aggregation to the middle level with 1 million members boosted query performance significantly. An optimization not achievable with the storage design wizard.

A word of warning: Creating the right custom aggregations is a task for cube experts. Creating every possible aggregation in your cube is not a good way to boost query performance. In the worst case your cube processing will never finish. Lets just, theoretically, assume you could process your cube with all aggregations design. Locating the right aggregation to use in the universe sized space of aggregations will still take eternity for Analysis Services. The words "eternity" and "universe" are fitting in this context - cube space can be BIG - very very BIG - you really don't want to aggregate all of it.

Assuming you have heeded my words of warning - how do you go about manually adding aggregations to your cube? Though the API is there Microsoft does not supply a tool for inspecting and creating custom aggregations. One way is of course to edit the cube XML files or use XML/A. A daunting task.

But guess what? The developers at Microsoft probably needed a tool for their own internal use - so they build one. This tool is called AggDesign - and I have had the pleasure of using it. According to Eric Jacobsen from Microsoft, we may see this tool shipped with SQL Server 2005 SP2. AggDesign allows you to visually inspect your cube aggregations and add new ones a you see fit. The tools allows you to assign one or more "AggDesigns" to each partition in your cube. These AggDesign add extra aggregations to the ones already defined by the storage wizard. If you are working with a large cube and feel very confident that you need custom aggregations (and you really should be sure) you might be able to get a copy of this tool from your local Microsoft product support.

2006/11/29

Automating build of Integration Services projects

After our experiences with the build process in Analysis Services - expectations were low when we turned our gaze towards Integrations Services (SSIS).

Our overall experience with SSIS has generally been positive - it's an extremely powerful tool with some interesting performance characteristics. There are some annoying flaws - especially in the Visual Studio integration. When I work with SSIS projects - three or four daily crashes of Visual Studio is not uncommon. Never fear - the power provided by the tool more than makes up for this when it comes to choosing the ETL platform. And buggyness is what service packs are for ... isn't it? :)

The first thing you notice when you choose "build" in Visual Studio on an SSIS projects is that Visual Studio doesn't do anything - except of course copy files to the bin directory. Why? Well, guess you could say that the *.dtsx files contain all the information need to run the packages - there is "nothing to be done".

But, when you try to automate a build there actually is something to be done. First of all: you have to mark which files contain the build output. That is, the files you wish to copy to your deployment target. With this in mind - we tried using msbuild on the project - hoping that it would at least mark the *.dtsx files as build output (it could just take a look at the bin directory for us) so Team Foundation can copy them to the release directory. No surprises - integration is non existent and there are no free meals here.

Guess you could say that the command line to automate build of SSIS is simple: just copy the bin\*.dtsx files to your build output directory.

We could have stopped here and be done with it - but we like build logs (if you can use the about the word: "like" in this context). Those of you who have deployed a big BI-installation by hand may have noticed the excellent command line tool; dtexec.exe used to run SSIS packagesThis tool has some very useful parameters that come in handy for the structured developer.

When you execute an SSIS package - two things happen:

  • The package is validated. During this phase, connection managers and logs providers are tested and metadata for all tasks and flow components is checked for inconsistencies
  • The package is executed. There seems to be some sort of compilation going on - part of this is probably done in the validate phase. After this compilation - the package is executed. The compilation part is an educated guess - I cannot believe SSIS can be as fast as it it without first compiling the package to machine code.

Here's the trick: The first phase produces both warnings and errors if you have made a mistakes in your package design. This is a very useful output and can be considered a build log of the package. Fortunately - its possible to invoke the first phase without executing the entire package. You simply do this:

dtexec /F MyPackage.dtsx /Va /Rep EW

/Va makes the package validate but not execute

/Rep EW tells SSIS that you only want to see warnings and errors on stdout. If you don't use the /Rep switch you will clutter your "build log" with useless informational messages.

Summary: to build an integration services project you do two things:

  • Execute dtexec with proper command line switches on all your *.dtsx files. Save stdout from these command lines to your build log.
    • If the build log contains any errors - the sadistic among you may want to email notificaty to the responsible developer. It gives him a chance to get up in the middle of the night and fix the bug :-)
  • Copy all bin/*.dtsx files to you build output directory
2006/11/28

Automating build of Analysis Services Projects

As you may be aware, an Analysis Services project consists of the following files:

  • A project file (*.dwProj)
  • One or more  Data sources (*.ds)
  • One or more Data source views (*.dsv)
  • A database file (*.database)
  • One or more cube files (*.cube) with their partition (*.partitions)
  • One or more dimension files (*.dim)

You can right click the project in Visual Studio and choose "build". What happens now seems to be a lot of called to DLL files executed by the integration between Visual Studio and the Analysis Services. The output of this build is the bin folder in your project directory. This folder contains a database file (*.asdatabase). The .asdatabase file is the one you want - you will see why later.

Unlike other projects - which you can build using their proper compiler (Example: CSC.EXE, VBC.EXE) - Analysis Services does not seem to have a compiler executable.

You could be tempted to run msbuild. Unlike most temptation this will result in disappointment - msbuild does not understand Analysis Services files.

The only way I have found to automate the building of Analysis Services projects is to invoke Visual Studio from the command line. Say you have a project called: MyAsProject.dwProj. To build it and obtain the asdatabase file you execute the following:

devenv MyAsProject.dwProj /rebuild Relase /project MyAsProject

So far so good - now you have automated the createion of the .asdatabase file.

You are not done yet. The asdatabase file provides the input to the Analysis Services Deployment Wizard. The spell cast by this wizard allows you to transform an asdatabase filen into an XML/A file. This XML/A file can be run on your deployment target - which will create the cube database.

From the command line the Deployment Wizard can be run in two modes.

The answer mode:

Microsoft.AnalysisServices.Deployment.exe /a

In this mode, you get to choose all the server specific settings and write them back to your bin directory. Check out the BOL documentation to find out exactly how this is implemented (You will find no defense for the naming of the executable in the documentation)

Once you have your answers stored (which you will only do once) you want to automate the creation of the XML/A file. Execute:

Microsoft.AnalysisServices.Deployment.exe MyAsProject.asdatabase /d /o:MyProject.xmla

Voila! You have your xmla file, which you can copy to your deployment target and run. By the way. xmla files can be run from the command line using ascmd.exe which you can find here: http://msdn2.microsoft.com/en-us/library/ms365187....

Automating the building of BI-Solutions

Coming from a "Old fashioned" development environment (that is: C/C++, CSV, Unix and the likes) I have always wondered why BI-projects don't seem to get the idea of a structured software development.

There are some technological cornerstones of a good, well structured software development process:

  • Source control systems - Never loose code again, unlimited undo. Tag your releases
  • Automated, daily builds - Make sure you project is ready to deploy and release. Never again spend hours on manually building your project
  • Bug tracking and issue logs - Speaks for itself. A necessary tool for proper project management
  • Multiple environments - At the very least: Development, Test and Production.
  • Change management - Well documented and scripted procedures for moving your code into production

I don't know about you - but I have yet to see a Business Intelligence projects use all of the above. Some, actually a lot, of the examples I have worked on did not even use ANY of the above. My theory is that BI-development community is in a sense "immature" to the ways of structured software development.

With Team Foundation Services and the new Visual Studio integration of the Microsoft BI-suite this is about to change. On my current development project I have decided to try a fully structured approach to BI-development with my team.

Team Foundation gives us the following for free (or rather - for a very cheap license):

  • Source control - Microsoft finally got it right. Source Safes successor is brilliant and has the multiple checkout features and proper merge functionality I missed from CVS (another, very good, source control system). Ohh, and did I mention that the source control database does not seem to spontaneously break anymore?
  • Bug tracking and issue logs - Working as intended - and working well. Integrates seamlessly with both Visual Studio and the source control. This means that my developer team can actually be bothered to use it.

We have set up multiple environments and I have borrowed a few change management procedures from my past.

Now, Team Foundations promises to automate our daily build too. It turns out that this is not really working with BI-projects. Team foundation uses a tool called msbuild - which it will execute automatically on solution files. However, msbuild doesn't have any plug-ins for BI-projects :-(

Guess it was a bit too much to ask for anyway - back to basics: scripted builds. We are experimenting with daily build using command line script. Once we have the "building blocks" to build a BI-solution we plan to integrate it with the Team Foundation Build.

2006/11/17

The Many Pitfalls of Attribute Properties in AS 2005

In AS 2000 dimension design, once the database has been loaded, is a 10 minute task. Declare your hierarchies, set up a few member properties, a short look at aggregations settings and off you go. There really was no such thing as a multiple hierarchies. Member properties was a design afterthought - as anyone that has ever designed a virtual dimension can testify. Though MDX feature powerful query abilities - there were still queries that were best served by the relations database.

Enter AS 2005. Dimensions design has become much more powerful. We now have the full flexibility of the relational model build into the UDM. However, this power comes with a heavy price tag.

When you design a dimension in AS 2005, every column in the relational source is a potential attribute. If you leave the properties of these attributes at their default  value - you have set yourself up for a nasty performance surprise. Not only will performance suffer - your interface, should you use any version of Excel pre-2007 will be a mess. 

For each attribute in your dimension you must manually adjust each property to fit your needs. Books Online is a bit vague on the subject - so here's a quick summary:

AttributeHierarchyEnabled

The name of the property is misleading - it really should be called: EnableAggregations. When set to true, the default value, this attribute is considered part of the aggregated cube space and will influence the aggregations design. This also means that you cannot set this value to false if your attribute participates in a user hierarchy - AS 2005 will raise an error if you do.

However, you don't want all attributes to be aggregatable. For example, the street address of a customer is generally not a good candidates for aggregations - its solely there for display purposes. In AS 2000 you would implement this as a member property.

Settings the value of AttributeHierarchyEnable to false tells Analysis Services that this attribute is used only for display purposes and that storage and processing time should not be wasted on building aggregations. If you really don't need these aggregations - setting this value to false helps the aggregation wizard do its work

AttributeHierarchyVisible

If you have set AttributeHierarchEnable to false this setting should be set to false. However, its default value is true. If set to true you attribute will, as the name says, be displayed as a hierarchy in client tools.

What does this mean? Even though the attribute already participates in user hierarchies it will also have its own, separate, hierarchy. In Excel 2003 and earlier this means that you will see one extra dimension for each attribute with this property set to true. This tends to create unnecessary clutter in you interface.

If you are designing solely for older client tools you should consider settings this property to false. If you are blessed with Enterprise Edition you can use the perspective feature. With perspectives this property can be true but you can still hide the hierarchy to older clients.

OptimizedState

This innocent looking property can have a big influence no your processing time.

My understanding is not complete, but let me try to illustrate:

Assume you have a three attribubutes A0, A1, A2 that are related (through an attribute relationship) in a natural hierarchy:

A0 -> A1 -> A2

A0 is the leaf (key) level of the dimension.

During the aggregation phase, Analysis Services will first build aggregation on the leaf level A0. Using the leaf level aggregations the aggregations for A1 can be build (A1 is directly related to the leaf). Since A2 is related to A1 an optimization can be made: The aggregations for  A2 can be now be build using only the aggregation from A1 - without scanning the aggregations for A0.

Assume OptimizedState is set to FullyOptimized (the default) for A2. This instructs Analysis Services to build aggregations specifically for this attribute. The means that A0 is scanned to build aggregations A2. This is of course an expensive operations - the leaf level can be quite big in large cubes. Setting OptimizedState to NotOptimzed will disable this behavior.

One of my customers has reduced his process time from 14 hours to 5 just by disabling this property for selected attributes. It may be based on my incomplete understanding - but I think the default property for this attribute is wrong and you generally would use NotOptimized instead of FullyOptimzed.

What I really don't get about this setting is why it is even needed - especially when you consider the property:

AggregationUsage

This property is used to force a specific behavior in the aggregation wizard for this attribute. The Books Online documentation is pretty good:

ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/uas9/html/57...

One question remains though: If I can use AggregationUsage to fine tune my which properties are build with the aggregation wizard - why do I even need OptimizedState?

2006/11/16

Arbitrary Shapes in AS 2005

 

I have recently returned from a performance tuning assignment at one of the largest AS-cube installations in the world. Working on such large installations always teach you a thing or two about the internal workings of the server products in use. Learning this new stuff is what I love the most about performance tuning. My colleague and me had a conference call with the people at Microsoft - and they have been really helpful about sharing their knowledge. I wanted to pass on some of it here.

First of all - let me highly recommend the new ability to trace a query in SQL-Profiler. MDX coders: you will love this feature, it will really boost your efficiency while tuning MDX. One of the events you want to watch out for is "Query Subcube Verbose". This event tells you a lot about the internal working of the AS engine. My knowledge is still pretty sparse - but this is what I have gathered:

MDX queries are evaluated by the Formula Engine which interprets the dimensions axes and combine aggregations as necessary to answer the query. The aggregations are stored in the aggregation data registry which can consult the Storage Engine if it needs new aggregations to populate the cache.

Aggregations are returned from the Aggregation data cache using subcube queries - these are the ones you can trace in profiler. To determine which subcubes to request the Formula Engine uses and a mechanism called "SONAR" (I need to see if any literature is written about this).

Now, the amount of subcubes requested from the aggregation data registry has a large impact on query performance. Subcube queries can either be cached in memory or need retrieval from disk. If cached, a subquery is pretty fast - typically a few millisecond. Of course, these can add up and take time if you have many of them. On the other hand, if your subcube queries need to access the Storage Engine for aggregations they can be quite slow - since this requires I/O requests. The more subcubes you need to retrieve from disk - the larger your I/O load will become - eventually resulting in bad performance. You can use SQL-Profiler (look at the event subclass) to see if a subcube is retrieved from disk or memory.

In summary, it is important that you avoid too many subcubes and make sure you have good cache hits on the ones you have. Fortunately, you are partially in control of this process.

How does this tie into my previous post about multi-select? Well, try to execute and trace the subcubes in the following MDX statement against AdventureWorks:

SELECT
  {[Reseller Order Count]}
ON 0,
  NON EMPTY CROSSJOIN(
    Product.[Product Categories].members
    , DESCENDANTS([Sales Territory].[Sales Territory], [Sales Territory].[Sales Territory].levels(3), SELF_AND_BEFORE )
  )
ON 1
FROM
  (SELECT
    {Product.[Product Categories].levels(1).item(0)
     ,Product.[Product Categories].levels(2).item(1)}
ON 0
FROM [Channel Sales])

Notice the high amount of subcubes needed? Now try to execute it again...

What's wrong here? A lot of the subcubes are still being fetched from disk. Even though the query has already run before.

Before we can understand what is happening here I must define what the people at Microsoft call "An Arbitrary Shape". To understand what an arbitrary shape is, it is perhaps easier to define what a "non-arbitrary shape is"

A Non-Arbitrary Shape is a set, S, of dimension members that match these criteria:

  • All members of S must be in the same dimension hierarchy
  • Let Lmin be the lowest level of the members in S.
  • Let Lmax be the highest level of the members in S.
  • For all members in the, M, in level Lmax the following must be true:
    • All ascendants of the M, up to and including the level Lmin must be in S

Put another way: The value of every member in S can be expressed as the aggregation over the members only from Lmin

I have tried to create a taxonomy of all possible non-arbitrary shapes. These are illustrated below:

Notice that NA1, ..., NA3 can all be expressed using the DESCENDANTS function in MDX. Case NA4 is an entire hierarchy (which is of course non-arbitrary). Case NA5, NA6 is the special variant where LMax = LMin. There is an entire family of MDX functions that allow you to specify NA6 variants.

An arbitrary shape is anything that is NOT a "non-arbitrary shape"

Now, the Formula Engine can evaluate non-arbitrary shapes a lot more efficiently than arbitrary shapes. The work required to evaluate an arbitrary shape in a sub select generally involves a lot more subcube queries than when evaluating a non-arbitrary shape. Furthermore, the arbitrary shapes cause a lot of trouble with the caching of aggregations.

I do not know the exact algorithm used by the formula engine. But one could suspect that non-arbitrary shapes allows the formula engine to build the higher aggregations directly from the lower ones using information about the hierarchies in the cube. Arbitrary shapes clearly prevent this optimization.

There are many arbitrary shapes, but here are a few of them:

So what can you learn from this?

First, since arbitrary shapes are difficult to optimize you should seek to eliminate them your subcube queries. This is often an easy goal to accomplish; try this simple rewrite of the query above:

SELECT
  {[Reseller Order Count]}
ON 0,
  NON EMPTY CROSSJOIN(
    Product.[Product Categories].members
    , DESCENDANTS([Sales Territory].[Sales Territory], [Sales Territory].[Sales Territory].levels(3), SELF_AND_BEFORE )
  )
ON 1
FROM
  (SELECT
    {Product.[Product Categories].levels(1).item(0).children
    ,Product.[Product Categories].levels(2).item(1)}
ON 0
FROM [Channel Sales])

Trace it, and notice the difference.

Secondly, by experimenting with arbitrary shapes we can learn a lot about he internals of the query engine. There is not much documentation to be found, I invite the reader to share his experiences. Even though we would expect all non-arbitrary shape to allow "smart caching" this is not the case. Try the following:

SELECT
  {[Reseller Order Count]}
ON 0,
  NON EMPTY CROSSJOIN(
    Product.[Product Categories].members
    , DESCENDANTS([Sales Territory].[Sales Territory], [Sales Territory].[Sales Territory].levels(3), SELF_AND_BEFORE )
   )
ON 1
FROM
  (SELECT Product.[Product Categories].members
ON 0
FROM [Channel Sales])

Though the sub select quite easily qualifies as a non-arbitrary shapes (perhaps the most simple non-arbitrary shape I can think of) the non-cached disk access still happens - even on a warm cache.

Having done some experimentation I have found that only non-arbitrary shapes that contain members from a single level only are optimized by the query engine. Even though any non-arbitrary shape in theory should trigger a smart caching.

2006/8/7

Factors to consider when performance tuning cubes and MDX

Before we move on the the wonders of subselects I would like to give you a few pointers on tuning MDX queries and cubes in general. During my time as a performance tuner on Analysis Services - these are the most important factors you should consider in a 2005 environment:
 
  • Use NON_EMPTY_BEHAVIOR whenver possible in your calculated members. This has a huge impact on your overall calculation performance. You should be aware of the semantic implications that NON_EMPTY_BEHAVIOR has on you calculation - so think carefully about applying this correctly. I should probably make a post on this later
  • Aggregation design. In most scenarioes in Analysis Services 2000 - this it not hard: Get you dimension counts right, avoid changing and parent/child dimensions, do as many aggregations as you feel comfortable storing and processing and extend as needed with usage based aggregation. In Analysis Services 2005 there are many more settings to consider and tune. I cannot stress enough how important it is to get yourattribute relationsships right. Use rigid relationships whenever possible and make sure you hierarchies are supported. Probably enough matter here for a seperate post
  • Watch our for CPU-hogs in your calculations. The MDX SUM function is the classical example - if you sum over too large a set you are essentially creating a "mega loop" in your calculations. I often hear people complain that Analysis Services is very bad at calculations. I think there are improvements to be made on the server side - even in 2005. But, most of the time, programmers just need to think harder about how they express their calculations before they complain about performance.
  • Minimize subcube operations. This is closely related to my previous post about multi-selects. I have some experiences from a big installation to share with you
2006/8/5

Doing Multi-Selects in Analysis Services 2000 and 2005

One of the major challenges for users of Analysis Services 2000 are queries that answer question of the form:
 
  • Return a hierarchized, drilled down, subset of dimension members
  • Limit the members of the above set to selected parts of the dimension

For example. Your interface has a product drop down list like this:

    • All Product
      • Accessories
        • Bike Racks
        • Bike Stands
        • Bottles and Cages
        • etc...
      • Bikes
      • Clothing
      • Components

You allow you user to select one or more subcategories at the same time. For the purposes of illustration, lets assume the users selects  "Bike Racks" and "Bike Stands". The user then asks for the order count of the hierarchized product dimension - but filtered by his selection.

Put differently, the users wants:

  • The Order Count for Bike Racks and Bike Stands
  • The Order Count of all parent members for the above two (ie: Accessories and All Product)

Naïvely you execute this MDX statement:

SELECT

[Measures].[Internet Order Count] ON 0,

HIERARCHIZE

(

GENERATE(

{[Product].[Product Categories].[Subcategory].&[26], [Product].[Product Categories].[Subcategory].&[27]}

,

ASCENDANTS([Product].[Product Categories])

)

)

ON 1

FROM

[Direct Sales]

(Try it out in Adventureworks)

What happens?... You get:

All Products  27,659
Accessories   18,208
Bike Racks       328
Bike Stands      249

The values for All Products and Accessories are not right! You want the value for accessories and All Products to be 328 + 249 (the sum of the selected descendants), not 18208 and 27659.

This is the essence of the visual totals problem: Analysis Services will happily scan the cube space for you and return the values of the parent. But you get the Order Count for the entire parent member - not the sum of the values you have selected. Why? Because this is the value that has been preaggregated in the cube.

There is something intuitively wrong with the MDX query above. You are essentially mixing up two different questións:

  • What is the shape of your desired return value? (The hierarchy of the product dimension)
  • Which subspace of the cube do you want the shape calculated over? (only calculate on Bike Rack and Bike Stands)

Aha!... You need to seperate shape and subspace - just like you seperate the columns and WHERE parts of an SQL statement. Applying our SQL knowledge to MDX (generally a bad idea) we try:

SELECT

[Measures].[Internet Order Count] ON 0,

DESCENDANTS

([Product].[Product Categories].[All Products], [Product].[Product Categories].[Subcategory], SELF_AND_BEFORE) ON 1

FROM

[Direct Sales]

WHERE

{[Product].[Product Categories].[Subcategory].&[26], [Product].[Product Categories].[Subcategory].&[27]}

This generates an error. Even if we reduced our WHERE critieria to a single select (simplifying the problem):

WHERE

([Product].[Product Categories].[Subcategory].&[26])

... we still get the infamous error:

The Product Categories hierarchy already appears in the Axis1 axis

This, deceptively simple, multi-select problem have caused many headaches for MDX programmers. There ARE workarounds in Analysis Services 2000 - but they take a syntactic form that is disgusting to behold and I will not quote them here.

Analysis Services 2005 to the rescue! Finally, we are able to seperate shape and subspace in our query. The above query can be written, correctly, like this in AS2005;

SELECT

[Measures].[Internet Order Count] ON 0,

DESCENDANTS

([Product].[Product Categories].[All Products], [Product].[Product Categories].[Subcategory], SELF_AND_BEFORE) ON 1

FROM

(

SELECT {[Product].[Product Categories].[Subcategory].&[26]

, [Product].[Product Categories].[Subcategory].&[27]} ON 0

FROM [Direct Sales])

Notice the user of the "subselect". The outer select specifies the shape of space to return - the inner select specifies the subspace that is the scope of the query. Syntactically elegant - and we get the desired result:

All Products  577
Accessories   577
Bike Racks    328
Bike Stands   249

Now, every coin has a back side. Allowing the user to specify any subspace in his query has som implications on how aggregations are being accessed. Programmers familiar with visual totals will probably have spotted this by now.

My next post will deal with the subspaces you can define and how they affect your queries.

2006/7/10

Doing FIRST and LAST aggregates in SQL Server 2005

Users of Microsoft Access may be familiar with the aggregation functions FIRST and LAST. Basically, what you want from these aggregates is to scan the tables in a sorted order. The first or last value encountered in each group is send to the output - much like the existing SQL Server MIN/MAX function.

 

Let me illustrate with some pseudo code:

 

SELECT Col

  , FIRST(Value) AS FirstV

  , LAST(Value) AS LastV

  , SUM(Value) AS SumV

FROM Table

GROUP BY Col

 

On a table with these rows:

Col Value
A 3
A 2
A 6
B 5
B 1

...You want this output:

Col FirstV LastV SumV
A 3 6 11
B 5 1 12

 

Notice how there in an implicit assumption of a row ordering. You probably want some sort of identity column to order your rows. Implicitly - this is a very ISAM way of looking at the world. Now, how do we do this inside SQL Server?

 

Lets make some test data based on AdventureWorksDW:

 

SELECT s.*, IDENTITY(int, 1,1) AS ID

INTO #Agg

FROM dbo.FactInternetSales s

CROSS JOIN (SELECT TOP 30 * FROM sys.objects) scaleUpFactor

 

CREATE

UNIQUE INDEX IX_FirstLast ON #Agg (ProductKey, ID, OrderQuantity)

 

Well... My colleage Jesper Rasmussion came up with a brilliant answer (try this on the testdata from above):

 

SELECT

I.ProductKey

  , F.OrderQuantity AS FirstQuantity

  , L.OrderQuantity AS LastQuantity

  , I.SumQuantity AS SumQuantity

FROM

  (SELECT ProductKey

   , min(ID) AS FirstID

   , max(ID) AS LastID

   , sum(OrderQuantity) AS SumQuantity

   FROM #Agg

   GROUP BY ProductKey) I

INNER

JOIN #Agg F ON F.ID = I.FirstID AND F.ProductKey = I.ProductKey

INNER

JOIN #Agg L ON L.ID = I.LastID AND L.ProductKey = I.ProductKey

 

This query has some very interesting properties. If you check the executions statistics you will see that SQL Server only does very few I/O requests in the index created (not more than 10-20% more than it takes for a full scan - probably around log(size(#agg) ). Since all rows HAVE to be visited to answer the requiest - this means that the query is very efficient.

 

Now, if only we could make a first and last aggregate that does just one table scan... :-)

 

How fast can SQL Server 2005 possibly be

I am currently making a relatively deep (read: nerdy) study of how to optimally load a huge data warehouse.

Based on my experience with ETL tools, SQL tuning and index optimization I know that there are several approaches one may take when loading a billion row data warehouse with million row dimension tables. However, very few of them are optimal or even viable...

On question that must be answered to find the optimal stratagy is the following:



Let us assume we have an optimally configured SQL Server system (all sp_configure settings and trace flags optimized for our hardware).

Futhermore, assume we have the following:

  • A 1 GB Full-Duplex network card with the best drivers available
  • The best disk system money can buy (Real spindles - not Solid state)
  • The fastest CPU of each class: x86, x64 or IA-64 architecture (there is no parallel execution happening here - yet)
  • Plenty of RAM allocated to sqlserv.exe

Now, let us run these three simple statements against our warehouse database:

  • SELECT SurrogateKey FROM DimensionTable WHERE EntityKey = @ek
  • UPDATE DimensionTable SET EntityKey = @ek WHERE SurrogateKey = @sk
  • INSERT INTO DimensionTable (SurrogateKey, EntityKey) VALUES (@sk, @ek)

These are (simplified) version of the queries used to respectively make key lookups, type 1 and type 2 dimension changes

Assume the following about the execution of the above statements

  • We are using the OLEDB driver subsystem to communicate with SQL Server
  • Optimal indexes are in place to support all queries
  • No page splitting occurs
  • The Fillfactor is 100% in the indexes
  • The query plan for all statements is in the plan cache
  • The SELECT can be serviced from the buffer pool by looking at only 3 pages in the index B-tree (optimistic guess)
  • The UPDATE and the INSERT each require only one I/O operation (optimistic guess)

For our warehouse architecture we can consider two viable scenarios:

  • Statements executed from another machine on the network to the server (crossing the 1GB Full-Duplex link)
  • Statements executed from same machine (doing in memory movement where possible)

Now my question is: How fast can these statements be if we have the best software, tuning and hardware available?...

Or in Data Warehouse terms:

"What is the absolute lowest time used pr. row when loading a dimension table using a naïve (read: straightforward) approach to dimension loading"


BTW: (An example of such a naïve approach is the one employed by the SSIS 2005 SCD transformation)

In a later post I will explore a non-naïve approach which i consider optimal for dimension loading for dimension of any size.