| Thomas 的个人资料Thomas Kejsers Blog日志列表网络 | 帮助 |
|
2007/12/3 Moving to SQL CAT TeamI have been very silent on this blog for some time now.
The reason is that I have been hired to work for the the SQL Server CAT team. In the future I will be working with (even more of) the most challenging SQL Server installations in the world. My focus on BI will probably continue but I will also apply my OLTP tuning skills again.
My technical blog will now be found on: http://sqlcat.com
I would like to thank you for your comments and feedback and I look forward to hearing from you in the SQL CAT blog community
Sincerely Thomas 2007/4/18 SQL Server 2005 Standard cannot lock pages in memoryToday, I by accident discovered that SQL Server 2005 cannot lock pages in memory on the Standard Edition. The documentation for this is well hidden away. Look at the knowledge base article:
Look Under Applies to: you will only see Enterprise Edition. Today I validated this to be true, you cannot prevent SQL Server from being pages out on a Standard Edition.
Locking pages i memory is a recommended configuration of SQL Server. You do it by granting the SQL Server Service account the privilege "Lock Pages in Memory" (using gpedit.msc). The effect is that SQL Server will not be swapped out by the operating system - an operation that, if it occurs, can mean a very serious performance degradation.
You can validate that your SQL Server is running with locked pages by looking for the message: Using locked pages for buffer pool in the SQL Server log. 2006/12/18 IO Complexity of processing aggregation phaseIn my last post I missed the IO complexities of the aggregation phase. My memory of external sorting has become a bit rusty and I once again needed to bring Knuth from my bookshelf. First of all - if you have not yet tried running perfmon while processing this would be a good time. It is worth noticing that the aggregations phase does not consume much IO. Even on a less than well tuned IO subsystem the aggregation phase is typically CPU bound. For those of you only interested optimizing the speed of the aggregation phase - take another look at your attribute relationships and properties (you can use some of previous posts as inspiration). From my mail exchange with Eric Jacobsen I interpret that the external sort used by analysis services is a variant of balanced merge. The basic idea of the algorithm is
Steps 1-3 require nd reads and nd writes. Steps 4-5 require the same amount of reads and writes - furthermore, there is a small lg(q) CPU overhead for merging, with q being the number of temp files. From the algorithm above we can also deduce that we can save a significant amount of IO if we have plenty of memory for the sort in phase 2. Not surpring - but it is worth noticing that the records size in analysis services tend to be very big. 64-bit memory spaces clearly have a big advantage in the aggregation phase. Summarizing, we have:
2006/12/15 Time complexity of processing in Analysis ServicesWhen I build enterprise scale solution it is crucial for me to know what sort of scalability can be expected from a server product. Is it linear? It is exponential? Should I scale-out or scale-up? Are there hard limits to what the architecture can do? To understand these limitations and provide valid statements about scalability - you have to know the time complexity of the operations performed by your implementation. A major challenge for large scale Analysis Services installations is to reduce the processing time for a cube. If we are understand to what constitutes a cube processing we must first look at how processing is done. Processing proceeds in three phases
I need to get a deeper understanding of the index phase before embarrassing myself completely (as opposed to partially) in my blog. For now - lets look at the first two phases in more detail: Read PhaseThe purpose of the read phase is - as the name implies - to read the rows from the relational data source. The rows read form what I call the "base aggregations" - I.e. the leaf level (or key level if you like) data of the cube. It is these base aggregations that are used to answer queries at leaf level. The base aggregations also form the "base" of higher level aggregates in the next phase. Each row from the relational source is read once. Assuming there are n rows in the source. We find that our number of IO reads must be proportional with:
Each row takes about the same amount of CPU time to read. Hence, our CPU usage is:
How about the writes, once could assume O(n) - but this would not be completely true: As you may be aware - a relational fact table can contain "duplicate" data at the leaf level. Let me provide you with a few examples: Example 1: I go into a supermarket today and buy a bottle of coke. Being a caffeine junkie - I later visit the same store - buying yet another bottle of coke. This will generate two rows in the Point Of Sales system - one of each bottle. Depending on your warehouse design - you may see both of these rows in your relational source - hence observing a "duplicate" row. The interesting data from the cube perspective (assuming no drillthrough) is one row with 2 bottles of coke - not two rows with 1 each Example 2: You have a relational fact table with 5 dimension columns. However, for you analysis services cube you only want to use 4 of these dimensions. When you read the fact table you will almost certainly read duplicate leaf level data. Why? Compare the two statements: (1) SELECT DISTINCT D1, D2, D3, D4 , D5 FROM <FactTable> (2) SELECT DISTINCT D1, D2, D3, D4 FROM <FactTable> The row count of (2) is most likely smaller than (1) - why else would you even bother with D5 in the first place? Because the two examples happens a lot in the OLAP world, Analysis Services implements a "merge buffer" . While reading rows from the relational source- this buffer is used to merge such duplicate rows into single rows - thus reducing the storage space needed for the leaf level aggregates. The merge buffer will not eliminate all duplicates. It has a limited size and will thus only on a subset of the fact table. In the ideal world - where the fact rows enter Analysis Services in sorted order and the merge buffer is large enough, we will eliminate all duplicates. This brings us back to the number of IO operations needed to write base aggregation. Let nd be the distinct count of the dimension columns we use in the cube. Assuming an optimal reading of the fact table we have:
The reduction of the writes compared with the read can be quite significant for later phases of processing. Essentially we are seeing a kind of "pre aggregation" here. I like to think of nd as the size of the "non empty cube space". Also, notice that there is an upper limit to nd that is unaffected by you fact table size. Assuming your cube dimension, D1, ... , Dn have the following leaf level sizes: |D1|, ... , |Dn|. The upper limit for nd is:
For cubes with a small number of dimensions this upper limit can work to your advantage - as we shall see later. Aggregation PhaseOnce analysis services has stored the data read it must build higher level aggregates. If you have ever tried managing you own aggregate tables in a relational database you will know that there are two ways to build an aggregate:
This is where things get a little complicated - but I will give it a shot. Let's start with the first case - leaf level aggregation building: Every attribute in you cube is a potential candidate for an aggregation. You control which ones you want to aggregate by a combination of:
For the moment - lets keep attribute relationships out of the equation. Let's just assume you have a cube with aggregates designed, one way or the other, for a subset, AAgg, of your attributes. During the aggregation phase analysis services has to read your leaf level rows (remember, there were nd of these). Reading these rows takes:
From my conversations with Eric Jacobsen I gather that the aggregation algorithm in AS 2005 is an external disk sort (as compared to the other common alternative - a hash/bucket sort). What does this tell us? Well, for those you that don't already own it - get a copy of Donald E. Knuths masterpiece: "The Art of Computer Programming". From this work we can lookup the complexity of different sort algoritms and how to implement them. I fair assumption is that the external sort uses quicksort on its memory buffer the time complexiy: O(n lg(n)). Each attribute that is aggregated must be sorted - so we have:
Aggregation phase - summary so farTime complexity: Not surprisingly - CPU usage in the aggregation phase is proportional to the number of attributes you are aggregating. Notice however, that it is not proportional to the amount of rows in your relational database. Its proportional to the DISTINCT count of dimension values (assuming your merging in the read phase is optimal). Possible optimization (untested): This is pure speculation - but there may be some benefit of sorting your fact table using a clustered index to optimize the "pre aggregation" in the read phase. Cubes with a small number of dimension: If you cube has a very small number of dimension with relatively few manbers you may be able to leverage the fact that nd has an upper limit. This is pure theory - but it seems to be that some cubes could always be fully aggregated in reasonable time - irrespective of the number of source rows in the underlying relational table
It's time to post this first part. Next part: Attribute relationships and how they are used to build higher level aggregates. Creating custom aggregations in Analysis Services 2005Most of you are probably using the aggregation wizard to create aggregations for your cubes. If you done your homework and set attribute relationships, counts and properties right (please refer to my previous post) - the aggregation wizard will generally do a good job creating the right aggregations. The exact workings of the storage design wizard are probably best left to a separate blog entry. But there are a few things you should be aware of: To determine which aggregations to build, the storage design wizard will apply the "30% rule". Roughly speaking this means is that an aggregation is only considered for building if it represents the sum of at least 30% of level below it in a natural hierarchy. As an example, take this hierarchy: The top level is considered for aggregation: The 3 rows each aggregate an average of 20 / 3 - around: 6.6 rows - this is more than 30% of the 20 rows just below the top level. The second level is not considered for aggregation: It is almost as fast to scan the 30 rows as it is to scan the almost equal sized 20 rows above them. There are rare cases when the aggregations designed by the wizard using the 30% rule are not good enough. A real life customer example had a hierarchy with these characteristics: The middle level, with 1 million members, is not considered for aggregation by the storage design wizard: it does not meet the criteria of the 30% rule. However, in this particular case - a large fraction of the customers queries will access the middle level. Furthermore, the distribution between the middle and the bottom level is skewed: there might only be an average of 1.2 members the lowest level for every members at the middle level - but some members have thousands of children - whereas some only have one. For this particular customer - adding a custom aggregation to the middle level with 1 million members boosted query performance significantly. An optimization not achievable with the storage design wizard. A word of warning: Creating the right custom aggregations is a task for cube experts. Creating every possible aggregation in your cube is not a good way to boost query performance. In the worst case your cube processing will never finish. Lets just, theoretically, assume you could process your cube with all aggregations design. Locating the right aggregation to use in the universe sized space of aggregations will still take eternity for Analysis Services. The words "eternity" and "universe" are fitting in this context - cube space can be BIG - very very BIG - you really don't want to aggregate all of it. Assuming you have heeded my words of warning - how do you go about manually adding aggregations to your cube? Though the API is there Microsoft does not supply a tool for inspecting and creating custom aggregations. One way is of course to edit the cube XML files or use XML/A. A daunting task. But guess what? The developers at Microsoft probably needed a tool for their own internal use - so they build one. This tool is called AggDesign - and I have had the pleasure of using it. According to Eric Jacobsen from Microsoft, we may see this tool shipped with SQL Server 2005 SP2. AggDesign allows you to visually inspect your cube aggregations and add new ones a you see fit. The tools allows you to assign one or more "AggDesigns" to each partition in your cube. These AggDesign add extra aggregations to the ones already defined by the storage wizard. If you are working with a large cube and feel very confident that you need custom aggregations (and you really should be sure) you might be able to get a copy of this tool from your local Microsoft product support. 2006/11/29 Automating build of Integration Services projectsAfter our experiences with the build process in Analysis Services - expectations were low when we turned our gaze towards Integrations Services (SSIS). Our overall experience with SSIS has generally been positive - it's an extremely powerful tool with some interesting performance characteristics. There are some annoying flaws - especially in the Visual Studio integration. When I work with SSIS projects - three or four daily crashes of Visual Studio is not uncommon. Never fear - the power provided by the tool more than makes up for this when it comes to choosing the ETL platform. And buggyness is what service packs are for ... isn't it? :) The first thing you notice when you choose "build" in Visual Studio on an SSIS projects is that Visual Studio doesn't do anything - except of course copy files to the bin directory. Why? Well, guess you could say that the *.dtsx files contain all the information need to run the packages - there is "nothing to be done". But, when you try to automate a build there actually is something to be done. First of all: you have to mark which files contain the build output. That is, the files you wish to copy to your deployment target. With this in mind - we tried using msbuild on the project - hoping that it would at least mark the *.dtsx files as build output (it could just take a look at the bin directory for us) so Team Foundation can copy them to the release directory. No surprises - integration is non existent and there are no free meals here. Guess you could say that the command line to automate build of SSIS is simple: just copy the bin\*.dtsx files to your build output directory. We could have stopped here and be done with it - but we like build logs (if you can use the about the word: "like" in this context). Those of you who have deployed a big BI-installation by hand may have noticed the excellent command line tool; dtexec.exe used to run SSIS packages. This tool has some very useful parameters that come in handy for the structured developer. When you execute an SSIS package - two things happen:
Here's the trick: The first phase produces both warnings and errors if you have made a mistakes in your package design. This is a very useful output and can be considered a build log of the package. Fortunately - its possible to invoke the first phase without executing the entire package. You simply do this:
/Va makes the package validate but not execute /Rep EW tells SSIS that you only want to see warnings and errors on stdout. If you don't use the /Rep switch you will clutter your "build log" with useless informational messages. Summary: to build an integration services project you do two things:
2006/11/28 Automating build of Analysis Services ProjectsAs you may be aware, an Analysis Services project consists of the following files:
You can right click the project in Visual Studio and choose "build". What happens now seems to be a lot of called to DLL files executed by the integration between Visual Studio and the Analysis Services. The output of this build is the bin folder in your project directory. This folder contains a database file (*.asdatabase). The .asdatabase file is the one you want - you will see why later. Unlike other projects - which you can build using their proper compiler (Example: CSC.EXE, VBC.EXE) - Analysis Services does not seem to have a compiler executable. You could be tempted to run msbuild. Unlike most temptation this will result in disappointment - msbuild does not understand Analysis Services files. The only way I have found to automate the building of Analysis Services projects is to invoke Visual Studio from the command line. Say you have a project called: MyAsProject.dwProj. To build it and obtain the asdatabase file you execute the following:
So far so good - now you have automated the createion of the .asdatabase file. You are not done yet. The asdatabase file provides the input to the Analysis Services Deployment Wizard. The spell cast by this wizard allows you to transform an asdatabase filen into an XML/A file. This XML/A file can be run on your deployment target - which will create the cube database. From the command line the Deployment Wizard can be run in two modes. The answer mode:
In this mode, you get to choose all the server specific settings and write them back to your bin directory. Check out the BOL documentation to find out exactly how this is implemented (You will find no defense for the naming of the executable in the documentation) Once you have your answers stored (which you will only do once) you want to automate the creation of the XML/A file. Execute:
Voila! You have your xmla file, which you can copy to your deployment target and run. By the way. xmla files can be run from the command line using ascmd.exe which you can find here: http://msdn2.microsoft.com/en-us/library/ms365187.... Automating the building of BI-SolutionsComing from a "Old fashioned" development environment (that is: C/C++, CSV, Unix and the likes) I have always wondered why BI-projects don't seem to get the idea of a structured software development. There are some technological cornerstones of a good, well structured software development process:
I don't know about you - but I have yet to see a Business Intelligence projects use all of the above. Some, actually a lot, of the examples I have worked on did not even use ANY of the above. My theory is that BI-development community is in a sense "immature" to the ways of structured software development. With Team Foundation Services and the new Visual Studio integration of the Microsoft BI-suite this is about to change. On my current development project I have decided to try a fully structured approach to BI-development with my team. Team Foundation gives us the following for free (or rather - for a very cheap license):
We have set up multiple environments and I have borrowed a few change management procedures from my past. Now, Team Foundations promises to automate our daily build too. It turns out that this is not really working with BI-projects. Team foundation uses a tool called msbuild - which it will execute automatically on solution files. However, msbuild doesn't have any plug-ins for BI-projects :-( Guess it was a bit too much to ask for anyway - back to basics: scripted builds. We are experimenting with daily build using command line script. Once we have the "building blocks" to build a BI-solution we plan to integrate it with the Team Foundation Build. 2006/11/17 The Many Pitfalls of Attribute Properties in AS 2005In AS 2000 dimension design, once the database has been loaded, is a 10 minute task. Declare your hierarchies, set up a few member properties, a short look at aggregations settings and off you go. There really was no such thing as a multiple hierarchies. Member properties was a design afterthought - as anyone that has ever designed a virtual dimension can testify. Though MDX feature powerful query abilities - there were still queries that were best served by the relations database. Enter AS 2005. Dimensions design has become much more powerful. We now have the full flexibility of the relational model build into the UDM. However, this power comes with a heavy price tag. When you design a dimension in AS 2005, every column in the relational source is a potential attribute. If you leave the properties of these attributes at their default value - you have set yourself up for a nasty performance surprise. Not only will performance suffer - your interface, should you use any version of Excel pre-2007 will be a mess. For each attribute in your dimension you must manually adjust each property to fit your needs. Books Online is a bit vague on the subject - so here's a quick summary: AttributeHierarchyEnabledThe name of the property is misleading - it really should be called: EnableAggregations. When set to true, the default value, this attribute is considered part of the aggregated cube space and will influence the aggregations design. This also means that you cannot set this value to false if your attribute participates in a user hierarchy - AS 2005 will raise an error if you do. However, you don't want all attributes to be aggregatable. For example, the street address of a customer is generally not a good candidates for aggregations - its solely there for display purposes. In AS 2000 you would implement this as a member property. Settings the value of AttributeHierarchyEnable to false tells Analysis Services that this attribute is used only for display purposes and that storage and processing time should not be wasted on building aggregations. If you really don't need these aggregations - setting this value to false helps the aggregation wizard do its work AttributeHierarchyVisibleIf you have set AttributeHierarchEnable to false this setting should be set to false. However, its default value is true. If set to true you attribute will, as the name says, be displayed as a hierarchy in client tools. What does this mean? Even though the attribute already participates in user hierarchies it will also have its own, separate, hierarchy. In Excel 2003 and earlier this means that you will see one extra dimension for each attribute with this property set to true. This tends to create unnecessary clutter in you interface. If you are designing solely for older client tools you should consider settings this property to false. If you are blessed with Enterprise Edition you can use the perspective feature. With perspectives this property can be true but you can still hide the hierarchy to older clients. OptimizedStateThis innocent looking property can have a big influence no your processing time. My understanding is not complete, but let me try to illustrate: Assume you have a three attribubutes A0, A1, A2 that are related (through an attribute relationship) in a natural hierarchy: A0 -> A1 -> A2 A0 is the leaf (key) level of the dimension. During the aggregation phase, Analysis Services will first build aggregation on the leaf level A0. Using the leaf level aggregations the aggregations for A1 can be build (A1 is directly related to the leaf). Since A2 is related to A1 an optimization can be made: The aggregations for A2 can be now be build using only the aggregation from A1 - without scanning the aggregations for A0. Assume OptimizedState is set to FullyOptimized (the default) for A2. This instructs Analysis Services to build aggregations specifically for this attribute. The means that A0 is scanned to build aggregations A2. This is of course an expensive operations - the leaf level can be quite big in large cubes. Setting OptimizedState to NotOptimzed will disable this behavior. One of my customers has reduced his process time from 14 hours to 5 just by disabling this property for selected attributes. It may be based on my incomplete understanding - but I think the default property for this attribute is wrong and you generally would use NotOptimized instead of FullyOptimzed. What I really don't get about this setting is why it is even needed - especially when you consider the property: AggregationUsageThis property is used to force a specific behavior in the aggregation wizard for this attribute. The Books Online documentation is pretty good: ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/uas9/html/57... One question remains though: If I can use AggregationUsage to fine tune my which properties are build with the aggregation wizard - why do I even need OptimizedState? 2006/11/16 Arbitrary Shapes in AS 2005
I have recently returned from a performance tuning assignment at one of the largest AS-cube installations in the world. Working on such large installations always teach you a thing or two about the internal workings of the server products in use. Learning this new stuff is what I love the most about performance tuning. My colleague and me had a conference call with the people at Microsoft - and they have been really helpful about sharing their knowledge. I wanted to pass on some of it here. First of all - let me highly recommend the new ability to trace a query in SQL-Profiler. MDX coders: you will love this feature, it will really boost your efficiency while tuning MDX. One of the events you want to watch out for is "Query Subcube Verbose". This event tells you a lot about the internal working of the AS engine. My knowledge is still pretty sparse - but this is what I have gathered: MDX queries are evaluated by the Formula Engine which interprets the dimensions axes and combine aggregations as necessary to answer the query. The aggregations are stored in the aggregation data registry which can consult the Storage Engine if it needs new aggregations to populate the cache. Aggregations are returned from the Aggregation data cache using subcube queries - these are the ones you can trace in profiler. To determine which subcubes to request the Formula Engine uses and a mechanism called "SONAR" (I need to see if any literature is written about this). Now, the amount of subcubes requested from the aggregation data registry has a large impact on query performance. Subcube queries can either be cached in memory or need retrieval from disk. If cached, a subquery is pretty fast - typically a few millisecond. Of course, these can add up and take time if you have many of them. On the other hand, if your subcube queries need to access the Storage Engine for aggregations they can be quite slow - since this requires I/O requests. The more subcubes you need to retrieve from disk - the larger your I/O load will become - eventually resulting in bad performance. You can use SQL-Profiler (look at the event subclass) to see if a subcube is retrieved from disk or memory. In summary, it is important that you avoid too many subcubes and make sure you have good cache hits on the ones you have. Fortunately, you are partially in control of this process. How does this tie into my previous post about multi-select? Well, try to execute and trace the subcubes in the following MDX statement against AdventureWorks:
Notice the high amount of subcubes needed? Now try to execute it again... What's wrong here? A lot of the subcubes are still being fetched from disk. Even though the query has already run before. Before we can understand what is happening here I must define what the people at Microsoft call "An Arbitrary Shape". To understand what an arbitrary shape is, it is perhaps easier to define what a "non-arbitrary shape is" A Non-Arbitrary Shape is a set, S, of dimension members that match these criteria:
Put another way: The value of every member in S can be expressed as the aggregation over the members only from Lmin I have tried to create a taxonomy of all possible non-arbitrary shapes. These are illustrated below: Notice that NA1, ..., NA3 can all be expressed using the An arbitrary shape is anything that is NOT a "non-arbitrary shape" Now, the Formula Engine can evaluate non-arbitrary shapes a lot more efficiently than arbitrary shapes. The work required to evaluate an arbitrary shape in a sub select generally involves a lot more subcube queries than when evaluating a non-arbitrary shape. Furthermore, the arbitrary shapes cause a lot of trouble with the caching of aggregations. I do not know the exact algorithm used by the formula engine. But one could suspect that non-arbitrary shapes allows the formula engine to build the higher aggregations directly from the lower ones using information about the hierarchies in the cube. Arbitrary shapes clearly prevent this optimization. There are many arbitrary shapes, but here are a few of them: So what can you learn from this? First, since arbitrary shapes are difficult to optimize you should seek to eliminate them your subcube queries. This is often an easy goal to accomplish; try this simple rewrite of the query above:
Trace it, and notice the difference. Secondly, by experimenting with arbitrary shapes we can learn a lot about he internals of the query engine. There is not much documentation to be found, I invite the reader to share his experiences. Even though we would expect all non-arbitrary shape to allow "smart caching" this is not the case. Try the following:
Though the sub select quite easily qualifies as a non-arbitrary shapes (perhaps the most simple non-arbitrary shape I can think of) the non-cached disk access still happens - even on a warm cache. Having done some experimentation I have found that only non-arbitrary shapes that contain members from a single level only are optimized by the query engine. Even though any non-arbitrary shape in theory should trigger a smart caching. 2006/8/7 Factors to consider when performance tuning cubes and MDXBefore we move on the the wonders of subselects I would like to give you a few pointers on tuning MDX queries and cubes in general. During my time as a performance tuner on Analysis Services - these are the most important factors you should consider in a 2005 environment:
2006/8/5 Doing Multi-Selects in Analysis Services 2000 and 2005One of the major challenges for users of Analysis Services 2000 are queries that answer question of the form:
For example. Your interface has a product drop down list like this:
You allow you user to select one or more subcategories at the same time. For the purposes of illustration, lets assume the users selects "Bike Racks" and "Bike Stands". The user then asks for the order count of the hierarchized product dimension - but filtered by his selection. Put differently, the users wants:
Naïvely you execute this MDX statement: [Measures].[Internet Order Count] ON 0, HIERARCHIZE (GENERATE( {[Product].[Product Categories].[Subcategory].&[26], [Product].[Product Categories].[Subcategory].&[27]} , ASCENDANTS([Product].[Product Categories])) ) ON 1FROM [Direct Sales](Try it out in Adventureworks) What happens?... You get:
The values for All Products and Accessories are not right! You want the value for accessories and All Products to be 328 + 249 (the sum of the selected descendants), not 18208 and 27659. This is the essence of the visual totals problem: Analysis Services will happily scan the cube space for you and return the values of the parent. But you get the Order Count for the entire parent member - not the sum of the values you have selected. Why? Because this is the value that has been preaggregated in the cube. There is something intuitively wrong with the MDX query above. You are essentially mixing up two different questións:
Aha!... You need to seperate shape and subspace - just like you seperate the columns and WHERE parts of an SQL statement. Applying our SQL knowledge to MDX (generally a bad idea) we try: [Measures].[Internet Order Count] ON 0, DESCENDANTS ([Product].[Product Categories].[All Products], [Product].[Product Categories].[Subcategory], SELF_AND_BEFORE) ON 1FROM [Direct Sales]WHERE {[Product].[Product Categories].[Subcategory].&[26], [Product].[Product Categories].[Subcategory].&[27]}This generates an error. Even if we reduced our WHERE critieria to a single select (simplifying the problem): ([Product].[Product Categories].[Subcategory].&[26]) ... we still get the infamous error:
This, deceptively simple, multi-select problem have caused many headaches for MDX programmers. There ARE workarounds in Analysis Services 2000 - but they take a syntactic form that is disgusting to behold and I will not quote them here. Analysis Services 2005 to the rescue! Finally, we are able to seperate shape and subspace in our query. The above query can be written, correctly, like this in AS2005; [Measures].[Internet Order Count] ON 0, DESCENDANTS ([Product].[Product Categories].[All Products], [Product].[Product Categories].[Subcategory], SELF_AND_BEFORE) ON 1FROM (SELECT {[Product].[Product Categories].[Subcategory].&[26] , [Product].[Product Categories].[Subcategory].&[27]} ON 0 FROM [Direct Sales]) Notice the user of the "subselect". The outer select specifies the shape of space to return - the inner select specifies the subspace that is the scope of the query. Syntactically elegant - and we get the desired result:
Now, every coin has a back side. Allowing the user to specify any subspace in his query has som implications on how aggregations are being accessed. Programmers familiar with visual totals will probably have spotted this by now. My next post will deal with the subspaces you can define and how they affect your queries. 2006/7/10 Doing FIRST and LAST aggregates in SQL Server 2005Users of Microsoft Access may be familiar with the aggregation functions FIRST and LAST. Basically, what you want from these aggregates is to scan the tables in a sorted order. The first or last value encountered in each group is send to the output - much like the existing SQL Server MIN/MAX function.
Let me illustrate with some pseudo code:
SELECT Col , FIRST(Value) AS FirstV , LAST(Value) AS LastV , SUM(Value) AS SumV FROM Table GROUP BY Col
On a table with these rows:
...You want this output:
Notice how there in an implicit assumption of a row ordering. You probably want some sort of identity column to order your rows. Implicitly - this is a very ISAM way of looking at the world. Now, how do we do this inside SQL Server?
Lets make some test data based on AdventureWorksDW:
SELECT s.*, IDENTITY(int, 1,1) AS IDINTO #AggFROM dbo.FactInternetSales sCROSS JOIN (SELECT TOP 30 * FROM sys.objects) scaleUpFactor
CREATE UNIQUE INDEX IX_FirstLast ON #Agg (ProductKey, ID, OrderQuantity)
Well... My colleage Jesper Rasmussion came up with a brilliant answer (try this on the testdata from above):
SELECT I.ProductKey, F.OrderQuantity AS FirstQuantity , L.OrderQuantity AS LastQuantity , I.SumQuantity AS SumQuantity FROM (SELECT ProductKey , min(ID) AS FirstID , max(ID) AS LastID , sum(OrderQuantity) AS SumQuantity FROM #Agg GROUP BY ProductKey) I INNER JOIN #Agg F ON F.ID = I.FirstID AND F.ProductKey = I.ProductKeyINNER JOIN #Agg L ON L.ID = I.LastID AND L.ProductKey = I.ProductKey
This query has some very interesting properties. If you check the executions statistics you will see that SQL Server only does very few I/O requests in the index created (not more than 10-20% more than it takes for a full scan - probably around log(size(#agg) ). Since all rows HAVE to be visited to answer the requiest - this means that the query is very efficient.
Now, if only we could make a first and last aggregate that does just one table scan... :-)
How fast can SQL Server 2005 possibly beI am currently making a relatively deep (read: nerdy) study of how to optimally load a huge data warehouse. Based on my experience with ETL tools, SQL tuning and index optimization I know that there are several approaches one may take when loading a billion row data warehouse with million row dimension tables. However, very few of them are optimal or even viable... On question that must be answered to find the optimal stratagy is the following: Let us assume we have an optimally configured SQL Server system (all sp_configure settings and trace flags optimized for our hardware). Futhermore, assume we have the following:
Now, let us run these three simple statements against our warehouse database:
These are (simplified) version of the queries used to respectively make key lookups, type 1 and type 2 dimension changes Assume the following about the execution of the above statements
For our warehouse architecture we can consider two viable scenarios:
Now my question is: How fast can these statements be if we have the best software, tuning and hardware available?...
In a later post I will explore a non-naïve approach which i consider optimal for dimension loading for dimension of any size. |
|
|