| Thomas 的个人资料Thomas Kejsers Blog日志列表网络 | 帮助 |
|
2006/11/29 Automating build of Integration Services projectsAfter our experiences with the build process in Analysis Services - expectations were low when we turned our gaze towards Integrations Services (SSIS). Our overall experience with SSIS has generally been positive - it's an extremely powerful tool with some interesting performance characteristics. There are some annoying flaws - especially in the Visual Studio integration. When I work with SSIS projects - three or four daily crashes of Visual Studio is not uncommon. Never fear - the power provided by the tool more than makes up for this when it comes to choosing the ETL platform. And buggyness is what service packs are for ... isn't it? :) The first thing you notice when you choose "build" in Visual Studio on an SSIS projects is that Visual Studio doesn't do anything - except of course copy files to the bin directory. Why? Well, guess you could say that the *.dtsx files contain all the information need to run the packages - there is "nothing to be done". But, when you try to automate a build there actually is something to be done. First of all: you have to mark which files contain the build output. That is, the files you wish to copy to your deployment target. With this in mind - we tried using msbuild on the project - hoping that it would at least mark the *.dtsx files as build output (it could just take a look at the bin directory for us) so Team Foundation can copy them to the release directory. No surprises - integration is non existent and there are no free meals here. Guess you could say that the command line to automate build of SSIS is simple: just copy the bin\*.dtsx files to your build output directory. We could have stopped here and be done with it - but we like build logs (if you can use the about the word: "like" in this context). Those of you who have deployed a big BI-installation by hand may have noticed the excellent command line tool; dtexec.exe used to run SSIS packages. This tool has some very useful parameters that come in handy for the structured developer. When you execute an SSIS package - two things happen:
Here's the trick: The first phase produces both warnings and errors if you have made a mistakes in your package design. This is a very useful output and can be considered a build log of the package. Fortunately - its possible to invoke the first phase without executing the entire package. You simply do this:
/Va makes the package validate but not execute /Rep EW tells SSIS that you only want to see warnings and errors on stdout. If you don't use the /Rep switch you will clutter your "build log" with useless informational messages. Summary: to build an integration services project you do two things:
2006/11/28 Automating build of Analysis Services ProjectsAs you may be aware, an Analysis Services project consists of the following files:
You can right click the project in Visual Studio and choose "build". What happens now seems to be a lot of called to DLL files executed by the integration between Visual Studio and the Analysis Services. The output of this build is the bin folder in your project directory. This folder contains a database file (*.asdatabase). The .asdatabase file is the one you want - you will see why later. Unlike other projects - which you can build using their proper compiler (Example: CSC.EXE, VBC.EXE) - Analysis Services does not seem to have a compiler executable. You could be tempted to run msbuild. Unlike most temptation this will result in disappointment - msbuild does not understand Analysis Services files. The only way I have found to automate the building of Analysis Services projects is to invoke Visual Studio from the command line. Say you have a project called: MyAsProject.dwProj. To build it and obtain the asdatabase file you execute the following:
So far so good - now you have automated the createion of the .asdatabase file. You are not done yet. The asdatabase file provides the input to the Analysis Services Deployment Wizard. The spell cast by this wizard allows you to transform an asdatabase filen into an XML/A file. This XML/A file can be run on your deployment target - which will create the cube database. From the command line the Deployment Wizard can be run in two modes. The answer mode:
In this mode, you get to choose all the server specific settings and write them back to your bin directory. Check out the BOL documentation to find out exactly how this is implemented (You will find no defense for the naming of the executable in the documentation) Once you have your answers stored (which you will only do once) you want to automate the creation of the XML/A file. Execute:
Voila! You have your xmla file, which you can copy to your deployment target and run. By the way. xmla files can be run from the command line using ascmd.exe which you can find here: http://msdn2.microsoft.com/en-us/library/ms365187.... Automating the building of BI-SolutionsComing from a "Old fashioned" development environment (that is: C/C++, CSV, Unix and the likes) I have always wondered why BI-projects don't seem to get the idea of a structured software development. There are some technological cornerstones of a good, well structured software development process:
I don't know about you - but I have yet to see a Business Intelligence projects use all of the above. Some, actually a lot, of the examples I have worked on did not even use ANY of the above. My theory is that BI-development community is in a sense "immature" to the ways of structured software development. With Team Foundation Services and the new Visual Studio integration of the Microsoft BI-suite this is about to change. On my current development project I have decided to try a fully structured approach to BI-development with my team. Team Foundation gives us the following for free (or rather - for a very cheap license):
We have set up multiple environments and I have borrowed a few change management procedures from my past. Now, Team Foundations promises to automate our daily build too. It turns out that this is not really working with BI-projects. Team foundation uses a tool called msbuild - which it will execute automatically on solution files. However, msbuild doesn't have any plug-ins for BI-projects :-( Guess it was a bit too much to ask for anyway - back to basics: scripted builds. We are experimenting with daily build using command line script. Once we have the "building blocks" to build a BI-solution we plan to integrate it with the Team Foundation Build. 2006/11/17 The Many Pitfalls of Attribute Properties in AS 2005In AS 2000 dimension design, once the database has been loaded, is a 10 minute task. Declare your hierarchies, set up a few member properties, a short look at aggregations settings and off you go. There really was no such thing as a multiple hierarchies. Member properties was a design afterthought - as anyone that has ever designed a virtual dimension can testify. Though MDX feature powerful query abilities - there were still queries that were best served by the relations database. Enter AS 2005. Dimensions design has become much more powerful. We now have the full flexibility of the relational model build into the UDM. However, this power comes with a heavy price tag. When you design a dimension in AS 2005, every column in the relational source is a potential attribute. If you leave the properties of these attributes at their default value - you have set yourself up for a nasty performance surprise. Not only will performance suffer - your interface, should you use any version of Excel pre-2007 will be a mess. For each attribute in your dimension you must manually adjust each property to fit your needs. Books Online is a bit vague on the subject - so here's a quick summary: AttributeHierarchyEnabledThe name of the property is misleading - it really should be called: EnableAggregations. When set to true, the default value, this attribute is considered part of the aggregated cube space and will influence the aggregations design. This also means that you cannot set this value to false if your attribute participates in a user hierarchy - AS 2005 will raise an error if you do. However, you don't want all attributes to be aggregatable. For example, the street address of a customer is generally not a good candidates for aggregations - its solely there for display purposes. In AS 2000 you would implement this as a member property. Settings the value of AttributeHierarchyEnable to false tells Analysis Services that this attribute is used only for display purposes and that storage and processing time should not be wasted on building aggregations. If you really don't need these aggregations - setting this value to false helps the aggregation wizard do its work AttributeHierarchyVisibleIf you have set AttributeHierarchEnable to false this setting should be set to false. However, its default value is true. If set to true you attribute will, as the name says, be displayed as a hierarchy in client tools. What does this mean? Even though the attribute already participates in user hierarchies it will also have its own, separate, hierarchy. In Excel 2003 and earlier this means that you will see one extra dimension for each attribute with this property set to true. This tends to create unnecessary clutter in you interface. If you are designing solely for older client tools you should consider settings this property to false. If you are blessed with Enterprise Edition you can use the perspective feature. With perspectives this property can be true but you can still hide the hierarchy to older clients. OptimizedStateThis innocent looking property can have a big influence no your processing time. My understanding is not complete, but let me try to illustrate: Assume you have a three attribubutes A0, A1, A2 that are related (through an attribute relationship) in a natural hierarchy: A0 -> A1 -> A2 A0 is the leaf (key) level of the dimension. During the aggregation phase, Analysis Services will first build aggregation on the leaf level A0. Using the leaf level aggregations the aggregations for A1 can be build (A1 is directly related to the leaf). Since A2 is related to A1 an optimization can be made: The aggregations for A2 can be now be build using only the aggregation from A1 - without scanning the aggregations for A0. Assume OptimizedState is set to FullyOptimized (the default) for A2. This instructs Analysis Services to build aggregations specifically for this attribute. The means that A0 is scanned to build aggregations A2. This is of course an expensive operations - the leaf level can be quite big in large cubes. Setting OptimizedState to NotOptimzed will disable this behavior. One of my customers has reduced his process time from 14 hours to 5 just by disabling this property for selected attributes. It may be based on my incomplete understanding - but I think the default property for this attribute is wrong and you generally would use NotOptimized instead of FullyOptimzed. What I really don't get about this setting is why it is even needed - especially when you consider the property: AggregationUsageThis property is used to force a specific behavior in the aggregation wizard for this attribute. The Books Online documentation is pretty good: ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/uas9/html/57... One question remains though: If I can use AggregationUsage to fine tune my which properties are build with the aggregation wizard - why do I even need OptimizedState? 2006/11/16 Arbitrary Shapes in AS 2005
I have recently returned from a performance tuning assignment at one of the largest AS-cube installations in the world. Working on such large installations always teach you a thing or two about the internal workings of the server products in use. Learning this new stuff is what I love the most about performance tuning. My colleague and me had a conference call with the people at Microsoft - and they have been really helpful about sharing their knowledge. I wanted to pass on some of it here. First of all - let me highly recommend the new ability to trace a query in SQL-Profiler. MDX coders: you will love this feature, it will really boost your efficiency while tuning MDX. One of the events you want to watch out for is "Query Subcube Verbose". This event tells you a lot about the internal working of the AS engine. My knowledge is still pretty sparse - but this is what I have gathered: MDX queries are evaluated by the Formula Engine which interprets the dimensions axes and combine aggregations as necessary to answer the query. The aggregations are stored in the aggregation data registry which can consult the Storage Engine if it needs new aggregations to populate the cache. Aggregations are returned from the Aggregation data cache using subcube queries - these are the ones you can trace in profiler. To determine which subcubes to request the Formula Engine uses and a mechanism called "SONAR" (I need to see if any literature is written about this). Now, the amount of subcubes requested from the aggregation data registry has a large impact on query performance. Subcube queries can either be cached in memory or need retrieval from disk. If cached, a subquery is pretty fast - typically a few millisecond. Of course, these can add up and take time if you have many of them. On the other hand, if your subcube queries need to access the Storage Engine for aggregations they can be quite slow - since this requires I/O requests. The more subcubes you need to retrieve from disk - the larger your I/O load will become - eventually resulting in bad performance. You can use SQL-Profiler (look at the event subclass) to see if a subcube is retrieved from disk or memory. In summary, it is important that you avoid too many subcubes and make sure you have good cache hits on the ones you have. Fortunately, you are partially in control of this process. How does this tie into my previous post about multi-select? Well, try to execute and trace the subcubes in the following MDX statement against AdventureWorks:
Notice the high amount of subcubes needed? Now try to execute it again... What's wrong here? A lot of the subcubes are still being fetched from disk. Even though the query has already run before. Before we can understand what is happening here I must define what the people at Microsoft call "An Arbitrary Shape". To understand what an arbitrary shape is, it is perhaps easier to define what a "non-arbitrary shape is" A Non-Arbitrary Shape is a set, S, of dimension members that match these criteria:
Put another way: The value of every member in S can be expressed as the aggregation over the members only from Lmin I have tried to create a taxonomy of all possible non-arbitrary shapes. These are illustrated below: Notice that NA1, ..., NA3 can all be expressed using the An arbitrary shape is anything that is NOT a "non-arbitrary shape" Now, the Formula Engine can evaluate non-arbitrary shapes a lot more efficiently than arbitrary shapes. The work required to evaluate an arbitrary shape in a sub select generally involves a lot more subcube queries than when evaluating a non-arbitrary shape. Furthermore, the arbitrary shapes cause a lot of trouble with the caching of aggregations. I do not know the exact algorithm used by the formula engine. But one could suspect that non-arbitrary shapes allows the formula engine to build the higher aggregations directly from the lower ones using information about the hierarchies in the cube. Arbitrary shapes clearly prevent this optimization. There are many arbitrary shapes, but here are a few of them: So what can you learn from this? First, since arbitrary shapes are difficult to optimize you should seek to eliminate them your subcube queries. This is often an easy goal to accomplish; try this simple rewrite of the query above:
Trace it, and notice the difference. Secondly, by experimenting with arbitrary shapes we can learn a lot about he internals of the query engine. There is not much documentation to be found, I invite the reader to share his experiences. Even though we would expect all non-arbitrary shape to allow "smart caching" this is not the case. Try the following:
Though the sub select quite easily qualifies as a non-arbitrary shapes (perhaps the most simple non-arbitrary shape I can think of) the non-cached disk access still happens - even on a warm cache. Having done some experimentation I have found that only non-arbitrary shapes that contain members from a single level only are optimized by the query engine. Even though any non-arbitrary shape in theory should trigger a smart caching. |
|
|