Extending the Data Profiling Task

 

SQL Server 2008 RTM has been out a while now and yet, I am just now making my way to using the Data Profiling Task, new to SSIS in this release. Time and client demands are always a factor governing when and what I get to look at, however I had read up on the features and was hoping that what I had read was not all the new task could do.

The new task is a good start, but unfortunately, it is a little more limited than I would have hoped. Fortunately though, with the SSIS APIs open to us, we can use the class object as we please.

So first -- what did I find limiting about the Data Profiling Task?

It appears that its intended design was to serve as a checkpoint during loads/transforms and that conditional logic can be applied based upon results; this is incredibly useful. My major concern is with activities that happen before the design phase – and quite directly, before any development begins at all.

Reference this great little tip from Kimball here on data profiling and its necessity according to his framework – profiling is tantamount to understanding the system that you are getting ready to work with in the course of a project.

To get to the heart of what I wanted to see the Data Profiling task do: I want to use it to generate an aggregate view of a given system so that I can view the results and learn more about the data within the database. Many of us have been given a database, custom built or extended far from the original software vendor's spec, with little documentation.

Even for those who are not preparing to move this data for warehousing or reporting, anything that provides a reference to what the data is, how it is stored, how unique, etc. is a boon for developers.

The Data Profiling task can deliver this data -- but only one table at a time?

There are no expression settings or variables that you can use to set which table and where the output goes. For two of the profiling tasks, I can see this as a bit of a problem – both the Candidate Key and Value Inclusion profile do not lend themselves to being easily generated dynamically via iterating over all the tables and views in a database.

But all the others are useful in this aggregated manner, so why, if I am looking at a client database that has 96 tables, must I build 96 tasks via Visual Studio and still not end up with an aggregate report (just 96 xml files)? I would like to be able to run this on the entire database, and start looking at the hotspots or data profile results that are the fat tails in the distribution chart.

My solution is to build the SSIS package programmatically, run it and then read and aggregate the results. I still need to clean up the solution a bit and finish some unit tests, so I will post it soon for everyone to download and enjoy.

My envisioned endgame for this project is for the output to be configurable for user desired thresholds and for that output to render in HTML or Word – still a little ways away from that, but I have the bulk of the SSIS work done. If anyone is interested in contributing, let me know and I will see about getting this on CodePlex.

Notes for anyone who wants to work with this project or attempt this on their own:

  • This effort makes use of the Microsoft.DataDebugger.DataProfiling class, which is not necessarily a supported API. An example of using this class is on the SSIS Team Blog here.
  • DLL references necessary for this project come from multiple places. For example, the Microsoft.SqlServer.DataProfilingTask is in ~\Program Files\Microsoft SQL Server\100\DTS\Tasks and Microsoft.SqlServer.DataProfiler is in ~\Program Files\Microsoft SQL Server\100\DTS\Binn
  • It is a good idea to read up on building a package programmatically on MSDN as there are class objects, such as TaskHost, which are not immediately intuitive as they are not seen when using the designer
Tags: , , , ,
Categories: .NET | SSIS

March 31, 2009 09:19 by Sid
E-mail | Permalink | Comments (4) | Comment RSSRSS comment feed

Yes Virginia, PerformancePoint does work with SQL 2008

No sense in repeating the article, full steps available in the Technet posting "Install PerformancePoint Monitoring Server with SQL Server 2008".

The instruction detail is great, although I do not see a lot of our customers going for this set up as it is a bit of a hack.

The other thing to note is that this instruction set concerns the Monitoring portion of PerformancePoint only – for most people, this should not be a concern, as MS did not see a big uptake on the Planning piece anyway (and none of my direct clients have show interest either).

For the most part, I would really only recommend doing this if:

  • You want to take advantage of some of the new features in SSAS 2008
  • Existing SQL 2008 farm
  • Memory management headaches in SSRS 2005 are making you want to move to 2008

 

The list is not comprehensive by any means, but it is just a few of the reasons I can directly see. Feel free to post or e-mail me with any challenges, questions or scenarios!

Tags: ,
Categories: Performance Point

March 24, 2009 15:52 by Sid
E-mail | Permalink | Comments (2) | Comment RSSRSS comment feed

Consistency in Measurement

The ubiquitous mantra surrounding business intelligence is "the right information, at the right time, in the right format".

There is nothing wrong with this phrasing. Information put in the right hands and in an immediately usable format can be very empowering. One of the finer points is around "right information" -- how do we know what is right?

An example of this issue (and to pick on something in the news a lot right now), are the U.S. unemployment figures batted about in the news. Are we better than the Great Depression era? Where does the current economy sit in relation to the Great Depression? Where are we in relation to other recessions or depressions?

On the surface when we attempt to answer this, one can say that the U.S. employment metric of 8.1% from February 2009 is still far below the 1933 peak of 24.9%. Even removing seasonal adjustments on the current number puts the "official" unemployment metric at 8.9%. Does this mean we are better off? What relief can one take from this measure?

As with any measure, it is important to know how it is calculated – this becomes even more important when trying to compare across time period, and not just one quarter from the next, but in this instance, decades. Here are a few quick facts to keep in mind when looking at and trying to compare these unemployment measures:

  • Before 1940, unemployment statistics were gathered in a variety of decentralized means, with various ways to measure employed versus unemployed
  • The Works Progress Administration owned the survey between 1940 and 1942, before transferring to the Census Bureau
  • In 1959, the Bureau of Labor Stats took over analyzing and publishing the data, but the Census Bureau still collects it
  • In 1994, the CPS (Current population survey) chose a different measure for the "official" unemployment metric. Article here

What do these various facts tell us?

Our current unemployment measures are broken into several categories, U1-U6. U3 is the current official metric that is quoted in the news and plastered across news papers. U3 is the unemployment measure mentioned above, currently sitting at 8.1% for end of February 2009. Historical economists however, recognize that the surveys that were done during the 1930s are closer to our current U6 unemployment measure. Given this, can we really compare our current official metric to the official metric of the Great Depression era? To be accurate, no.

Compounding the issue in looking at even more recent recessions in the 1970s and 1980s is the fourth bullet point: the BLS switched officially reported unemployment metrics from 1993 to the 1994 year. Pre-1994, a modified version of U5 was the official measure, while post 1993 and forward, the BLS has switched to U3.

Luckily for the savvy numbers enthusiast, the Bureau of Labor Stats still published all six unemployment measures. So how do we really compare? Still a tricky answer; the accuracy of the pre-1940 surveys are still called into question, however, if one works off of the assumption that our current U6 unemployment measure is the closest, then we are at 14.8%!

Seasonally adjusted of course.

References:

Tags: , , , ,
Categories: Business Intelligence

March 18, 2009 16:00 by Sid
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

I have been remiss

I have been remiss in paying attention to my visitors – my sincerest apologies to those who use FireFox.

It was not until earlier this week that I realized my previous blog format and CSS was not friendly to FireFox. Personally, I do not use the browser, switching only between IE and Google's Chrome (and Safari on my iPod touch). The problem came to my attention when I was assisting a friend with some queries and we referenced something on one of my prior posts – the ugly scene came alive in his FireFox window and I immediately resolved to correct this oversight.

What makes this more of a travesty is that I have information at my fingertips that removes any excuses. Here is a visitor chart for a given timeframe (Google Analytics):

Why even mention this? Who cares? Just change the template and don't tell me why….(a questioning voice other than my own)

I think it is important for those of us in the business intelligence/analytics field to recognize when there is information readily available, yet not taken advantage of for any number of reasons. It is always important to question the basics, turn over assumptions and look at what is really there.

As for me, I had simply forgotten that people use other browsers. My laptop comes pre-loaded with everything for work, and since I spend most of my time in the lower tiers of the application, I do not often work in the layers of a project where UI validation is an important subject matter. My assumption? That my readers are like me (bad assumption).

The important lesson? In my case, voice of the customer – I need to pay attention to what my readers are doing and how they access my site so that I do not alienate those seeking information or to post a comment.

Have a good weekend!

Tags: ,
Categories: Business Intelligence | General

March 7, 2009 15:43 by Sid
E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

PerformancePoint still lives

 

Microsoft posted to their BI blog back in January about the future of PerformancePoint as now linked to SharePoint. Given how widespread this announcement was, and that all the associated product teams timed posts to blogs to coincide with Guy Weismantel's video announcement, I am still surprised about how much disinformation is out there.

Bottom Line: PerformancePoint is still a viable, enterprise class product that customers can design, develop and deploy solutions on.

But what has changed?

For those of you accessing this information late, it appears that Guy's video is inaccessible; whether this is designed or not, I do not know. I was able view it in February, but the past few days I have not been as lucky.

Here are the pertinent facts:

  • PerformancePoint will not be a standalone product, it will be part of SharePoint
  • The Planning component will not be there in the next release, so if SP3 does not cover your planning needs, then using the Planning component of PerformancePoint 2007 is not for you
  • Customers using the Monitoring and Analytics will not be affected

In some ways, I view this as giving more importance to BI and its associated processes. The Information Worker segment is one of the largest and most profitable segments for Microsoft – it is also at the core in how PerformancePoint is distributed to the masses. Planning, development and sales wise, this, to me, brings a more consolidated message and usage strategy to customers.

In the end, there will be both critics and proponents, but what is done is done – and Microsoft made a move that they believe will support PerformancePoint as an enterprise level BI solution, inside the information worker space.

Obviously there are a lot of details still to come, but this move is not surprising. Remember Microsoft's purchase of Stratature (an MDM software company)? MDM will be part of Office 14 – this was the plan from the beginning of the acquisition; from my perspective, the question really is: why wasn't PerformancePoint released as part of SharePoint in the beginning?

Short and sweet, but this is the bottom line: PerformancePoint is here to stay.

References:

 

Tags: ,
Categories: Performance Point

March 3, 2009 12:59 by Sid
E-mail | Permalink | Comments (5) | Comment RSSRSS comment feed

Sponsors

Author

Tags

Recent Posts