Extending the Data Profiling Task

 

SQL Server 2008 RTM has been out a while now and yet, I am just now making my way to using the Data Profiling Task, new to SSIS in this release. Time and client demands are always a factor governing when and what I get to look at, however I had read up on the features and was hoping that what I had read was not all the new task could do.

The new task is a good start, but unfortunately, it is a little more limited than I would have hoped. Fortunately though, with the SSIS APIs open to us, we can use the class object as we please.

So first -- what did I find limiting about the Data Profiling Task?

It appears that its intended design was to serve as a checkpoint during loads/transforms and that conditional logic can be applied based upon results; this is incredibly useful. My major concern is with activities that happen before the design phase – and quite directly, before any development begins at all.

Reference this great little tip from Kimball here on data profiling and its necessity according to his framework – profiling is tantamount to understanding the system that you are getting ready to work with in the course of a project.

To get to the heart of what I wanted to see the Data Profiling task do: I want to use it to generate an aggregate view of a given system so that I can view the results and learn more about the data within the database. Many of us have been given a database, custom built or extended far from the original software vendor's spec, with little documentation.

Even for those who are not preparing to move this data for warehousing or reporting, anything that provides a reference to what the data is, how it is stored, how unique, etc. is a boon for developers.

The Data Profiling task can deliver this data -- but only one table at a time?

There are no expression settings or variables that you can use to set which table and where the output goes. For two of the profiling tasks, I can see this as a bit of a problem – both the Candidate Key and Value Inclusion profile do not lend themselves to being easily generated dynamically via iterating over all the tables and views in a database.

But all the others are useful in this aggregated manner, so why, if I am looking at a client database that has 96 tables, must I build 96 tasks via Visual Studio and still not end up with an aggregate report (just 96 xml files)? I would like to be able to run this on the entire database, and start looking at the hotspots or data profile results that are the fat tails in the distribution chart.

My solution is to build the SSIS package programmatically, run it and then read and aggregate the results. I still need to clean up the solution a bit and finish some unit tests, so I will post it soon for everyone to download and enjoy.

My envisioned endgame for this project is for the output to be configurable for user desired thresholds and for that output to render in HTML or Word – still a little ways away from that, but I have the bulk of the SSIS work done. If anyone is interested in contributing, let me know and I will see about getting this on CodePlex.

Notes for anyone who wants to work with this project or attempt this on their own:

  • This effort makes use of the Microsoft.DataDebugger.DataProfiling class, which is not necessarily a supported API. An example of using this class is on the SSIS Team Blog here.
  • DLL references necessary for this project come from multiple places. For example, the Microsoft.SqlServer.DataProfilingTask is in ~\Program Files\Microsoft SQL Server\100\DTS\Tasks and Microsoft.SqlServer.DataProfiler is in ~\Program Files\Microsoft SQL Server\100\DTS\Binn
  • It is a good idea to read up on building a package programmatically on MSDN as there are class objects, such as TaskHost, which are not immediately intuitive as they are not seen when using the designer
Tags: , , , ,
Categories: .NET | SSIS

March 31, 2009 09:19 by Sid
E-mail | Permalink | Comments (4) | Comment RSSRSS comment feed

Comments

March 31. 2009 14:57

John Welch

Hi, Sid, nice post. There is a property that you can set to customize what it profiled, but it's hidden. You can only set it through an expression. Take a look here for an example of a package built only in the designer that profiles all the tables in a database.

agilebi.com/.../...l-the-tables-in-a-database.aspx

Still, using the API to create the package is cool - looking forward to seeing the end result.

John Welch

March 31. 2009 18:12

Sid

John -

Thanks for the comment -- looks like we both have the same end goal, we just worked out the resolution in two different paths. I will definitely post the project once I get it cleaned up.

Thanks for the comment and the link!

Sid

April 10. 2009 12:14

trackback

Trackback from Sid Atkinson Jr.

CodePlex project for Data Profiling

Sid Atkinson Jr.

July 18. 2009 11:23

Steve J. Laye

Thanks for the good post. Will use the link.

Steve J. Laye

Add comment


(Will show your Gravatar icon)

  Country flag

biuquote
  • Comment
  • Preview
Loading



Sponsors

Author

Tags

Recent Posts