Extracting SharePoint Data using SSIS

!!!!   Follow-up to this posted September 19, 2008: SQL 2008 SSIS Adapter for SharePoint from Microsoft Dev team posted to Codeplex !!!!

 

SharePoint and WSS are fantastic in that almost everything is driven by Web Services. For the data and reporting gurus, our job is a lot easier when it comes to extracting data from the mini-applications and workflows that the business community creates inside a WSS or SharePoint Portal framework – though not completely straight forward, so that is what we’ll discuss today: using SSIS to extract data from WSS lists using the out of the box WSS web services.

 

In this example, we have a user community that has data in a WSS list that needs to be extracted so it can be reported on. The data in this list represents a collection of information about a business process.

 

One of our major limitations in using out of the box SSIS features is the Web Service Task. If you look at Books Online, article: http://msdn2.microsoft.com/en-us/library/ms140114.aspx, the first note indicates that the Web Service task supports primitive types only. If you look at the WSDL for our particular method we’ll be using, GetListItems, you can see from the WSDL below that we will have a problem:

 

To get around this we’ll use a proxy since the Web Service task is the only item currently that can use a web reference. For our first step, we will use the same tool the VS 2005 uses when you add a web reference to a web or windows project: wsdl.exe. We’ll create our class by executing the following from the command line (wsdl.exe should be located in your C:\Program Files\Microsoft Visual Studio 8\SDK\v2.0\Bin directory):

 Wsdl.exe /language:VB /out:SharePointListService.vb http://SharePointURL/TargetSite/_vti_bin/lists.asmx?wsdl 

Note that you will need to do this again for any sub-site that you are targeting for your import as part of the WSDL description is the location of the service.

 

Our SSIS package will have three simple control flow steps:

-         Script Task

-         XML Task

-         Data Flow Task

Script Task

After you run this command, open your newly created vb file (I normally code C#, but the SSIS script task only uses VB). In your script task, we are going to make a new class reference. Right click on the solution, go to Add > Add Class. You can use the same name as newly created file from wsdl.exe or something that will more appropriately fit given your project. Once you create the new class file, erase any text/code automatically generated and then paste the content of your wsdl.exe outputted file into this code file in your project. Add references to the System.XML and System.Web.Services.

After you have completed this step, you are ready to consume your new class and mimic the Web Service SSIS task.

 

Here is sample code from the default ScriptMain file. My code in this sample uses the proxy class Lists (generated by wsdl.exe) to call the service, pass the necessary parameters and then save the output to an xml file. Note that for the view name for a list, you have to pass the GUID. It seems that the method was not set to return values using the common view name.

One of the easiest ways to grab the GUID is to go to modify the view you want to use to pull the data from and look in the URL, the portion of the query string after View= has the GUID for this view.

Another way to do this is to use the Web Service Task against the Views web service in WSS/SharePoint, call the method GetViewCollection, pass the List name as a parameter and you can get an XML output with all the views and GUIDs associated with a list item.

 

After implementing the code above, your Script Task is complete and your output will be an XML file of all the data from the list given the view you hit. An important point to note is to set your rowLimit parameter. The others can be blank or null, but if the rowLimit is not set, then by default you will only get the data that displays in the default view paging size, which is usually 100 records. By setting it to a figure larger than the collection, you will get the entire collection of records.

 

XML Task

 

The SSIS XML Source in the Data Flow section of your project does not like XML files that have multiple namespaces, which is almost a guarantee when you are retrieving data from a SharePoint List.

 

The XML Task is pretty basic in that it strips the extraneous namespaces from the XML file. I got this from Ashvini Sharma’s blog when trying to solve the same problem in hitting RSS feeds for a real estate project. Here is the link to his post:

http://sqljunkies.com/WebLog/ashvinis/archive/2005/05/25/15654.aspx

 

Right click on the XML Task, choose edit, and then we will set the properties on the task as such:

 

 

Under the SecondOperand property, paste the following XSLT:

 

 <?xml version="1.0" encoding="utf-8" ?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="xml" indent="no" />

<xsl:template match="/|comment()|processing-instruction()">

<xsl:copy>

<xsl:apply-templates />                        

</xsl:copy>           

</xsl:template>           

<xsl:template match="*">                                    

<xsl:element name="{local-name()}">                       

<xsl:apply-templates select="@*|node()" />                       

</xsl:element>           

</xsl:template>           

<xsl:template match="@*">                       

<xsl:attribute name="{local-name()}">                                   

<xsl:value-of select="." />                        

</xsl:attribute>           

</xsl:template>

</xsl:stylesheet>

 

I have the FileConnection set to the file that my first task outputted, and the operation saves the results so that I end up with a clean version of my outputted WSS/SharePoint list data.

 

 

Data Flow Task

 

Setting up the import can be one of the more painful tasks, depending on the number of columns you have on the view and list your are hitting. The XML that you will have after pulling the data will have GUIDs for all of the user defined columns in your list, which probably is where most of your data is.

To know exactly how the GUIDs in your XML export map up to columns on your list, you can export the schema for SharePoint. In the Lists web service, hit the method GetList and pass it the name of the List you are using. Your output will have the schema of the list. Here is a sample for one column:

 

 <Field ReadOnly="TRUE" DisplayName="TestColumn" Name="xd__x007b_8CEEC711_x002d_0B7C_x002d_4C6A_x002d_9F82_x002d_5584303A997F_x007d_" XName="{8CEEC711-0B7C-4C6A-9F82-5584303A997F}" Node="/my:myFields/my:TestList/my:TestColumn" Type="Text" ColName="nvarchar18" /> 

The XName element has the GUID for the column that will show up in the data export. You can use the output from the GetList method to build mappings for your export.

 

Your Data Flow Task at a minimum will have two objects: an XML Source and your OLE DB Destination. I would recommend putting a Data Conversion object between the two – that way you can save on space on the insert into the database and have a way to redirect failed rows for analysis should you get something really strange in your incoming dataset.

  

That rounds it out; please feel free to post any questions.

Tags:
Categories: SSRS | SharePoint | SSIS

October 25, 2007 04:04 by Sid
E-mail | Permalink | Comments (16) | Comment RSSRSS comment feed

Comments

May 10. 2008 01:13

Jason McLean

A great aritcle and tutorial.  
Thanks,
Jason

Jason McLean

August 20. 2008 20:59

Kevin Idzi

You can now connect to Sharepoint easier - a public ssis adapter for sharepoint (a source and destination adapter) has been released:
http://www.codeplex.com/SQLSrvIntegrationSrv

Kevin Idzi

August 26. 2008 08:08

Sid

Kevin,

Thanks for your post -- I never got the notification event, so sorry for the delayed response. I'll take a look at the samples and see what they are using to connect to SharePoint. One of the reasons for the sample posted above was the lack of access to the SharePoint APIs, so we had to use web services. Again, thanks for the post, I'll check out the CodePlex project.

Sid

August 29. 2008 22:38

Kevin Idzi

The samples also use the web service.  They provide some interesting features:
- You can craft a custom CAML expression to filter the data you want
- Custom batch size - the component does all of the paging for you, but depending on the width of the row of data, sometimes you can pull a few thousand at a time, sometimes a few hundred, so you can easily change it around
- Only pulls the queries used in the output columns, so if you remove unnecessary output columns, it will be faster (removing the basic ones almost allows 2x the batch size in my simplified tests)
- Works in a data flow, so you can integrate from other sources. And has a destination adapter so you can adjust and then update the sharepoint site afterwards with changes - or delete rows if you wish.

Some upcoming features are on the codeplex site, and you can post any comments / issues on there as you find them. Enjoy.

Kevin Idzi

September 17. 2008 08:07

trackback

Trackback from Sid Atkinson Jr.

One year later

Sid Atkinson Jr.

September 29. 2008 00:47

Robin Mathew

Hi,
Thanks for this article.
I am trying to get the list of data from sharepoint list. I have
created 5 columns and in the view I selected(ticked) only these 5
columns. I am using the code
Dim ndViewFields As XmlNode = xmlDoc.CreateNode(XmlNodeType.Element,
"ViewFields", "")
ndViewFields.InnerXml = "<FieldRef Name='Field1'/><FieldRef
Name='Field2'/>"
xmlnode xnode =<Proxy List>.GetListItems("Project Master", Nothing,
Nothing, ndViewFields, Nothing, Nothing, Nothing)
I able to retrive all those columns. But one of the column is
"Multiple lines of text". So while retrieving the xml data of that
column, it is giving the data in the HTML format.

So because of this html data, I am not able to get all the records.
Please help me to get only the value of that column?

Thanks in Advance
Robin Mathew

Robin Mathew

October 9. 2008 11:05

Kevin I

If you are interested in this, please use the general Releases tab instead of a direct link:
www.codeplex.com/.../ProjectReleases.aspx

Otherwise, you'll always get that release and miss out if any changes occur.

Bundled in that release is a SharepointUtility DLL as well which provides rich access to the Lists.asmx from code (or a script task).  In addition to list items, it also will handle uploading/deleting files from a document library and much more flexibility over all aspects of SharePoint - without writing any XML.  The Adapter uses that DLL extensively, and although it should probably be its own project (as it is overlooked), it is pretty cool.

Kevin

Kevin I

October 10. 2008 10:00

Sid

Kevin,
Many thanks for continuing to follow up -- this information is great.  Maybe we can get you and the other developers of these SSIS extentions to present at the 2009 BI conference? The work you guys have posted on CodePlex is incredibly practical, useful, and covers a lot of things we see our clients asking for in SSIS.

Sid

Add comment


(Will show your Gravatar icon)

  Country flag

biuquote
  • Comment
  • Preview
Loading



Sponsors

Author

Tags

Recent Posts