If you need to extract a sizeable amount of data from a Microsoft Content Management Server 2002 (MCMS) database you have two options:
- Using CMS 2002 API (PAPI)
- Interacting directly with the CMS 2002 database
Although the second option it is totally discouraged due to the fact that the database schema is not published and CMS server executes complex procedures against this database, we are going to use it, but we are only going to do simple read-only queries against the CMS database. Under no circumstances should write operations be performed directly against the database. The PAPI is the appropriate interface for writing to the CMS database.
DISCLAIMER: All described in this post was used for me exporting out of MCMS 2002 migrating to Wordpress and Liferay. It worked for me, but that does not mean it has to be all right.
Database Schema
My MCMS 2002 database had the following tables:
CMS data is stored in various formats within these tables. Finding information stored in a particular placeholder is difficult. Placeholder data is often stored in binary large object (BLOB) format and cannot be easily found or read. Resources within placeholders are stored in the blob table, but placeholder data is stored in the NodePlaceholderContent table. It is difficult to find placeholder data, because CMS assembles various pieces of content from this table.
The tables storing most of the data are Node and NodePlaceholderContent
Node types
Node table contains the core information about the CMS system. There are different note types. You can execute the following select to check the different node types.
SELECT
\[Type\]
,CASE \[Type\]
WHEN 1 THEN 'Server'
WHEN 4 THEN 'Channel'
WHEN 16 THEN 'Post / Page'
WHEN 64 THEN 'Resource Gallery'
WHEN 256 THEN 'Resource Gallery Item (images/documents)'
WHEN 16384 THEN 'Template Gallery'
WHEN 65536 THEN 'Template'
WHEN 262144 THEN 'NT Domain User'
WHEN 524288 THEN 'Role Gallery'
WHEN 1048576 THEN 'Role'
WHEN 2097152 THEN 'Archive Folder / Folders / Deleted Items'
WHEN 16777216 THEN 'Schemas'
WHEN 33554432 THEN 'Page Definition' END as \[Description\]
,COUNT(\[Type\]) as \[Count\]
FROM dbo.Node
GROUP BY \[Type\]
ORDER BY \[Type\]
The main columns in table Node are Type, isShortcut, NodeGUID, ParentGUID, FollowGUID, Name, TemplateGUID and PageDefinitionGUID. There are more columns, but they are properties what must be used depending of node type.
Pages and Postings
When a page content and the page properties in CMS 2002 are submitted or posted, the combination is referred to in CMS as a posting. Both terms, pages and posting are very often referred to interchangeably.
From database point of view, page is a node which Type column is equals to 16 and which isShorcut column is equals to 0. Pages have its column FollowGUID set to null. Postings have its column isShortcut set to 1. Postings and Pages are related by column FollowGUID.
A page is published if ApprovalStatus = 1
Given a page, its postings are nodes which FollowGUID is equal to page's NodeGUID.
Parent pages are usually folders and parent postings are channels (Type=4).
Templates, Page Definitions and Schemas
Given a page, if you want to know its template, you have to read its posting. Next you can reach the template using the value of TemplateGUID column from posting record. In TemplateAspPath column you will find the path of the template. Template is usually an aspx page. Finally, if you want to know the fields (placeholders) of the template, you have to read the page definition record, using the value of PageDefinitionGUID from posting record. Schema xml of the page definition is in SchemaXML column. Parent node of a page definition record is a node called Schemas. The root node or main node is usually called Server.
The following code shows the connection between all nodes.
/* Page */
SELECT isShortcut, ApprovalStatus, Type, NodeGUID , ParentGUID, FollowGUID, TemplateGUID, Name, PageDefinitionGUID , SchemaXML
FROM dbo. Node
where NodeGuid = '{928B8DA2-xxx}'
and isShortcut = 0
/* Posting */
SELECT isShortcut, ApprovalStatus, Type, NodeGUID , ParentGUID, FollowGUID, TemplateGUID, Name, PageDefinitionGUID , SchemaXML
FROM dbo. Node
where FollowGuid = '{928B8DA2-xxx}'
and isShortcut = 1
/* Template */
SELECT isShortcut, ApprovalStatus, Type, NodeGUID , ParentGUID, FollowGUID, TemplateGUID, Name, PageDefinitionGUID , SchemaXML, TemplateAspPath
FROM dbo. Node
where NodeGuid = '{8971902E-xxx}'
/* PageDefinitionGUID */
SELECT isShortcut, ApprovalStatus, Type, NodeGUID , ParentGUID, FollowGUID, TemplateGUID, Name, PageDefinitionGUID , SchemaXML
FROM dbo. Node
where NodeGuid = '{8BE8D823-xxx}'
/* Schemas */
SELECT isShortcut, ApprovalStatus, Type, NodeGUID , ParentGUID, FollowGUID, TemplateGUID, Name, PageDefinitionGUID , SchemaXML
FROM dbo. Node
where NodeGuid = '{128FE591-xxx}'
/* Root node - Server */
SELECT isShortcut, ApprovalStatus, Type, NodeGUID , ParentGUID, FollowGUID, TemplateGUID, Name, PageDefinitionGUID , SchemaXML
FROM dbo. Node
where NodeGuid = '{9673B503-xxx}'
The following picture shows the result of the code above.
In SchemaXML column you will find the schema of the page definition.
An example of page definition is:
<PageDefinition>
<PlaceholderDefinitions>
<PlaceholderDefinition>
<Name>PH_DESCRIPTION</Name>
<PlaceholderAssembly>Microsoft.ContentManagement.Publishing.Extensions.Placeholders,
Version=5.0.1200.0, Culture=neutral,
PublicKeyToken=31bf3856ad364e35</PlaceholderAssembly>
<PlaceholderClass>Microsoft.ContentManagement.Publishing.Extensions.Placeholders.HtmlPlaceholderDefinition</PlaceholderClass>
<PlaceholderSettings>
<HtmlPlaceholderDefinition>
<Description />
<Name>PH_DESCRIPTION</Name>
<Formatting>NoFormatting</Formatting>
<AllowHyperlinks>false</AllowHyperlinks>
<MustUseResourceGallery>false</MustUseResourceGallery>
<AllowAttachments>false</AllowAttachments>
<AllowImages>false</AllowImages>
<UseGeneratedIcon>false</UseGeneratedIcon>
<AllowLineBreaks>true</AllowLineBreaks>
</HtmlPlaceholderDefinition>
</PlaceholderSettings>
</PlaceholderDefinition>
<PlaceholderDefinition>
<Name>PH_TEXTIMAGE</Name>
<PlaceholderAssembly>Microsoft.ContentManagement.Publishing.Extensions.Placeholders,
Version=5.0.1200.0, Culture=neutral,
PublicKeyToken=31bf3856ad364e35</PlaceholderAssembly>
<PlaceholderClass>Microsoft.ContentManagement.Publishing.Extensions.Placeholders.HtmlPlaceholderDefinition</PlaceholderClass>
<PlaceholderSettings>
<HtmlPlaceholderDefinition>
<Description />
<Name>PH_TEXTIMAGE</Name>
<Formatting>FullFormatting</Formatting>
<AllowHyperlinks>false</AllowHyperlinks>
<MustUseResourceGallery>false</MustUseResourceGallery>
<AllowAttachments>true</AllowAttachments>
<AllowImages>true</AllowImages>
<UseGeneratedIcon>true</UseGeneratedIcon>
<AllowLineBreaks>true</AllowLineBreaks>
</HtmlPlaceholderDefinition>
</PlaceholderSettings>
</PlaceholderDefinition>
<PlaceholderDefinition>
<Name>PH_BIGIMAGE</Name>
<PlaceholderAssembly>Microsoft.ContentManagement.Publishing.Extensions.Placeholders,
Version=5.0.1200.0, Culture=neutral,
PublicKeyToken=31bf3856ad364e35</PlaceholderAssembly>
<PlaceholderClass>Microsoft.ContentManagement.Publishing.Extensions.Placeholders.ImagePlaceholderDefinition</PlaceholderClass>
<PlaceholderSettings>
<ImagePlaceholderDefinition>
<Description />
<Name>PH_BIGIMAGE</Name>
<MustUseResourceGallery>false</MustUseResourceGallery>
<AllowHyperlinks>false</AllowHyperlinks>
</ImagePlaceholderDefinition>
</PlaceholderSettings>
</PlaceholderDefinition>
</PlaceholderDefinitions>
</PageDefinition>
A page definition is built using Template Manager, configuring PlaceholderDefinitions property as is shown in the following image.
There are several types of placeholder definitions, such as HtmlPlaceholderDefinition, XmlPlaceholderDefinition, AttachmentPlaceholderDefinition, ImagePlaceholderDefinition, OfficeHtmlPlaceholderDefinition and OfficeAttachmentPlaceholderDefinition. In Template Manager you select the type of placeholder definition, when you create a new placeholder definition.
Information about pageholder information is also stored in NodePlaceholder. In our case, our template has three placeholders.
/* Page */
SELECT id, isShortcut, ApprovalStatus, Type, NodeGUID , ParentGUID, FollowGUID, TemplateGUID, Name, PageDefinitionGUID , SchemaXML
FROM node
where nodeguid = '{928B8DA2-0D0A-4A3B-921D-39642A766FB2}'
/* Placeholders */
select *
from dbo.NodePlaceholder
where nodeid = 699289
Properties
You can reach out the properties of a node reading NodeProperty table by NodeId. You have to use page node id to get properties of a page.
select id
from dbo.node
where nodeguid = '{F1D91EA1-}'
and isShortcut = 0
select *
from dbo.NodeProperty
where NodeId = 696567
You can do the same, for example, with channel nodes.
Placeholder contents and resources
The content is stored in NodePlaceholderContent. You can get placeholder contents following these sentences:
/* Posting by TemplateGUID */
SELECT id, isShortcut, ApprovalStatus, Type, NodeGUID , ParentGUID, FollowGUID, TemplateGUID, Name, PageDefinitionGUID , SchemaXML
from dbo.Node where TemplateGUID='{8971902E-xxx}'
/* Page */
SELECT id, isShortcut, ApprovalStatus, Type, NodeGUID , ParentGUID, FollowGUID, TemplateGUID, Name, PageDefinitionGUID , SchemaXML
FROM node
where nodeguid = '{928B8DA2-xxx}'
/* Placeholder contents */
select *
from dbo.NodePlaceholderContent
where nodeid = 699289
/* Resources */
select *
from dbo.NodeResource
where nodeid = 699289
/* Archive */
select *
from dbo.BlobTable
where BlobId = 150676
You can analyze the relation between records in more detail in the next picture.
You need the page's id for getting the records of placeholder contents. The value of every placeholder is stored in records with value 'HTML' in PropName column. In our case, the third placeholder is an image placeholder. So we need get its related resource. As you can see in the image, in the value <img src="" alt="Black horse" border="0"> the virtual identifier for the resource is ControlProp0. Then you need to look for the value ControlProp0 in PropName column in records with the same NodeId and PlaceholderId, that is in our case, NodeId = 699289 and PlaceholderId = 3. Once located the record, you get the value in ResourceId, in our case 1, and next you look for the record with this Id and the same NodeId (699289) in NodeResource table. Finally, with the value of ResourceBlobId (150676) of this table you will find the bits of the resource in BlobData in BlobTable table. For the name of the resource, you will need to concatenate the value of Name column in NodeResource with BlobFileExt ins BlobTable table. In our case, the name is blackhorse.png.
When the html page is built, the html code for image is converted to:
<img src="/NR/rdonlyres/069FA0F9-xxx/150676/blackhorse.png" alt="Black horse" >
As you can see, the url for the image source is built with the posting's node guid.
In placeholders where properties AllowAttachments or AllowImages you could find inside its values patterns similar to <a href="">Report.pdf. You will have to procede in the same way that was described in the paragraph above in order to get the resource.
Last but not least, you can reach out resources concatenating hexadecimal value of the BlobId with BlobSecret. For example, in ower case
BlobId -> 150676 (dec) = 00024C94 (hex)
<img src="/NR/rdonlyres/00024C94/krwjbyrpflxq.../blackhorse.png" alt="Black horse" >