Imagine you are given a table with a column of XML in it and a few million rows of data. Further imagine that you want to write an efficient query using an attribute of that XML in the WHERE clause of your query.
A short test scenario describes how I recently dealt with the problem. First, a basic table definition. The XML will end up in the SomeContent field.
CREATE TABLE [dbo].[MarcTest]
(
[ID] [int] NOT NULL,
[TypeCode] [varchar](20) NOT NULL,
[SomeContent] [varchar](MAX) NOT NULL
CONSTRAINT [PK_MarcTest] PRIMARY KEY CLUSTERED
([ID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
Now for some test data. As in my real-life scenario, the schema of the XML data varies from row to row.
INSERT INTO marctest (ID, TypeCode, SomeContent) VALUES(1, ‘CODE1’, ‘<MyDoc><MetaData><EmployeeID>13791</EmployeeID><Name>Marc Moore</Name></MetaData><ContentData><ContentField1>This is some content.</ContentField1><ContentField2>This is some additional content.</ContentField2></ContentData></MyDoc>’)
INSERT INTO marctest (ID, TypeCode, SomeContent) VALUES(2, ‘CODE2’, ‘<MyDoc><MetaData><EmployeeID>4</EmployeeID><Name>Xi Zhang</Name></MetaData><ContentData><ContentField1>This is some content.</ContentField1><ContentField2>This is some additional content.</ContentField2></ContentData></MyDoc>’)
INSERT INTO marctest (ID, TypeCode, SomeContent) VALUES(1001, ‘OTHER1’, ‘<SomeOtherData><Data1>111111</Data1><Data2>222222</Data2>333333<Data3></Data3></SomeOtherData>’)
GO
It’s not possible to create an index directly on an XML column’s XPath expression and the XML indexes offered by SQL Server 2008 have a bad reputation for being bloated. So what to do? The solution that follows was blessed by David, our DBA, as a valid one.
Based on this question and answer about SQL, XML, and XPath, what I first wanted to do was create a computed column based on an XPATH expression like this one:
Unfortunately that does not work. There are a couple of reason for this: The SomeContent column is not an XML column, which doesn’t help, nor does the fact that SQL considers this expression non-deterministic.
Various articles led me to create this scalar function to help solve the problem:
FUNCTION MarcTestEmployeeID
(
@SomeXml xml
)
RETURNS INT
AS
BEGIN
DECLARE @ret INT = @SomeXml.value(‘(/MyDoc/MetaData/EmployeeID/node())[1]’, ‘int’)
IF @ret IS NULL
SET @ret = 0
RETURN @ret
END
Now SQL was happy and let me create the computed column
ALTER TABLE MarcTest ADD EmployeeID as ([dbo].[MarcTestEmployeeID]([SomeContent]))
Unfortunately, when I attempted to create my index:
CREATE INDEX MarcTestEmployeeID_IDX ON dbo.MarcTest(TypeCode, EmployeeID)
SQL became unhappy once more:
Msg 2729, Level 16, State 1, Line 2
Column ‘EmployeeID’ in table ‘dbo.MarcTest’ cannot be used in an index or statistics or as a partition key because it is non-deterministic.
I’m sure SQL has its own reasons for doing this, but its logic was clearly incorrect as far as my function goes. Given a blob of XML, dbo.MarcTestEmployeeID will always return the same answer for that XML; ergo, it is deterministic. Yet SQL remained unhappy.
Happily, this question and answer about deterministic scalar functions solved the issue with a one-line change to the function definition:
FUNCTION MarcTestEmployeeID
(
@SomeXml xml
)
RETURNS INT
WITH SCHEMABINDING
…
After creating my index, my sample “SELECT *” query returned the following, which is just what I would expect:
Next, I inserted a couple of thousand more records to make things more interesting and ran this query, which is similar to the one I plan to run in the real world:
SELECT * FROM MarcTest WHERE TypeCode = ‘CODE1’ AND EmployeeID = 13791
The execution plan indicates that my new index that’s built on the computed column using the UDF and XML data is being used:
Exactly what was hoped for!