Click here to monitor SSC
  • Av rating:
  • Total votes: 17
  • Total comments: 7
Joe Sack

Exploring Semantic Search Key Term Relevance

09 October 2012

SQL Server's 'Semantic Search' feature seemed an exciting feature when first shown. Was it really true that Microsoft had come up with a system to rival the industry-leaders, one that could extract the contextual meaning of terms in text, or automatically categorise the subject matter of text? On first inspection, it seems unlikely.

Semantic Search is the new feature introduced in SQL Server 2012 that allows you to extract statistically relevant words, referred to by Microsoft as ‘key terms’, from unstructured documents stored in SQL Server. In addition to key term extraction, you can also find documents that are similar to one another based on the shared relevant key terms.

Note: I say “key terms” but you may see official Microsoft documentation call them “key phrases.” This is a bit of a misnomer because as of SQL Server 2012 RTM, these “phrases” are unigrams (single word terms). I do hope we’ll see this expand in future versions to support bigrams, trigrams and beyond.

If you’re wondering whether you can use Semantic Search for your own application requirements, I strongly recommend you test this functionality using your own data. The viability of this new feature depends on the nature of the data being explored. For example, if you’re looking to create an application that pulls unigrams from a set of customer documents, you’ll want to make sure that the terms that were extracted are truly relevant to the subject matter and not assume that all subject areas have equivalent support by the Semantic Search feature.

I’ve experimented with various types of SQL Server related documents in the past and I do see that in some cases the key terms that are extracted are what I would expect. However in several cases I’ve also found that terms extracted are not what I expected.

Let’s take an example of exploring a library of technical articles published by Simple-Talk. My Red Gate editor provided me with a database containing 1,388 articles. For the table containing these articles, I created an associated full-text index and enabled the Semantic Search feature. The table name containing the article text is called dbo.xlaANMarticles.

So to begin with, I asked a question about which terms were most significant across the entire library of articles? I asked this question using the following query, leveraging the SEMANTICKEYPHRASETABLE function:

SELECT TOP (10)
        
[skp].[keyphrase],
        
SUM([skp].[score]) AS [totalscore]
FROM    [dbo].[xlaANMarticles] AS wp
INNER JOIN SEMANTICKEYPHRASETABLE(xlaANMarticles, article) AS [skp]
        
ON [wp].[articleid] = [skp].[document_key]
GROUP BY [skp].[keyphrase]
ORDER BY SUM([skp].[score]) DESC;
GO

This query returned the following 10 rows:

keyphrase totalscore
font-family 883.2972255
font 844.5126802
font-weight 789.5226674
margin-left 775.4817398
font-size 763.5196039
span 720.4189129
nbsp 701.175603
background-color 689.4987194
padding-left 687.912422
padding 682.9183152

As you can see – we’re getting noise. The articles are stored in the table with the original article text and associated HTML metadata. So before we can even see the relevant keywords, I needed to filter out the noise. For example, I created a full-text stoplist and added several “noise” terms:

CREATE FULLTEXT STOPLIST [slArticles];
ALTER FULLTEXT STOPLIST [slArticles] ADD 'font-family' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [slArticles] ADD 'font' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [slArticles] ADD 'font-weight' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [slArticles] ADD 'margin-left' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [slArticles] ADD 'span' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [slArticles] ADD 'nbsp' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [slArticles] ADD 'background-color' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [slArticles] ADD 'padding-left' LANGUAGE 'English';
ALTER FULLTEXT STOPLIST [slArticles] ADD 'padding' LANGUAGE 'English';
GO

I ended up having to add several stopwords, manually reviewing what I knew was noise given the common subject-matter of the articles. Some terms like “table” or “column” were ambiguous. Were we talking about HTML or SQL in that context? There was not reliable way to know programmatically.

Also, as a short-cut, I scripted out the scripting of the ALTER FULLTEXT STOPLIST and then added the (more) obvious metadata choices:

SELECT TOP (100)
        
'ALTER FULLTEXT STOPLIST [slArticles] ADD ''' + skp.keyphrase +
        
''' LANGUAGE ''English'';',
        
SUM([skp].[score]) AS [totalscore]
FROM    [dbo].[xlaANMarticles] AS wp
INNER JOIN SEMANTICKEYPHRASETABLE([xlaANMarticles], [article]) AS skp
        
ON [wp].[articleid] = [skp].[document_key]
GROUP BY [skp].[keyphrase]
ORDER BY SUM([skp].[score]) DESC;
GO

After adding the various stopwords, I made sure that my stoplist was associated with my full-text index. For example:

CREATE FULLTEXT INDEX ON [dbo].[xlaANMarticles]
    ([article] LANGUAGE 1033 STATISTICAL_SEMANTICS)
    
KEY INDEX [PK_articles]
  
ON [ft_SimpleTalkArticles]
  
WITH STOPLIST = [slArticles];
GO

After adding the stoplist and rebuilding the catalog, I saw the following top results:

keyphrase totalscore
sql 483.0557612
database 224.5216756
select 156.4048312
query 126.7041208
css 124.2681781
type 110.4776265
databases 99.8270269
varchar 96.96654268
your 96.69461793
microsoft 94.31538226

As you can see, we’re getting closer to “relevant” terms – but there was still some noise. Including “CSS”, “your” and “Microsoft”. I decided to spend a few more minutes removing additional noise terms, including that noise that was related to the subject matter, for example – “Microsoft” may not be noise in some scenarios, but most certainly isn’t a helpful term when looking for statistically significant terms from a Microsoft-centric publication website.

My new top 10 was as follows:

keyphrase totalscore
sql 483.2933343
database 225.7992599
select 158.2329725
query 127.7730246
databases 100.7824847
script 70.22023614
log 61.72487602
column 61.24853854
developers 60.87272082

Is this helpful compared to a standard search engine? It seems unlikely at this point.

What about the top significant terms for a specific article? I tested the following query against an article on Availability Group wait statistics I did back in May of 2012:

SELECT TOP (15)
        
[skp].[keyphrase],
        
SUM([skp].[score]) AS [totalscore]
FROM    [dbo].[xlaANMarticles] AS [wp]
INNER JOIN SEMANTICKEYPHRASETABLE([xlaANMarticles], [article]) AS [skp]
        
ON [wp].[articleid] = [skp].[document_key]
WHERE   [headline] = 'A first look at SQL Server 2012 Availability Group Wait Statistics'
GROUP BY [skp].[keyphrase]
ORDER BY SUM([skp].[score]) DESC;
GO

In this case I saw the following terms scoped to just my article:

keyphrase totalscore
nvarchar 0.437133938
sql 0.40759635
xel 0.391592562
failover 0.38552931
sqlserver 0.367718428
waitfor 0.352442741
replica 0.348848522
synchronous 0.347228736
duration 0.330882519
concurrency 0.321891427
asynchronous 0.311569184
wait 0.30677563
availability 0.299674034
desc 0.297606289
replicas 0.296193898

How relevant were these terms? Certainly “replica”, “synchronous”, “asynchronous”, “wait”, “availability”, and “replicas” fall into that category. But conspicuously missing are the various wait types themselves. For example, I would have expected to see wait types such as HADR_WORK_QUEUE, HADR_SYNC_COMMIT, and WRITELOG. Looking through the top 100 key phrases (the capped term list), I didn’t see these phrases surfaced as meaningful, even though from a search perspective people are likely to use them as key terms.

In addition to the missing relevant terms, we’re seeing a key limitation with using unigrams because we cannot represent the key phrase “wait statistics” as something that is likely the most relevant term in the entire paper. You can also see how subjective “relevant” is when we’re talking about specific industries or subject areas. How do you output terms that are as relevant for SQL Server as they are for the construction, retail, and medical industries? I’d see this as a very difficult task to achieve without some kind of human intervention.

Another example includes a query for the top terms from Jonathan Kehayias’ “Optimizing tempdb configuration with SQL Server 2012 Extended Events” article:

keyphrase totalscore
sql 0.640196383
sqlserver 0.543729305
in-memory 0.534120202
contention 0.512095034
latch 0.488780767
allocation 0.485334814
gam 0.473063797
predicate 0.468191713
xel 0.466187418

Inability to have the phrase “Extended Events” is a key limiting factor – and with that aside, there was no identification of “tempdb” as a statistically significant word. There were plenty of other relevant words like “allocation”, “bitmap”, “contention”, “gam” (but no “sgam” or “pfs” terms interestingly enough given the frequency), “in-memory”, “histogram” – but would they be relevant in absence of the core subject of the paper?

In conclusion, I feel Semantic Search is an interesting first “alpha” version and I really do hope it is further cultivated by the Microsoft product team. I’m also interesting in hearing how it is applied in real-world scenarios. There may be applications that can benefit from it today, and I’d like to understand where the current boundaries of utility exist. If there is a next version, ideally it includes support for true key phrases. I’d also like to see the ability to extend the Semantic Search model to support industry terminology and custom term relevance weighting. That involves human-intervention of course, which isn’t ideal for various scenarios, but I’d still be willing to do some manual intervention in order in order to find relevant phrases in a library of content.

Joe Sack

Author profile:

Joe Sack (Joe@SQLskills.com, twitter @josephsack) is a Principal Consultant, author and trainer at SQLskills.com. Prior to joining SQLskills, Joe worked at Microsoft as a Premier Field Engineer for large enterprise environments. He is a SQL Server MVP and Microsoft Certified Master (MCM) for versions 2005 & 2008. For his last few years at Microsoft, he was responsible for the SQL Server MCM Program (2009 to 2011). He has written several books, articles and whitepapers including SQL Server 2008 Transact-SQL Recipes (Apress, 2008) and SQL Server 2005 T-SQL Recipes (Apress, 2005).

Search for other articles by Joe Sack

Rate this article:   Avg rating: from a total of 17 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: Multiple Word Key Phrases Needed
Posted by: Steve K (not signed in)
Posted on: Monday, October 15, 2012 at 8:45 AM
Message: I agree with Joe. In my experimentation for a business subject domain, I identified the lack of multi-word key phrases as the most significant limitation of this version of semantic search. I strongly encourage Microsoft to rectify this in an upcoming Service Pack for SQL Server 2012.

Subject: Multiple Word Key Phrases Needed
Posted by: Joe Sack (not signed in)
Posted on: Monday, October 15, 2012 at 8:50 AM
Message: Thanks Steve. I definitely hope they grow this feature. Lots of (unrealized) potential.

Subject: Multiple words, key phrase, Sentiment analysis and Relevance
Posted by: kevinkunz2000 (view profile)
Posted on: Thursday, October 18, 2012 at 7:32 AM
Message: As Joe clearly pointed out, this is an Alpha product; and IMO barely as it is nearly impossible to develop a practical business use for the feature. That said there is an interesting application out on CodePlex: http://mysemanticsearch.codeplex.com/, but really we have not progressed very far from standard search features provided in something like SharePoint. I know that comment may create an uproar for many reasons but when we look at semantic search from a pure functional perspective…well it doesn’t provide much. To be truly useful, yes we need to be able to search actual phrases, compare those as well to a series of lexicons, and perhaps provide weighting (The correlation score) based on mappings relevant to say emotion, confidence, etc… Then we have progressed into the area of sentiment analysis and rudimentary computational linguistics, where we can begin to provide true business value.

Subject: Not quite semantic - yet
Posted by: Ralph (not signed in)
Posted on: Thursday, October 18, 2012 at 9:13 AM
Message: Joe, Thanks for the article and good review of a nice new feature is SQL 2012. Statistical relevance can be useful but is not yet quite "semantic" in the sense of conveying words with meaning. I think that's what kevin was referring to with the statement "...Then we have progressed into the area of sentiment analysis and rudimentary computational linguistics...". I agree. It would seem that there would be lots of challenges. Most linguistic communication comes with some cultural context which would have suppositions weighted for impact based on phrase use. Wouldn't it be a hoot configuring a linquistic interpretive engine to read, searching for relevent concepts from the perspective of a financially astute investor or a medical research doctor or a left wing politician all reading from the same document?

Subject: use filters to strip the html tags
Posted by: jesse g (not signed in)
Posted on: Thursday, October 18, 2012 at 11:07 PM
Message: Add two computed columns, one with a static value of '.html', and one with the body casted as varbinary(max), and recreate your fulltext index specifying a type column. Then you won't need a custom stop list.

The main problem with statistical semantics is that domain specific terms aren't incorporated into the statistical db.

There are two main inputs to calculate the key terms in a document, first is the relative freq of each term in the document, second is the relative freq of the term across all documents.

MS provides the semanticsdb which contains a couple hundred thousand terms and their frequency. This allows you to index one document and pull out key phrases. If the term isn't in the semanticsdb, it will never show up as a key phrase.

Until they introduce a way to incorporate domain specific terms into the semanticsdb, I'll have limited use for it. However, for general topics (e.g. newspaper articles), it works quite well.

This is definitely a feature to keep your eye on over the next couple years.

Subject: Follow-ups
Posted by: Joe Sack (not signed in)
Posted on: Friday, October 19, 2012 at 9:10 AM
Message: Kevin, Ralph and Jesse - thanks for the comments!

Jesse - thanks for the filter suggestion. Agreed that this will eliminate the HTML noise. Also agreed that the viability of this feature depends on the incorporation of domain specific terms. I truly hope this happens and doesn't go the way of "English Query".

Thanks

Subject: thanks Joe and everyone
Posted by: Mike D (not signed in)
Posted on: Friday, October 19, 2012 at 11:45 AM
Message: Joe and all subsequent commenters, thanks for the article and follow-ons, most helpful. jesse g, thanks in particular!

Having now experimented with Semantic Search, I agree with everyone and must admit my surprise that so much noise has been made about such a quarter-baked "feature." My (rapidly fading) hope has been to analyze the content of customer communication using Statistical Semantic Search [SSS].

Particularly annoying are the many references to "keyphrases" when only unigrams are currently supported. Does this point to internal versions that do in fact support ngrams?

Now searching for hacks to extend SSS so that it does something interesting. Intuitively it feels like there must be ways to put even the existing limited functionality to effective use. kevin, you encourage me! Maybe we can give SSS some prosthetic appendages, so to speak.

MySemanticSearch on CodePlex is a trailmarker perhaps. Way back in the day I worked on the AT&T text-to-speech engine and adaptive parsers for help and training systems. Ralph, what you describe can be and has been done, but your implication that completeness in a given domain is non-trivial (impossible in the general case, although we can get close enough to be useful) is correct.

Perhaps Microsoft can open SSS up a bit, that would be a great start. Ah yes, dream on.

 

Phil Factor
Searching for Strings in SQL Server Databases

Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

 View the blog

Top Rated

SQL Server XML Questions You Were Too Shy To Ask
 Sometimes, XML seems a bewildering convention that offers solutions to problems that the average... Read more...

Continuous Delivery and the Database
 Continuous Delivery is fairly generally understood to be an effective way of tackling the problems of... Read more...

The SQL Server Sqlio Utility
 If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

The PoSh DBA - Reading and Filtering Errors
 DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

MySQL Compare: The Manual That Time Forgot, Part 1
 Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Reading and Writing Files in SQL Server using T-SQL
 SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.