Exploring Semantic Search Key Term Relevance

SQL Server's 'Semantic Search' feature seemed an exciting feature when first shown. Was it really true that Microsoft had come up with a system to rival the industry-leaders, one that could extract the contextual meaning of terms in text, or automatically categorise the subject matter of text? On first inspection, it seems unlikely.

Semantic Search is the new feature introduced in SQL Server 2012 that allows you to extract statistically relevant words, referred to by Microsoft as ‘key terms’, from unstructured documents stored in SQL Server. In addition to key term extraction, you can also find documents that are similar to one another based on the shared relevant key terms.

Note: I say “key terms” but you may see official Microsoft documentation call them “key phrases.” This is a bit of a misnomer because as of SQL Server 2012 RTM, these “phrases” are unigrams (single word terms). I do hope we’ll see this expand in future versions to support bigrams, trigrams and beyond.

If you’re wondering whether you can use Semantic Search for your own application requirements, I strongly recommend you test this functionality using your own data. The viability of this new feature depends on the nature of the data being explored. For example, if you’re looking to create an application that pulls unigrams from a set of customer documents, you’ll want to make sure that the terms that were extracted are truly relevant to the subject matter and not assume that all subject areas have equivalent support by the Semantic Search feature.

I’ve experimented with various types of SQL Server related documents in the past and I do see that in some cases the key terms that are extracted are what I would expect. However in several cases I’ve also found that terms extracted are not what I expected.

Let’s take an example of exploring a library of technical articles published by Simple-Talk. My Red Gate editor provided me with a database containing 1,388 articles. For the table containing these articles, I created an associated full-text index and enabled the Semantic Search feature. The table name containing the article text is called dbo.xlaANMarticles.

So to begin with, I asked a question about which terms were most significant across the entire library of articles? I asked this question using the following query, leveraging the SEMANTICKEYPHRASETABLE function:

This query returned the following 10 rows:

keyphrase totalscore
font-family 883.2972255
font 844.5126802
font-weight 789.5226674
margin-left 775.4817398
font-size 763.5196039
span 720.4189129
nbsp 701.175603
background-color 689.4987194
padding-left 687.912422
padding 682.9183152

As you can see – we’re getting noise. The articles are stored in the table with the original article text and associated HTML metadata. So before we can even see the relevant keywords, I needed to filter out the noise. For example, I created a full-text stoplist and added several “noise” terms:

I ended up having to add several stopwords, manually reviewing what I knew was noise given the common subject-matter of the articles. Some terms like “table” or “column” were ambiguous. Were we talking about HTML or SQL in that context? There was not reliable way to know programmatically.

Also, as a short-cut, I scripted out the scripting of the ALTER FULLTEXT STOPLIST and then added the (more) obvious metadata choices:

After adding the various stopwords, I made sure that my stoplist was associated with my full-text index. For example:

After adding the stoplist and rebuilding the catalog, I saw the following top results:

keyphrase totalscore
sql 483.0557612
database 224.5216756
select 156.4048312
query 126.7041208
css 124.2681781
type 110.4776265
databases 99.8270269
varchar 96.96654268
your 96.69461793
microsoft 94.31538226

As you can see, we’re getting closer to “relevant” terms – but there was still some noise. Including “CSS”, “your” and “Microsoft”. I decided to spend a few more minutes removing additional noise terms, including that noise that was related to the subject matter, for example – “Microsoft” may not be noise in some scenarios, but most certainly isn’t a helpful term when looking for statistically significant terms from a Microsoft-centric publication website.

My new top 10 was as follows:

keyphrase totalscore
sql 483.2933343
database 225.7992599
select 158.2329725
query 127.7730246
databases 100.7824847
script 70.22023614
log 61.72487602
column 61.24853854
developers 60.87272082

Is this helpful compared to a standard search engine? It seems unlikely at this point.

What about the top significant terms for a specific article? I tested the following query against an article on Availability Group wait statistics I did back in May of 2012:

In this case I saw the following terms scoped to just my article:

keyphrase totalscore
nvarchar 0.437133938
sql 0.40759635
xel 0.391592562
failover 0.38552931
sqlserver 0.367718428
waitfor 0.352442741
replica 0.348848522
synchronous 0.347228736
duration 0.330882519
concurrency 0.321891427
asynchronous 0.311569184
wait 0.30677563
availability 0.299674034
desc 0.297606289
replicas 0.296193898

How relevant were these terms? Certainly “replica”, “synchronous”, “asynchronous”, “wait”, “availability”, and “replicas” fall into that category. But conspicuously missing are the various wait types themselves. For example, I would have expected to see wait types such as HADR_WORK_QUEUE, HADR_SYNC_COMMIT, and WRITELOG. Looking through the top 100 key phrases (the capped term list), I didn’t see these phrases surfaced as meaningful, even though from a search perspective people are likely to use them as key terms.

In addition to the missing relevant terms, we’re seeing a key limitation with using unigrams because we cannot represent the key phrase “wait statistics” as something that is likely the most relevant term in the entire paper. You can also see how subjective “relevant” is when we’re talking about specific industries or subject areas. How do you output terms that are as relevant for SQL Server as they are for the construction, retail, and medical industries? I’d see this as a very difficult task to achieve without some kind of human intervention.

Another example includes a query for the top terms from Jonathan Kehayias’ “Optimizing tempdb configuration with SQL Server 2012 Extended Events” article:

keyphrase totalscore
sql 0.640196383
sqlserver 0.543729305
in-memory 0.534120202
contention 0.512095034
latch 0.488780767
allocation 0.485334814
gam 0.473063797
predicate 0.468191713
xel 0.466187418

Inability to have the phrase “Extended Events” is a key limiting factor – and with that aside, there was no identification of “tempdb” as a statistically significant word. There were plenty of other relevant words like “allocation”, “bitmap”, “contention”, “gam” (but no “sgam” or “pfs” terms interestingly enough given the frequency), “in-memory”, “histogram” – but would they be relevant in absence of the core subject of the paper?

In conclusion, I feel Semantic Search is an interesting first “alpha” version and I really do hope it is further cultivated by the Microsoft product team. I’m also interesting in hearing how it is applied in real-world scenarios. There may be applications that can benefit from it today, and I’d like to understand where the current boundaries of utility exist. If there is a next version, ideally it includes support for true key phrases. I’d also like to see the ability to extend the Semantic Search model to support industry terminology and custom term relevance weighting. That involves human-intervention of course, which isn’t ideal for various scenarios, but I’d still be willing to do some manual intervention in order in order to find relevant phrases in a library of content.

Tags: , , ,


  • Rate
    [Total: 22    Average: 4.1/5]
  • Steve K

    Multiple Word Key Phrases Needed
    I agree with Joe. In my experimentation for a business subject domain, I identified the lack of multi-word key phrases as the most significant limitation of this version of semantic search. I strongly encourage Microsoft to rectify this in an upcoming Service Pack for SQL Server 2012.

  • Joe Sack

    Multiple Word Key Phrases Needed
    Thanks Steve. I definitely hope they grow this feature. Lots of (unrealized) potential.

  • kevinkunz2000

    Multiple words, key phrase, Sentiment analysis and Relevance
    As Joe clearly pointed out, this is an Alpha product; and IMO barely as it is nearly impossible to develop a practical business use for the feature. That said there is an interesting application out on CodePlex: http://mysemanticsearch.codeplex.com/, but really we have not progressed very far from standard search features provided in something like SharePoint. I know that comment may create an uproar for many reasons but when we look at semantic search from a pure functional perspective…well it doesn’t provide much. To be truly useful, yes we need to be able to search actual phrases, compare those as well to a series of lexicons, and perhaps provide weighting (The correlation score) based on mappings relevant to say emotion, confidence, etc… Then we have progressed into the area of sentiment analysis and rudimentary computational linguistics, where we can begin to provide true business value.

  • Ralph

    Not quite semantic – yet
    Joe, Thanks for the article and good review of a nice new feature is SQL 2012. Statistical relevance can be useful but is not yet quite "semantic" in the sense of conveying words with meaning. I think that’s what kevin was referring to with the statement "…Then we have progressed into the area of sentiment analysis and rudimentary computational linguistics…". I agree. It would seem that there would be lots of challenges. Most linguistic communication comes with some cultural context which would have suppositions weighted for impact based on phrase use. Wouldn’t it be a hoot configuring a linquistic interpretive engine to read, searching for relevent concepts from the perspective of a financially astute investor or a medical research doctor or a left wing politician all reading from the same document?

  • jesse g

    use filters to strip the html tags
    Add two computed columns, one with a static value of ‘.html’, and one with the body casted as varbinary(max), and recreate your fulltext index specifying a type column. Then you won’t need a custom stop list.

    The main problem with statistical semantics is that domain specific terms aren’t incorporated into the statistical db.

    There are two main inputs to calculate the key terms in a document, first is the relative freq of each term in the document, second is the relative freq of the term across all documents.

    MS provides the semanticsdb which contains a couple hundred thousand terms and their frequency. This allows you to index one document and pull out key phrases. If the term isn’t in the semanticsdb, it will never show up as a key phrase.

    Until they introduce a way to incorporate domain specific terms into the semanticsdb, I’ll have limited use for it. However, for general topics (e.g. newspaper articles), it works quite well.

    This is definitely a feature to keep your eye on over the next couple years.

  • Joe Sack

    Kevin, Ralph and Jesse – thanks for the comments!

    Jesse – thanks for the filter suggestion. Agreed that this will eliminate the HTML noise. Also agreed that the viability of this feature depends on the incorporation of domain specific terms. I truly hope this happens and doesn’t go the way of "English Query".


  • Mike D

    thanks Joe and everyone
    Joe and all subsequent commenters, thanks for the article and follow-ons, most helpful. jesse g, thanks in particular!

    Having now experimented with Semantic Search, I agree with everyone and must admit my surprise that so much noise has been made about such a quarter-baked "feature." My (rapidly fading) hope has been to analyze the content of customer communication using Statistical Semantic Search [SSS].

    Particularly annoying are the many references to "keyphrases" when only unigrams are currently supported. Does this point to internal versions that do in fact support ngrams?

    Now searching for hacks to extend SSS so that it does something interesting. Intuitively it feels like there must be ways to put even the existing limited functionality to effective use. kevin, you encourage me! Maybe we can give SSS some prosthetic appendages, so to speak.

    MySemanticSearch on CodePlex is a trailmarker perhaps. Way back in the day I worked on the AT&T text-to-speech engine and adaptive parsers for help and training systems. Ralph, what you describe can be and has been done, but your implication that completeness in a given domain is non-trivial (impossible in the general case, although we can get close enough to be useful) is correct.

    Perhaps Microsoft can open SSS up a bit, that would be a great start. Ah yes, dream on.

  • kranthi kumar

    Hi is there any update on support for bi gram or n gram in semantic search for words like “New York” or “SQL Server”.