Skills Search Engine

Document to give an overview of the Solr based Skills search design & implementation.

On Skills search, the keywords get applied on Solr Courses' indexed fields such as title, description, categories & tags for matching results. The appropriate results get grouped by the score with an aggregated Tag or Category or Course Title as Skill and the associated courses for drill down.

The Solr based search having weighting/boosting on primary attributes as:

Tages 3.0
Categories 2.0
Title 1.0

Solr Product searchable fields will get indexed as:

  searchable :auto_index => false do
    text :title, :boost => 1.0
    text :description
    text :first_name do |product|
      product.owner.firstname
    end
    text :category_name, :boost => 2.0 do |product|
      product.associated.categories
    end
    text :tag_name, :boost => 3.0 do |product|
      product.associated.tags
    end

    string :title do
      title.downcase
    end
    integer :product_id
    integer :associated_id
    string  :associated_type
    integer :company_id
    integer :owner_id
    integer :category_ids, :multiple => true, :references => Category
    integer :group_ids, :multiple => true, :references => Group
    integer :tag_ids, :multiple => true, :references => Tag
  ...

A sample query and the hash results will look like:

>> Product.skills 'science' {
  'responseHeader' => {
    'status' => 0,
    'QTime' => 10,
    'params' => {
      'facet' => 'true',
      'f.tag_ids_im.facet.mincount' => '1',
      'fl' => '* score',
      'start' => '0',
      'q' => 'science',
      'qf' => 'description_text tag_name_text category_name_text first_name_text title_text',
      'qs' => '2',
      'facet.field' => ['category_ids_im', 'tag_ids_im'],
      'f.category_ids_im.facet.mincount' => '1',
      'wt' => 'ruby',
      'fq' => 'type:Product',
      'defType' => 'dismax',
      'rows' => '30'
    }
  }, 'response' => {
    'numFound' => 4,
    'start' => 0,
    'maxScore' => 353.23367,
    'docs' => [{
      'id' => 'Product 2647',
      'score' => 353.23367
    }, {
      'id' => 'Product 734',
      'score' => 0.61053926
    }, {
      'id' => 'Product 1785',
      'score' => 0.43171644
    }, {
      'id' => 'Product 443',
      'score' => 0.15263481
    }]
  }, 'facet_counts' => {
    'facet_queries' => {},
    'facet_fields' => {
      'category_ids_im' => ['4', 1, '8', 1],
      'tag_ids_im' => ['35', 1, '36', 1, '37', 1]
    },
    'facet_dates' => {},
    'facet_ranges' => {}
  }
}

Use Cases

Use Case #1:

When tags match and returned with higher scores, Courses get grouped by Tags as skills in the order of score. Categories by facets can be given to users as questions to narrow down the search.

In the following use case, 'Programming' Tags scored high. Moreover the Course 'The art of innovation' also having tags 'computer' & 'science'.

>> Product.skills 'programming'

Programming
144.40782 -> The art of innovation [Cats: Information Technology, Operations]
61.889065 -> Clean Code [Cats: Information Technology]

computer
144.40782 -> The art of innovation [Cats: Information Technology, Operations]

science
144.40782 -> The art of innovation [Cats: Information Technology, Operations]

Java
61.889065 -> Clean Code [Cats: Information Technology]

Use Case #2:

When categories match and returned with higher scores, Courses get grouped by Tags as skills in the order of score. Categories by facets can be listed for users to narrow down the search. When no Categories or Tags available then the Courses get highlighted as skills by the score.

>> Product.skills 'operations'

computer
64.10782 -> The art of innovation [Cats: Information Technology, Operations]

//Following items listed in top Tag group with no matching Category or Tag but by matching title with high score
12.210758 -> Operations Training Course [Cats: ]
9.768606 -> Fundamentals of Operations Valuation [Cats: eLearning]

science
64.10782 -> The art of innovation [Cats: Information Technology, Operations]

Programming
64.10782 -> The art of innovation [Cats: Information Technology, Operations]

Use Case #3:

When no categories or tags match then the Courses title get grouped by score and a top scored course get highlighted as skill. The course keyword can be trimmed off to improve readability.

>> Product.skills 'sales'

Counter Sales
7.367684 -> Counter Sales Course [Cats: ], Salesforce Training [Cats: ]
0.45790341 -> PPTX course IE11 [Cats: ]

Use Case #4:

The search engine is used to present the user a highly relevant set of search results. It can be used to narrow down the results by applying the 2nd and 3rd level criteria. For instance, the 2nd click in a category, and the 3rd click in a title.

First the search on keyword 'wellness'.

>> Product.skills 'wellness'

Skills:
Health             -> Art of living [Cats: Entertainment, Health Science]
Programming        -> web [Cats: Information Technology, Computer Science]
Fitness            -> Slim Fit [Cats: ], Art of living [Cats: Entertainment, Health Science]
Science            -> web [Cats: Information Technology, Computer Science]
Computer           -> web [Cats: Information Technology, Computer Science]

Second by narrow down the original result with selected categories 'Information Technology' & 'Health Science'

>> Product.skills 'wellness', { :cats => [1162,4] }

Skills:
Health             -> Art of living [Cats: Entertainment, Health Science]
Programming        -> web [Cats: Information Technology, Computer Science]
Fitness            -> Art of living [Cats: Entertainment, Health Science]
Science            -> web [Cats: Information Technology, Computer Science]
Computer           -> web [Cats: Information Technology, Computer Science]

Further narrowed with the title 'web'

>> Product.skills 'wellness', { :cats => [1162,4], :titles => ['web'] }

Skills:
Programming        -> web [Cats: Information Technology, Computer Science]
Science            -> web [Cats: Information Technology, Computer Science]
Computer           -> web [Cats: Information Technology, Computer Science]

POC - Sample Skills Search

On Skills search, the results categories will get grouped on the basis of top facets and highlighted as options/questions to narrow down the results.

  def self.skills(k)
  
    qry = solr_search do
      keywords k,  :query_phrase_slop => 2
      facet :category_ids
    end
  ...

Results Products (Courses) will be grouped by score and listed in the descending order of score.

    ..
    
    h = Hash.new(0)
    t = 0
    qry.each_hit_with_result do |hit, post|

      h[hit.score] = [] if hit.score != t
      h[hit.score] << post.title #+ " [ID: #{post.id}]"
      t = hit.score

    end

    h =  h.sort_by { |key, val| key }.reverse.each do |k, v|
      puts k.to_s + ' -> ' + v.join(', ')
    end
    ...

A sample output on Skills search

>> Product.skills 'science' {
  'responseHeader' => {
    'status' => 0,
    'QTime' => 10,
    'params' => {
      'facet' => 'true',
      'f.tag_ids_im.facet.mincount' => '1',
      'fl' => '* score',
      'start' => '0',
      'q' => 'science',
      'qf' => 'description_text tag_name_text category_name_text first_name_text title_text',
      'qs' => '2',
      'facet.field' => ['category_ids_im', 'tag_ids_im'],
      'f.category_ids_im.facet.mincount' => '1',
      'wt' => 'ruby',
      'fq' => 'type:Product',
      'defType' => 'dismax',
      'rows' => '30'
    }
  }, 'response' => {
    'numFound' => 4,
    'start' => 0,
    'maxScore' => 353.23367,
    'docs' => [{
      'id' => 'Product 2647',
      'score' => 353.23367
    }, {
      'id' => 'Product 734',
      'score' => 0.61053926
    }, {
      'id' => 'Product 1785',
      'score' => 0.43171644
    }, {
      'id' => 'Product 443',
      'score' => 0.15263481
    }]
  }, 'facet_counts' => {
    'facet_queries' => {},
    'facet_fields' => {
      'category_ids_im' => ['4', 1, '8', 1],
      'tag_ids_im' => ['35', 1, '36', 1, '37', 1]
    },
    'facet_dates' => {},
    'facet_ranges' => {}
  }
}
Results:
  By Score:
  353.23367  -> Bunny Videos Courses system[Cats: Information Technology, Operations]
  0.61053926 -> In - Person Event[Cats: ]
  0.43171644 -> Test Course LION[Cats: ]
  0.15263481 -> GNAT LIVE EVENT[Cats: ]

Search Query Log

The Skills query log is a good source to train/tune a search engine. They are storing the query that the user entered and the result/product that the user clicked afterward. We'll be trying to get the categories with the higher chance. Some script to pre-process the query log to take it to the format that the engine accepts.

Users search pattern will get recorded for search trend analaysis. Following data models is used to store the search trend records:

Table search_pattern: 
  - user_id
  - client_id
  - keywords
  - skill_viewed

Table located_course: 
  - course_selected
  - search_pattern_id

Recommendation

Recommendation to upgrade Learnexa Solr version from 1.4 to 4. Solr 4 comes with some useful inbuilt features for optimal performance.

NLP Tools Summarization

IBM Watson Natural Language - Understanding demo: http://bit.ly/2w0m2Zd. Accessing the Watson Natural Language Understanding service is free for up to 30,000 units per month. Pricing beyond that is listed here: https://console.bluemix.net/catalog/services/natural-language-understanding

Apache UIMA - UIMA is a framework support open source components to plug & play. These components do the actual work of analyzing the unstructured information on Text, Audio, Video. Users can create their own components, some components are available as part of this project: https://uima.apache.org/sandbox.html#uima-addons-annotators. Out of this AlchemyAPI Annotator comes close to our requirements and interestingly it got acquired by IBM under Watson: https://www.ibm.com/watson/alchemy-api.html

Apache OpenNLP - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks - https://opennlp.apache.org/

Stanford CoreNLP - Stanford CoreNLP provides a set of natural language analysis tools written in Java. It is an easy to apply a bunch of linguistic analysis tools to a piece of text. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications: https://stanfordnlp.github.io/CoreNLP/. Understanding demo: http://nlp.stanford.edu:8080/corenlp/process

IBM Watson NLP is the closest match to our requirements. On the open source side, Stanford CoreNLP would be the selection based on its performance, popularity and activity. For a quick evaluation: https://java.libhunt.com/project/corenlp/vs/apache-opennlp

Product information classification using Open-source software

Automated Product information classification using Open-source tools is to avoid manual entry of product tags & categories, we tried to automate the process by auto detection of keywords and auto-classification of products based on product information. Latest Ruby gems 'stanford-core-nlp', 'tf-idf-similarity' & 'categorize' are applied for this purpose. Used unsupervised document classification technique (document clustering) in classifying the product content.

POC #1 - Auto detection of product keywords:

Stanford CoreNLP gem is used to identify keywords without stop words and the keywords get weighted through Tf-Idf algorithm/gem.

require 'stanford-core-nlp'
require 'matrix'
require 'tf-idf-similarity'
require 'unicode_utils'

Sample text used for keywords detection

text = “This intensive and hands-on series of courses give you the skills to ensure your projects are completed on time and on budget while giving the user the product they expect. You will gain a strong working knowledge of the basics of project management and be able to immediately use that knowledge to effectively manage work projects. At the end of the series, you will be able to identify and manage the product scope, build a work breakdown structure, create a project plan, create the project budget, define and allocate resources, manage the project development, identify and manage risks, and understand the project procurement process.”

Raw text is put into an Annotation object and then a sequence of Annotators add information in an analysis pipeline. The resulting Annotation, containing all the analysis information added by the Annotators, can be parsed further to list out the keywords.

pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
text = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(text)
...

Iterating by sentence and tokenizing words

text.get(:sentences).each do |sentence|
  # Syntatical dependencies
  puts sentence.get(:basic_dependencies).to_s
  
  sentence.get(:tokens).each do |token|
    # Default annotations for all tokens
    tag = token.get(:named_entity_tag).to_s
    ...
  
    puts token.get(:original_text).to_s

To weigh the keywords

document1 = TfIdfSimilarity::Document.new(text, :term_counts => term_counts, :size => size)
corpus = [document1]
model = TfIdfSimilarity::TfIdfModel.new(corpus)
...
tfidf_by_term = {}
document1.terms.each do |term|
  tfidf_by_term[term] = model.tfidf(document1, term)
end

puts tfidf_by_term.inspect

The output is compared with Watson results as a benchmark.

["intensive", "hands-on", "hands-on series", "courses", "skills", "ensure", "projects", "completed", "time", "budget", "giving", "user", "product", "will", "will gain", "strong", "strong working", "strong working knowledge", "basics", "project", "project management", "be", "be able", "immediately", "immediately use", "knowledge", "effectively", "effectively manage", "effectively manage work", "effectively manage work projects", "end", "series", "will be", "will be able", "identify", "manage", "product scope", "build", "work", "work breakdown", "work breakdown structure", "create", "project plan", "project budget", "define", "allocate", "allocate resources", "project development", "manage risks", "understand", "project procurement", "project procurement process"]

{" "=>3.114215990939937, "intensive"=>0.3068528194400547, "hands"=>0.3068528194400547, ""=>1.0177156679453347, "series"=>0.4339554189045479, "courses"=>0.3068528194400547, "gives"=>0.3068528194400547, "skills"=>0.3068528194400547, "ensure"=>0.3068528194400547, "projects"=>0.4339554189045479, "completed"=>0.3068528194400547, "time"=>0.3068528194400547, "budget"=>0.4339554189045479, "giving"=>0.3068528194400547, "user"=>0.3068528194400547, "product"=>0.4339554189045479, "expect"=>0.3068528194400547, "gain"=>0.3068528194400547, "strong"=>0.3068528194400547, "working"=>0.3068528194400547, "knowledge"=>0.4339554189045479, "basics"=>0.3068528194400547, "project"=>0.6861437633554314, "management"=>0.3068528194400547, "able"=>0.4339554189045479, "immediately"=>0.3068528194400547, "use"=>0.3068528194400547, "effectively"=>0.3068528194400547, "manage"=>0.6137056388801094, "work"=>0.4339554189045479, "end"=>0.3068528194400547, "identify"=>0.4339554189045479, "scope"=>0.3068528194400547, "build"=>0.3068528194400547, "breakdown"=>0.3068528194400547, "structure"=>0.3068528194400547, "create"=>0.4339554189045479, "plan"=>0.3068528194400547, "define"=>0.3068528194400547, "allocate"=>0.3068528194400547, "resources"=>0.3068528194400547, "development"=>0.3068528194400547, "risks"=>0.3068528194400547, "understand"=>0.3068528194400547, "procurement"=>0.3068528194400547, "process"=>0.3068528194400547}

POC #2 - Auto classification of products:

Used latest Ruby gems 'categorize' & 'ai4ruby' for unsupervised document classification

require 'categorize'
include Categorize

The sample text used in POC #1 is submitted as statements for iteration purpose

documents = [
%q(This intensive and hands-on series of courses gives you the skills to ensure your projects are completed on time and on budget while giving the user the product they expect.),
%q(You will gain a strong working knowledge of the basics of project management and be able to immediately use that knowledge to effectively manage work projects.),
%q(At the end of the series you will be able to identify and manage the product scope, build a work breakdown structure, create a project plan, create the project budget, define and allocate resources, manage the project development, identify and manage risks, and understand the project procurement process.)]

Applied unsupervised document classification technique - document clustering

hierarchical_model = Models::HierarchicalCluster.new
y = Model.make_model('plan', documents, hierarchical_model)
puts y.inspect

Statement wise output.

[[["manage project", [1, 0, 2]]], [["manage project", [2]], ["working knowledge", [1, 0]]], [["intensive hands", [0]], ["working knowledge", [1]], ["manage project", [2]]]]

This is a raw output. This can be further processed to frame categories. Moreover, these keywords can be used to train an algorithm/machine learning to classify similar text in future.

Product information classification using Watson NLU

Secured an account in IBM Cloud - https://console.bluemix.net/registration/

POC - Watson API to analyze any given text:

require 'rest-client'
require 'json'
..
pload = {"text" => stext, "features" => { "concepts" => {}, "categories" => {}, "keywords" => {}}}

begin
	response = RestClient::Request.execute :method => :post,
		:url => wurl,
		:user => uname,
		:password => pword,
		:payload => pload.to_json,
		:headers => {:content_type => :json}

	profile = JSON.parse(response.body)
..

API extracts Categories, Keywords and Concepts are stored in database for the quick Solr search.

class ProductInsight < ActiveRecord::Base
	belongs_to :associate, :polymorphic => true

	searchable :auto_index => false do
		text :categories, :boost => 3.0
		text :keywords, :boost => 2.0
		text :concepts, :boost => 1.0
	end

	def self.skills(k, opt = {})
          qry = solr_search do
            keywords k,  :query_phrase_slop => 2
          end
..

pi = ProductInsight.new
pi.update_attributes(:keywords => kwds, :concepts => cpts, :categories => cats)
pi.associate = Course.first
pi.save

For example a search for 'programming' skills

ProductInsight.skills 'programming'

Will fetch possible skills and respective courses.

"/art and entertainment - 568"
"/education/distance learning - 1302"
"/education/language learning - 1301"
"/technology and computing/computer certification - 9,12,16,19,22,25,28,33,38,43,46,57,66"
"/technology and computing/programming languages - 357"
"/technology and computing/programming languages/javascript - 769,30,48,49,336,96,888,445,1255,1256,242"
"/technology and computing/software/shareware and freeware - 60"

Reference

- https://en.wikipedia.org/wiki/Web_query_classification
- https://emmaespina.wordpress.com/2011/08/23/enhancing-search-results-using-machine-learning/
- http://rubynlp.org/

DokuWiki

Table of Contents

Skills Search Engine

Use Cases

POC - Sample Skills Search

Search Query Log

Recommendation

NLP Tools Summarization

Product information classification using Open-source software

POC #1 - Auto detection of product keywords:

POC #2 - Auto classification of products:

Product information classification using Watson NLU

POC - Watson API to analyze any given text:

Reference

DokuWiki

Site Tools

Table of Contents

Skills Search Engine

Use Cases

POC - Sample Skills Search

Search Query Log

Recommendation

NLP Tools Summarization

Product information classification using Open-source software

POC #1 - Auto detection of product keywords:

POC #2 - Auto classification of products:

Product information classification using Watson NLU

POC - Watson API to analyze any given text:

Reference

Page Tools