Document to give an overview of the Solr based Skills search design & implementation.
On Skills search, the keywords get applied on Solr Courses' indexed fields such as title, description, categories & tags for matching results. The appropriate results get grouped by the score with an aggregated Tag or Category or Course Title as Skill and the associated courses for drill down.
The Solr based search having weighting/boosting on primary attributes as:
Solr Product searchable fields will get indexed as:
searchable :auto_index => false do
text :title, :boost => 1.0
text :description
text :first_name do |product|
product.owner.firstname
end
text :category_name, :boost => 2.0 do |product|
product.associated.categories
end
text :tag_name, :boost => 3.0 do |product|
product.associated.tags
end
string :title do
title.downcase
end
integer :product_id
integer :associated_id
string :associated_type
integer :company_id
integer :owner_id
integer :category_ids, :multiple => true, :references => Category
integer :group_ids, :multiple => true, :references => Group
integer :tag_ids, :multiple => true, :references => Tag
...
A sample query and the hash results will look like:
>> Product.skills 'science' {
'responseHeader' => {
'status' => 0,
'QTime' => 10,
'params' => {
'facet' => 'true',
'f.tag_ids_im.facet.mincount' => '1',
'fl' => '* score',
'start' => '0',
'q' => 'science',
'qf' => 'description_text tag_name_text category_name_text first_name_text title_text',
'qs' => '2',
'facet.field' => ['category_ids_im', 'tag_ids_im'],
'f.category_ids_im.facet.mincount' => '1',
'wt' => 'ruby',
'fq' => 'type:Product',
'defType' => 'dismax',
'rows' => '30'
}
}, 'response' => {
'numFound' => 4,
'start' => 0,
'maxScore' => 353.23367,
'docs' => [{
'id' => 'Product 2647',
'score' => 353.23367
}, {
'id' => 'Product 734',
'score' => 0.61053926
}, {
'id' => 'Product 1785',
'score' => 0.43171644
}, {
'id' => 'Product 443',
'score' => 0.15263481
}]
}, 'facet_counts' => {
'facet_queries' => {},
'facet_fields' => {
'category_ids_im' => ['4', 1, '8', 1],
'tag_ids_im' => ['35', 1, '36', 1, '37', 1]
},
'facet_dates' => {},
'facet_ranges' => {}
}
}
Use Case #1:
When tags match and returned with higher scores, Courses get grouped by Tags as skills in the order of score. Categories by facets can be given to users as questions to narrow down the search.
In the following use case, 'Programming' Tags scored high. Moreover the Course 'The art of innovation' also having tags 'computer' & 'science'.
>> Product.skills 'programming' Programming 144.40782 -> The art of innovation [Cats: Information Technology, Operations] 61.889065 -> Clean Code [Cats: Information Technology] computer 144.40782 -> The art of innovation [Cats: Information Technology, Operations] science 144.40782 -> The art of innovation [Cats: Information Technology, Operations] Java 61.889065 -> Clean Code [Cats: Information Technology]
Use Case #2:
When categories match and returned with higher scores, Courses get grouped by Tags as skills in the order of score. Categories by facets can be listed for users to narrow down the search. When no Categories or Tags available then the Courses get highlighted as skills by the score.
>> Product.skills 'operations' computer 64.10782 -> The art of innovation [Cats: Information Technology, Operations] //Following items listed in top Tag group with no matching Category or Tag but by matching title with high score 12.210758 -> Operations Training Course [Cats: ] 9.768606 -> Fundamentals of Operations Valuation [Cats: eLearning] science 64.10782 -> The art of innovation [Cats: Information Technology, Operations] Programming 64.10782 -> The art of innovation [Cats: Information Technology, Operations]
Use Case #3:
When no categories or tags match then the Courses title get grouped by score and a top scored course get highlighted as skill. The course keyword can be trimmed off to improve readability.
>> Product.skills 'sales' Counter Sales 7.367684 -> Counter Sales Course [Cats: ], Salesforce Training [Cats: ] 0.45790341 -> PPTX course IE11 [Cats: ]
Use Case #4:
The search engine is used to present the user a highly relevant set of search results. It can be used to narrow down the results by applying the 2nd and 3rd level criteria. For instance, the 2nd click in a category, and the 3rd click in a title.
First the search on keyword 'wellness'.
>> Product.skills 'wellness' Skills: Health -> Art of living [Cats: Entertainment, Health Science] Programming -> web [Cats: Information Technology, Computer Science] Fitness -> Slim Fit [Cats: ], Art of living [Cats: Entertainment, Health Science] Science -> web [Cats: Information Technology, Computer Science] Computer -> web [Cats: Information Technology, Computer Science]
Second by narrow down the original result with selected categories 'Information Technology' & 'Health Science'
>> Product.skills 'wellness', { :cats => [1162,4] }
Skills:
Health -> Art of living [Cats: Entertainment, Health Science]
Programming -> web [Cats: Information Technology, Computer Science]
Fitness -> Art of living [Cats: Entertainment, Health Science]
Science -> web [Cats: Information Technology, Computer Science]
Computer -> web [Cats: Information Technology, Computer Science]
Further narrowed with the title 'web'
>> Product.skills 'wellness', { :cats => [1162,4], :titles => ['web'] }
Skills:
Programming -> web [Cats: Information Technology, Computer Science]
Science -> web [Cats: Information Technology, Computer Science]
Computer -> web [Cats: Information Technology, Computer Science]
On Skills search, the results categories will get grouped on the basis of top facets and highlighted as options/questions to narrow down the results.
def self.skills(k)
qry = solr_search do
keywords k, :query_phrase_slop => 2
facet :category_ids
end
...
Results Products (Courses) will be grouped by score and listed in the descending order of score.
..
h = Hash.new(0)
t = 0
qry.each_hit_with_result do |hit, post|
h[hit.score] = [] if hit.score != t
h[hit.score] << post.title #+ " [ID: #{post.id}]"
t = hit.score
end
h = h.sort_by { |key, val| key }.reverse.each do |k, v|
puts k.to_s + ' -> ' + v.join(', ')
end
...
A sample output on Skills search
>> Product.skills 'science' {
'responseHeader' => {
'status' => 0,
'QTime' => 10,
'params' => {
'facet' => 'true',
'f.tag_ids_im.facet.mincount' => '1',
'fl' => '* score',
'start' => '0',
'q' => 'science',
'qf' => 'description_text tag_name_text category_name_text first_name_text title_text',
'qs' => '2',
'facet.field' => ['category_ids_im', 'tag_ids_im'],
'f.category_ids_im.facet.mincount' => '1',
'wt' => 'ruby',
'fq' => 'type:Product',
'defType' => 'dismax',
'rows' => '30'
}
}, 'response' => {
'numFound' => 4,
'start' => 0,
'maxScore' => 353.23367,
'docs' => [{
'id' => 'Product 2647',
'score' => 353.23367
}, {
'id' => 'Product 734',
'score' => 0.61053926
}, {
'id' => 'Product 1785',
'score' => 0.43171644
}, {
'id' => 'Product 443',
'score' => 0.15263481
}]
}, 'facet_counts' => {
'facet_queries' => {},
'facet_fields' => {
'category_ids_im' => ['4', 1, '8', 1],
'tag_ids_im' => ['35', 1, '36', 1, '37', 1]
},
'facet_dates' => {},
'facet_ranges' => {}
}
}
Results:
By Score:
353.23367 -> Bunny Videos Courses system[Cats: Information Technology, Operations]
0.61053926 -> In - Person Event[Cats: ]
0.43171644 -> Test Course LION[Cats: ]
0.15263481 -> GNAT LIVE EVENT[Cats: ]
The Skills query log is a good source to train/tune a search engine. They are storing the query that the user entered and the result/product that the user clicked afterward. We'll be trying to get the categories with the higher chance. Some script to pre-process the query log to take it to the format that the engine accepts.
Users search pattern will get recorded for search trend analaysis. Following data models is used to store the search trend records:
Table search_pattern: - user_id - client_id - keywords - skill_viewed
Table located_course: - course_selected - search_pattern_id
Recommendation to upgrade Learnexa Solr version from 1.4 to 4. Solr 4 comes with some useful inbuilt features for optimal performance.
IBM Watson NLP is the closest match to our requirements. On the open source side, Stanford CoreNLP would be the selection based on its performance, popularity and activity. For a quick evaluation: https://java.libhunt.com/project/corenlp/vs/apache-opennlp
Automated Product information classification using Open-source tools is to avoid manual entry of product tags & categories, we tried to automate the process by auto detection of keywords and auto-classification of products based on product information. Latest Ruby gems 'stanford-core-nlp', 'tf-idf-similarity' & 'categorize' are applied for this purpose. Used unsupervised document classification technique (document clustering) in classifying the product content.
Stanford CoreNLP gem is used to identify keywords without stop words and the keywords get weighted through Tf-Idf algorithm/gem.
require 'stanford-core-nlp' require 'matrix' require 'tf-idf-similarity' require 'unicode_utils'
Sample text used for keywords detection
text = “This intensive and hands-on series of courses give you the skills to ensure your projects are completed on time and on budget while giving the user the product they expect. You will gain a strong working knowledge of the basics of project management and be able to immediately use that knowledge to effectively manage work projects. At the end of the series, you will be able to identify and manage the product scope, build a work breakdown structure, create a project plan, create the project budget, define and allocate resources, manage the project development, identify and manage risks, and understand the project procurement process.”
Raw text is put into an Annotation object and then a sequence of Annotators add information in an analysis pipeline. The resulting Annotation, containing all the analysis information added by the Annotators, can be parsed further to list out the keywords.
pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref) text = StanfordCoreNLP::Annotation.new(text) pipeline.annotate(text) ...
Iterating by sentence and tokenizing words
text.get(:sentences).each do |sentence|
# Syntatical dependencies
puts sentence.get(:basic_dependencies).to_s
sentence.get(:tokens).each do |token|
# Default annotations for all tokens
tag = token.get(:named_entity_tag).to_s
...
puts token.get(:original_text).to_s
To weigh the keywords
document1 = TfIdfSimilarity::Document.new(text, :term_counts => term_counts, :size => size)
corpus = [document1]
model = TfIdfSimilarity::TfIdfModel.new(corpus)
...
tfidf_by_term = {}
document1.terms.each do |term|
tfidf_by_term[term] = model.tfidf(document1, term)
end
puts tfidf_by_term.inspect
The output is compared with Watson results as a benchmark.
["intensive", "hands-on", "hands-on series", "courses", "skills", "ensure", "projects", "completed", "time", "budget", "giving", "user", "product", "will", "will gain", "strong", "strong working", "strong working knowledge", "basics", "project", "project management", "be", "be able", "immediately", "immediately use", "knowledge", "effectively", "effectively manage", "effectively manage work", "effectively manage work projects", "end", "series", "will be", "will be able", "identify", "manage", "product scope", "build", "work", "work breakdown", "work breakdown structure", "create", "project plan", "project budget", "define", "allocate", "allocate resources", "project development", "manage risks", "understand", "project procurement", "project procurement process"]
{" "=>3.114215990939937, "intensive"=>0.3068528194400547, "hands"=>0.3068528194400547, ""=>1.0177156679453347, "series"=>0.4339554189045479, "courses"=>0.3068528194400547, "gives"=>0.3068528194400547, "skills"=>0.3068528194400547, "ensure"=>0.3068528194400547, "projects"=>0.4339554189045479, "completed"=>0.3068528194400547, "time"=>0.3068528194400547, "budget"=>0.4339554189045479, "giving"=>0.3068528194400547, "user"=>0.3068528194400547, "product"=>0.4339554189045479, "expect"=>0.3068528194400547, "gain"=>0.3068528194400547, "strong"=>0.3068528194400547, "working"=>0.3068528194400547, "knowledge"=>0.4339554189045479, "basics"=>0.3068528194400547, "project"=>0.6861437633554314, "management"=>0.3068528194400547, "able"=>0.4339554189045479, "immediately"=>0.3068528194400547, "use"=>0.3068528194400547, "effectively"=>0.3068528194400547, "manage"=>0.6137056388801094, "work"=>0.4339554189045479, "end"=>0.3068528194400547, "identify"=>0.4339554189045479, "scope"=>0.3068528194400547, "build"=>0.3068528194400547, "breakdown"=>0.3068528194400547, "structure"=>0.3068528194400547, "create"=>0.4339554189045479, "plan"=>0.3068528194400547, "define"=>0.3068528194400547, "allocate"=>0.3068528194400547, "resources"=>0.3068528194400547, "development"=>0.3068528194400547, "risks"=>0.3068528194400547, "understand"=>0.3068528194400547, "procurement"=>0.3068528194400547, "process"=>0.3068528194400547}
Used latest Ruby gems 'categorize' & 'ai4ruby' for unsupervised document classification
require 'categorize' include Categorize
The sample text used in POC #1 is submitted as statements for iteration purpose
documents = [ %q(This intensive and hands-on series of courses gives you the skills to ensure your projects are completed on time and on budget while giving the user the product they expect.), %q(You will gain a strong working knowledge of the basics of project management and be able to immediately use that knowledge to effectively manage work projects.), %q(At the end of the series you will be able to identify and manage the product scope, build a work breakdown structure, create a project plan, create the project budget, define and allocate resources, manage the project development, identify and manage risks, and understand the project procurement process.)]
Applied unsupervised document classification technique - document clustering
hierarchical_model = Models::HierarchicalCluster.new
y = Model.make_model('plan', documents, hierarchical_model)
puts y.inspect
Statement wise output.
[[["manage project", [1, 0, 2]]], [["manage project", [2]], ["working knowledge", [1, 0]]], [["intensive hands", [0]], ["working knowledge", [1]], ["manage project", [2]]]]
This is a raw output. This can be further processed to frame categories. Moreover, these keywords can be used to train an algorithm/machine learning to classify similar text in future.
Secured an account in IBM Cloud - https://console.bluemix.net/registration/
require 'rest-client'
require 'json'
..
pload = {"text" => stext, "features" => { "concepts" => {}, "categories" => {}, "keywords" => {}}}
begin
response = RestClient::Request.execute :method => :post,
:url => wurl,
:user => uname,
:password => pword,
:payload => pload.to_json,
:headers => {:content_type => :json}
profile = JSON.parse(response.body)
..
API extracts Categories, Keywords and Concepts are stored in database for the quick Solr search.
class ProductInsight < ActiveRecord::Base
belongs_to :associate, :polymorphic => true
searchable :auto_index => false do
text :categories, :boost => 3.0
text :keywords, :boost => 2.0
text :concepts, :boost => 1.0
end
def self.skills(k, opt = {})
qry = solr_search do
keywords k, :query_phrase_slop => 2
end
..
pi = ProductInsight.new pi.update_attributes(:keywords => kwds, :concepts => cpts, :categories => cats) pi.associate = Course.first pi.save
For example a search for 'programming' skills
ProductInsight.skills 'programming'
Will fetch possible skills and respective courses.
"/art and entertainment - 568" "/education/distance learning - 1302" "/education/language learning - 1301" "/technology and computing/computer certification - 9,12,16,19,22,25,28,33,38,43,46,57,66" "/technology and computing/programming languages - 357" "/technology and computing/programming languages/javascript - 769,30,48,49,336,96,888,445,1255,1256,242" "/technology and computing/software/shareware and freeware - 60"