====== Skills Search Engine ====== Document to give an overview of the Solr based Skills search design & implementation. On Skills search, the keywords get applied on Solr Courses' indexed fields such as title, description, categories & tags for matching results. The appropriate results get grouped by the score with an aggregated Tag or Category or Course Title as Skill and the associated courses for drill down. The Solr based search having weighting/boosting on primary attributes as: * Tages 3.0 * Categories 2.0 * Title 1.0 Solr Product searchable fields will get indexed as: searchable :auto_index => false do text :title, :boost => 1.0 text :description text :first_name do |product| product.owner.firstname end text :category_name, :boost => 2.0 do |product| product.associated.categories end text :tag_name, :boost => 3.0 do |product| product.associated.tags end string :title do title.downcase end integer :product_id integer :associated_id string :associated_type integer :company_id integer :owner_id integer :category_ids, :multiple => true, :references => Category integer :group_ids, :multiple => true, :references => Group integer :tag_ids, :multiple => true, :references => Tag ... A sample query and the hash results will look like: >> Product.skills 'science' { 'responseHeader' => { 'status' => 0, 'QTime' => 10, 'params' => { 'facet' => 'true', 'f.tag_ids_im.facet.mincount' => '1', 'fl' => '* score', 'start' => '0', 'q' => 'science', 'qf' => 'description_text tag_name_text category_name_text first_name_text title_text', 'qs' => '2', 'facet.field' => ['category_ids_im', 'tag_ids_im'], 'f.category_ids_im.facet.mincount' => '1', 'wt' => 'ruby', 'fq' => 'type:Product', 'defType' => 'dismax', 'rows' => '30' } }, 'response' => { 'numFound' => 4, 'start' => 0, 'maxScore' => 353.23367, 'docs' => [{ 'id' => 'Product 2647', 'score' => 353.23367 }, { 'id' => 'Product 734', 'score' => 0.61053926 }, { 'id' => 'Product 1785', 'score' => 0.43171644 }, { 'id' => 'Product 443', 'score' => 0.15263481 }] }, 'facet_counts' => { 'facet_queries' => {}, 'facet_fields' => { 'category_ids_im' => ['4', 1, '8', 1], 'tag_ids_im' => ['35', 1, '36', 1, '37', 1] }, 'facet_dates' => {}, 'facet_ranges' => {} } } ===== Use Cases ===== **Use Case #1:** When tags match and returned with higher scores, Courses get grouped by Tags as skills in the order of score. Categories by facets can be given to users as questions to narrow down the search. In the following use case, '//Programming//' Tags scored high. Moreover the Course '//The art of innovation//' also having tags '//computer//' & '//science//'. >> Product.skills 'programming' Programming 144.40782 -> The art of innovation [Cats: Information Technology, Operations] 61.889065 -> Clean Code [Cats: Information Technology] computer 144.40782 -> The art of innovation [Cats: Information Technology, Operations] science 144.40782 -> The art of innovation [Cats: Information Technology, Operations] Java 61.889065 -> Clean Code [Cats: Information Technology] **Use Case #2:** When categories match and returned with higher scores, Courses get grouped by Tags as skills in the order of score. Categories by facets can be listed for users to narrow down the search. When no Categories or Tags available then the Courses get highlighted as skills by the score. >> Product.skills 'operations' computer 64.10782 -> The art of innovation [Cats: Information Technology, Operations] //Following items listed in top Tag group with no matching Category or Tag but by matching title with high score 12.210758 -> Operations Training Course [Cats: ] 9.768606 -> Fundamentals of Operations Valuation [Cats: eLearning] science 64.10782 -> The art of innovation [Cats: Information Technology, Operations] Programming 64.10782 -> The art of innovation [Cats: Information Technology, Operations] **Use Case #3:** When no categories or tags match then the Courses title get grouped by score and a top scored course get highlighted as skill. The course keyword can be trimmed off to improve readability. >> Product.skills 'sales' Counter Sales 7.367684 -> Counter Sales Course [Cats: ], Salesforce Training [Cats: ] 0.45790341 -> PPTX course IE11 [Cats: ] **Use Case #4:** The search engine is used to present the user a highly relevant set of search results. It can be used to narrow down the results by applying the 2nd and 3rd level criteria. For instance, the 2nd click in a category, and the 3rd click in a title. First the search on keyword //'wellness'//. >> Product.skills 'wellness' Skills: Health -> Art of living [Cats: Entertainment, Health Science] Programming -> web [Cats: Information Technology, Computer Science] Fitness -> Slim Fit [Cats: ], Art of living [Cats: Entertainment, Health Science] Science -> web [Cats: Information Technology, Computer Science] Computer -> web [Cats: Information Technology, Computer Science] Second by narrow down the original result with selected categories //'Information Technology' & 'Health Science'// >> Product.skills 'wellness', { :cats => [1162,4] } Skills: Health -> Art of living [Cats: Entertainment, Health Science] Programming -> web [Cats: Information Technology, Computer Science] Fitness -> Art of living [Cats: Entertainment, Health Science] Science -> web [Cats: Information Technology, Computer Science] Computer -> web [Cats: Information Technology, Computer Science] Further narrowed with the title //'web'// >> Product.skills 'wellness', { :cats => [1162,4], :titles => ['web'] } Skills: Programming -> web [Cats: Information Technology, Computer Science] Science -> web [Cats: Information Technology, Computer Science] Computer -> web [Cats: Information Technology, Computer Science] ===== POC - Sample Skills Search ======= On Skills search, the results categories will get grouped on the basis of top facets and highlighted as options/questions to narrow down the results. def self.skills(k) qry = solr_search do keywords k, :query_phrase_slop => 2 facet :category_ids end ... Results Products (Courses) will be grouped by score and listed in the descending order of score. .. h = Hash.new(0) t = 0 qry.each_hit_with_result do |hit, post| h[hit.score] = [] if hit.score != t h[hit.score] << post.title #+ " [ID: #{post.id}]" t = hit.score end h = h.sort_by { |key, val| key }.reverse.each do |k, v| puts k.to_s + ' -> ' + v.join(', ') end ... A sample output on Skills search >> Product.skills 'science' { 'responseHeader' => { 'status' => 0, 'QTime' => 10, 'params' => { 'facet' => 'true', 'f.tag_ids_im.facet.mincount' => '1', 'fl' => '* score', 'start' => '0', 'q' => 'science', 'qf' => 'description_text tag_name_text category_name_text first_name_text title_text', 'qs' => '2', 'facet.field' => ['category_ids_im', 'tag_ids_im'], 'f.category_ids_im.facet.mincount' => '1', 'wt' => 'ruby', 'fq' => 'type:Product', 'defType' => 'dismax', 'rows' => '30' } }, 'response' => { 'numFound' => 4, 'start' => 0, 'maxScore' => 353.23367, 'docs' => [{ 'id' => 'Product 2647', 'score' => 353.23367 }, { 'id' => 'Product 734', 'score' => 0.61053926 }, { 'id' => 'Product 1785', 'score' => 0.43171644 }, { 'id' => 'Product 443', 'score' => 0.15263481 }] }, 'facet_counts' => { 'facet_queries' => {}, 'facet_fields' => { 'category_ids_im' => ['4', 1, '8', 1], 'tag_ids_im' => ['35', 1, '36', 1, '37', 1] }, 'facet_dates' => {}, 'facet_ranges' => {} } } Results: By Score: 353.23367 -> Bunny Videos Courses system[Cats: Information Technology, Operations] 0.61053926 -> In - Person Event[Cats: ] 0.43171644 -> Test Course LION[Cats: ] 0.15263481 -> GNAT LIVE EVENT[Cats: ] ===== Search Query Log ======= The Skills query log is a good source to train/tune a search engine. They are storing the query that the user entered and the result/product that the user clicked afterward. We'll be trying to get the categories with the higher chance. Some script to pre-process the query log to take it to the format that the engine accepts. Users search pattern will get recorded for search trend analaysis. Following data models is used to store the search trend records: Table search_pattern: - user_id - client_id - keywords - skill_viewed Table located_course: - course_selected - search_pattern_id ===== Recommendation ======= Recommendation to upgrade Learnexa Solr version from 1.4 to 4. Solr 4 comes with some useful inbuilt features for optimal performance. ===== NLP Tools Summarization ======= * **IBM Watson Natural Language** - Understanding demo: http://bit.ly/2w0m2Zd. Accessing the Watson Natural Language Understanding service is free for up to 30,000 units per month. Pricing beyond that is listed here: https://console.bluemix.net/catalog/services/natural-language-understanding * **Apache UIMA** - UIMA is a framework support open source components to plug & play. These components do the actual work of analyzing the unstructured information on Text, Audio, Video. Users can create their own components, some components are available as part of this project: https://uima.apache.org/sandbox.html#uima-addons-annotators. Out of this AlchemyAPI Annotator comes close to our requirements and interestingly it got acquired by IBM under Watson: https://www.ibm.com/watson/alchemy-api.html * **Apache OpenNLP** - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks - https://opennlp.apache.org/ * **Stanford CoreNLP** - Stanford CoreNLP provides a set of natural language analysis tools written in Java. It is an easy to apply a bunch of linguistic analysis tools to a piece of text. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications: https://stanfordnlp.github.io/CoreNLP/. Understanding demo: http://nlp.stanford.edu:8080/corenlp/process IBM Watson NLP is the closest match to our requirements. On the open source side, Stanford CoreNLP would be the selection based on its performance, popularity and activity. For a quick evaluation: https://java.libhunt.com/project/corenlp/vs/apache-opennlp ===== Product information classification using Open-source software ======= Automated Product information classification using Open-source tools is to avoid manual entry of product tags & categories, we tried to automate the process by auto detection of keywords and auto-classification of products based on product information. Latest Ruby gems **'stanford-core-nlp', 'tf-idf-similarity'** & **'categorize'** are applied for this purpose. Used unsupervised document classification technique (document clustering) in classifying the product content. ==== POC #1 - Auto detection of product keywords: ==== Stanford CoreNLP gem is used to identify keywords without stop words and the keywords get weighted through Tf-Idf algorithm/gem. require 'stanford-core-nlp' require 'matrix' require 'tf-idf-similarity' require 'unicode_utils' Sample text used for keywords detection text = //"This intensive and hands-on series of courses give you the skills to ensure your projects are completed on time and on budget while giving the user the product they expect. You will gain a strong working knowledge of the basics of project management and be able to immediately use that knowledge to effectively manage work projects. At the end of the series, you will be able to identify and manage the product scope, build a work breakdown structure, create a project plan, create the project budget, define and allocate resources, manage the project development, identify and manage risks, and understand the project procurement process."// Raw text is put into an Annotation object and then a sequence of Annotators add information in an analysis pipeline. The resulting Annotation, containing all the analysis information added by the Annotators, can be parsed further to list out the keywords. pipeline = StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref) text = StanfordCoreNLP::Annotation.new(text) pipeline.annotate(text) ... Iterating by sentence and tokenizing words text.get(:sentences).each do |sentence| # Syntatical dependencies puts sentence.get(:basic_dependencies).to_s sentence.get(:tokens).each do |token| # Default annotations for all tokens tag = token.get(:named_entity_tag).to_s ... puts token.get(:original_text).to_s To weigh the keywords document1 = TfIdfSimilarity::Document.new(text, :term_counts => term_counts, :size => size) corpus = [document1] model = TfIdfSimilarity::TfIdfModel.new(corpus) ... tfidf_by_term = {} document1.terms.each do |term| tfidf_by_term[term] = model.tfidf(document1, term) end puts tfidf_by_term.inspect The output is compared with Watson results as a benchmark. ["intensive", "hands-on", "hands-on series", "courses", "skills", "ensure", "projects", "completed", "time", "budget", "giving", "user", "product", "will", "will gain", "strong", "strong working", "strong working knowledge", "basics", "project", "project management", "be", "be able", "immediately", "immediately use", "knowledge", "effectively", "effectively manage", "effectively manage work", "effectively manage work projects", "end", "series", "will be", "will be able", "identify", "manage", "product scope", "build", "work", "work breakdown", "work breakdown structure", "create", "project plan", "project budget", "define", "allocate", "allocate resources", "project development", "manage risks", "understand", "project procurement", "project procurement process"] {" "=>3.114215990939937, "intensive"=>0.3068528194400547, "hands"=>0.3068528194400547, ""=>1.0177156679453347, "series"=>0.4339554189045479, "courses"=>0.3068528194400547, "gives"=>0.3068528194400547, "skills"=>0.3068528194400547, "ensure"=>0.3068528194400547, "projects"=>0.4339554189045479, "completed"=>0.3068528194400547, "time"=>0.3068528194400547, "budget"=>0.4339554189045479, "giving"=>0.3068528194400547, "user"=>0.3068528194400547, "product"=>0.4339554189045479, "expect"=>0.3068528194400547, "gain"=>0.3068528194400547, "strong"=>0.3068528194400547, "working"=>0.3068528194400547, "knowledge"=>0.4339554189045479, "basics"=>0.3068528194400547, "project"=>0.6861437633554314, "management"=>0.3068528194400547, "able"=>0.4339554189045479, "immediately"=>0.3068528194400547, "use"=>0.3068528194400547, "effectively"=>0.3068528194400547, "manage"=>0.6137056388801094, "work"=>0.4339554189045479, "end"=>0.3068528194400547, "identify"=>0.4339554189045479, "scope"=>0.3068528194400547, "build"=>0.3068528194400547, "breakdown"=>0.3068528194400547, "structure"=>0.3068528194400547, "create"=>0.4339554189045479, "plan"=>0.3068528194400547, "define"=>0.3068528194400547, "allocate"=>0.3068528194400547, "resources"=>0.3068528194400547, "development"=>0.3068528194400547, "risks"=>0.3068528194400547, "understand"=>0.3068528194400547, "procurement"=>0.3068528194400547, "process"=>0.3068528194400547} ==== POC #2 - Auto classification of products: ==== Used latest Ruby gems 'categorize' & 'ai4ruby' for unsupervised document classification require 'categorize' include Categorize The sample text used in POC #1 is submitted as statements for iteration purpose documents = [ %q(This intensive and hands-on series of courses gives you the skills to ensure your projects are completed on time and on budget while giving the user the product they expect.), %q(You will gain a strong working knowledge of the basics of project management and be able to immediately use that knowledge to effectively manage work projects.), %q(At the end of the series you will be able to identify and manage the product scope, build a work breakdown structure, create a project plan, create the project budget, define and allocate resources, manage the project development, identify and manage risks, and understand the project procurement process.)] Applied unsupervised document classification technique - document clustering hierarchical_model = Models::HierarchicalCluster.new y = Model.make_model('plan', documents, hierarchical_model) puts y.inspect Statement wise output. [[["manage project", [1, 0, 2]]], [["manage project", [2]], ["working knowledge", [1, 0]]], [["intensive hands", [0]], ["working knowledge", [1]], ["manage project", [2]]]] This is a raw output. This can be further processed to frame categories. Moreover, these keywords can be used to train an algorithm/machine learning to classify similar text in future. ===== Product information classification using Watson NLU ======= Secured an account in IBM Cloud - https://console.bluemix.net/registration/ ==== POC - Watson API to analyze any given text: ==== require 'rest-client' require 'json' .. pload = {"text" => stext, "features" => { "concepts" => {}, "categories" => {}, "keywords" => {}}} begin response = RestClient::Request.execute :method => :post, :url => wurl, :user => uname, :password => pword, :payload => pload.to_json, :headers => {:content_type => :json} profile = JSON.parse(response.body) .. API extracts Categories, Keywords and Concepts are stored in database for the quick Solr search. class ProductInsight < ActiveRecord::Base belongs_to :associate, :polymorphic => true searchable :auto_index => false do text :categories, :boost => 3.0 text :keywords, :boost => 2.0 text :concepts, :boost => 1.0 end def self.skills(k, opt = {}) qry = solr_search do keywords k, :query_phrase_slop => 2 end .. pi = ProductInsight.new pi.update_attributes(:keywords => kwds, :concepts => cpts, :categories => cats) pi.associate = Course.first pi.save For example a search for 'programming' skills ProductInsight.skills 'programming' Will fetch possible skills and respective courses. "/art and entertainment - 568" "/education/distance learning - 1302" "/education/language learning - 1301" "/technology and computing/computer certification - 9,12,16,19,22,25,28,33,38,43,46,57,66" "/technology and computing/programming languages - 357" "/technology and computing/programming languages/javascript - 769,30,48,49,336,96,888,445,1255,1256,242" "/technology and computing/software/shareware and freeware - 60" ===== Reference ======= - https://en.wikipedia.org/wiki/Web_query_classification\\ - https://emmaespina.wordpress.com/2011/08/23/enhancing-search-results-using-machine-learning/\\ - http://rubynlp.org/\\