5
votes

I am trying to create a product search for my site, where a user can search for products in multiple languages and (hopefully) get fuzzy search results if there is no exact match.

  • I have a pro_search table that has columns id, pro_id, en, de, es, fr, it.
  • The pro_id column refers to the id of the products in their own table.
  • The en, de, es, fr, it columns have the translated meta of each product in various languages.
  • The meta is just keywords seperated by spaces
  • $term is the search term.
  • $lang refers to the users chosen language

So first I do a basic 'LIKE' SQL query to see if there are matches, if there are no results from this I query all the products and create an array sorted by their similarity using the similar_text() function

For example I search 'shirt' this is fine if the meta for this product just includes the word 'shirt', but if the meta includes 'blue branded tshirt' this is being more descriptive and gives the user a chance to search by brand but means that the search will more than likely go fuzzy rather than be found with a LIKE SQL query.

This is kind of working but I was wondering how this could be improved, is there a better way of searching or how do people normally do it? Should I be splitting the meta into each individual keywords and try to see how many words match rather than matching the term to the whole meta?

    $ids = [];

    $params = ['%'.$term.'%'];
    $sql = "SELECT * FROM pro_search WHERE $lang LIKE ?";
    $stmt = DB::run($sql,$params);

    $count = $stmt->rowCount();
    if($count > 0){

        // product search
        while ($row = $stmt->fetch(PDO::FETCH_ASSOC)){
            $id = $row["pro_id"];
            array_push($ids,$id);
        }
        show_products($ids);

    }else{

        // product fuzzy search
        $sql = "SELECT * FROM pro_search";
        $stmt = DB::run($sql);
        while ($row = $stmt->fetch(PDO::FETCH_ASSOC)){
            $id = $row["pro_id"];
            $result = $row[$lang];
            similar_text($term,$result,$similarity);
            $similar_array[$similarity][] = $id;
        }

        $closest_match = array_keys($similar_array);
        rsort($closest_match);
        $match_count = count($closest_match);

        for($i=0; $i<$match_count; $i++){
            foreach($similar_array[$closest_match[$i]] as $id){
                array_push($ids,$id);
            }
        }
        show_products($ids);
    }

I have asked similar questions before and people have pointed me to different ways of comparing the term against the meta (such as levenshtein), but everything I've seen has been comparing two simple words (like apples and oranges) and this just isn't good enough for a real life application with thousands of products and a user could search for literally anything (as in $term='literally anything';)

Key Questions:

  • Should my meta have just the product name or multiple relevant keywords (too many keywords means an individual word is less similar to the whole)?
  • If I have multiple keywords in the meta should I be taking each individual keyword and comparing it against the search term?
  • Also would it be possible to maybe have negative keywords for individual products.
3
Have you thought about using a database only for this requirement? Elasticsearch might be able to handle all this with higher performance than writing custom code. Having multiple concurrent visits on that search will otherwise result in really high loading timesNico Haase
"How do people normally do it?" - the best choice is to build an Elastic index, it has a lot of features you simply cannot achieve in SQL. We've tried SQL in our project, but struggled with typos, aliases, rankings, etc, Elastic has built-in features for natural language processing.gaborsch
I've never used elastic search before but will look into it although I don't like having too many dependenciesPaddy Hallihan
Which RDBMS are you using? Some engines offer functionalities that can help you out if you don't want to pull in and pay for 3rd parties. SQL Server has Full-Text Search, so does PostgreSQL, and MySQL.MarcinJ
I am using mySQL, thanks I will look Full-Text Search as well. I will probably end up testing a combination of thingsPaddy Hallihan

3 Answers

4
votes

You are looking for Full-Text Searches WITH QUERY EXPANSION

MySQL supports text searching by using the LIKE operator and regular expression. However, when the text column is large and the number of rows in a table is increased, using these methods has some limitations:

  • Performance: MySQL has to scan the whole table to find the exact text based on a pattern in the LIKE statement or pattern in the regular expressions.
  • Flexible search: with the LIKE operator and regular expression searches, it is difficult to have a flexible search query e.g., to find product whose description contains car but not classic.
  • Relevance ranking: there is no way to specify which row in the result set is more relevant to the search terms.

Because of these limitations, MySQL extended a very nice feature so-called full-text search. Technically, MySQL creates an index from the words of the enabled full-text search columns and performs searches on this index. MySQL uses a sophisticated algorithm to determine the rows matched against the search query.

To do that, the columns that will be used for search must be in TEXT type and index of type FULLTEXT, index can be given using ALTER TABLE or CREATE INDEX and if you are using phpMyAdmin to manage your databases, you can do that by going to the Structure of that table, then click on More under Action of that column and choose Fulltext.

After that you can performe a search using MATCH AGAINST syntax. MATCH() takes the columns to be searched. AGAINST takes a string to search for, and an optional modifier that indicates what type of search to perform.

Full-Text Searches WITH QUERY EXPANSION:

In some cases, users want to search for information based on the knowledge that they have. Users use their experience to define keywords to search for information, and typically those keywords are too short.

To help users to find information based on the too-short keywords, MySQL full-text search engine introduces a concept called query expansion.

The query expansion is used to widen the search result of the full-text searches based on automatic relevance feedback (or blind query expansion). Technically, MySQL full-text search engine performs the following steps when the query expansion is used:

  • First, MySQL full-text search engine looks for all rows that match the search query.
  • Second, it checks all rows in the search result and finds the relevant words.
  • Third, it performs a search again based on the relevant words instead of the original keywords provided by the users.

The following example shows you how to search for a product whose product name or meta contains at least one word (shirt tshirt).

SELECT * FROM products WHERE MATCH(product_name,product_meta) AGAINST('shirt tshirt' WITH QUERY EXPANSION)

You can read more info in MYSQL document (the link at the beginning of the answer) and here

Also don't miss How Fine-Tuning MySQL Full-Text Search

3
votes

You can think a bit differently if you are still designing the system. In terms of search just do exact search and do in db as suggested earlier because that is much faster - but "learn from each interaction".

  • User types in some term
  • You do exact search, if found great.
  • if not do fuzzy search for each part of the term entered. Still not found, you do soundex. You are trying to find something! But present a long list to the user to filter through.
  • Eventually user selects one. As soon as they do, you add the term they entered to the product they selected.

So basic idea being that you are learning from each interaction and enriching your search set. Also whenever a term is user and user actually clicks on your item, you keep a count on that term-product association as your confidence in that term-product association improved.

Similarly whenever you are present an option, very easily user should be able to say "Not this" and then guide them through a pre-selected hierarchy of your items and eventually when they select one, you save their search term to that product.

So over a few months if you are getting enough users you will have a rich organic search term dataset for your product category with confidence level on each term.

2
votes

You can use SOUNDEX in sql

SELECT * FROM users 
           WHERE SOUNDEX(job) 
LIKE CONCAT('%',SUBSTRING(SOUNDEX('Manual worker'),2),'%');

And things like Manual worka will work. You just need to ajust the value (currently 2) to match your need.

I see that you already tried the Levenshtein algorithm but you should have an eye on this adaptation (which is also compatible with UTF-8 strings)

For my case, the soundex was more efficient, it will depend on how your users will interact with your application.


But as said in the comment, third parties like ElasticSearch or Algolia can be much more efficient.

For ma part I never used it because the company do not allow us to use third parties softwares. That's why I tried both Levensthein and Soundex