Fuzzy Search Algorithm Using Python

When searching data, common problems arise due to spelling and phonetic errors in queries, incorrect input, and inconsistent transcription standards for foreign languages. Because of these issues, exact match searches alone cannot fully solve the problem. As a result, the development of fuzzy search methods and technologies becomes crucial.

Fuzzy search is a type of information search that compares information with a given pattern or a similar value.

This workflow implements a fuzzy search problem using Python programming language libraries in Megaladata.

To run the Megaladata Python node, you may need pre-configuring Megaladata and installing Python. This demo uses the fuzzywuzzy library.

Library installation instructions.

Note: The demo example has a limitation in demonstrating all capabilities when running on the demo stand (Launch Demo). We recommend installing the example locally.

Launch demo

Download example

Algorithm Description

1. Import data

The Data Import supernode loads data from a text file containing search queries for comparison.

a) Loading data

Table Data for comparison:

Name Caption
 Text_1 Text 1
 Text_2 Text 2

The Fuzzy Search supernode implements the fuzzy search algorithm using a Python handler.

a) Fuzzy search methods

The Python library we employ offers various fuzzy search methods. The choice of a particular method is determined by the specific task or data type.

Let's look at each method in more detail:

  1. fuzz.ratio compares strings character by character, assigning a score of 100 only for exact matches. Any difference, including punctuation, capitalization, or word order, reduces the score.

  2. fuzz.token_sort_ratio breaks strings into words, ignores case, word order, and punctuation at the edges, and compares the sorted words. A higher score indicates greater similarity in word composition.

  3. fuzz.token_set_ratio similar to token_sort_ratio, but removes duplicate words before comparison, focusing on the unique word set. This can be useful when comparing strings with repeated words.

  4. fuzz.partial_ratio checks if one string is a substring of another, ignoring extra words. It evaluates the similarity of the core message, focusing on the essential content rather than exact word order or punctuation.

  5. fuzz.WRatio combines multiple fuzziness methods (including ratio, token_sort_ratio, and partial_ratio) with weighted scores to provide a more comprehensive and human-like assessment of string similarity. It considers various factors to achieve a more accurate comparison.

To select a method, we must set the number of the necessary algorithm as the value for the fuzchanger variable. This can be done in the variable port settings of the Fuzzy Search supernode.

b) Fuzzy search node settings

In the Python node configuration wizard, we manually set the following columns:

Name Caption
 Text_1 Text 1
 Text_2 Text 2
 fuzzy_search Fuzzy Search Score

The checkbox Allow creating output columns in script is checked.

To access port data and other built-in objects in the context of code execution, the following are provided:

  • Input datasets (InputTable)
  • Input variables (InputVariables)
  • Output dataset (OutputTable)
  • Required enumerations (DataType)

The above objects are imported from the built-in module builtin_data. By default, an import line is added to the text of the code executed by the node.

from builtin_data import InputTable, InputVariables, OutputTable, DataType
from fuzzywuzzy import fuzz

We create a dictionary equipped with fuzzy search functionality:

fuzz_function_map = {     
    1: fuzz.ratio,
    2: fuzz.token_set_ratio,
    3: fuzz.token_sort_ratio,
    4: fuzz.partial_ratio,
    5: fuzz.WRatio
}

We assign a function, obtained from a dictionary key, to a variable:

fuz_value = fuzz_function_map.get(InputVariables.Items["fuzchanger"].Value, fuzz.ratio) 
     for i in range(InputTable.RowCount):
        text1 = InputTable.Get(i, 0)
        text2 = InputTable.Get(i, 1)
        OutputTable.Append()
        OutputTable.Set(0, text1)
        OutputTable.Set(1, text2)

We apply the fuzzy search algorithm to compare the columns:

OutputTable.Set(2, fuz_value(text1, text2))

Here is the result from the fuzz.ratio method:

Text 1 Text 2 Fuzz Ratio
What is the capital of France? Where is the Eiffel Tower located? 67
Who painted the Mona Lisa? Who is the author of 'Hamlet'? 33
Who is the author of 'Pride and Prejudice? Which famous English novelist wrote 'Sense and Sensibility? 80
Is the company's annual revenue increasing or decreasing? How is the company's annual revenue trending? 90
What is the best way to make a homemade pizza? How can I improve my cooking skills? 57

Download and open the file in Megaladata. If necessary, you can install the free Megaladata Community Edition.

Download example

results matching ""

    No results matching ""