Occasionally, I’ll come across a project where I encounter an interesting challenge or two. In one of these instances, I was working on an automation project for a customer who wanted to ingest all of their quotes from different suppliers into their Salesforce application. The quote documents were highly unstructured PDFs and Excel files, requiring the design of a bot which would extract data from these files and ensure that mandatory information was loaded into the Salesforce application using OCR technology.
The aforementioned automation helps the customer manage current challenges in their Quote-to-Cash process, where most quotes are consolidated at a high level without importing individual line items into Salesforce. This is due to the sheer volume of quote line items and the time spent matching quote line item details with the products in Salesforce. It can be particularly challenging when dealing with large datasets and potential variations in product names from different suppliers.
At Infomentum, I designed the RPA to account for variations in document formats and changes in field names used by suppliers. However, to tackle the challenge of comparing various quote line-item details with the products in Salesforce, I decided to use the Levenshtein Algorithm. This algorithm has proven effective in past projects involving other automation tools for creating master data.
The Levenshtein Algorithm offers a powerful solution to the comparison problem by measuring the similarity between two strings and providing the nearest matching product name, thereby allowing the bot to pick the product from Salesforce. This blog will guide you through implementing the Levenshtein Algorithm as part of an automation solution to compare and match quote line items with products.
Importance of matching product with quote line item for business operations
Matching quote line items to products accurately is crucial for generating precise quotes and maintaining data integrity. The solution we designed using the Levenshtein algorithm enhances this process by:
- Improving accuracy in matching similar product names.
- Reducing manual efforts in data cleaning and matching.
- Enhancing automation in quote generation workflows.
The quote line items eventually lead to asset creation for the opportunity; therefore, having all line items loaded accurately is essential.
Introduction to the Damerau-Levenshtein Algorithm
The Levenshtein algorithm, also known as the edit distance algorithm, was developed by Soviet scientist Vladimir Levenshtein in 1965. It calculates the minimum number of single-character edits (insertions, deletions or substitutions) required to transform one string into another. This metric is particularly useful for identifying similar strings despite minor differences, such as typos or variations in naming conventions.
The Damerau-Levenshtein distance is an extension of the Levenshtein distance algorithm, incorporating additional operations to measure string similarity more accurately by specifically addressing common human typing errors. These could include transpositions in addition to the three operations considered by Levenshtein.
Improvements
The Damerau-Levenshtein distance improves on the Levenshtein distance by considering transpositions of adjacent characters as a single operation. As a result, it provides better accuracy in certain scenarios, especially where character swaps are prevalent.
Key Operations:
- Insertion: Adding a character.
- Deletion: Removing a character.
- Substitution: Replacing one character with another.
- Transposition: Swapping two adjacent characters.
Example
Consider the strings "cat" and "act":
- Levenshtein Distance:
- Transforming "cat" to "act" requires 1 substitution (changing 'c' to 'a' and 'a' to 'c').
- Distance = 2.
- Damerau-Levenshtein Distance:
- Transforming "ca" to "ac" requires 1 transposition (swapping 'c' and 'a').
- Distance = 1.
By accounting for transpositions, the Damerau-Levenshtein distance more accurately reflects the true edit distance between strings, particularly in the context of common typographical errors.
In my particular automation project, I had to consider variations in product names from suppliers that could mean the same thing and fall under the same Salesforce product category. For example, “Router 2GHz” and “Router 5GHz” both fall under the “Hardware-Routers” category in Salesforce. Through the algorithm, I get a 58% match on the Hardware Routers category for both these Product names from Suppliers, which is a good indication of a match. A design decision was made on what percentage can be agreed upon as a close match.
Understanding the stages in the algorithm
- Calculate the distance: The distance checker function calculates the Levenshtein distance between two strings. This distance represents the minimum number of single-character edits (insertions, deletions or substitutions) needed to transform one string into another.
- Convert to percentage for a single comparison: The percentage match function calculates the similarity between two strings based on the Levenshtein distance. It converts the distance into a percentage, representing how similar the two strings are.
- Extend the comparison to a list/array of strings and sort by closest match: The third function compares a single input string with multiple strings, returning the percentage match for each comparison. In our case where each product string is compared with a set of products from Salesforce, this allows us to find the best/closest match.
Implementing the D-L Algorithm with MuleSoft automation
In the automation, the robot executes a sequence of steps to identify matches. Below is how the process is run through MuleSoft RPA Automation:
- The bot first downloads the quote document to process it.
- It then extracts all the line items, including key mandatory details like product name, description, quantity, and unit cost using OCR Technology. There is a minor difference in processing PDF compared to XLSX Quotes.
- It creates a list of product names that need to be compared with the products from Salesforce.
- It loops through the product list from the quote and executes the Levenshtein algorithm one by one to get the closest matches.
- The bot then creates a JSON/CSV document that can be imported into the Salesforce application through REST API calls.
- The bot repeats the same process for all of the quotes and creates the quote line items in the system.
- If there is no close match, or if the matched percentage is lower than a set value, the bot picks a default product category of Unassigned, which then becomes a manual task to update the corresponding category.
- The lookup product list in Salesforce will inevitably grow, so in the solution care has been taken to improve comparison speed by converting the strings and array lists into file-based processing.
Conclusion
The Damerau-Levenshtein Algorithm offers a more nuanced approach to measuring string similarity by incorporating transpositions, making it particularly useful for applications where typographical errors are common and variations of the same name exist. As a result, it is a valuable tool in various fields including text processing, data cleaning and natural language processing. Further, it can also be incorporated into any comparison-based automation requirements. There are definitely other algorithms that can be leveraged to make such comparisons, however for the use case mentioned and project timelines considered, this one suits the requirements very well.
By understanding and implementing the Damerau-Levenshtein distance in Salesforce or other platforms, you can achieve more accurate and efficient data matching, ultimately improving the quality and reliability of automations. In the case discussed, it was evident from testing that the algorithm successfully matched 75% of the total product line items from the quotes. The customer team is constantly updating the list of products in Salesforce to make the matching more reliable for the bot and to increase this percentage closer to 100.