Fuzzy Search using Java

Fuzzy search helps you find content that are kind of similar, not exactly the same, in your data. It’s super handy when there are little typo mistakes, misspellings, or variations in data. This article is all about the Java way to perform fuzzy search to find in files that are scattered around folders.

Here are few examples to elaborate the fuzzy search sample content:

  • You might not be sure if the document uses the spelling “color” or “colour.”
  • When looking for “John,” it could actually be spelled as “Jon” or perhaps “Jhon.”
  • Trying to find “USA” even if someone types “U.S.A.”
  • If you make a “mistaek,” oh wait, it’s actually a “mistake.”

The solution to finding such content is Fuzzy Search.

Java Fuzzy Search Library

To make fuzzy searches in Java, we’ll use the GroupDocs.Search for Java API. The API is highly capable of showing flexibility and provides a customizable degree of error tolerance. This feature is favorable when dealing with language variations like British and American English, and typos.

With this library, fuzzy search can be performed within a large variation of file formats. The support is not just limited to Word documents (DOC, DOCX), spreadsheets (XLS, XLSX), presentations (PPT, PPTX), PDFs, Markup languages (HTML, XML), Markdown (MD), eBooks (EPUB, CHM, FB2), emails (MSG, EML), OneNote notes, and even ZIP archives.

If you want to know all the file types this magic can handle, just peek at the documentation.

To get started, you can grab the API from the download section, or just add the latest repository and dependency Maven configurations directly into your Java applications.

Let’s Fuzzy Search in Files using Java

Follow these steps to perform a fuzzy search in multiple files of various file formats within folders using Java:

  1. Start by creating an Index using the folder where your files are.
  2. Add the main folder path to the index.
  3. Provide the search query that you want to search.
  4. Turn on the magic of Fuzzy Search so it understands small mistakes.
  5. Set the Similarity Level in the Fuzzy Algorithm.
  6. Execute the search using the search method to get the search results.
  7. Now, you can traverse the SearchResults to create or print the output as you like.

In the Java code below, the program looks for similar content that is kind of what you asked for in all the files and subfolders. It’s tolerant of spelling mistakes, up to 25% error. If you peek at the code, you’ll see that the similarity level is set to 0.75, which is equivalent to the 75% match. If you want to fine-tune the search, just change the similarity level in the code.

After you run the code, you’ll get a list of fetched fuzzy search results. If you want to see how to print the search results, keep reading this article.

Query: nulla
 Documents: 2
 Occurrences: 135

     Document: Lorem ipsum.docx
     Occurrences: 132
         Field: content
         Occurrences: 132
             nulla               98
             nullam              34

     Document: EnglishText.txt
     Occurrences: 3
         Field: content
         Occurrences: 3
             dull                1
             full                1
             fully               1

Printing Search Results

The following Java code provides two ways to present your search results.

  • Highlight all the approximate matches.
  • Print the results in a readable and analyzable format

Getting a Free License or a Free Trial

Free License

Obtain a temporary license for free to explore this library without constraints.

Free Trial

You can download the free trial from the downloads section.

Java API for Searching within Files and Folders

Conclusion

In this article, we explored the programmatic Java way to perform Fuzzy search. It helps find approximate matching words that are kind of similar, even if there are small mistakes. This feature is handy for dealing with differences between British and American English, typos, name changes, and similar sounds in words.

For more about the API, check out the documentation.

If you have questions or want to discuss more, head to the forum.


See Also