Fuzzy search helps you find content that are kind of similar, not exactly the same, in your data. It’s super handy when there are little typo mistakes, misspellings, or variations in data. This article is all about the Java way to perform fuzzy search to find in files that are scattered around folders.
Here are few examples to elaborate the fuzzy search sample content:
- You might not be sure if the document uses the spelling “color” or “colour.”
- When looking for “John,” it could actually be spelled as “Jon” or perhaps “Jhon.”
- Trying to find “USA” even if someone types “U.S.A.”
- If you make a “mistaek,” oh wait, it’s actually a “mistake.”
The solution to finding such content is Fuzzy Search.
Java Fuzzy Search Library
To make fuzzy searches in Java, we’ll use the GroupDocs.Search for Java API. The API is highly capable of showing flexibility and provides a customizable degree of error tolerance. This feature is favorable when dealing with language variations like British and American English, and typos.
With this library, fuzzy search can be performed within a large variation of file formats. The support is not just limited to Word documents (DOC, DOCX), spreadsheets (XLS, XLSX), presentations (PPT, PPTX), PDFs, Markup languages (HTML, XML), Markdown (MD), eBooks (EPUB, CHM, FB2), emails (MSG, EML), OneNote notes, and even ZIP archives.
If you want to know all the file types this magic can handle, just peek at the documentation.
To get started, you can grab the API from the download section, or just add the latest repository and dependency Maven configurations directly into your Java applications.
Let’s Fuzzy Search in Files using Java
Follow these steps to perform a fuzzy search in multiple files of various file formats within folders using Java:
- Start by creating an Index using the folder where your files are.
- Add the main folder path to the index.
- Provide the search query that you want to search.
- Turn on the magic of Fuzzy Search so it understands small mistakes.
- Set the Similarity Level in the Fuzzy Algorithm.
- Execute the search using the search method to get the search results.
- Now, you can traverse the SearchResults to create or print the output as you like.
In the Java code below, the program looks for similar content that is kind of what you asked for in all the files and subfolders. It’s tolerant of spelling mistakes, up to 25% error. If you peek at the code, you’ll see that the similarity level is set to 0.75, which is equivalent to the 75% match. If you want to fine-tune the search, just change the similarity level in the code.
After you run the code, you’ll get a list of fetched fuzzy search results. If you want to see how to print the search results, keep reading this article.
Query: nulla
Documents: 2
Occurrences: 135
Document: Lorem ipsum.docx
Occurrences: 132
Field: content
Occurrences: 132
nulla 98
nullam 34
Document: EnglishText.txt
Occurrences: 3
Field: content
Occurrences: 3
dull 1
full 1
fully 1
Printing Search Results
The following Java code provides two ways to present your search results.
- Highlight all the approximate matches.
- Print the results in a readable and analyzable format
Getting a Free License or a Free Trial
Free License
Obtain a temporary license for free to explore this library without constraints.
Free Trial
You can download the free trial from the downloads section.
Conclusion
In this article, we explored the programmatic Java way to perform Fuzzy search. It helps find approximate matching words that are kind of similar, even if there are small mistakes. This feature is handy for dealing with differences between British and American English, typos, name changes, and similar sounds in words.
For more about the API, check out the documentation.
If you have questions or want to discuss more, head to the forum.