Full-text search is a way to search a text/query within a collection of documents. This approach quickly finds all instances of a term/phrase and it works by using text indexes. In this article, we will learn how to programmatically search full-text in documents using Java.

Full Text Search

After this, you can implement various search techniques and build your search solution for word-processing documents, spreadsheets, presentations, HTML files, PDF files, eBooks, email messages, ZIP archives, and many other document formats.

The following topics are covered below:

Java API for Full-Text Searching

GroupDocs.Search provides a full-text search Java API that can be integrated into any application without any third-party tool and software dependency. It allows you to search over a large list of document formats. Some of the search techniques that can be performed using the API are as follows:

  • Case Sensitive Search
  • Regular Expression Search
  • Faceted Search
  • Fuzzy Search
  • Homophone Search
  • Synonym Search

Download or Configure

You may download the JAR file from the downloads section, or just get the latest repository and dependency configurations for the pom.xml of your maven-based Java applications.

<repository>
	<id>GroupDocsJavaAPI</id>
	<name>GroupDocs Java API</name>
	<url>http://repository.groupdocs.com/repo/</url>
</repository>
<dependency>
        <groupId>com.groupdocs</groupId>
        <artifactId>groupdocs-search</artifactId>
        <version>21.3</version> 
</dependency>

There are two steps to perform the search within files stored in a folder.

  • Indexing
  • Perform Search

Index files using Java

An index possesses scanned text of all the documents. Therefore, when you are going to perform a search operation, only the index is referenced, rather than the text of the original documents. To make it possible to search instantly across thousands of documents with the same or different file formats, you need to create an index and add these documents to it. When documents are indexed, the index is ready to handle the search queries.

The following simple two lines create an index and also add the documents folder to the index.

Index index = new Index("indexingFolderPath");
index.add("documentsFolderPath");

After indexing multiple documents of the same or different formats like (Word, PDF, Excel, and HTML), we can move ahead to process a specific search query (search term “Draw”) over them. The following are the steps for how to perform text search on multiple documents within a folder using Java:

  • Specify the source folder of documents and index folder.
  • Create Index using the index folder.
  • Add the source folder to the index.
  • Prepare the query string.
  • Perform a search using the search method of Index class.
  • Traverse each search results for the properties of each document.

The following source code performs text search in Java on all the documents of the provided folder.

// Search specified text in multiple PDF, Word, Excel, HTML documents with in a folder using Java
Index index = new Index("path/indexingFolder");
index.add("path/documentsFolderPath");
// Searching in index for specified text
SearchResult result = index.search("Draw");
for (int i = 0; i < result.getDocumentCount(); i++) {
FoundDocument document = result.getFoundDocument(i);
System.out.println("Document Path: " + document.getDocumentInfo().getFilePath());
System.out.println("Occurrence : " + document.getOccurrenceCount());
}

We will get the document path and the number of occurrences of the search terms in all the documents with that specified folder. Here is the screenshot for visualization.

Full Search Text Output

Highlight Text Search Results in Java

Let’s now perform the same full-text search and also highlight all the occurrences that match your query.

The following steps show how to highlight the text search results:

  • Create Indexand add the documents folder to the index.
  • Prepare the query string.
  • Search the document folder using the search method.
  • While traversing the results, create the highlighter using the HtmlHighlighter.
  • Use the highlight method to highlight the search results.

The following code generates the HTML output with highlighted search results using Java.

// Highlight the full text search results of multiple documents in a folder in Java
Index index = new Index("path/indexingFolder");
index.add("path/documentsFolderPath"); // Synchronous indexing documents from the specified folder
String query = "draw"; // Specify a search query
SearchResult result = index.search(query); // Searching in the index
for (int i = 0; i < result.getDocumentCount(); i++)
{
FoundDocument document = result.getFoundDocument(i);
String path = "path/Highlighted-"+ i +".html";
OutputAdapter outputAdapter = new FileOutputAdapter(path);
HtmlHighlighter highlighter = new HtmlHighlighter(outputAdapter); // Creating the highlighter
index.highlight(document, highlighter); // Generates HTML formatted output document with highlighted search results
}

As an output, we will get multiple HTML files. Each file will show the content of a separate document (e.g. excel.xlsx, source.docx, target.docx) with highlighted search terms/words. Given below is the highlighted HTML output of a DOCX file, TXT file, and PDF file obtained using the above code.

Highlight full text search results in content using Java

Get a Free API License

You can get a free temporary license in order to use the API without the evaluation limitations.

Conclusion

In this article, we have learned to search text within multiple documents of a folder in Java. Further, we discussed how to programmatically highlight the text of search results in HTML format for MS Word files, TXT files, and PDF files using GroupDocs.Search for Java.

You may learn more about the API using documentation. Many more examples are available at GitHub. For queries, contact us via the forum.

See Also