Searching in a Large Number of Documents

Usually, a searching software is able to achieve fast search responses because, instead of search the text directly, it searches an index. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

Using GroupDocs.Search for Indexing and Searching

Problem: Suppose you have 10 million documents of different file formats, e.g. MS Word, Spreadsheets, Presentations, etc. Due to limited memory size, you cannot store more than 5% of the entire data. Now the main issue is how to apply indexing and searching in this case.

Solution: The GroupDocs.Search for .NET provides many ways to perform search operations on any size of document collections. It is capable in indexing various types of documents and perform searches on it. The API supports searching for:

  • Text occurrences
  • Basic metadata fields
  • File names
  • Document types

Moreover, it allows searching on the basis of different search query types. The advanced search (e.g fuzzy search, synonyms search, boolean search) is also supported.

Creating Index

Let’s try the GroupDocs.Search API for indexing the bulk of documents of different file formats(see the supported formats list). Although, the Index can be created in memory, but here, let’s create it on disk. You just need to follow these simple steps:

  • Create a directory for Indexing
  • Create another directory and copy all the required documents into it.
  • Come to the code and firstly initialize Index object by passing the path of the index directory
  • Add documents using AddToIndex(“Documents_Folder_Path”) method of Index object.

The C# code will look like this:

Java guys can write the code like this:

After creating Index you will see the files in the Index folder like following screenshot:

Searching The Terms

The GroupDocs.Search allows various kinds of queries for search operations with more advance features. Please see this article for the detail.

Lets come to the code.

Suppose, the index has been already created as described in the above section. Let’s simply search a term. Follow the steps as written below:

  • Instantiate Index by passing index folder path
  • Search the term using Index.Search() method which will return SearchResults object.
  • Show list of searched files

The C# code will look like:

Java developers can write the code like this:

The output will be appeared like the following screenshot:

The complete ready to run code sample is available on GitHub.