Six Apache Solr Search Best Practices for Optimal Performance and Cost Saving

Necessary cookies
Name	Hostname	Vendor	Expiry
cookiehub	.gspann.com	CookieHub	365 days
Used by CookieHub to store information about whether visitors have given or declined the use of cookie categories used on the site.
__cf_bm	.vimeo.com	Cloudflare, Inc.	1 hour
The __cf_bm cookie supports Cloudflare Bot Management by managing incoming traffic that matches criteria associated with bots. The cookie does not collect any personal data, and any information collected is subject to one-way encryption.
_cfuvid	.vimeo.com	Cloudflare, Inc.	Session
Used by Cloudflare WAF to distinguish individual users who share the same IP address and apply rate limits
jobFilterFlag			Persistent

Preferences
Name	Hostname	Vendor	Expiry
li_gc	.linkedin.com	LinkedIn Ireland Unlimited Company	180 days
Stores user consent preferences for LinkedIn cookies to comply with privacy regulations.
lidc	.linkedin.com	LinkedIn Ireland Unlimited Company	1 day
Maintains session routing for reliable LinkedIn service performance.
_gcl_au	.gspann.com	Google Advertising Products	90 days
Used by Google AdSense to understand user interaction with the website by generating analytical data.
vuid	.vimeo.com		400 days
These cookies are used by the Vimeo video player on websites.
_gcl_ls		Google Advertising Products	Persistent
Used by Google AdSense to understand user interaction with the website by generating analytical data.
_hjSessionUser_	.gspann.com	Hotjar	365 days
Hotjar cookie. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the Hotjar User ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjSession_	.gspann.com	Hotjar	1 hour
Used by Hotjar to hold current session data.
prism_	prism.app-us1.com	HubSpot	30 days
Used by ActiveCampaign to track visitors across marketing channels.
hjActiveViewportIds		Hotjar	Persistent
User by Hotjar to store user active viewports IDs and an expirationTimestamp that is used to validate active viewports on script initialization.
hjViewportId		Hotjar	Session
This Hotjar cookie stores information about the user viewport such as size and dimensions.

Analytical cookies
Name	Hostname	Vendor	Expiry
_ga	.gspann.com	Google	400 days
Contains a unique identifier used by Google Analytics to determine that two distinct hits belong to the same user across browsing sessions.
bcookie	.linkedin.com	LinkedIn Ireland Unlimited Company	365 days
This is a Microsoft MSN 1st party cookie for sharing the content of the website via social media.
_ga_	.gspann.com	Google	400 days
Contains a unique identifier used by Google Analytics 4 to determine that two distinct hits belong to the same user across browsing sessions.
_gid	.gspann.com	Google	1 day
Contains a unique identifier used by Google Analytics to determine that two distinct hits belong to the same user across browsing sessions.
_gat	.gspann.com	Google	1 hour
Used by Google Analytics to throttle request rate (limit the collection of data on high traffic sites)

Marketing cookies
Name	Hostname	Vendor	Expiry
IDE	.doubleclick.net	Google Advertising Products	390 days
Used by Google's DoubleClick to serve targeted advertisements that are relevant to users across the web. Targeted advertisements may be displayed to users based on previous visits to a website. These cookies measure the conversion rate of ads presented to the user.
AnalyticsSyncHistory	.linkedin.com	LinkedIn Ireland Unlimited Company	30 days
Used by LinkedIn to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries
UserMatchHistory	.linkedin.com	LinkedIn Ireland Unlimited Company	30 days
Contains a unique identifier used by LinkedIn to determine that two distinct hits belong to the same user across browsing sessions.
bscookie	.www.linkedin.com	LinkedIn Ireland Unlimited Company	365 days
Used by the social networking service, LinkedIn, for tracking the use of embedded services.
test_cookie	.doubleclick.net	Google Advertising Products	1 hour
Used to check if the user's browser supports cookies

Apache Solr is exceptionally reliable, scalable, and fault-tolerant. It provides everything from distributed indexing, replication, and load-balanced queries to centralized configuration, and automated failover and recovery. Solr powers the search and navigation features of many of the world's largest internet sites.What many companies fail to recognize, however, is that an improperly configured Solr cloud can quickly cause costs to soar. By taking advantage of the best practices featured in this blog, you can achieve top-notch performance while saving costs. Indexed Fields You can save your company a considerable amount of money in storage and cloud usage fees by controlling the number of index fields you incorporate in your search implementation. Indexing represents a trade-off between quality, performance, and cost. It is important to note that creating too many indexes may end-up degrading performance while increasing costs. The best strategy is to choose minimum number of indexes to support your performance target. The number of indexed fields increases the following: <ul><li>Memory usage while indexing</li> <li>Segment merge times</li> <li>Optimization times</li> <li>Index size</li></ul> Do not mark fields as indexed=true if they are not used in a query. <div><img src="https://images.prismic.io/gspann/0865828a-4813-4818-8271-9f08cc2b6c09_Image+1_+Solr+Index+Fields.png?auto=compress,format"></div> Stored Fields Retrieving stored fields of a query result can be a significant expense. This cost is affected by the number of bytes stored per document. The higher the byte count, the sparser the documents will be distributed on disk - which requires more I/O bandwidth to retrieve the desired fields. <div><img src="https://images.prismic.io/gspann/6f0c64bb-beee-4903-a304-6267ef0c1a99_Image+2_+Stored+Fields.png?auto=compress,format"></div> These costs start to add up on cloud-based implementations. As with index fields, however, you are again confronted with a trade-off between quality, performance, and cost. As you define more stored fields, performance improves, but you incur higher costs. <div id="contentBlocker"></div> Utilizing Solr Cache Solr caches several distinct types of information to ensure that similar queries do not repeat work unnecessarily. There are three major cache types, as summarized in Table 1. <table style="font-size: 15px; border-collapse: separate;" cellspacing="10" width="100%"> <th style="background-color: #cad5d7; padding: 10px;">Cache Type</th> <th style="background-color: #cad5d7; padding: 10px;">Description</th> <tr> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;"> Query Cache </td> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;"> Stores sets of document IDs returned by queries </td> </tr> <tr> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;"> Filter Cache </td> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;"> Stores filters built by Solr in response to filters added to queries </td> </tr> <tr> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;"> Document Cache </td> <td style="background-color: #f1f5f6; padding: 10px; vertical-align: top; border-bottom: none; text-align: left;"> Stores the requested document fields when showing query results </td> </tr> </table> Here is a little bit more information on Document Cache. When asking Solr to search, you may request one or more stored fields to be returned with your results. Document Cache helps in improving the time required to service subsequent requests, including the same document. <div><img src="https://images.prismic.io/gspann/c1089c76-2d8e-4d7a-b17b-45221b16fc20_Image+3_+Solr+Cache+Configuration.png?auto=compress,format"></div> Configure Cache Autowarm You can give your search performance a tremendous boost by activating ‘autowarming’ for any or all three types of cache (see Table 1). By increasing the autowarmCount configuration setting, you instruct Solr to pre-populate or ‘autowarm’ its cache with cache objects created by the search results process. The autowarmCount is the number of cached items that will be copied into the new searcher. <div><img src="https://images.prismic.io/gspann/4879f360-c6bc-4f42-a1e3-8650d9ef7360_Image+4_+Autowarm+Configuration.png?auto=compress,format"></div> Streaming Expressions Solr offers a simple yet powerful stream processing language for Solr Cloud. Streaming expressions are a suite of functions that can be combined to perform many parallel computing tasks. These functions are the basis for the Parallel SQL Interface. The example shown here dumps all rows from ‘some_collection’ and allows parallel processing of data using ‘year’ as the partition key. <code><pre style="padding: 0px 20px; margin: 0px; font-size: 15px; text-align: left; border:1px solid #B5B5B5; background-color: #F4F4F4;">expr=search(some_collection, zkHost="https://solr.gspann.com:9999", qt="/export", q="*:*", fl="id,name,addr,status,year,month,day", sort="year desc, name asc", partitionKeys="year" ) </pre></code> JVM Memory Heap Settings The Java Virtual Machine (JVM) memory heap is an efficient pool of memory used by Java to store objects created by Java applications at runtime. The Java Virtual Machine (JVM) memory heap settings affect system resource utilization. In a cloud environment resource usage is metered and costs can quickly add up. In addition, these settings impact performance and directly affect the success or failure of your search implementation. Adding Enough Headroom Various tools, including jconsole, usually included with standard Java installations, show minimum amount of memory consumed by the JVM when running Solr. If you do not allocate enough memory to the heap, the JVM can cause heightened resource consumption due to the ‘garbage collection’ process running too often. We recommend adding an additional 25% to 50% over the minimum required memory to operate at peak performance. OS Memory At first, you might be inclined to allocate as much memory as possible to the JVM memory heap. You need to exercise caution because Solr makes extensive use of the org.apache.lucene.store.MMapDirectory component. This technology leverages the operating system (OS), and, by extension, CPU memory management. As part of its normal operation, Solr maps file into memory that is drawn from the OS kernel, and not from the memory heap used by the JVM. You should strive to minimize the amount of memory allocated to the JVM memory heap to free OS memory, which is used by the ‘MMapDirectory’ component. Yes, this is contradictory advice, but it’s important to recognize that the more memory you allocate to the JVM memory heap, the less memory there is available to the OS. If Solr indexing suffers, search performance suffers as well. It’s important to strike a balance between the JVM memory heap and OS memory. What Should be the JVM Memory Heap Size? A range of 8 to 16 GB is common. It’s best to start with the smaller value and run tests while monitoring memory usage. Gradually tune the heap size up until you achieve optimal performance. The more extensive your tests, the better your chance of a successful deployment. Garbage Collection ‘Garbage collection’ is the term that refers to the process where memory freed up in a Java program is returned to the general memory pool for re-use. Over the years, there has been considerable improvement in garbage collection components. The legacy garbage collector used prior to Java version 9 was ParallelGC. The G1GC garbage collector operates in a concurrent multi-threaded fashion and addresses latency issues encountered in ParallelGC. Thus, our team recommends that if you are using Java 9, make sure to use the G1GC garbage collector. Conclusion The key takeaway from this article is that when it comes to Solr search, it’s important to strike a balance between quality, performance, and cost. Finer quality can be achieved by manipulating indexed and stored fields. Higher performance comes by configuring Solr cache, auto warming, using streaming expressions, and allocating memory. High quality and performance can translate to higher costs due to memory, resource, and bandwidth usage. You can achieve optimal search by carefully monitoring memory, resource, and bandwidth usage as you conduct extensive testing using a large volume of realistic data.

Home / Resources / Blogs / Six Apache Solr Search Best Practices for Optimal Performance and Cost Saving

Apache Solr is exceptionally reliable, scalable, and fault-tolerant. It provides everything from distributed indexing, replication, and load-balanced queries to centralized configuration, and automated failover and recovery. Solr powers the search and navigation features of many of the world's largest internet sites.

What many companies fail to recognize, however, is that an improperly configured Solr cloud can quickly cause costs to soar. By taking advantage of the best practices featured in this blog, you can achieve top-notch performance while saving costs.

Indexed Fields

You can save your company a considerable amount of money in storage and cloud usage fees by controlling the number of index fields you incorporate in your search implementation. Indexing represents a trade-off between quality, performance, and cost. It is important to note that creating too many indexes may end-up degrading performance while increasing costs. The best strategy is to choose minimum number of indexes to support your performance target.

The number of indexed fields increases the following:

Memory usage while indexing
Segment merge times
Optimization times
Index size

Do not mark fields as indexed=true if they are not used in a query.

Stored Fields

Retrieving stored fields of a query result can be a significant expense. This cost is affected by the number of bytes stored per document. The higher the byte count, the sparser the documents will be distributed on disk - which requires more I/O bandwidth to retrieve the desired fields.

These costs start to add up on cloud-based implementations. As with index fields, however, you are again confronted with a trade-off between quality, performance, and cost. As you define more stored fields, performance improves, but you incur higher costs.

Utilizing Solr Cache

Solr caches several distinct types of information to ensure that similar queries do not repeat work unnecessarily. There are three major cache types, as summarized in Table 1.

Cache Type	Description
Query Cache	Stores sets of document IDs returned by queries
Filter Cache	Stores filters built by Solr in response to filters added to queries
Document Cache	Stores the requested document fields when showing query results

Here is a little bit more information on Document Cache. When asking Solr to search, you may request one or more stored fields to be returned with your results. Document Cache helps in improving the time required to service subsequent requests, including the same document.

Configure Cache Autowarm

You can give your search performance a tremendous boost by activating ‘autowarming’ for any or all three types of cache (see Table 1). By increasing the autowarmCount configuration setting, you instruct Solr to pre-populate or ‘autowarm’ its cache with cache objects created by the search results process. The autowarmCount is the number of cached items that will be copied into the new searcher.

Streaming Expressions

Solr offers a simple yet powerful stream processing language for Solr Cloud. Streaming expressions are a suite of functions that can be combined to perform many parallel computing tasks. These functions are the basis for the Parallel SQL Interface.

The example shown here dumps all rows from ‘some_collection’ and allows parallel processing of data using ‘year’ as the partition key.

expr=search(some_collection, 
       zkHost="https://solr.gspann.com:9999", 
       qt="/export", 
       q="*:*", 
       fl="id,name,addr,status,year,month,day", 
       sort="year desc, name asc", 

       partitionKeys="year" 

)

JVM Memory Heap Settings

The Java Virtual Machine (JVM) memory heap is an efficient pool of memory used by Java to store objects created by Java applications at runtime. The Java Virtual Machine (JVM) memory heap settings affect system resource utilization. In a cloud environment resource usage is metered and costs can quickly add up. In addition, these settings impact performance and directly affect the success or failure of your search implementation.

Adding Enough Headroom

Various tools, including jconsole, usually included with standard Java installations, show minimum amount of memory consumed by the JVM when running Solr. If you do not allocate enough memory to the heap, the JVM can cause heightened resource consumption due to the ‘garbage collection’ process running too often. We recommend adding an additional 25% to 50% over the minimum required memory to operate at peak performance.

OS Memory

At first, you might be inclined to allocate as much memory as possible to the JVM memory heap. You need to exercise caution because Solr makes extensive use of the org.apache.lucene.store.MMapDirectory component. This technology leverages the operating system (OS), and, by extension, CPU memory management.

As part of its normal operation, Solr maps file into memory that is drawn from the OS kernel, and not from the memory heap used by the JVM. You should strive to minimize the amount of memory allocated to the JVM memory heap to free OS memory, which is used by the ‘MMapDirectory’ component.

Yes, this is contradictory advice, but it’s important to recognize that the more memory you allocate to the JVM memory heap, the less memory there is available to the OS. If Solr indexing suffers, search performance suffers as well. It’s important to strike a balance between the JVM memory heap and OS memory.

What Should be the JVM Memory Heap Size?

A range of 8 to 16 GB is common. It’s best to start with the smaller value and run tests while monitoring memory usage. Gradually tune the heap size up until you achieve optimal performance. The more extensive your tests, the better your chance of a successful deployment.

Garbage Collection

‘Garbage collection’ is the term that refers to the process where memory freed up in a Java program is returned to the general memory pool for re-use. Over the years, there has been considerable improvement in garbage collection components. The legacy garbage collector used prior to Java version 9 was ParallelGC. The G1GC garbage collector operates in a concurrent multi-threaded fashion and addresses latency issues encountered in ParallelGC. Thus, our team recommends that if you are using Java 9, make sure to use the G1GC garbage collector.

Conclusion

The key takeaway from this article is that when it comes to Solr search, it’s important to strike a balance between quality, performance, and cost. Finer quality can be achieved by manipulating indexed and stored fields.

Higher performance comes by configuring Solr cache, auto warming, using streaming expressions, and allocating memory. High quality and performance can translate to higher costs due to memory, resource, and bandwidth usage. You can achieve optimal search by carefully monitoring memory, resource, and bandwidth usage as you conduct extensive testing using a large volume of realistic data.

Navaneeth Aripekatta Kocha

Practice - Application Development

Published Mar 07 2023

GSPANN for Apache Solr

Six Apache Solr Search Best Practices for Optimal Performance and Cost Saving

You May Also Like

Blog

Case Study

Case Study

Blog

Case Study

Case Study

Blog