The efficacy of the search plays an important role in the success of any website. However, in the age of big data, websites struggle with low latency and high throughput workloads, when it comes to scaling search platforms for searching virtually zillions of documents. Of late, Apache Solr has gained traction among developers as a viable and cost-effective enterprise search platform, for such high end search. Apache Solr has continued to soar in popularity ever since it was introduced in 2004, and it is today the most popular enterprise search engine. Such popularity is owing to many reasons.
Compatibility: Apache Solr is written in Java, with Apache Lucene at the backend. Lucene’s API serves as a REST-like API, capable of being called over HTTP from any platform or language. The REST communications protocol delivers a simple stateless architecture and facilitates cacheable client-server communication without using much bandwidth.
Scalability: The Solr infrastructure, built on Apache Zookeeper, supports heterogeneous workloads, and is extremely scalable. It grows or shrinks search servers dynamically, to offer isolation at application and pipeline level.
Speed and Performance: Solr delivers on speed and performance. It uses the Lucene Java’s search library at its core for full-text indexing and search, and offers near real time search, streaming and indexing with latency guarantees. It also backs it up with robust features such as context aware replication, application-specific performance tuning, and disaster recovery; this comes with no-downtime assurance for all consumers. The presence of HTTP interfaces with flexible IO formats and extensive support for query parsing makes it very easy to add and find the required data.
One noteworthy feature that improves performance is automatic shard and replica rebalancing. Adding additional nodes do not add indexing capacity, but adding replica shards allows the software to take advantage of the extra hardware at search time. Using such an infrastructure, Solr is capable of searching millions of documents with millisecond response times, with load ranges between 100K and 120K QPS. Replication to other Solr servers ensures high availability.
Ease of Use and Flexibility: Schema-less and schema data modes—Solr supports both. It has dynamic field types that enable addition of new fields, which auto-map to existing field types, based on field names. Mixing and matching Lucene Analyzers, without writing code, is also made easy with field types.
Customization: Solr’s powerful external configuration makes it possible to tailor it to most application types, without the need to write code. The REST interfaces provide for easy integration with any language. Also, it has a plugin architecture that supports advanced customization.
Features
Solr’s open source, highly scalable, and fault tolerant nature, and the platform coming optimized for high volume traffic makes it popular, but what makes it the search platform of choice is the host of high end features on offer.
The following are some features, which make the task of performing accurate searches easy, fast and efficient:
• Full-text search capability: The Lucene library delivers powerful matching capabilities including phrases, wildcards, joins, groupings, and more, cutting across data-types. The query language supports both structured and textual search.
• Availability of hit highlighting, or highlighting the search phrase in the search results as a native feature. This feature is hard to activate in SQL server.
• In-built faceted navigation or search. The faceted navigation model leverages metadata fields, allowing users to clarify and refine queries. Users start with a classic keyword search and scan the results to make incremental searches easily. The end-output is equivalent to what a sophisticated Boolean query generates.
• Spatial search: Solr supports spatial and geo-spatial search. Spatial search makes it possible to index points or other shapes, filter search results by circle, bounding box, or other shapes, sort or boost scoring by distance between points, index and search multi-value time or other numeric durations, and do much more.
• Ability to handle rich documents, such as Word, PDF, PowerPoint and more, making for more comprehensive search.
• Dynamic clustering: It improves performance considerably. Each search is compared to a log of previous situations to find those that are most similar, then use the data collected from those previous situations to decide what to do.
• NoSQL features: NoSQL or “Not only SQL” is a rage in the database world now, as if offers a mechanism for storage and retrieval of data modeled in means other than the tabular relations. This facilitates simplicity of design, horizontal scaling, and finer control over availability.
Solr 5.0
The release of Solr 5.0 in February 2015 marks a watershed in the history of Solr, for this is the first release packaged as a standalone application. The previous versions of Solr came packaged as a WAR file. WAR is essentially a collection of resources that together constitute a web application. Packaging the software this way comes with a big disadvantage of not being able to make any changes during runtime. Any change requires regenerating and redeploying the entire WAR file.
Apache Solr 5.0 has also matured in many ways compared to the past releases. It is now more scalable and easier to use than before. It also brings forth many end of use improvements.
• Improved Security: Solr 4.0 had introduced Solr Cloud, a new design and architecture. Solr 5.0 builds on this, to affect plenty of hardening and usability improvements such as bash scripts to configure APIs and splitting of the ClusterState to improve scalability. Hardening is the process of securing a system by reducing its surface of vulnerability, by removing unnecessary software and code, tying up the loose ends and infusing code integrity.
• Improved Scalability: Solr 5.0 splits ClusterState to give every collection its own ClusterState by default. This eliminates the need to watch what everyone else is doing.
• Improved Stability: Solr 5.0 improves stability by giving Replication handler an option to throttle the speed of the replication, and also by making tweaks in timeouts.
• Better Management: The ZooKeeper server helps manage the overall structure of Solr, to route both indexing and search requests routed properly. The new version makes it easier to manage Zookeeper.
• Ease of Use and Performance Enhancements: Posting documents is easier compared to the previous versions. This is made possible by an improved SimplePostTool, and wrapping the Bin/post scripts around it. Scripts become even richer, easier and faster.
Solr 5.0 also brings forth some interesting new features:
• Distributed IDF (Intermediate Distributed Frame): SolrCloud provides a truly distributed set of features, with support for leader election, optimistic concurrency, automatic routing, and other checks expected from a distributed system.
• Additions to the Stats component: The Stats component returns uncomplicated statistics for numeric, string, and date fields within the document set. This allows using the Stats field to generate stats over numeric results of arbitrary functions. The new DataRangeField makes it easy to index date ranges.
Solr has already carved out a niche for itself, and is the search platform of choice for many major websites and Content Management Systems and more. The users include big names such as Twitter, LinkedIn, CNET, Netflix, and many others. Many users, especially LinkedIn SNA (Search, network and Analytic team) contribute to make this open-source platform even more resilient and relevant with every passing day.