Posted by manning_pubs
on October 12, 2012 at 10:30 AM PDT
Searching at Scale
by Trey Grainger and Timothy Potter, authors of Solr in Action
One of the most appealing aspects of Solr, beyond its speed, relevancy, and powerful text searching features, is how well it scales. In this article, based on chapter 3 of Solr in Action, the authors explain how Solr is able to scale to handle billions of documents while still maintaining lightning-fast search response times. Click here for 40% savings on Solr in Action and other related titles.
Solr is able to scale to handle billion of documents and an infinite number of queries by adding servers. This article will lay the groundwork for operating a scalable search engine. Specifically, we will discuss the nature of Solr documents as “denormalized” documents and why this enables linear scaling across servers, how distributed (federated) searching works, the conceptual shift from thinking about servers to thinking about clusters of servers, and some of the limits of scaling Solr.
The denormalized document
Central to Solr is the concept of all having documents denormalized. A completely denormalized document is one in which all fields related to a document are contained within it and no data that must be utilized as part of a query is found only in another document that must be considered as part of the query. If you have any training whatsoever in building normalized tables for relational databases, please leave that training at the door when thinking about modeling content into Solr. Figure 1 demonstrates a traditional normalized database table model, with a big X over it to make it obvious that this is not the kind of data—modeling strategy you will use with Solr.
Figure 1 Solr documents do not follow the traditional normalized model of a relational database. This figure demonstrates how NOT to think of Solr Documents. Instead of thinking in terms of multiple entities with relationship to each other, a Solr Document is modeled as a flat, denormalized data structure, as shown in listing 1.
Notice that the information in figure 1 represents two users who are working at a company called Code Monkeys R Us, LLC. While this figure shows the data nicely normalized into separate tables for the employees' personal information, location, and company, this is not how we would represent these users in a Solr Document. Listing 1 shows the denormalized representation for each of these employees as mapped to a Solr Document.
Listing 1 Two denormalized user documents
<field name="about">I'm a real monkey</field>
<field name=”companyname”>Code Monkeys R Us, LLC</field>
<field name=”companydescription”>we write lots of code</field>
<field name="username">John Doe</field>
<field name="about">Senior Software Engineer with 10 years of
experience with java, ruby, and .net
<field name=”companyname”>Code Monkeys R Us, LLC</field>
<field name=”companydescription”>we write lots of code</field>
Notice that all of the company information is repeated in both the first and second user's documents from listing 1, which seems to go against the principles of normalized database design for reducing data redundancy and minimizing data dependency. In a traditional relational database, you can construct a query that will join data from multiple tables when resolving a query. While some basic join functionality does now exist in Solr, it is only recommended for cases where it is impractical to actually denormalize content. Solr knows about terms that map to documents but does not natively know about any relationships between documents. That is, if you wanted to search for all users who work for companies in Decatur, GA, you would need to ensure the companycity and companystate fields are populated for all of the users for that lookup to be successful.
While this denormalized document data model may sound limiting, it also provides a sizable advantage—extreme scalability. Because we can make the assumption that each document is self-contained, this means that we can also partition documents across multiple servers without concern over having documents the same server as related documents for querying needs. This fundamental assumption of document independence allows queries to be parallelized across multiple partitions of documents and multiple servers to improve query performance, and this ultimately allows Solr to scale horizontally to handle querying billions of documents. This ability to scale across multiple partitions and servers is called federated or distributed searching.
The world would be a much simpler place if every important data operation could be run using a single server. It would also be a much more boring world. In reality, sometimes your search servers become overloaded by either too many queries at a time or by too much data needing to be searched through for a single server to handle.
In the latter case, it is necessary to break your content into two or more separate Solr indexes, each of which contains separate partitions of your data. Then, every time a search is run, it will actually be sent to both servers, and the results will be returned and aggregated before being returned from the search engine.
Solr includes this federated searching capability out of the box, which is often just referred to as distributed search within the Solr community. Conceptually, each Solr index (called a Solr core) is available through its own unique URL, and each of those Solr Cores can be told to perform an aggregated search across other Solr cores using the following syntax:
Notice three features about the example shards parameter above:
- The searched Solr core (box1, core1) is also included in the list of shards—it will not automatically search itself unless explicitly requested as above.
- This distributed search is searching across multiple servers.
- Multiple Solr cores (core2 and core3) are on the same box (box2). There is no requirement that Solr cores be located on separate machines.
The important take-away here is the nature of scaling Solr. Solr should scale theoretically linearly. The reason for this is that a federated search across multiple Solr Cores is run in parallel on each of those index partitions. Thus, if you divide one Solr index into two Solr indexes with the same number of documents, the federated search across the two indexes should be approximately 50 percent faster, minus any aggregation overhead.
This should also theoretically scale to any other number of servers (in reality, you will eventually hit a limit at some point). The conceptual formula for determining total query speed (assuming the same number of documents) is thus as follows:
(Query Speed on N+1 indexes) = Aggregation Overhead
+ (Query Speed on N Partitions)/(N+1)
This formula is useful for estimating the benefit you can expect from increasing the number of partitions into which your data is evenly divided. Since Solr scales nearly linearly, you should be able to reduce your query times proportional to the additional number of Solr cores (partitions) you add, assuming you are not constrained by server resources due to heavy load.
Clusters vs. servers
In the last section, we introduced the concept of federated searching to enable scaling to handle large document sets. It is also possible to add multiple essentially identical servers into your system to balance the load of high query volumes.
Both of these scaling strategies rely on a conceptual shift away from thinking about servers and toward thinking about “clusters” of servers. A cluster of servers is essentially defined as multiple servers, working in concert, to perform the same function.
Take the following example:
This example performs a federated search across two Solr cores, core1 on box1 and core2 on box2. When running this distributed search, what happens to queries hitting box2 if box1 goes down? Listing 2 shows Solr's response under this scenario, which includes the error message from the failed connection to box2.
Listing 2 A failed distributed search (RemoteServer down)
occured when talking to server at: http://box1:8983/solr/core2
Notice that the servers, for this use case, are mutually dependent. If one becomes unavailable for searching, they all become unavailable for searching and begin failing, as indicated in the exception in listing 2. It is important to think in terms of clusters of servers instead of single servers when building out Solr solutions which must scale beyond a single box, as those servers are essentially combining to serve as a single computing resource.
As we wrap up our discussion of the key concepts behind searching at scale, we should be clear that Solr does have its limitations.
The limits of Solr
Solr is an incredibly powerful document-based NoSQL datastore that supports full-text searching and data analytics. Some of the powerful benefits of Solr are inverted index and complex keyword-based Boolean query capabilities. We've seen that Solr can scale essentially linearly across multiple servers to handle additional content or query volumes. What then, are the use cases where Solr is not a good solution—what are the limits of Solr?
One limit is that Solr is NOT relational in any way across documents. It is not well suited for joining significant amounts of data across different fields on different documents, and it cannot perform join operations across multiple servers.
We have also already discussed the denormalized nature of Solr documents—data that is redundant must be repeated across each document for which that data applies. This can be particularly problematic when the data in one field that is shared across many documents changes.
For example, let's say that you were creating a search engine of social networking user profiles, and one of your users, John Doe, becomes friends with another user named Coco. Now, I not only need to update John's and Coco's profiles, but I also need to update the second-level connections field for all of John's and Coco's friends, which could represent hundreds of document updates for one simple operation—two users' becoming friends. This harkens back to the notion of Solr's not being relational in any way.
An additional limitation of Solr is that it currently serves primarily as a document storage mechanism—that is, you can insert, delete, and update documents, but not single fields (very easily). Solr does currently have some minimal capability to update a single field, but only if the field is attributed in such a way that its original value is stored in the index, which can be very wasteful. Even then, Solr is internally still updating the entire document based upon re-indexing all of the stored fields internally. What this means is that, whenever a new field is added to Solr or the contents of an existing field has changes, every single document in the Solr index must be reprocessed in its entirety before the data will be populated for the new field. Many other NoSql systems suffer from this same problem, but it is worth noting that data updates across the corpus require a non-trivial amount of document management to ensure the updates make it to Solr and in a timely fashion.
Solr is also optimized for a very specific use case, which is taking search queries with small numbers of search terms and rapidly looking up each of those terms to find matching documents, calculating relevancy scores and ordering them all, and then only returning a few actual results for display. Solr is not optimized, however, for processing very long (thousands of terms) queries or returning back very large result sets to users.
One final limitation of Solr worth mentioning is its elastic scalability. While Solr scales very well across servers, it does not yet elastically scale by itself in a fully automatic way. Recent work with Solr Cloud, utilizing Apache Zookeeper for cluster management, is a great first step in this direction, but there are still many features to be worked out, such as automatic content resharding and pluggable sharding strategies that are currently discussed but are not yet fully implemented.
We discussed key concepts for how Solr scales, including discussions of content denormalization within documents and federated distributed searching to ensure query execution can be parallelized to maintain or decrease search speed even as content grows beyond what can be reasonably handled by a single machine. We ended with a discussion of scaling our search architectures over clusters as opposed to servers, and we ended by discussing some of the limitations of Solr and use cases for when Solr may not be a great fit.
Here are some other Manning titles you might be interested in: