Java & ElasticSearch – Become up and running

ElasticSearch is a great Java based search solution. When I started researching it, I became very enthusiastic about it. The search capabilities that come out of the box, combined with a solid clustering architecture seem to make for a terrific scalable search solution.

However, when I started to use it, I noticed that there are not that many tutorials on the use of the Java API. Also, the documentation on elasticsearch, although quite complete, is somewhat vague. It is perfectly possible with comparing the JSON API and the Java API and trial and error to figure out the details, but I want to document some of my findings. Therefore I plan to write something down here on Dev Discoveries, so that I have a future reference. And maybe it will help others as well.

As I am working on this material, I am using a project that I host on GitHub: elasticsearch-java-demo. You may look at this project as a reference to some of the material that is explained here. The project might not contain all the code that is shown here, but should give some insights in how to use ElasticSearch.

This article is written, while using ElasticSearch 0.90.10.

Quick overview of ElasticSearch

When looking for a Java based search solution, one quickly discovers that ElasticSearch is a major player. It is a very fast search engine, based on the Lucene libraries, with solid clustering built in from the ground up. One can find use cases on the web where ElasticSearch enables fast and feature rich searching over many terabytes of data. I will not go into the Solr versus ElasticSearch debate. Both are viable choices, but for me the ease of configuration and clustering capabilities of ElasticSearch make a difference.

An ElasticSearch cluster contains typically multiple nodes, each running in a JVM. Every node contains a part of the indexed data in so-called shards. Out of the box, ElasticSearch is configured with five shards that are also replicated. So there are 10 shards of data evenly distributed over the nodes of the cluster. Distribution is done completely automatically As nodes connect and disconnect from the cluster. One can configure the number of shards and replicas, corresponding to the specific use case.

ElasticSearch supports advanced term based queries, location queries, range filters, facetting, suggesters (“Did you mean…”), “more like this” queries, fuzzy queries, stemming, scripts, several analyzers, etc. Percolate queries and data rivers are also interesting concepts that are very worthwhile to investigate.

Basic installation

Installing ElasticSearch is as simple as downloading the latest distribution from the site and unpacking it on your machine. Then one just runs it, by launching elasticsearch (elasticsearch.bat for Windows) from the bin directory. In the conf directory you will find the elasticsearch.yml file that can be used to configure the started nodes. Although it will run fine on a local machine without altering anything.

You can start as many nodes as you want. They will run on different ports automatically. The started nodes will immediately search the network for other ElasticSearch nodes with the same clustername to form a cluster. It is fun to try this out with a few colleagues.

Installing the elasticsearch-head plugin

Before we dive in, we will install one plugin that allows us to easily see what`s going on in the ElasticSearch cluster. Installing it is easy. Just run the following on the commandline:

plugin is an executable in the bin directory of your ElasticSearch installation. More information can be found on the elasticsearch-head website.

elasticsearch-head plugin

The elasticsearch-head plugin showing one master node and one client (application) node.

Once you have started ElasticSearch after installing the plugin, you can go to: http://localhost:9200/_plugin/head/. You will see something like this:

You see one master node (called Maestro) and one application node (My beautiful node). You see that the client node does not contain any data. The cluster health is yellow, because the shards 0 – 4 are not replicated. The other tabs show details on the data in the search index and allow for querying ElasticSearch.

Connecting with a Java client

There are several ways in which you can connect your application to an ElasticSearch cluster. All methods have some advantages and drawbacks. This will be explained per option.

Run a local client node in your application

The first way to connect a Java application to an ElasticSearch cluster is to run a local node in your application. It basically boils down to starting a local node which will then search for other nodes and join a cluster. From this node one can obtain a client. This works in the following way:

This code launches a new node. It is the simplest possible code that will work on your local development environment (provided that you run ElasticSearch out of the box). From the node one can obtain a client. This client reference is used to make the API calls to.

When doing more serious development, you probably want to configure ElasticSearch a bit more. You probably want to set a cluster name, to make sure that your cluster is not joined by mistake. A slightly longer version of the code above looks like this:

It sets the cluster name so it is not the default (“elasticsearch”) anymore. Also it is configured as a client node. This means among others that the node will not be eligible for master and will not hold data.

In this example, the node is also set to be local. Therefore, it will only look for other cluster nodes within the same JVM process. This is a useful setting for unit tests.

When your application shuts down, you need to close the client and shut the node down. You can use a shutdown hook for this. For example, if you would be using the Spring framework, you could have a bean like this:

Because the client node is a full-fledged ElasticSearch node, you will need the transitive Lucene dependencies in your project. This can be a disadvantage when dependencies clash.

Use a transport client to connect

The second alternative to connect to ElasticSearch is to create a TransportClient to perform the API calls to. When you start the application, you initialize a transport client like this:

You will need to set the cluster name in the settings of the  TransportClient if you are not using the default. 9300 is one of the default ports of ElasticSearch.

Using the setting  client.transport.sniff  allows the  TransportClient  to sniff the rest of the cluster, so you do not need to provide a full set of hosts/ports at startup. However, it is important to note that after a restart, at least one of the transport addresses must be online, in order to sniff the rest of the cluster.

Also in this case you need to shutdown the client when your application is shutdown. You can use the same mechanism as used in the local node alternative, but you just close the client (there is no node running in your application):

Use the ElasticSearch REST API

The third alternative to connect to ElasticSearch is to use the REST API of ElasticSearch to perform all operations. We will not further discuss this method, because we want to focus on the Java API.

Comparing the connection methods

Here is a quick overview of the main pros and cons of these three connection methods. The three methods have their own strengths and weaknesses. You must choose the one that best fits your requirements and environment.

Local client

Pros:
* Queries are more efficient because all data shards are queried directly. This is possible because the client is a real node that has information on all other nodes and shards.
* Less configuration is needed to connect to a cluster. You just need the broadcasting to work in your network.
* No hostnames and ports of the cluster need to be configured, which makes deploy and disaster recovery easier.
* The node in your application can hold data. This could be useful in specific cases, although this is not advisable in most cases.

Cons:
* This solution uses more resources in your application than the other solutions.
* There is no security, since this is not yet supported by ElasticSearch.
* The Lucene libraries are pulled as dependencies in your project.

Transport client

Pros:
* This solution uses less resources than the local client method.
* Broadcasting in your network ia not needed, since the transport client is provided with an initial list of nodes.
* A transport client launches quicker than a local client.
* The Lucene dependencies can be excluded (for example with Maven).

Cons:
* There is no security, since this is not yet supported by ElasticSearch.
* An initial list of nodes to connect to the cluster, which means that these should be running when redeploying/restarting your application. When using a large cluster of tens or hundreds of nodes, you either need to provide the complete list or you need to handle some nodes in the cluster as special as at least one needs to be available for transport clients to connect.
* More configuration needed than with the local client method.

Rest API

Pros:
* Security can be installed by using a proxy. There is no direct connection needed for this method to work.
* It is a lightweight standard solution.
* There are no ElasticSearch specific dependencies needed in your project.

Cons:
* You will need a proxy/loadbalancer to make your application use the whole cluster.
* No access to the Java API, which means building JSON requests yourself. This can become pretty complicated with large queries.

Summary

This ends the introduction to starting ElasticSearch and connecting to it from a Java application. We covered basic installation of ElasticSearch and the three ways to connect to an ElasticSearch cluster.

I hope to continue with more articles concerning indices, mappings and searching. Please feel free to contribute in the comments.