Data Exploration
Now that we’ve gotten a glimpse of the basics, let’s try to work on a more realistic dataset. We have prepared a sample of fictitious JSON documents of customer bank account information. Each document has the following form:
{
"accountNumber": 121931,
"balance": 221.46,
"employer": "Globoil",
"name": {
"firstName": "Michael",
"lastName": "Lee"
},
"age": 24,
"gender": "male",
"contact": {
"address": "630 Victor Road",
"city": "Leyner",
"state": "Indiana",
"postcode": "61952",
"phone": "+1 (924) 594-3216",
"email": "michael.lee@globoil.co.uk"
},
"checkin": {
"_point": {
"_longitude": -95.63079,
"_latitude": 31.76212
}
},
"favoriteFruit": "lemon",
"eyeColor": "blue",
"style": {
"clothing": {
"pants": "khakis",
"shirt": "t-shirt"
},
"hairstyle": "slick back"
},
"personality": "A lot can be assumed..."
}
For the curious, this data was generated using Faker, using the generator.py script, so please ignore the actual values and semantics of the data as these are all randomly generated.
Loading the Sample Dataset
After downloading the sample dataset, let’s load it into our cluster as follows:
RESTORE /bank/
Content-Type: application/x-ndjson
@accounts.ndjson
When using curl, make sure to use --data-binary
, not -d
or --data
More information about RESTORE
method can be found in the
Backups section.
After having loaded the dataset, you can then use the Info API to get information about the new index:
INFO /bank/
Response should be something like:
{
"endpoint": "bank",
"doc_count": 1000,
"last_id": 1000,
"doc_del": 0,
"av_length": 200.023,
"doc_len_lower": 176,
"doc_len_upper": 238,
"has_positions": true,
"shards": [
"bank/.__1",
"bank/.__2",
"bank/.__3",
"bank/.__4",
"bank/.__5"
]
}
Which means that we just successfully bulk indexed 1000 documents into the bank index.
The Search API
Now let’s start with some simple searches. There are two basic ways to run searches: one is by sending search parameters through the REST request URI and the other by sending them through the REST request body. The request body method allows you to be more expressive and also to define your searches in a more readable JSON format. We’ll try one example of the request URI method but for the remainder of this guide, we will exclusively be using the request body method.
The REST API for search is accessible using the SEARCH
method. This example
returns all documents in the bank index:
SEARCH /bank/?sort=accountNumber&pretty
Let’s first dissect the search call. We are searching in the /bank/
index,
and the q=*
query parameter instructs Xapiand to match all documents in the
index (the default if not specified). The sort=accountNumber
parameter
indicates to sort the results using the accountNumber
field of each document
in an ascending order. The pretty
parameter just tells Xapiand to return
pretty-printed JSON results, a similar effect can be achieved by using the
Accept
header as in: Accept: application/json; indent: 2
.
And the response (partially shown):
{
"total": 1000,
"count": 10,
"hits": [
{
"accountNumber": 100123,
"balance": 10073.05,
"employer": "Affluex",
"name": {
"firstName": "Margaret",
"lastName": "Anderson"
},
"age": 24,
"gender": "female",
"contact": {
"address": "756 Strauss Street",
"city": "Fairview",
"state": "Virgin Islands",
"postcode": "06099",
"phone": "+1 (919) 400-3616",
"email": "margaret.anderson@affluex.net"
},
"checkin": {
"_point": {
"_longitude": -122.39168,
"_latitude": 40.58654
}
},
"favoriteFruit": "lemon",
"eyeColor": "brown",
"style": {
"clothing": {
"pants": "mini-skirt",
"shirt": "jersey",
"footwear": "sneakers"
}
},
"personality": "There's a lot to say about Margaret...",
"_id": 233,
"_version": 1,
"#docid": 233,
"#shard": 2,
"#rank": 0,
"#weight": 0.0,
"#percent": 100
}, ...
],
"took": "39.889ms"
}
As for the response, we see the following parts:
total
- Number of estimated documents that match the query.count
- Total number of returned hits.hits
- Search results.took
- Time it took to execute the query.
Introducing the Query Language
Xapiand provides a JSON-style domain-specific language that you can use to execute queries. This is referred to as the Query DSL. The query language is quite comprehensive and can be intimidating at first glance but the best way to actually learn it is to start with a few basic examples.
The Query DSL method for searching is much more efficient.
Going back to our last example, we executed a query to retrieve all documents
using q=*
. Here is the same exact search using the alternative request body
method:
SEARCH /bank/
{
"_query": "*",
"_sort": "accountNumber"
}
The difference here is that instead of passing q=*
in the URI, we POST
a
JSON-style query request body to the SEARCH
method.
Dissecting the above, the query part tells us what our query definition is and
the _query
part is simply the type of query that we want to run. The
*
query is simply a search for all documents in the specified index (the
default).
In addition to the query parameter, we also can pass other parameters to
influence the search results. In the example in the section above we passed in
sort, here we pass in limit
:
SEARCH /bank/
{
"_query": "*",
"_limit": 1
}
Note that if limit
is not specified, it defaults to 10.
This example does a match all and returns documents 10 through 19:
SEARCH /bank/
{
"_query": "*",
"_offset": 10,
"_limit": 10
}
The offset
parameter (0
-based) specifies which document index to start from
and the limit
parameter specifies how many documents to return starting at the
given offset
. This feature is useful when implementing paging of search
results. Note that if offset
is not specified, it defaults to 0.
This example does a match all and sorts the results by account balance in
descending order and returns the top 10 (default for limit
) documents.
SEARCH /bank/
{
"_query": "*",
"_sort": {
"balance": {
"_order": "desc"
}
}
}
Executing Searches
Now that we have seen a few of the basic search parameters, let’s dig in some
more into the Query DSL. Let’s first take a look at the returned document
fields. By default, the full JSON document is selected and returned as part of
all searches. If we don’t want the entire document returned, we have the ability
to request only a few fields from within it to be returned by selecting them
by using _selector
field during the search.
There are two types of selectors (which can be mixed):
- Field Selector
- Drill Selector
Field Selector
It takes the form of "{field1,field2}"
, and it selects only field1
and
field2
to be returned.
This example shows how to return two fields using the Field Selector,
accountNumber
and balance
, from the search:
SEARCH /bank/
{
"_query": "*",
"_selector": "{accountNumber,balance}"
}
Drill Selector
It takes the form of "field.sub_field.sub_sub_field"
, and it brings the
innermost field to the top level.
This example shows how to return a list of emails using the Drill Selector from the search:
SEARCH /bank/
{
"_query": "*",
"_selector": "contact.email"
}
Executing Filters
SEARCH /bank/
{
"_limit": 1000,
"_query": {
"balance": {
"_in": {
"_range": {
"_from": 2000,
"_to": 3000
}
}
}
}
}
Executing Aggregations
Aggregations provide the ability to group and extract statistics from your data.
The easiest way to think about aggregations is by roughly equating it to the
SQL GROUP BY
and the SQL aggregate functions. In Xapiand, you have the ability
to execute searches returning hits and at the same time return aggregated
results separate from the hits all in one response. This is very powerful and
efficient in the sense that you can run queries and multiple aggregations and
get the results back of both (or either) operations in one shot avoiding network
roundtrips using a concise and simplified API.
To start with, this example groups all the accounts by state, and then returns the count of accounts by state:
SEARCH /bank/
{
"_query": "*",
"_limit": 0,
"_check_at_least": 1000,
"_aggregations": {
"group_by_state": {
"_values": {
"_field": "contact.state",
"_keyed": true
}
}
}
}
In SQL, the above aggregation is similar in concept to:
SELECT state, COUNT(*) FROM bank GROUP BY state;
And the response (partially shown):
{
"total": 1000,
"count": 0,
"aggregations": {
"_doc_count": 1000,
"group_by_state": {
"Texas": {
"_doc_count": 15
},
"Pennsylvania": {
"_doc_count": 20
},
"Oklahoma": {
"_doc_count": 16
},
...
"Nebraska": {
"_doc_count": 17
}
}
},
"hits": [],
"took": "14.83ms"
}
There are many other aggregations capabilities that we won’t go into detail here. The Aggregations Reference Guide is a great starting point if you want to do further experimentation.