Text Datatype
A field to index full-text values, such as the body of an email or the description of a product. These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed. The analysis process allows Xapiand to search for individual words within each full text field. Text fields are not used for sorting and seldom used for aggregations (although the significant text aggregation is a notable exception).
If you need to index structured content such as email addresses, hostnames, status codes, or tags, it is likely that you should rather use the Keyword Datatype instead.
By default every field in the document with text value is interpreted as Text Datatype:
UPDATE /bank/1
{
"resume": {
"_type": "text",
"_language": "en",
"_value": "Four years experience in early childhood development with a diverse background in the care of special needs children and adults. OBJECTIVE: To begin my post-graduate career in an insignificant, entry-level position that will provide me with income and a sense of self-worth. EDUCATION: Small Collage you Haven't Heard Of, BS in Early Childhood Development. EXPERIENCE: None really, but please let me articulate the many reasons why I think my minimum-wage work history is extremely relevant and has adequately prepared me for this job."
}
}
Stemmers
A common form of normalisation is stemming. This process converts various different forms of words to a single form: for example, converting a plural (e.g., “birds”) and a singular form of a word (“bird”) to the same term (in this case, both are converted to “bird”).
Note that the output of a stemmer is not necessarily a valid word; what is important is that words with closely related meaning are converted to the same form, allowing a search to find them. For example, both the word “happy” and the word “happiness” are converted to the form “happi”, so if a document contained “happiness”, a search for “happy” would find that document.
The rules applied by a stemmer are dependent on the language of the text; Xapian includes stemmers for more than a dozen languages (and for some languages there is a choice of stemmers), built using the Snowball language. We’d like to add stemmers for more languages - see the Snowball site for information on how to contribute.
Caution
By default Xapiand doesn’t do any stemming to text fields. This feature is only
enabled when the parameter _language
(or otherwise _stem_language
) is
specified in the Schema.
Stem Strategy
The default _stem_strategy
is "stem_some"
, but you can choose others.
Other available stemming strategies are:
stem_none , none |
Don’t perform any stemming. |
stem_some , some |
Stem all terms except for those which start with a capital letter, or are followed by certain characters (currently: ( , / , \ , @ , < , > , = , * , [ , { , " ), or are used with operators which need positional information. (note: stemmed terms are prefixed with ‘Z’). (This is the default mode). |
stem_all , all |
Stem all terms (note: no ‘Z ’ prefix is added). |
stem_all_z , all_z |
Stem all terms (note: ‘Z ’ prefix is added).. |
Stop Strategy
The default _stop_strategy
is "stop_stemmed"
, so stemmed forms of stopwords
aren’t indexed, but unstemmed forms still are so that searches for phrases
including stop words still work.
Other available stop strategies are:
stop_none , none |
Don’t use the stopper. |
stop_all , all |
If a word is identified as a stop word, skip it completely. |
stop_stemmed , stemmed |
If a word is identified as a stop word, index its unstemmed form but skip the stem. Unstemmed forms are indexed with positional information by default, so this allows searches for phrases containing stopwords to be supported. (This is the default mode). |
Parameters
The following parameters are accepted by Text fields:
_language |
The language to use for stemming and stop words. (The default is "none" ) |
_stop_strategy |
The stopper strategy that the stopper is going to use (defaults to "stop_stemmed" ) |
_stem_language |
The stemming language that stemming algorithm is going to use (defaults to _language value) |
_stem_strategy |
The stemming strategy that stemming algorithm is going to use (defaults to "stem_some" ) |
_value |
The value for the field. (Only used at index time). |
_index |
The mode the field will be indexed as: "none" , "field_terms" , "field_values" , "field_all" , "field" , "global_terms" , "global_values" , "global_all" , "global" , "terms" , "values" , "all" . (The default is "field_all" ). |
_slot |
The slot number. (It’s calculated by default). |
_prefix |
The prefix the term is going to be indexed with. (It’s calculated by default) |
_weight |
The weight the term is going to be indexed with. |