3.3.8Search engine services
cmsWorks comes with an integrated search engine service to search CMS contents in on-page-searches at live pages as well as to find documents within the editors desktop.
Why dedicated search services
cmsWorks stores all content like text, pictures and linking structures in a database. Anyway, querying databases may cause performance leaks for the rest of the system. Therefore, cmsWorks uses big data technology that stores certain (searchable) content in specialized index files instead of only storing data in the database. These independent, specialized index files and the access on them respond faster than standard SQL databases.
The service Search
This service encapsulate the operations on search indexes and provide access via HTTP requests. A search index itself consists of file(s) in several folders of the file system on the server. Deleting a folder containing an index will erase that index. The services Search is responsible for the editors desktop search results of all edited documents in one index and for online search results in a separate index containing only published results.
The service stores entries identified by an unique ID (the document ID), it stores the document type, the date, content which is searchable and the content that is returned as result.
To create, update and delete these entries, the service is derived from the AbstractHttpServer (a JSP-engine) including the corresponding JSPs for these CRUD-functionalities being the interface that is queried by the SearchCollector services.
Property file and property key paths are:
Service properties file
<cmsWorks-installdir>/run/properties/search.properties
Service property key paths
/app/cmsworks/service/search/lucene3/Search/
Entry | Value | Description |
Port | 8091 | The port is opened on the server for HTTP-Request to the Search service |
Htdocs | ../htdocs/search/cms | The path in the filesystem relative to the installation directory, where the JSP files of the Search service reside in ("CRUD"). This is a standard value, meaning that the JSP files placed in that directory provide the interface for the SearchCollector services and must not be altered or deleted to provide full functionality. |
DefaultRootIndexFolder | ../index/cms | A path in the filesystem that is used for the search indexes, its folders and files. |
IndexFields | searchId,yes,un_tokenized,true; | The index fields defined in this parameter configure the structure of the search index. This value reflects the common use of cmsWorks document searches (ID, date, searchable content, returned content). The fields of one search entry are:
These values of this parameter must not be changed. |
# ###################################################################################
#
# Properties for the core search engine service
#
# ###################################################################################
# hook service /app/cmsworks/service/search/lucene3,SearchLucene
# service create /app/cmsworks/service/search/lucene3,SearchLucene,Search
/app/cmsworks/service/search/lucene3/Search/StartTimeout=10000
/app/cmsworks/service/search/lucene3/Search/StopTimeout=10000
/app/cmsworks/service/search/lucene3/Search/AccessTimeout=1000
/app/cmsworks/service/search/lucene3/Search/StopInformExtern=1
/app/cmsworks/service/search/lucene3/Search/LogMode=LogLevel=fatal error info
/app/cmsworks/service/search/lucene3/Search/Port=8091
/app/cmsworks/service/search/lucene3/Search/Htdocs=../htdocs/search/cms
/app/cmsworks/service/search/lucene3/Search/DefaultRootIndexFolder=../index/cms
/app/cmsworks/service/search/lucene3/Search/IndexFields=searchId,yes,un_tokenized,true;searchType,yes,un_tokenized,false;searchDate,yes,un_tokenized,false;searchable,no,tokenized,false;searchContent,yes,no,false
Configuration example of the Search service for internal search
The service SearchCollector
The SearchCollector is a service that updates the search index on any changes of documents in the project.
The service SearchCollectorWebUI is configured to update all document contents to the search index indexintern so the editors desktop can find every document within the search features of the editors desktop. The service SearchCollectorOnline is configured to update the search index indexonline only when documents are published. Also the index does not contain all contents but only the relevant field contents of relevant document types.
The parameters for the service configuration of the SearchCollector services are found in the properties-file <cmsWorks-installdir>/run/properties/search.properties. In the following table, the "Entry" values paths are /app/cmsworks/service/search/collect/SearchCollectorWebUI/ or /app/cmsworks/service/search/collect/SearchCollectorOnline/
Entry | Value | Description |
RemoteHost | 127.0.0.1 | Addressing the search service this is the host name or IP of the server where the search service is started |
RemotePort | 8091 | Addressing the search service this is the port of the service |
CMSService | CMSCore | This is the name of the CMSService where document contents are fetched to be sent to the search service |
EventProviderService | CMSCore | This is the name of a service implementing the interface CMSEventProviderable sending Events on changed documents. |
PublishedOnly | false | If true this service will only send updates to the search service if documents are published or deleted. Otherwise any document change event will trigger an update in the search index |
ResourcePropertyConfig | all | Defines the content that will be send to the search index. There are some rules to this value described below this table. |
ResourceAttributeConfig | noname; noreferences; notypes | Without a declaration a documents name, the outgoing references to other documents and special type properties of the document are send to the search index. This can be rejected with the following values:
The optional values are separated by ";". |
NonSearchablePaths | /config;/testpages | Documents that are stored under one of these paths will not be sent to the search index. The paths are are separated by ";". |
LuceneSearchIndexName | indexintern | This is the index name of the search index for the Search service. It also is the name in the file system where the Search service will store the search index. This name has to be used when searching for documents in that search index. |
ResourcePropertyConfig
This service parameter contains the description of the content that will be sent to the search service.
The value all is used mostly in the configuration of SearchCollectorWebUI to send all available content into the search index. Therefor the search features of the editors desktop will work properly and find any stored content in your project.
But if a search index is to be built for an online web page the content to be found should be restricted. Therefor each property of each document type to be indexed is to be declared like this:
4:date(optional),category(searchableonly),keywords(searchableonly),headline(searchableonly),text(searchableonly);
8:date(optional),category(searchableonly),keywords(searchableonly),description(searchableonly optional)
This declaration defines
Value | Description |
4: | document type with id 4 (this is the article document type) |
date(optional) | The property date of the document type article may be filled or not. If not the documents last changed date property will be used instead. The date property is the only sorting criterion for search results in this case for articles to be found. If the date field is not declared always be the documents last changed property will be used. |
category(searchableonly) | The category of the article is a document reference to a category document (another document type) The value category[documentId]a will be written into the search index. Because this value is not declared optional, the article will not be sent to the search index if this property is not filled. The property content is not stored in the search index and not returned in the search result. |
keywords(searchableonly) | The keywords property of the article must be filled or else the document is not sent to the search index. The property content is not stored in the search index and not returned in the search result. |
headline(searchableonly) | The headline property of the article must be filled or else the document is not sent to the search index. The property content is not stored in the search index and not returned in the search result. |
text(searchableonly) | The text property is not optional and will not be returned in the search result. The property content is not stored in the search index and not returned in the search result. |
When declaring more document types use ; as separator. When listing the properties of a document type use , as separator.
The content control values (names after fields in braces) are:
Value | Description |
searchable | The field value is sent to the search index. The field value can be found and is returned in the search result. |
searchableonly | The field value is sent to the search index. The field value can be found but is not returned in the search result. Furthermore the field value is not stored in the search index so less disk space is required. |
optional | The documents fields will be sent to the search index if the field is filled or not. |
required | The field content will not be sent to the search index but if the field is not filled the document will not be sent to the search index. |
| Using no value means that the content is not searchable but is returned in the search result. |
Here is a full example of the searchcollector configuration:
# ###################################################################################
#
# Properties for a service listening to cms events to collect data from cms documents
# and send them to a search service
#
# The SearchCollectorWebUI is the most common searchcollector service used in a cms
# project. This service collects nearly all data to provide a search within the
# editors web access.
#
# ###################################################################################
# hook service /app/cmsworks/service/search/collect,SearchCollector
# service create /app/cmsworks/service/search/collect,SearchCollector,SearchCollectorWebUI
/app/cmsworks/service/search/collect/SearchCollectorWebUI/StartTimeout=10000
/app/cmsworks/service/search/collect/SearchCollectorWebUI/StopTimeout=10000
/app/cmsworks/service/search/collect/SearchCollectorWebUI/AccessTimeout=1000
/app/cmsworks/service/search/collect/SearchCollectorWebUI/StopInformExtern=1
/app/cmsworks/service/search/collect/SearchCollectorWebUI/LogLevel=fatal error info
/app/cmsworks/service/search/collect/SearchCollectorWebUI/RemoteHost=127.0.0.1
/app/cmsworks/service/search/collect/SearchCollectorWebUI/RemotePort=8091
/app/cmsworks/service/search/collect/SearchCollectorWebUI/CMSService=CMSCore
/app/cmsworks/service/search/collect/SearchCollectorWebUI/EventProviderService=CMSCore
/app/cmsworks/service/search/collect/SearchCollectorWebUI/PublishedOnly=false
/app/cmsworks/service/search/collect/SearchCollectorWebUI/ResourcePropertyConfig=all
/app/cmsworks/service/search/collect/SearchCollectorWebUI/ResourceAttributeConfig=
/app/cmsworks/service/search/collect/SearchCollectorWebUI/NonSearchablePaths=
/app/cmsworks/service/search/collect/SearchCollectorWebUI/LuceneSearchIndexName=indexintern
# ###################################################################################
#
# Properties for a service listening to cms events to collect data from cms documents
# and send them to a search service
#
# ###################################################################################
# service create /app/cmsworks/service/search/collect,SearchCollector,SearchCollectorOnline
/app/cmsworks/service/search/collect/SearchCollectorOnline/StartTimeout=10000
/app/cmsworks/service/search/collect/SearchCollectorOnline/StopTimeout=10000
/app/cmsworks/service/search/collect/SearchCollectorOnline/AccessTimeout=1000
/app/cmsworks/service/search/collect/SearchCollectorOnline/StopInformExtern=1
/app/cmsworks/service/search/collect/SearchCollectorOnline/LogLevel=fatal error info
/app/cmsworks/service/search/collect/SearchCollectorOnline/RemoteHost=127.0.0.1
/app/cmsworks/service/search/collect/SearchCollectorOnline/RemotePort=8091
/app/cmsworks/service/search/collect/SearchCollectorOnline/CMSService=CMSCore
/app/cmsworks/service/search/collect/SearchCollectorOnline/EventProviderService=CMSCore
/app/cmsworks/service/search/collect/SearchCollectorOnline/PublishedOnly=true
/app/cmsworks/service/search/collect/SearchCollectorOnline/ResourcePropertyConfig=4:date(optional),category(searchableonly),keywords(searchableonly),headline(searchableonly),text(searchableonly);8:date(optional),category(searchableonly),keywords(searchableonly),description(searchableonly optional)
/app/cmsworks/service/search/collect/SearchCollectorOnline/ResourceAttributeConfig=
/app/cmsworks/service/search/collect/SearchCollectorOnline/NonSearchablePaths=/config;/testpages
/app/cmsworks/service/search/collect/SearchCollectorOnline/LuceneSearchIndexName=indexonline
Configuration example of the service SearchCollectorWebUI
Usecases
The internal search is used to provide a search window in the editors desktop. There can be found all documents ordered by the date of the latest change. Also it is required to find documents referencing a selected document answering the question "Who references me?".
The online search may only find content documents like articles or diashows, if a user searches on the web page. But also it can be used to show the newest articles of a category.
A special online keyword search index may only make keywords searchable so an automatic related articles search can be build.
Re-indexing the search index
The search services of cmsWorks depend on an internal index that is updated when changes are made. Anyway, in case the index got invalid or even lost (i.e. after an installation without the index files, because of deleted or corrupted index files etc.) the index can be rebuilt from the existing data.
Therefore, cmsWorks provides two commands via the telnet server: searchcollector and directsearchcollector. The searchcollector command is the more flexible command and can handle remote search servers. The directsearchcollector is the by far faster collector and should be used to build an index from scratch but it only works if the search services are local services of this cmsWorks instance.
For example, the directsearchcollector is used here to fulfill a complete reindexing of all search indexes. The simple command
directsearchcollector reindex
will do the reindexing. In case you need a more fine grained behavior, please refer to the help pages of the commands (by typing "help directsearchcollector" or "help searchcollector" into the telnet server shell).
