4Using the cmsWorks document search

cmsWorks comes with a built in big data search engine based on Lucene. On any change of cmsWorks documents the contents are updated in the search index of the search service. The search service can be called to query results for search keys and as result a list of document IDs is returned.

Usually there is more than one search index configured in cmsWorks.

An index indexintern is used to contain all information about all documents in all states and is used within the cmsWorks desktop for all search documents features.

An index indexonline can be configured that only published versions of documents are contained. Every type of document is to be named and everty property is to declared the contains relevant content for the online search. So if the updated document not declared, the indexonline is not updated. In this configuration it's also possible to declare: If an article title is not filled, the article is not valid and therefor not to be found in the indexonline of the search service. All details about the configuration is contained in the administrators guide.

The search service is accessible via HTTP requests. The Generator service has a property configured listing remote hosts. So within a JSP of the Generator service producing a website the host information of the search service can be fetched to send a query for documents to be found.

The search example

In the example using the search service a component JSP is created. The component shall search for the latest News and produce a list of Links using the articles titles to be linked.

<%@page import="
                app.cmsworks.service.generator.Generator,
                app.cmsworks.cms.document.DocumentModel,
                app.cmsworks.cms.document.ErrorView,
                app.cmsworks.cms.document.Link,
                app.cmsworks.cms.document.HTMLErrorView,
                app.cmsworks.util.uilink.UILink,
                
                app.cmsworks.util.search.SearchResultIdIterator,
                app.cmsworks.util.search.term.SearchUtil,
                app.cmsworks.util.search.term.SearchTerm
               "
        session="false"
        contentType="text/html;charset=UTF-8"
%><%@include file="includes/documentmodel.jsf"
%><%@include file="includes/articletype.jsf"
%><%
DocumentModel dmCmp = null;
ErrorView errors = new HTMLErrorView();
UILink uiLink = new UILink(request);
// fetch the sensible data
try {
  dmCmp = new DocumentModel(request, new Types());
  Generator generator = (Generator) dmCmp.getMyService();
  errors.setPreview(dmCmp);
  
  // create a Search utitlity object
  SearchUtil searchUtil = new SearchUtil();
  // set the search host
  searchUtil.setHost(generator.getHostData("search"));
  // set the name of the index to search in
  searchUtil.setIndex(SearchUtil.INDEX_ONLINE);
  // one search call will produce a maximum of 50 results
  searchUtil.setIteratorPageSize(50);
  
  // only find documents of type article
  searchUtil.filterDocumentType(Types.RT_ARTICLE);
  
  // add the search keywords
  SearchTerm searchTerm = searchUtil.createSearchTermAnd();
  // filter all articles by the article type news
  searchUtil.addIDs(searchTerm, new int[]{ArticleType.NEWS}, Types.PT_ARTICLE_TYPE);
  
  StringBuffer sb = new StringBuffer();
  // execute the search
  SearchResultIdIterator srii = searchUtil.search(searchTerm);
  int cnt = 0;
  // only produce 10 valid news links
  while(srii.hasNextId() && cnt < 10) {
    int id = srii.getNextId();
    DocumentModel dmNews = new DocumentModel(id, dmCmp);
    if (dmNews.isType(Types.RT_ARTICLE)) {
      int articleType = dmNews.getInt(Types.PT_ARTICLE_TYPE);
      String headline = dmNews.getString(Types.PT_ARTICLE_HEADLINE);
      Link link = dmNews.toLink();
      if (articleType == ArticleType.NEWS && headline.length() > 0 && link != null) {
        sb.append("<li><a " + link.createAnchorTarget() + ">" + headline + "</a></li>");
        cnt++;
      }
    }
  }
  
  String htmlNewsList = sb.toString();
  
%>
<div class="news">
  <%= uiLink.getPageLink() %>
  <h2>News</h2>
  <ul><%= htmlNewsList %></ul>
</div>  
<%  
}
catch (Throwable t) {
  if (!errors.exit(response, t, dmCmp, this.getClass().getName())) {
    return;
  }
}
%><%= errors.render() %>

Using the Search service to find news for a news component in cmp-news.jsp

This example relies on different includes we created beforehand:

  • documentmodel.jsf - containing constants for all document types and property names of the project
  • articletype.jsf - containing constants for types of articles

Single steps to walk through the search creating ten news links on articles

The search support object is created:

SearchUtil searchUtil = new SearchUtil();

SearchUtil is the main object retrieving information on where and what to search. At first the information about the HTTP-Request to the search service will be filled in. Therefore the host and port of the configured host (Generator service configuration) is needed (here: "search"). Additionally the name of the search index has to be announced:

searchUtil.setHost(dmCmp.getGenerator().getIncludeHost("search"));
searchUtil.setIndex(SearchUtil.INDEX_ONLINE);

To limit the count of search results a maximum of results to be returned can be set. Otherwise default value is 100:

searchUtil.setIteratorPageSize(50);

Restricting the search in this way doesn't mean that the iteration will stop after 50 results; it merely is a method to keep the memory usage footprint as low as possible.

Next up is to declare that only documents to the document type article should be found:

searchUtil.filterDocumentType(Types.RT_ARTICLE);

The search term in our case is just an expression saying that only articles of type News should be returned. No further restrictions are needed.

The SearchTerm is used to create query information. The search uses a query language that is wrapped by the SearchTerm object. There can be conditions created like the following examples:

  • word1 - find all documents containing word1
  • word1 and word2 - find all documents containing word1 and word2
  • word1 or word2 - find all documents with either word1 containing or word2 containing or both
  • (word1 or word2) and word3 - find all documents containing word3 and word1 or containing word3 and word2

So a tree can be created of words to be found and conditions AND and OR. The next code line creates a Term with the condition AND.

SearchTerm searchTerm = searchUtil.createSearchTermAnd();

The SearchTerm now can collect words under the condition as well as other SearchTerms with other conditions.

To create the condition to find only articles of type news:

searchUtil.addIDs(searchTerm, new int[]{ArticleType.NEWS}, Types.PT_ARTICLE_TYPE);

This method internally creates a new SearchTerm with the condition OR adding all ID-words and adding the created SearchTerm to the given searchTerm. Meaning: if more than one ID would be added:

utilSearch.addIDs(searchTerm, new int[]{ArticleType.NEWS, ArticleType.BACKGROUND_STORY, ArticleType.GOSSIP}, Types.PT_ARTICLE_TYPE);

the search will find articles of type News or of type background or ...

Now the search is prepared and the search keywords are defined. The execution of the search can be started and the search results can be fetched.

SearchResultIdIterator srii = searchUtil.search(searchTerm);
while(srii.hasNextId()) {
int id = srii.getNextId();
}

This loop will request all search results there are. Assuming there would be 63 hits, after the first 50 hits a second search request would have been fired to fetch search results after the first 50 hits which would return the last 13 hits. But this is done in the engine and the user has not to worry about that.

The rest of the code within the loop is some defensive way of reading the property of the search results. Is the document really an article? Is it of the correct type? Does it have a filled headline? Ok, than produce a link. 10 valid articles should be found in the first 50 hits of the search.

Conditions of the search

The content of a document is basically not typed in the search index.

When searching for a word and a document is returned as a hit it's still unknown which property this word contains.

Hits are not weighted.

Meaning if the word exists multiple times in one document and only one time in another document it's no difference to the order of hits.

The order of hits is always based on the configured date configuration

Either a document property of type date or the date of the latest change of the document is the criteria to the order of results.

There is no configuration, that a word found in a keyword field is more important than a word found in a headline field or in a text field. If a search is need where only hits in a keyword field is acceptable, create a special keyword search index for that reason.