Build a search engine with an auto-complete feature using Apache Nutch and Solr
This article walks through how to setup Apache Nutch and Apache Solr to build a search engine that includes an 
auto-complete
feature (a.k.a. auto-suggest or 
incremental search
). The idea is to extract certain specified elements from the indexed pages (e.g. the header elements 
<h1>
, 
<h2>
, etc.) and use the contents for an auto-complete search index. 
The project makes use of the following two Nutch plugins:
- index-blacklist-whitelist: this plugin filters out unwanted HTML content from the search index
- nutch-custom-search: this plugin provides a number of features, one of which is called the Extractor. The Extractor extracts certain specified HTML elements into separately indexable fields. We'll use it to setup the auto-complete search index.
Note about versions: the info in this article is based on the following product versions:
- Apache Nutch v1.9
- Apache Solr v4.10
Installing the index-blacklist-whitelist plugin
Configuring the index-blacklist-whitelist plugin
Installing the Extractor plugin
Configuring the Extractor plugin
Enabling the index-blacklist-whitelist and extractor plugins
Running the Nutch crawler
Querying the auto-complete index in Solr
Querying the full text index in Solr
Troubleshooting
The link for the index-blacklist-whitelist plugin links to a patch file. For instructions on how to apply and build the patch, see the HowToContribute article on the Nutch wiki.
Once the plugin is built, copy the jar file and plugin.xml file into the new directory nutch/plugins/index-blacklist-whitelist/.
The index-blacklist-whitelist plugin provides two filtering options:
- "whitelisting": selecting which HTML elements to INclude
- "blacklisting": selecting which HTML elements to EXclude
You can use one or the other but not both at the same time. We'll use the "blacklisting" option to filter out the HTML elements we don't want to index. These are elements like navigation bars, page footers, and other elements that are common across all pages on the site.
To define the blacklist, add the "parser.html.blacklist" property to nutch/conf/nutch-site.xml:
<property>
  <name>parser.html.blacklist</name>
  <value>div.skip,div#nav-bar,div#footer,span#copyright</value>
  <description>
    A comma-delimited list of css-like tags to identify which elements 
    should NOT be parsed.  The remaining content (with these elements
    removed) is stored in the "strippedContent" field in the NutchDocument
  </description>
</property>
The plugin stores the remaining content (everything that wasn't excluded by the blacklist) in a field called "strippedContent" in the NutchDocument. We need to tell Nutch about this field so that it knows to copy the data over to the SolrDocument for indexing in Solr. The Nutch-to-Solr field mapping is defined in nutch/conf/solrindex-mapping.xml. This file maps which NutchDocument fields (source) are copied over to which SolrDocument fields (dest).
<!-- field gen'ed by the index-blacklist-whitelist plugin --> <field dest="strippedContent" source="strippedContent"/>
We also need to tell Solr about the "strippedContent" field so that Solr knows how to index it. This is done by adding the field to the Solr schema, solr/collection1/schema.xml:
<!-- field gen'ed by the index-blacklist-whitelist plugin -->
<field name="strippedContent" 
       type="text_general" 
       stored="true" 
       indexed="true"/>
The type="text_general" is a built-in field type for basic full-text indexing. The content is stored so that the actual text can be returned in the search results.
You can download the binary package for the Extractor plugin here. Just download the zip file and extract it to the nutch/plugins directory. It will create a new directory named nutch/plugins/extractor. Under that directory is a file named plugin.xml. This file defines which run-time extension points the Extractor will get plugged into. There are several extension points already defined in the file. You don't need them all, only the first two - the HtmlParseFilter and IndexingFilter:
<extension id="ir.co.bayan.simorq.zal.extractor.nutch.parseFilter"
           name="Extractor XML/HTML Parser filter"
           point="org.apache.nutch.parse.HtmlParseFilter">
  <implementation
      id="ir.co.bayan.simorq.zal.extractor.nutch.ExtractorParseFilter"
      class="ir.co.bayan.simorq.zal.extractor.nutch.ExtractorParseFilter" />
</extension>
<extension id="ir.co.bayan.simorq.zal.extractor.nutch.indexingFilter" 
           name="Extractor Indexing Filter" 
           point="org.apache.nutch.indexer.IndexingFilter">
  <implementation 
      id="ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter"
      class="ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter"/>
</extension>Make sure all other extensions are deleted or commented out.
First we need to tell the Extractor which html elements we want to extract. By "extract" I simply mean to copy those elements into their own field in the NutchDocument/SolrDocument, so that we can handle them separately and apply special indexing to them. This does NOT mean the elements are removed from the rest of the content. They're still included in the main content (the "strippedContent" field) and still turn up in searches on that field.
To configure the Extractor, create a new file called nutch/conf/extractors.xml. In this file we select which HTML elements to extract, and which NutchDocument field(s) to copy the extracted content into. HTML elements can be identified using css-like notation. Here's an example:
<config xmlns="http://bayan.ir" 
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xsi:schemaLocation="http://bayan.ir http://raw.github.com/BayanGroup/nutch-custom-search/master/zal.extractor/src/main/resources/extractors.xsd">
  <fields>
    <field name="headings" multi="true"/>
  </fields>
  <documents>
    <document url="." engine="css">
      <extract-to field="headings">
        <text>
          <expr value="h2.heading-1" />
        </text>
      </extract-to>
      <extract-to field="headings">
        <text>
          <expr value="h3.heading-1-1" />
        </text>
      </extract-to>
    </document>
  </documents>
</config>A brief description of the elements in this file:
- The <field> element defines which NutchDocument field receives the extracted content -- in this case a new field called "headings". It also tells the extractor that this is a multi-valued field (multi="true"), since multiple headings may be parsed from a single page.
- The <extract-to> elements indicate which HTML elements to extract. In this example we're extracting <h2 class="heading-1"> elements and <h3 class="heading-1-1"> elements. The extracted content is copied into the "headings" field in the NutchDocument. Each extracted element is copied into its own entry in the multi-valued field.
For more information about the format of extractors.xml, see here.
We need to tell Nutch about this new "headings" field in the Nutch-to-Solr field-mapping file so that it gets copied over from the NutchDocument to the SolrDocument, for indexing in Solr (just as we did for the "strippedContent" field gen'ed by the index-blacklist-whitelist plugin). Add the following to nutch/conf/solrindex-mapping.xml:
<!-- field for the extractor plugin --> <field dest="headings" source="headings"/>
We also need to tell Solr about the new "headings" field in the Solr schema, solr/collection1/schema.xml:
<!-- field for the extractor plugin -->
<field name="headings" 
       type="text_autocomplete" 
       stored="true" 
       indexed="true" 
       multiValued="true"/>
The field type is set to "text_autocomplete". This is a custom type that we'll define next in step 3.
Finally we need to create a new field type and configure it so that the content is indexed in such a way to handle auto-complete searches. Custom field types are also defined in the solr schema, solr/collection1/schema.xml:
<!-- for autocomplete searches -->
<fieldType name="text_autocomplete" 
           class="solr.TextField" 
           omitNorms="true" 
           positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" 
            ignoreCase="true" 
            words="stopwords.txt" 
            enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" 
            minGramSize="2" 
            maxGramSize="50" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>
Let's go over some of the settings here.
- omitNorms=true will disable length normalization on the field. The length of a multi-valued field includes all entries, but we want to consider each entry (each heading) in isolation. Therefore we should disable length normalization, so that a page with many headings (i.e. a lengthier "headings" field) does not get punished (assigned a lower search rank) compared to a page with fewer headings (i.e. a shorter "headings" field).
- positionIncrementGap="100" is the word-position-increment inserted between each entry in the multi-valued field. This prevents a phrase match (matching multiple words in close proximity) from incorrectly matching the phrase across entries in the field (i.e. matching the end of one heading and the beginning of another).
- The EdgeNGramFilterFactory is what will provide the "autocomplete" support. This filter generates and indexes the "edge ngrams" for each word in the field. Ngrams are substrings of words. "Edge" means that only substrings/ngrams that start at the beginning of the word are kept (substrings within words are ignored). The minGramSize=2 attribute specifies the smallest substring that will be indexed, maxGramSize=50 specifies the largest. For example, if the field contains the term "apache", the filter indexes not only the word "apache", but also the ngrams: "ap", "apa", "apac", and "apach". This way, the search string "apa" will match on the word "apache" (along with any other word that begins with "apa").
Now that the plugins have been installed and configured, enable them by adding them to the list of plugins in nutch/conf/nutch-site.xml:
<!-- add index-blacklist-whitelist and extractor plugins --> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist|extractor</value> </property>
Now we're ready to point Nutch at our website and let it roll.
$ nutch/bin/crawl {urls.dir} {crawl.dir} {solr.url}/solr/collection1 1- {urls.dir} contains the seed URL in seed.txt
- {crawl.dir} can be empty to start. Nutch will fill it up with crawl data (the crawl DB and links DB)
- {solr.url} is the http://{host}:{port} of the Solr instance. Nutch sends crawled content to Solr for indexing
- The final parameter, 1, indicates the depth of the crawl. A depth of 1 means only known links (those already in the links DB) are crawled. So the first time the command is run (assuming the links DB is empty), only the seed URL is crawled. All links parsed from the seed URL are added to the links DB. Subsequent invocations will crawl those links as well, and even more links will be added to the DB as they are parsed from the crawled content. Eventually the crawler should reach every page on the website
Now that Nutch has crawled our website, parsed the HTML, removed the blacklisted elements, extracted the header fields, and pushed it all into Solr for indexing, we can now use Solr to run search queries.
First let's look at how to query the auto-complete content. We'll use the basic DisMax query parser with the following options:
- qf: "headings" - specifies which field to query. We only want to query the auto-complete content in the "headings" field
- fl: "url,id,title" - specifies which fields to include in the results. The "headings" field isn't included here but is included in the highlighted results (see hl.fl)
- mm: "75%" - indicates that at least 75% of search terms must be matched by the search results
- ps: "4" - sets the phrase slop to 4, meaning that for phrase matches (matching multiple search terms in close proximity), the terms must be within 4 words of each other
- pf: "headings^100" - boosts results where all search terms are within close proximity of each other
- hl: "true" - include highlighted search results
- hl.fl: "headings" - highlight matched search terms in the "headings" field
- hl.simple.pre: "<em>" - prefix highlighted terms with "<em>"
- hl.simple.post: "</em>" - suffix highlighted terms with "</em>"
So in summary, the query is configured to search the auto-complete content (the "headings" field), return minimal data (basically just the "url" and "title" fields, along with the highlighted matches in the "headings" field), restrict the results to good matches (75% of search terms must match), and boost phrase matches, where the search terms are within close proximity of each other.
If the user doesn't find what they're looking for in the auto-complete results, they can submit a full text search against the rest of the content. Again we'll use the DisMax query, but with slightly different options:
- qf: "title headings strippedContent" - specifies which fields to query
- fl: "url,id,title" - specifies which fields to include in the results. Note that the "headings" and "strippedContent" fields are not included here, but are included in the highlighted results (see hl.fl)
- mm: "75%" - indicates that at least 75% of search terms must be matched by the search results
- ps: "4" - sets the phrase slop to 4, meaning that for phrase matches (matching multiple search terms in close proximity), the terms must be within 4 words of each other
- pf: "title headings strippedContent" - results where all search terms are within close proximity of each other
- stopWords: "true" - remove common terms like "the" from the query
- hl: "true" - include highlighted search results
- hl.fl: "strippedContent,headings" - highlight matched search terms in the "strippedContent" and "headings" fields
- hl.simple.pre: "<em>" - prefix highlighted terms with "<em>"
- hl.simple.post: "</em>" - suffix highlighted terms with "</em>"
- hl.snippets: "4" - return a maximum of 4 highlighted snippets from each result
This query is similar to the one above for the auto-complete index. Basically the only difference is which fields are queried and included in the results.
If you're having trouble getting things to work, you can enable trace for the plugins by adding the following to nutch/conf/log4j.properties:
# Logging for index-blacklist-whitelist log4j.logger.at.scintillation.nutch=ALL # Logging for extractor plugin log4j.logger.ir.co.bayan.simorq.zal.extractor.nutch=ALL