Apache Nutch 2.3, Hbase 0.94.14 & Solr 5.2.1 Tutorial

in

A guide on how to install Apache Nutch v2.3 with Hbase as data storage and search indexing via Solr 5.2.1.

Apache Nutch is an open source extensible web crawler. It allows us to crawl a page, extract all the out-links on that page, then on further crawls crawl them pages. It also handles the frequency of the calls and many other aspects which could be cumbersome to setup.

Let’s have a look at setting up Apache Nutch with Hbase.

Getting all the tools

OK first we need to get all the tools, on your machine create a working directory. If you fetch them from other mirrors, ensure you get the correct versions:

# Create our working dir
mkdir ~/Desktop/Nutch
cd ~/Desktop/Nutch

# Grab all our files (correct versions!)
curl -O http://archive.apache.org/dist/nutch/2.3/apache-nutch-2.3-src.tar.gz
curl -O http://archive.apache.org/dist/hbase/hbase-0.94.14/hbase-0.94.14.tar.gz
curl -O http://archive.apache.org/dist/lucene/solr/5.2.1/solr-5.2.1.tgz

# Unzip and rename them (for convenience!)
tar -xzf apache-nutch-2.3-src.tar.gz && mv apache-nutch-2.3 nutch
tar -xzf hbase-0.94.14.tar.gz && mv hbase-0.94.14 hbase
tar -xzf solr-5.2.1.tgz && mv solr-5.2.1 solr

# Delete command if you wanna start again!
# rm -rf nutch hbase solr

# Move our source files to another directory to keep things clean!
mkdir gzips && mv apache-nutch-2.3-src.tar.gz hbase-0.94.14.tar.gz solr-5.2.1.tgz gzips

# Your working directory should now look like this:
# .
# ├── gzips
# ├── hbase
# ├── nutch
# └── solr

Make sure JAVA_HOME and NUTCH_JAVA_HOME environment variable is set

In your ~/.bash_profile or ~/.bashrc ensure your JAVA_HOME is set correctly. We also need to set NUTCH_JAVA_HOME to point to our Java home directory.

export JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"
export NUTCH_JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"

Setup Hbase

open ~/Desktop/Nutch/hbase/conf/hbase-site.xml and add the following 2 <property> nodes. We need to tell hbase the rootdir of the install and also specify a data directory for zookeeper.

open ~/Desktop/Nutch/hbase/conf/hbase-site.xml
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>file:///Users/anil/Desktop/Nutch/hbase</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/Users/anil/Desktop/Nutch/zookeeper</value>
    </property>
</configuration>

Next, we need to tell gora to use Hbase for it’s default data store.

open ~/Desktop/Nutch/nutch/conf/gora.properties
# open ~/Desktop/Nutch/nutch/runtime/local/conf/gora.properties

# Add this line under `HBaseStore properties` (to keep things organised)
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

We need to add/uncomment the gora-hbase dependency to our ivy.xml (line 118).

open ~/Desktop/Nutch/nutch/ivy/ivy.xml
# Find and Uncomment this line (aprrox 118)
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />

Testing it all works! (and some common useful commands)

# Start it up! (Outputs: starting master, logging to /Users/../../Macbook-Pro.out)
~/Desktop/Nutch/hbase/bin/start-hbase.sh

# Stop it (Outputs: stopping hbase......) - Can take a while, be patient
~/Desktop/Nutch/hbase/bin/stop-hbase.sh

# Access the shell
 ~/Desktop/Nutch/hbase/bin/hbase shell

# list               = list all tables
# disable 'webpage'  = disable the table (before dropping)
# drop 'webpage'     = drop the table (webpage is created & used by nutch)
# exit               = exit from hbase

# For the next part, we need to start hbase
~/Desktop/Nutch/hbase/bin/start-hbase.sh

Configure nutch

By default we must provide a user agent name to identify your crawler. In this case we also need to configure the data store to use HBaseStore. We’ll also add our plugin list here noting we have indexer-solr in the regular expression.

open ~/Desktop/Nutch/nutch/conf/nutch-site.xml
<configuration>
    <property>
        <name>http.agent.name</name>
        <value>your-crawler-name</value>
    </property>
    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.hbase.store.HBaseStore</value>
        <description>Default class for storing data</description>
    </property>
    <property>
        <name>plugin.includes</name>
        <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
    </property>
</configuration>


Compiling nutch

We need to compile Apache Nutch using ant. This is straight-forward but can take a while (10 minutes on my Macbook Pro 2012) if it’s the first time. Make sure you have hbase and solr running.

# Start hbase and solr
~/Desktop/Nutch/hbase/bin/start-hbase.sh
~/Desktop/Nutch/solr/bin/solr start

# Build Nutch
cd ~/Desktop/Nutch/nutch
ant runtime

Once complete you’ll see BUILD SUCCESSFUL.


Configure Solr

We need to create a new core, update our schema.xml and start Solr so we can access it via a web browser.

~/Desktop/Nutch/solr/bin/solr start

# Create a new core
~/Desktop/Nutch/solr/bin/solr create_core -c demo

You should now be able to access Solr via your web browser via:


Fire it all up!

Finally we can play!

Make sure you have hbase and solr running, let’s also setup our regular expressions and initial seed list here.

# Ensure hbase and solr are started!
~/Desktop/Nutch/hbase/bin/start-hbase.sh
~/Desktop/Nutch/solr/bin/solr start

# We'll edit the regex filter to only crawl the apache nutch domain
open ~/Desktop/Nutch/nutch/runtime/local/conf/regex-urlfilter.txt

# Add the following line before the `accept anything else +.`
+^http://nutch.apache.org

# Insert the url into our `seed` list
mkdir -p ~/Desktop/Nutch/nutch/runtime/local/urls/
echo "http://nutch.apache.org" > ~/Desktop/Nutch/nutch/runtime/local/urls/seed.txt

Finally, lets start our first crawl!

# Change our working dir
cd ~/Desktop/Nutch/nutch/runtime/local/

# Inject the url's from our seed list (in the directory named `urls`)
# This injects into our Hbase database (as that's what we have specified)
bin/nutch inject urls

# Generate a segment which tells Nutch which and how to fetch the urls.
# Creates folder <#timestamp#> under the <segments_dir>
# Inside that dir creates the folder "crawl_generate" where Nutch puts list of URLs to be fetched.
# -topN = How many pages to crawl this depth
# This will only contain 1 URL the first round we execute
bin/nutch generate -topN 10

# Fetch the URLs we generated in the previous step.
# Nutch places the data into the "content" and "crawl_fetch" folders inside the <segments_dir>/#timestamp# directory.
bin/nutch fetch -all

# Instruct Nutch to parse the fetched data which is placed into three folders
# "crawl_parse", "parse_data" and "parse_text" inside the <segments_dir>/#timestamp# directory.
# We can write a custom ParserFilter extension to parse content how we like
bin/nutch parse -all

# Update the Database with the new urls
bin/nutch updatedb -all

# Finally let's push everything to solr (and to our demo core)
bin/nutch solrindex http://localhost:8983/solr/demo -all

If you go to your Solr web admin, under the demo core overview you will see NumDocs: 1.

Go to the query tab and hit the execute button to view the crawled page data.


Other: example output

[14:24] ~/Desktop/Nutch/nutch/runtime/local $ bin/nutch inject urls
InjectorJob: starting at 2015-06-27 14:46:49
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2015-06-27 14:46:52, elapsed: 00:00:03

[14:24] ~/Desktop/Nutch/nutch/runtime/local $ bin/nutch generate -topN 10
GeneratorJob: starting at 2015-06-27 14:47:00
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 10
GeneratorJob: finished at 2015-06-27 14:47:02, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1435412820-1932223658 containing 1 URLs

[14:24] ~/Desktop/Nutch/nutch/runtime/local $ bin/nutch fetch -all
FetcherJob: starting at 2015-06-27 14:47:29
FetcherJob: fetching all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
Fetcher: throughput threshold: -1
-finishing thread FetcherThread9, activeThreads=1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 84 84 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: finished at 2015-06-27 14:47:37, time elapsed: 00:00:07

[14:24] ~/Desktop/Nutch/nutch/runtime/local $ bin/nutch parse -all
ParserJob: starting at 2015-06-27 14:47:41
ParserJob: resuming:        false
ParserJob: forced reparse:  false
ParserJob: parsing all
Parsing http://nutch.apache.org/
ParserJob: success
ParserJob: finished at 2015-06-27 14:47:44, time elapsed: 00:00:02

[14:24] ~/Desktop/Nutch/nutch/runtime/local $ bin/nutch updatedb -all
DbUpdaterJob: starting at 2015-06-27 14:48:00
DbUpdaterJob: updatinging all
DbUpdaterJob: finished at 2015-06-27 14:48:02, time elapsed: 00:00:02


Troubleshooting

No IndexWriters activated - check your configuration

If you get this error whilst indexing to Solr:

bin/nutch solrindex http://localhost:8983/solr/demo -all
# IndexingJob: starting
# No IndexWriters activated - check your configuration
# IndexingJob: done.

# You need to add the plugins property to nutch-site.xml (BOTH of them the src and runtime)
open ~/Desktop/Nutch/nutch/conf/nutch-site.xml
open ~/Desktop/Nutch/nutch/runtime/local/conf/nutch-site.xml

<property>
    <name>plugin.includes</name>
    <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>

# Recompile
cd ~/Desktop/Nutch/nutch/
ant runtime

# Try solr again!
cd ~/Desktop/Nutch/nutch/runtime/local
bin/nutch solrindex http://localhost:8983/solr/demo -all

# Example Output:
# IndexingJob: starting
# Active IndexWriters :
# SOLRIndexWriter
    #   solr.server.url : URL of the SOLR instance (mandatory)
    #   solr.commit.size : buffer size when sending to SOLR (default 1000)
    #   solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    #   solr.auth : use authentication (default false)
    #   solr.auth.username : username for authentication
    #   solr.auth.password : password for authentication
# IndexingJob: done.

java.util.NoSuchElementException

If your getting this error, you probably have not configured your data store in gora.properties correctly.

java.util.NoSuchElementException
    at java.util.TreeMap.key(TreeMap.java:1323)
    at java.util.TreeMap.firstKey(TreeMap.java:290)
    at org.apache.gora.memory.store.MemStore.execute(MemStore.java:125)
    at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73)
    at org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:68)
    at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:110)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
    at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

CheatCode

Running this for the second and further times will result in more pages being added to the index. For the Apache Nutch website it found 3 pages on the second crawl (total 4 shown in Solr admin).

# Running everything again
cd ~/Desktop/Nutch/nutch/runtime/local/
bin/nutch inject urls
bin/nutch generate -topN 10
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
bin/nutch solrindex http://localhost:8983/solr/demo -all

# One liner
bin/nutch inject urls && bin/nutch generate -topN 10 && bin/nutch fetch -all && bin/nutch parse -all && bin/nutch updatedb -all
bin/nutch solrindex http://localhost:8983/solr/demo -all

# Clearing the solr database
rm -rf ~/Desktop/Nutch/solr/server/solr/demo
cd ~/Desktop/Nutch/solr && bin/solr restart
bin/solr create_core -c demo

# Clearing the hbase database
# The hbase db store the url's for nutch, you may want to clear this also
~/Desktop/Nutch/hbase/bin/hbase shell
disable 'webpage'
drop 'webpage'