Apache Nutch 2.2, MySQL & Solr 5.2.1 Tutorial

in

Apache Nutch is an open source extensible web crawler. It allows us to crawl a page, extract all the out-links on that page, then on further crawls crawl them pages. It also handles the frequency of the calls and many other aspects which could be cumbersome to setup.

Let’s have a look at setting up Apache Nutch with MySQL.


Getting all the tools

OK first we need to get all the tools, on your machine create a working directory. If you fetch them from other mirrors, ensure you get the correct versions:

# Create our working dir
mkdir ~/Desktop/Nutch
cd ~/Desktop/Nutch

# Grab all our files (correct versions!)
curl -O http://archive.apache.org/dist/nutch/2.2/apache-nutch-2.2-src.tar.gz
curl -O http://archive.apache.org/dist/lucene/solr/5.2.1/solr-5.2.1.tgz

# Unzip and rename them (for convenience!)
tar -xzf apache-nutch-2.2-src.tar.gz && mv apache-nutch-2.2 nutch
tar -xzf solr-5.2.1.tgz && mv solr-5.2.1 solr

# Delete command if you wanna start again!
# rm -rf nutch hbase solr

# Move our source files to another directory to keep things clean!
mkdir gzips && mv apache-nutch-2.2-src.tar.gz solr-5.2.1.tgz gzips

# Your working directory should now look like this:
# .
# ├── gzips
# ├── nutch
# └── solr

Make sure JAVA_HOME and NUTCH_JAVA_HOME environment variable is set

In your ~/.bash_profile or ~/.bashrc ensure your JAVA_HOME is set correctly. We also need to set NUTCH_JAVA_HOME to point to our Java home directory.

export JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"
export NUTCH_JAVA_HOME="$(/usr/libexec/java_home -v 1.8)"

Setup MySql

We need to add some simple MySQL configuration to get everything running.

Configure MySQL

Add the following lines to your my.conf.

# My MySQL location was here:
# open /usr/local/mysql/my.cnf

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
max_allowed_packet=500M

Create our table and set the default schema

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;
use nutch;
CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` longtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;


Add the dependency to our ivy.xml file

open ~/Desktop/Nutch/nutch/ivy/ivy.xml
# Change the following file (approx line 102) - all these are grouped together
<dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>
# to
<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>

# And uncomment the following gora-sql line
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />

# Also uncomment the mysql connector
# <!-- Uncomment this to use MySQL as database with SQL as Gora store. -->
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>

Remove the default SQL store for MySQL in gora.properties

open ~/Desktop/Nutch/nutch/conf/gora.properties. Delete or comment out the Default SqlStore Properties using #, then add the MySQL properties below replacing xxxxx with the user and password you set up when installing MySQL earlier.

# Comment these lines out
#gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
#gora.sqlstore.jdbc.user=sa
#gora.sqlstore.jdbc.password=

# Add the following MySQL Properties

###############################
# MySQL properties            #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=root

Add the SQL mapping

open ~/Desktop/Nutch/nutch/conf/gora-sql-mapping.xml

# Change : <primarykey column="id" length="512"/>
# To     : <primarykey column="id" length="767"/>

# It's in 2 locations! - Change them both! (Line 21 + Line 54)

Configure Solr

We need to create a new core, update our schema.xml and start Solr so we can access it via a web browser.

~/Desktop/Nutch/solr/bin/solr start

# Create a new core
~/Desktop/Nutch/solr/bin/solr create_core -c demo

You should now be able to access Solr via your web browser via:


Configure apache nutch

We need to add our default Apache Nutch configuration to nutch-site.xml, we particularly need to add or change the http.agent.name and the storage.data.store.class for our Gora SQL Store.

I’m adding the plugins here also.

open ~/Desktop/Nutch/nutch/conf/nutch-site.xml

# Place the following content's into it
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
        <name>http.agent.name</name>
        <value>nutch-spider-demo</value>
    </property>

    <property>
        <name>parser.character.encoding.default</name>
        <value>utf-8</value>
        <description>The character encoding to fall back to when no other information is available</description>
    </property>

    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.sql.store.SqlStore</value>
        <description>The Gora DataStore class for storing and retrieving data.</description>
    </property>

    <property>
        <name>plugin.includes</name>
        <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
    </property>
</configuration>

Compile apache nutch using Ant

Now that we have all the configuration setup, we have to compile Apache Nutch using Apache Ant.

cd ~/Desktop/Nutch/nutch
ant runtime

# Outputs: (On a Macbook Pro Mid-2012)
# BUILD SUCCESSFUL
# Total time: 7 minutes 2 seconds

This may take a several minutes to complete, the first time usually takes the longest whilst all the dependencies are fetched, subsequent builds times should be dramatically faster.



Fire it all up!

Finally we can play!

Make sure you have solr running, let’s also setup our regular expressions and initial seed list here.

# Ensure hbase and solr are started!
~/Desktop/Nutch/solr/bin/solr start

# We'll edit the regex filter to only crawl the apache nutch domain
open ~/Desktop/Nutch/nutch/runtime/local/conf/regex-urlfilter.txt

# Add the following line before the `accept anything else +.`
+^http://nutch.apache.org

# Insert the url into our `seed` list
mkdir -p ~/Desktop/Nutch/nutch/runtime/local/urls/
echo "http://nutch.apache.org" > ~/Desktop/Nutch/nutch/runtime/local/urls/seed.txt

Finally, lets start our first crawl.

# Change our working dir
cd ~/Desktop/Nutch/nutch/runtime/local/

# Inject the url's from our seed list (in the directory named `urls`)
# This injects into our Hbase database (as that's what we have specified)
bin/nutch inject urls

# Generate a segment which tells Nutch which and how to fetch the urls.
# Creates folder <#timestamp#> under the <segments_dir>
# Inside that dir creates the folder "crawl_generate" where Nutch puts list of URLs to be fetched.
# -topN = How many pages to crawl this depth
# This will only contain 1 URL the first round we execute
bin/nutch generate -topN 10

# Fetch the URLs we generated in the previous step.
# Nutch places the data into the "content" and "crawl_fetch" folders inside the <segments_dir>/#timestamp# directory.
bin/nutch fetch -all

# Instruct Nutch to parse the fetched data which is placed into three folders
# "crawl_parse", "parse_data" and "parse_text" inside the <segments_dir>/#timestamp# directory.
# We can write a custom ParserFilter extension to parse content how we like
bin/nutch parse -all

# Update the Database with the new urls
bin/nutch updatedb

# Finally let's push everything to solr (and to our demo core)
bin/nutch solrindex http://localhost:8983/solr/demo -all

If you go to your solr web admin panel, under the demo core overview you should see NumDocs: 1.

Go to the query tab and hit the execute button to view the crawled page data.

You should also see the table webpage in your MySQL table nutch filled with all the data from the crawl.

If your getting any errors, try using the following schema.xml file: http://pastebin.com/GAVBQ6ug


Troubleshooting

No IndexWriters activated - check your configuration

If you get this error whilst indexing to Solr:

bin/nutch solrindex http://localhost:8983/solr/demo -all
# IndexingJob: starting
# No IndexWriters activated - check your configuration
# IndexingJob: done.

# You need to add the plugins property to nutch-site.xml (BOTH of them the src and runtime)
open ~/Desktop/Nutch/nutch/conf/nutch-site.xml
open ~/Desktop/Nutch/nutch/runtime/local/conf/nutch-site.xml
<property>
    <name>plugin.includes</name>
    <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
# Recompile
cd ~/Desktop/Nutch/nutch/
ant runtime

# Try solr again!
cd ~/Desktop/Nutch/nutch/runtime/local
bin/nutch solrindex http://localhost:8983/solr/demo -all

# Example Output:
# IndexingJob: starting
# Active IndexWriters :
# SOLRIndexWriter
    #   solr.server.url : URL of the SOLR instance (mandatory)
    #   solr.commit.size : buffer size when sending to SOLR (default 1000)
    #   solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    #   solr.auth : use authentication (default false)
    #   solr.auth.username : username for authentication
    #   solr.auth.password : password for authentication
# IndexingJob: done.

java.util.NoSuchElementException

If your getting this error, you’ve probably not configured your data store in gora.properties correctly.

java.util.NoSuchElementException
    at java.util.TreeMap.key(TreeMap.java:1323)
    at java.util.TreeMap.firstKey(TreeMap.java:290)
    at org.apache.gora.memory.store.MemStore.execute(MemStore.java:125)
    at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73)
    at org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:68)
    at org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:110)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
    at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

“Cheat Code”

Running this after the second attempt will result in more pages being added to the index. For the Apache Nutch website it found 3 pages on the second crawl (total 4 shown in the Solr admin UI).

# Running everything again
cd ~/Desktop/Nutch/nutch/runtime/local/
bin/nutch inject urls
bin/nutch generate -topN 10
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb
bin/nutch solrindex http://localhost:8983/solr/demo -all

# One liner
bin/nutch inject urls && bin/nutch generate -topN 10 && bin/nutch fetch -all && bin/nutch parse -all && bin/nutch updatedb
bin/nutch solrindex http://localhost:8983/solr/demo -all

# Clearing the solr database
rm -rf ~/Desktop/Nutch/solr/server/solr/demo/data && ~/Desktop/Nutch/solr/bin/solr restart
cd ~/Desktop/Nutch/solr && bin/solr restart
bin/solr create_core -c demo

# Clearing the hbase database
# The hbase db store the url's for nutch, you may want to clear this also
~/Desktop/Nutch/hbase/bin/hbase shell
disable 'webpage'
drop 'webpage'

Apache Nutch Status Codes

/** Page was not fetched yet. */
public static final byte STATUS_DB_UNFETCHED      = 0x01;

/** Page was successfully fetched. */
public static final byte STATUS_DB_FETCHED        = 0x02;

/** Page no longer exists. */
public static final byte STATUS_DB_GONE           = 0x03;

/** Page temporarily redirects to other page. */
public static final byte STATUS_DB_REDIR_TEMP     = 0x04;

/** Page permanently redirects to other page. */
public static final byte STATUS_DB_REDIR_PERM     = 0x05;

/** Page was successfully fetched and found not modified. */
public static final byte STATUS_DB_NOTMODIFIED    = 0x06;

1   unfetched (links not yet fetched due to limits set in regex-urlfilter.txt, -TopN crawl parameters, etc.)
2   fetched (page was successfully fetched)
3   gone (that page no longer exists)
4   redir_temp (temporary redirection -- see reprUrl below for more details)
5   redir_perm (permanent redirection -- see reprUrl below for more details)
34  retry
38  not modified