Subscribe via RSS Feed

Installation and running Apache Nutch and Apache Solr for crawling and indexing Web Content

May 14, 2013 1 Comment


In our work, we needed to use open source web crawler for unstructured data gathering.

Here we have used

A> Apache Nutch for web crawling and

B> Apache Solr for unstructured web data indexing

Steps, that we have used to set up the complete environment are -

1> Downloaded Apache Solr (3.X)

2> Downloaded Apache Nutch ( 1.x)

We followed the tutorial from here.

We are giving here all the steps with the most important points, for which we have spent much time. Our OS was Mac OS X.

In Linux Terminal, Setting of two Environmental variables are important.

1> JAVA_HOME

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/

2> NUTCH_JAVA_HOME

export NUTCH_JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/

Checked that the variables are set properly.

Apache Nuch Installation -

Unzip Apache Nutch.

Run ‘bin/nutch’ to check whether nutch is installed properly or not.

Goto <<apache nutch installation directory>>/conf folder.

Open nutch-site.xml file.

save with following XML snippet:

<configuration>

<property>

 <name>http.agent.name</name>

 <value>My Nutch Spider</value>

</property>

</configuration>
  • Create folder urls in <<apache nutch installation directory>>
  • Create a file seed.txt
  • Enter your required site to crawl. In our case it was http://www.phloxblog.in
  • Execute following command in terminal -

      bin/nutch crawl urls -depth 5 -topN 20

     Here,

     depth – depth indicates the link depth from the main url that should be crawled.

     topN – N determines the maximum number of pages that will be retrieved at each level up to the depth.

Now,

during crawling, it will create  folder named like crawl-(timestamp)

Within crawl-(timestamp) folder, there will be  3 folders named – crawldb, linkdb and segments.

In crawldb folder, there will be ‘current’ folder. In the current folder, there will be ‘part-00000′ folder and within that another two file – data & index. 

So all the crawled data will be stored in  crawl-(timestamp) folder, if everything goes fine.

troubleshooting – If you get unknownhostexception in data file,

then

1> check the internet connection.

2> follow the versions and steps again as we have mentioned.

Note that, we have changed the command -

bin/nutch crawl urls -depth 5 -topN 20 which is slightly different from nutch wiki documentation.

So change the command as per your requirement, after reading the nutch command usage here.

Apache Solr Installation -

  • Unzip Apache Solr
  • Run  java -jar start.jar from <<Apache Solr Directory>>/example directory.
  • It will start solr in the 8983 port for default Apache Solr Installation.
  • Hit the url http://localhost:8983/solr/admin/. to check wheher solr installation is fine.
  • Close Apache Solr.

Note : for detail Apache Solr Installation you can refer here.

Integration of Apache Nutch with Apache Solr -

To integrate solr with nutch. 

  1. Copy schema.xml from conf folder of <<apache-nutch installation directory>>.
  2. Go to the <<apache solr installation directory>> . On that folder open the path example>solr>conf.
  3. Replace schema.xml with the copied one. 
  4. Open the schema.xml file. Change the line,
     <field name="content" type="text" stored="true" indexed="true"/> to store the crawled content as indexed.

Execute the following command from terminal in <<Apache Nutch Installation Directory>> 

bin/nutch solrindex http://127.0.0.1:8983/solr/ <<the folder name that has been created during crawling-crawl-timestamp>>/crawldb -linkdb crawl/linkdb <<the folder name that has been created during crawling-crawl-timestamp>>/segments/*

This will complete indexing of all the crawled data from nutch.

To be sure if the above process is done perfectly or not,

It will show all the data that have been indexed after crawling.

So happy crawling. Comments are welcome.

Enter your email address:

Delivered by FeedBurner

Sign Up to read the rest of the content

Email will be used only for updates of our site

No Thanks