In our work, we needed to use open source web crawler for unstructured data gathering.
Here we have used
A> Apache Nutch for web crawling and
B> Apache Solr for unstructured web data indexing
Steps, that we have used to set up the complete environment are -
1> Downloaded Apache Solr (3.X)
2> Downloaded Apache Nutch ( 1.x)
We followed the tutorial from here.
We are giving here all the steps with the most important points, for which we have spent much time. Our OS was Mac OS X.
In Linux Terminal, Setting of two Environmental variables are important.
Checked that the variables are set properly.
Apache Nuch Installation -
Unzip Apache Nutch.
Run ‘bin/nutch’ to check whether nutch is installed properly or not.
Goto <<apache nutch installation directory>>/conf folder.
Open nutch-site.xml file.
save with following XML snippet:
<configuration> <property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property> </configuration>
- Create folder urls in <<apache nutch installation directory>>
- Create a file seed.txt
- Enter your required site to crawl. In our case it was http://www.phloxblog.in
- Execute following command in terminal -
bin/nutch crawl urls -depth 5 -topN 20
depth – depth indicates the link depth from the main url that should be crawled.
topN – N determines the maximum number of pages that will be retrieved at each level up to the depth.
during crawling, it will create folder named like crawl-(timestamp)
Within crawl-(timestamp) folder, there will be 3 folders named – crawldb, linkdb and segments.
In crawldb folder, there will be ‘current’ folder. In the current folder, there will be ‘part-00000′ folder and within that another two file – data & index.
So all the crawled data will be stored in crawl-(timestamp) folder, if everything goes fine.
troubleshooting – If you get unknownhostexception in data file,
1> check the internet connection.
2> follow the versions and steps again as we have mentioned.
Note that, we have changed the command -
bin/nutch crawl urls -depth 5 -topN 20 which is slightly different from nutch wiki documentation.
So change the command as per your requirement, after reading the nutch command usage here.
Apache Solr Installation -
- Unzip Apache Solr
- Run java -jar start.jar from <<Apache Solr Directory>>/example directory.
- It will start solr in the 8983 port for default Apache Solr Installation.
- Hit the url http://localhost:8983/solr/admin/. to check wheher solr installation is fine.
- Close Apache Solr.
Note : for detail Apache Solr Installation you can refer here.
Integration of Apache Nutch with Apache Solr -
To integrate solr with nutch.
- Copy schema.xml from conf folder of <<apache-nutch installation directory>>.
- Go to the <<apache solr installation directory>> . On that folder open the path example>solr>conf.
- Replace schema.xml with the copied one.
- Open the schema.xml file. Change the line,
<field name="content" type="text" stored="true" indexed="true"/> to store the crawled content as indexed.
Execute the following command from terminal in <<Apache Nutch Installation Directory>>
bin/nutch solrindex http://127.0.0.1:8983/solr/ <<the folder name that has been created during crawling-crawl-timestamp>>/crawldb -linkdb crawl/linkdb <<the folder name that has been created during crawling-crawl-timestamp>>/segments/*
This will complete indexing of all the crawled data from nutch.
To be sure if the above process is done perfectly or not,
- Open http://localhost:8983/solr/admin/
- Click on the ‘Search’ button.
It will show all the data that have been indexed after crawling.
So happy crawling. Comments are welcome.