09 October, 2012

Apache Nutch 2.1


Developer:

Website:

License / Price:

Platforms:

Databases:

Language:

Last Updated:

Category:
Apache Software Foundation | More scripts
nutch.apache.org
Apache License 

Windows / Linux / Mac OS / BSD / Solaris
N/A
Java
October 9th, 2012, 15:57 GMT [view history]
C: \ Search Engines

It builds on Lucene Java, adding new web-specifics, such as parsers for HTML, a crawler, a link-graph database and other document formats.

Nutch can run on a single machine, but works better in Hadoop clusters.

Plugins are available for expanding its usage spectrum.

What's New in This Release: [ read full changelog ]

· Renamed HTMLParseFilter into ParseFilter.
· Remove remaining robots/IP blocking code in lib-http.
· Port logging to slf4j.
· External parser supports encoding attribute.
· Ivy configuration settings don't include Gora.
· Injector should add the metadata before calling injectedScore.
· Port Nutch benchmark to Nutchbase.
· Add parse-html back.
· MoreIndexingFilter missing date format.
· Timeout for Parser.
· Retry interval in crawl date is set to 0.
· Generate log output for solr indexer and dedup.
· Improved NutchConfiguration.
· SolrDeleteDuplicates needs to clone the SolrRecord objects.
· Native hadoop libs not available through maven.
· Separate the build and runtime environments.


Download button
Via: Apache Nutch 2.1

0 Comment: