当前位置：网站首页>Nutch2.1 using eclipse debug to store the build process in MySQL on the windows platform

Nutch2.1 using eclipse debug to store the build process in MySQL on the windows platform

2022-06-29 20:03:00 【Brother Xing plays with the clouds】

step 1： Get ready eclipse、eclipse svn plug-in unit 、MySQL Get ready ,mysql Use utf-8 code step 2：mysql Building database , Build table ： CREATE DATABASE nutch ; CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` mediumtext DEFAULT NULL, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(767) DEFAULT NULL, `content` longblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(767) DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED DEFAULT CHARSET=utf8mb4;

`id` varchar(767) NOT NULL This is not successful on my computer , Can only be set to a maximum of 100 So instead of ：`id` varchar(100) NOT NULL step 3： from https://svn.apache.org/repos/asf/nutch/tags/release-2.1 Pull down the code , Create locally Java project. I have tried many times , Therefore, the name of the project is test. step 4： Add src file stay project explorer Right click the item below , choice properties. Get into java build path , stay source tab , Delete src Folder , choice “Add Folder ”, Put... Here conf,src/bin,src/java,src/test,src/testresources, as well as src/plugin Of each plug-in under the folder src and test Join in . Finally, you can see the following interface （test Is the project name ）：

At every eclipse There is... Under the project folder .classpath file , open .classpath The file can see ： The content is basically like this . <classpathentry kind="src" path="conf"/> <classpathentry kind="src" path="src/java"/> <classpathentry kind="src" path="src/test"/> <classpathentry kind="src" path="src/plugin/protocol-file/src/test"/> <classpathentry kind="src" path="src/plugin/protocol-httpclient/src/test"/> <classpathentry kind="src" path="src/plugin/subcollection/src/test"/> <classpathentry kind="src" path="src/plugin/parse-html/src/test"/> <classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/test"/> <classpathentry kind="src" path="src/plugin/parse-html/src/java"/> <classpathentry kind="src" path="src/plugin/parse-tika/src/test"/> <classpathentry kind="src" path="src/plugin/lib-http/src/test"/> <classpathentry kind="src" path="src/plugin/parse-tika/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-regex/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-domain/src/java"/> <classpathentry kind="src" path="src/plugin/scoring-link/src/java"/> <classpathentry kind="src" path="src/plugin/index-anchor/src/test"/> <classpathentry kind="src" path="src/plugin/protocol-http/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/test"/> <classpathentry kind="src" path="src/plugin/urlfilter-prefix/src/java"/> <classpathentry kind="src" path="src/plugin/scoring-opic/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-domain/src/test"/> <classpathentry kind="src" path="src/plugin/protocol-file/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/java"/> <classpathentry kind="src" path="src/plugin/language-identifier/src/java"/> <classpathentry kind="src" path="src/plugin/lib-regex-filter/src/test"/> <classpathentry kind="src" path="src/plugin/language-identifier/src/test"/> <classpathentry kind="src" path="src/plugin/subcollection/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/test"/> <classpathentry kind="src" path="src/plugin/index-basic/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/test"/> <classpathentry kind="src" path="src/plugin/creativecommons/src/java"/> <classpathentry kind="src" path="src/bin"/> <classpathentry kind="src" path="src/plugin/protocol-httpclient/src/java"/> <classpathentry kind="src" path="src/plugin/tld/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/java"/> <classpathentry kind="src" path="src/plugin/index-basic/src/test"/> <classpathentry kind="src" path="src/plugin/lib-http/src/java"/> <classpathentry kind="src" path="src/plugin/protocol-ftp/src/java"/> <classpathentry kind="src" path="src/plugin/index-anchor/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-validator/src/java"/> <classpathentry kind="src" path="src/plugin/index-more/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/test"/> <classpathentry kind="src" path="src/plugin/creativecommons/src/test"/> <classpathentry kind="src" path="src/plugin/microformats-reltag/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-regex/src/test"/> <classpathentry kind="src" path="src/plugin/lib-regex-filter/src/java"/> <classpathentry kind="src" path="src/plugin/index-more/src/test"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/java"/> <classpathentry kind="src" path="src/testresources"/>

step 5： Join in lib package ： Switch to Libaries tab ,“Add Library"->"IvyDE Managed Dependencies"->"Next", choice “Project”, choice ivy\ivy.xml file . spot Ok.eclipse It will automatically download the dependent jar package .

Errors may be reported during this process , See the error message because org.restlet.jse Package cannot be downloaded . The solution is ：ivy\ivy.xml Find <dependency org="org.restlet.jse" name="org.restlet" rev="2.0.5" conf="*->default" /> <dependency org="org.restlet.jse" name="org.restlet.ext.jackson" rev="2.0.5" conf="*->default" /> part , Comment out . Manually find these two packages on the Internet , Put it in lib It's a bag , Add to Libaries in .

Then add plugin Of each plug-in under the folder ivy.xml file . Manually add one by one .

step 6： stay "Order and Export" tab , take conf top step 7： Database configuration and other configuration information open /conf/gora.properties , Delete everything in the file , write in mysql To configure ： ############################### # MySQL properties # ############################### gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=root gora.sqlstore.jdbc.password=123456

stay /conf/gora-sql-mapping.xml modify <primarykey column="id" length="240"/> stay /conf/nutch-site.xml Input ： <property> <name>http.agent.name</name> <value>Your Nutch Spider</value> </property>

<property> <name>http.accept.language</name> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> <description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property>

<property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other information is available</description> </property>

<property> <name>plugin.includes</name> <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>

<property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore</value> <description>The Gora DataStore class for storing and retrieving data. Currently the following stores are available: …. </description> </property>

<property> <name>plugin.folders</name> <value>./src/plugin</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property>

In the root directory build.xml Find the following code in

<target name="resolve-default" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy"> <ivy:resolve file="${ivy.file}" conf="default" log="download-only" /> <ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" /> <antcall target="copy-libs" /> </target> take pattern="${build.lib.dir}/[artifact]-[revision].[ext]" Replace with pattern="${build.lib.dir}/[artifact]-[type]-[revision].[ext]" step 8： Configuration grab url stay test Create a folder under the project urls, stay urls Create a file seeds.txt , Write the website you want to crawl . What I'm writing is http://www.163.com. step 9： function org.apache.nutch.crawl.Crawler open Crawler file ,“Run As” -> “Run Configurations” , stay “Arguments” Tab “Program Arguments”, Input “urls -depth 3 -topN 5”, spot "Run". ha-ha , Wrong report . The error message is similar to “ Failed to set permissions of path: \tmp\Hadoop-Administrator\mapred\staging\Administrator1712398257\. ” Error of . This is a hadoop A question of . The solution is , modify /hadoop-1.0.2/src/core/org/apache/hadoop/fs/FileUtil.java Inside checkReturnValue, Just comment it out . Of course, the easiest way is to find a modified package on the Internet , Replace FileUtil.class. Run again , ha-ha This is the end of the successful execution .

Good luck to you .

Problems encountered ： 1 newspaper Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0004 There may be many problems found on the Internet. First of all nutch-default.xml Middle configuration <name>plugin.folders</name><value>./src/plugin</value> Second, look for hadoop.log file .

原网站

版权声明
本文为[Brother Xing plays with the clouds]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/180/202206291956462069.html

当前位置：网站首页>Nutch2.1 using eclipse debug to store the build process in MySQL on the windows platform

Nutch2.1 using eclipse debug to store the build process in MySQL on the windows platform

边栏推荐

猜你喜欢

随机推荐