当前位置:网站首页>Data collection and management [4]
Data collection and management [4]
2022-06-29 03:44:00 【Star drawing question bank】
1. In incremental reptiles () refer to : Crawlers visit all web pages with the same frequency , Regardless of the frequency of web page changes .
A. Uniform renewal method
B. Classification based updating method
C. Random updating method
D. Individual renewal method
2. The target of collection is those that cannot be accessed through static links 、 Hidden behind the search form , Only users submit some keywords can get Web page .
A. focused crawler
B. Deep web crawler
C. General purpose web crawler
D. Incremental web crawler
3.HTTP In the request () Fields may be similar "Mozilla/5.0(iPhone;U;CPUiPhoneOS4_3_3likeMacOSX;en-us)AppleWebKit……” Such information .
A.Cookie
B.Host
C.User-Agent
D.Connection
4. The following statement about crawling strategy , What's not right is ().
A. Breadth first strategy can effectively control the crawling depth of the page , Avoid the problem that you can't finish crawling when you encounter an infinite deep branch
B. Depth first strategy is more suitable for vertical search or on-site search , But crawling the site with deep content level will cause a huge waste of resources .
C. The disadvantage of the depth first strategy is that it takes a long time to crawl to the pages with deep directory hierarchy .
D. Common crawling strategies for general web crawlers are : Depth first strategy 、 Breadth first strategy
5. Data preprocessing ETL Of L Express ().
A. cleaning
B. extract
C. load
D. transformation
6. The basic method is to follow the order of depth from low to high , Go to the next level of web page links in turn , Until we can't go any further .
A.PageRank Priority strategy
B. Random crawling strategy
C. Breadth first strategy
D. Depth first strategy
7. If you want to collect the specified data , You need to use the (), Because it only needs to crawl the pages related to the topic , It greatly saves hardware and network resources , The saved pages are also updated quickly due to the small number .
A. Incremental web crawler
B. Deep web crawler
C. General purpose web crawler
D. focused crawler
8.HTTP in () Method can be used to request and query the performance of the server , Or Query options and requirements related to resources .
A.TRACE
B.OPTIONS
C.PUT
D.DELETE
9. adopt HTTP perhaps HTTPS The resources requested by the protocol are () To mark .
A.TCP
B.FTP
C.URL
D.HTML
10. Our search strategy is to search in order of depth from low to high , Go to the next level of web page links in turn , Until we can't go any further , It is more suitable for vertical search or on-site search .
A. Depth first
B. Based on target characteristics
C. breadth-first
D. Based on domain
11. The following is about web crawlers , What's not right is ().
A. Web crawler is actually a kind of " Browse the web automatically ” The program , Or a kind of internet robot
B. Web crawlers are widely used in Internet search engines or other similar websites
C. At present, most of the information classification on the Internet is done manually
D. Traditional crawlers from one or several initial web pages URL Start , Get the URL, In the process of grabbing web pages , Constantly extract new URL Put it in the queue until certain stop conditions of the system are met
12. once HTTP from () form .
A. One request
B.2 Requests
C. One request and one response
D. One response
13. Which of the following is a deep web page
A. A web page whose content is not visible until the user registers
B. homepage
C. A static web page that hyperlinks can reach
D. Website navigation page
14. stay HTTP In request , adopt () Method to send data , It will be placed in URL after , With ? Division URL And transmit data , Between parameters & Connected to a .
A.POST
B.GET
C.TRACE
D.PUT
15. Pursuing high data quality is an important requirement for big data , To eliminate the unpredictability of some data , Get rid of some " clutter ” Of " dirty ” data , Is involved () technology .
A. Data forecast
B. Data collection
C. Data cleaning
D. Data statistics
16. The following cannot match regular expressions "^[\d]+$" Yes. ().
A.123
B.10
C.12
D.12abc
17. Data conversion does not include ().
A. Inconsistent data conversion
B. Incomplete data
C. Calculation of business rules
D. Data granularity transformation
18. Regular expression rules ,() Indicates that all non white space characters are matched
A.\d
B.\S
C.\w
D.\W
19.HTTP The response status code of the request is 403 Express ().
A. Server connection timed out
B. The request is successful
C. server is busy
D. Access to the requested page is prohibited
20.HTTP In response () Indicates the length of the entity body , Represented by a decimal number stored in bytes .
A.Content-Range
B.Content-Length
C.Content-Encoding
D.Content-Language
21.PageRank Priority strategies are often used to ().
A. All web crawlers
B.DeepWeb Reptiles
C. Incremental web crawler
D. General purpose web crawler
22. The following cannot match regular expressions "^[\w]+$" Yes. ().
A.S1
B.S+1
C.S_1
D.12
23. Data quality () Indicates whether the data correctly represents a realistic or verifiable source .
A. Uniformity
B. completeness
C. integrity
D. correctness
24. Which of the following belongs to HTTP Request information ().
A.User-Agent
B.Content-Length
C.Expires
D.Accept-Ranges
25.HTTP Requested () It can make the client-side connection to the server-side continue to be effective , When there is a subsequent request to the server , Avoid establishing or re establishing connections .
A.Host
B.Referer
C.Cookie
D.Keep-Alive
26. The form filling method of deep web crawler based on Web page structure analysis generally represents the web page form as (), Extract the value of each field of the form .
A. Images
B.BOM Trees
C.DOM Trees
D. Text
27. data () Our task is to filter data that does not meet the requirements .
A. transformation
B. extract
C. cleaning
D. load
28. The following are not data quality specifications 、 Integrity requires
A. Relevant information of legacy system shall be consistent with other modules
B. Referential integrity is not compromised : The data will not fail to find the reference
C. The data is internally consistent
D. There are no cross system matching violations , Data is well integrated
29. By using Web browser 、 Web crawlers or other tools , The client initiates a connection to the specified port on the server HTTP request . We call this client ().
A. The user agent
B. The source server
C. player
D. database
30.HTTP In the request () The content of the header field contains the requested user information .
A.User-Agent
B.Referer
C.Cookie
D.Authorization
31. stay HTP In the response message , If the status information code is 200 said ().
A. The requested page has been moved to the new url
B. Login failed
C. The request is successful
D. Access is forbidden
32. If HTTP The requested response information is 404, Which of the following measures should be taken ().
A. Check again if the requested page address is correct
B. Report the fault to the network management
C. Check browser permissions
D. Ask the administrator for a user name and password
33. Regular expressions [a-z] Can match ().
A."a” To "z” Any lowercase character in the range
B. Alphabetic character "a” or "z”
C."a” To "z” Any alphabetic character in the range
D. Lowercase characters "a” or "z”
34. The following does not belong to HTTP The protocol request method is
A.TRACE
B.SUBMIT
C.POST
D.GET
35.HTTP Requested POST Method requests the server to store a resource , And use Request-URI As its logo .
A. Yes
B. wrong
36.HTTP In the request Range The header field contains the information of the user who made the request .
A. Yes
B. wrong
37. The basic method of depth first strategy is in the order of depth from low to high , Go to the next level of web page links in turn .
A. Yes
B. wrong
38. Network data collection cannot process unstructured data .
A. Yes
B. wrong
39.HTTP Status code 500 Indicates that the request failed due to the client .
A. Yes
B. wrong
40. Focus on the order in which web crawlers crawl pages is relatively low .
A. Yes
B. wrong
41.Web The server does not save the file to send the request Web Any information about the browser process .
A. Yes
B. wrong
42. For systems with large amounts of data , Generally, one-time data extraction is also often done .
A. Yes
B. wrong
43.HTTP In the request Cookie Indicates the client type .
A. Yes
B. wrong
44. Focused crawlers are also called topic crawlers .
A. Yes
B. wrong
45. Surface pages (SurfaceWeb) The middle accessible information capacity is a deep web page (DeepWeb
A. Yes
B. wrong
46.HTTP In request Referer The content of the header field contains the requested user information .
A. Yes
B. wrong
47. Hunger patterns in regular expressions match as much text as possible .
A. Yes
B. wrong
48. The specific performance of data quality is correctness 、 integrity 、 Uniformity 、 completeness 、 effectiveness 、 Timeliness and availability, etc .
A. Yes
B. wrong
49. Web crawler is actually a kind of " Browse the web automatically ” The program .
A. Yes
B. wrong
50.HTTP Requested TRACE Method to query the performance of the server , Or Query options and requirements related to resources .
A. Yes
B. wrong
51. Depth first strategy is more suitable for vertical search or on-site search , But crawling the site with deep content level will cause a huge waste of resources .
A. Yes
B. wrong
52. Regular expressions consist of some ordinary characters and some metacharacters .
A. Yes
B. wrong
53.HTTP Requested PUT Method requests the server to send back the received request information , Mainly used for testing or diagnosis .
A. Yes
B. wrong
54. The amount of information stored in deep web pages only accounts for a small part of the amount of information in the Internet .
A. Yes
B. wrong
55. To focus on the crawler, you only need to crawl the pages related to the theme ..
A. Yes
B. wrong
56.POST Limited data size submitted , At best 1024 byte .
A. Yes
B. wrong
57.HTTP The request header field may contain Authorization、Referer、Content-Type、Content-Encoding Other parts .
A. Yes
B. wrong
58.HTTP In the request Range A header field can request one or more subscopes of an entity .
A. Yes
B. wrong
59. Breadth first strategy crawls the page according to the depth of the content directory , Pages at a shallow directory level are crawled first .
A. Yes
B. wrong
60. Data cleaning is a one-time process .
A. Yes
B. wrong
61. Web crawlers always start from a certain starting point , This starting point is called seed .
A. Yes
B. wrong
62. General web crawler requires high crawling speed and storage space .
A. Yes
B. wrong
63. Regular expressions support matching boundaries , For example, word boundaries , The beginning or end of a text .
A. Yes
B. wrong
64. stay HTTP Status information code is an important information in response information .
A. Yes
B. wrong
65. In regular expressions $ Match the beginning of the line .
A. Yes
B. wrong
66. Crawler tools can only be used Java Language writing .
A. Yes
B. wrong
67.HTTP Requested OPTIONS Method to query the performance of the server , Or Query options and requirements related to resources .
A. Yes
B. wrong
68. The actual network crawler system is usually a combination of several crawler technologies .
A. Yes
B. wrong
69.HTTP Status code 400 Indicates that the request was successful ..
A. Yes
B. wrong
70. Can be from a HTTP I learned some information from the request , for example : The requesting client , The language of the request , Keep connected (keep-alive), wait .
A. Yes
B. wrong
71.HTTP In the request method DELETE Used to request the server to delete Request-URI Identified resources .
A. Yes
B. wrong
72.Java The language does not support regular expressions ..
A. Yes
B. wrong
73. Greedy patterns in regular expressions match as much text as possible .
A. Yes
B. wrong
74. Correctness of data quality (Accuracy) It refers to whether the data correctly represents a realistic or verifiable source .
A. Yes
B. wrong
75.GET The request for Request-URI Identified resources .
A. Yes
B. wrong
76. The common crawling strategies for focused web crawlers are depth first 、 Breadth first strategy .
A. Yes
B. wrong
77. Surface pages are those that most content cannot be accessed through static links 、 Web pages hidden behind search forms .
A. Yes
B. wrong
78. Deep web pages contain far less information than surface web pages , There is no crawling value .
A. Yes
B. wrong
79. Web crawlers are widely used in Internet search engines or other similar websites , In order to obtain or update the content and retrieval methods of these websites .
A. Yes
B. wrong
80. At present, most of the information classification on the Internet is done manually .
A. Yes
B. wrong
81. Quantifiers of regular expressions + Express 0 Times or times .
A. Yes
B. wrong
82. In regular expressions $ Match the end of the line .
A. Yes
B. wrong
83. Incremental web crawler data download volume and time and space consumption are large .
A. Yes
B. wrong
84. Hypertext transfer protocol is usually composed of HTTP Client initiates a request , Create a... To the specified port of the server TCP Connect .
A. Yes
B. wrong
85.HTTP Responses include Content-Encoding、Content-Length、Content-Type etc. .
A. Yes
B. wrong
86. The number of deep web pages is much less than that of surface web pages .
A. Yes
B. wrong
87. Focus on web crawlers vs. general web crawlers , Added link evaluation module and content evaluation module .
A. Yes
B. wrong
88.DELETE Request server delete Request-URI Identified resources .
A. Yes
B. wrong
89. The timeliness of data quality refers to whether the data is within the acceptable range defined by the enterprise .
A. Yes
B. wrong
90. Depth first strategy is more suitable for vertical search or on-site search .
A. Yes
B. wrong
91. Different enterprises have different business rules 、 Different data indicators , These indicators are simply added and subtracted 、 Combination can be completed .
A. Yes
B. wrong
92. Hunger patterns in regular expressions match as little text as possible .
A. Yes
B. wrong
93. Universal web crawler is suitable for searching a wide range of topics for search engines , It has strong application value .
A. Yes
B. wrong
94. Regular expressions [abc] Representation string abc.
A. Yes
B. wrong
95.HTTP Status code 500 Indicates that the request failed due to the server .
A. Yes
B. wrong
96. General purpose web crawlers usually work in parallel , But it takes a long time to refresh the page .
A. Yes
B. wrong
97.HTTP The common request methods are GET、HEAD.POST.
A. Yes
B. wrong
98.HtmlParser It's a Java Compiling html Parsed Library .
A. Yes
B. wrong
99. Quantifiers of regular expressions ? Express 0 Times or times .
A. Yes
B. wrong
100.HTTP The request consists of three parts , Namely : Request line 、 The message header 、 Request body .
A. Yes
B. wrong
边栏推荐
- leetcode - 295. Median data flow
- Laravel, execute PHP artist migrate and report an error alter table `users`add unique `users_ email_ unique`(`email`))
- vim配置与使用
- 【若依(ruoyi)】ztree初始化
- 为什么信息化 ≠ 数字化?终于有人讲明白了
- 2022/02/15
- 搭建nexus服务
- Digital twin application of smart Park Based on Web GIS aerial photography
- leetcode:304. 二维区域和检索 - 矩阵不可变
- 高性能限流器 Guava RateLimiter
猜你喜欢

4种分布式session解决方案

leetcode:304. 2D area and retrieval - matrix immutable

【世界海洋日】TcaplusDB号召你一同保护海洋生物多样性
![[tcapulusdb] I wish you all a healthy Dragon Boat Festival!](/img/f8/d790cc38a0e1c4d6c3bd27d7158c4a.png)
[tcapulusdb] I wish you all a healthy Dragon Boat Festival!
![[World Ocean Day] tcapulusdb calls on you to protect marine biodiversity together](/img/87/373af42f3a2ffa6b9f7fb0c0c3735b.png)
[World Ocean Day] tcapulusdb calls on you to protect marine biodiversity together

go-redsync分布式锁源码解析
![[tcapulusdb knowledge base] Introduction to tcapulusdb tcapsvrmgr tool (III)](/img/7b/8c4f1549054ee8c0184495d9e8e378.png)
[tcapulusdb knowledge base] Introduction to tcapulusdb tcapsvrmgr tool (III)

88.(cesium篇)cesium聚合图

2D人体姿态估计 - DeepPose

The efficiency of 20 idea divine plug-ins has been increased by 30 times, and it is necessary to write code
随机推荐
[tcapulusdb] I wish you all a healthy Dragon Boat Festival!
Zigzag sequence traversal of binary tree [one of layered traversal methods - > preorder traversal +level]
【TcaplusDB知识库】TcaplusDB-tcapsvrmgr工具介绍(三)
[ruoyi] ztree initialization
【TcaplusDB知识库】TcaplusDB-tcapsvrmgr工具介绍(一)
【TcaplusDB知识库】TcaplusDB表数据缓写介绍
Open source demo| you draw and I guess -- make your life more interesting
Same tree [from part to whole]
【Ubuntu】【Mysql】ubuntu安装了mysql 但是编译报错 mysql.h: No such file or directory
MySQL Varcahr to int
Set up nexus service
20款IDEA 神级插件 效率提升 30 倍,写代码必备
相同的树[从部分到整体]
Mobaihe box, ZTE box, Migu box, Huawei box, Huawei Yuehe box, Fiberhome box, Skyworth box, Tianyi box and other operators' box firmware collection and sharing
88.(cesium篇)cesium聚合图
【TcaplusDB知识库】TcaplusDB数据导入介绍
[Ubuntu] [MySQL] Ubuntu installs mysql, but the compilation error is mysql h: No such file or directory
高性能限流器 Guava RateLimiter
SQL performance optimization is really eye popping
分布式id解决方案