当前位置：网站首页>Data collection and management [6]

Data collection and management [6]

2022-06-29 03:44:00 【Star drawing question bank】

1.（） The target of collection is those that cannot be accessed through static links 、 Hidden behind the search form , Only users submit some keywords can get Web page .

A. focused crawler
B. Incremental web crawler
C. General purpose web crawler
D. Deep web crawler

2.HTTP Requested （） The method is in Request-URI New data is appended to the identified resource .

A.GET
B.POST
C.PUT
D.TRACE

3. stay HTTP In request , adopt （） Method to send data , It will be placed in URL after , With ? Division URL And transmit data , Between parameters & Connected to a .

A.GET
B.POST
C.PUT
D.TRACE

4. Regular expression rules ,（） Indicates that all non white space characters are matched

A.\S
B.\d
C.\W
D.\w

5. In regular expressions ,（） Matches any character except the newline character .

A.^
B.\d
C..
D.\w

6. The following statement about the rule of quantity representation in regular expressions , What's not right is （）

A.X+ Can appear 0 Time ,1 Times or times
B.X Indicates that it must occur once
C.X* Can appear 0 Time ,1 Times or times
D.X+ Can appear 1 Times or times

7. The causes of quality problems do not include （）.

A. The time difference of data
B. The diversity of data acquisition methods
C. Data instability
D. Data dependency

8.HTTP In response （） And so on are used to specify the time of message sending and document expiration respectively .

A.Date,Expires
B.Date,Allow
C.Last-Modified,Allow
D.Last-Modified,Expires

9. The form filling method of deep web crawler based on Web page structure analysis generally represents the web page form as （）, Extract the value of each field of the form .

A.DOM Trees
B.BOM Trees
C. Images
D. Text

10. Data conversion does not include （）.

A. Inconsistent data conversion
B. Data granularity transformation
C. Calculation of business rules
D. Incomplete data

11. Network data collection is generally through （） Or website public API And other ways to get data information from the website .

A. Web crawler
B. Website log
C.HTTP
D. Forms

12. Data （） It refers to whether the data is easy to obtain 、 Easy to understand and use .

A. completeness
B. Su shi
C. Guan Hanqing
D. Li qingzhao

13.（） The most important part of the crawling process is form filling and processing .

A. focused crawler
B. Incremental web crawler
C. General purpose web crawler
D. Deep web crawler

14. according to （） The crawled web page content crawls the page according to the depth of the directory , Pages at a shallow directory level are crawled first , When the pages in the same level have finished crawling , The crawler goes further and continues to crawl .

A. Depth first strategy
B. Breadth first strategy
C.PageRank Priority strategy
D. Random crawling strategy

15. Data preprocessing ETL Of L Express （）.

A. extract
B. transformation
C. load
D. cleaning

16.、HTTP in （） Method can be used to request and query the performance of the server , Or Query options and requirements related to resources .

A.OPTIONS
B.DELETE
C.PUT
D.TRACE

17. once HTTP from （） form .

A. One request
B. One response
C. One request and one response
D.2 Requests

18. Regular expressions support matching boundaries . for example ,() Match the beginning of the line ..

A.^
B.\d
C.\w
D.$

19.（） The two main goals are to keep the pages stored in the local page set up to date and to improve the quality of the pages in the local page set .

A. focused crawler
B. Incremental web crawler
C. General purpose web crawler
D. Deep web crawler

20.（） Our search strategy is to search in order of depth from low to high , Go to the next level of web page links in turn , Until we can't go any further , It is more suitable for vertical search or on-site search .

A. breadth-first
B. Depth first
C. Based on target characteristics
D. Based on domain

21. data （） Our task is to filter data that does not meet the requirements .

A. extract
B. transformation
C. load
D. cleaning

22. About surface web pages and deep web pages , The following statement is not true （）.

A. Surface page refers to the pages that can be indexed by traditional search engines , It is mainly composed of static web pages that can be reached by hyperlinks Web page .
B. Deep web pages are those where most of the content cannot be accessed through static links 、 Hidden behind the search form , Only users submit some keywords can get Web page .
C. Deep web pages contain far less information than surface web pages .
D. Deep web crawlers are mainly used to crawl deep web pages behind search forms .

23. For issued HTTP Some resources are stored on the server where the request is answered , such as HTML Files and images . We call this answering server （）.

A. browser
B. player
C. The user agent
D. The source server

24. Data preprocessing ETL Of E Express （）.

A. extract
B. transformation
C. load
D. cleaning

25.HTTP Request header （） The contents of the domain contain information about the user who made the request , For example, the client name and version number used .

A.Host
B.User-Agent
C.Cookie
D.Referer

26.（） Describes a pattern of string matching , Usually used to retrieve 、 Replace those that match a pattern ( The rules ) The text of .

A. Web crawler
B. Data collection
C. Character set
D. Regular expressions

27.GET and POST The difference between , The following statement is incorrect （）.

A.GET The submitted data will be placed in URL after
B.POST The submitted data will be placed in URL Later
C.GET Method needs to be used Request.QueryString To get the value of a variable
D.POST Way through Request.Form To get the value of a variable

28. adopt HTTP perhaps HTTPS The resources requested by the protocol are () To mark .

A.HTML
B.URL
C.TCP
D.FTP

29. stay HTTP In the response message , If the status information code is 404 said （）.

A. The requested page was not found
B. Login failed
C. The requested page has been moved to the new url
D. Access is forbidden

30. Which of the following belongs to HTTP Request information （）.

A.User-Agent
B.Content-Length
C.Accept-Ranges
D.Expires

31. The following does not belong to HTTP The protocol request method is （）

A.GET
B.POST
C.TRACE
D.SUBMIT

32. Data cleaning is a one-time process .（1 branch ）

33. The timeliness of data quality refers to whether the data is within the acceptable range defined by the enterprise .（1 branch ）

34. The disadvantage of the depth first strategy is that it takes a long time to crawl to the pages with deep directory hierarchy .（1 branch ）

35.HTTP Requested TRACE Method requests the server to send back the received request information , Mainly used for testing or diagnosis .（1 branch ）

36.HTTP Response Content-Type The natural language used to indicate the object of the response .（1 branch ）

37.HTTP Requested PUT Method requests the server to store a resource , And use Request-URI As its logo .（1 branch ）

38. Surface pages are those that most content cannot be accessed through static links 、 Web pages hidden behind search forms .（1 branch ）

39. Web crawlers always start from a certain starting point , This starting point is called seed .（1 branch ）

40. Hunger patterns in regular expressions match as much text as possible .（1 branch ）

41. Web crawler technology does not support images 、 Audio 、 Video and other files or attachments .（1 branch ）

42. Quantifiers of regular expressions + Express 0 Times or times .（1 branch ）

43. The focused crawler filters the links irrelevant to the topic according to a certain web page analysis algorithm , Keep the useful links and put them in the waiting URL queue .（1 branch ）

44. In regular expressions \d Represents any word character .（1 branch ）

45. Surface page refers to the pages that can be indexed by traditional search engines .（1 branch ）

46. The widespread use of web crawlers may cause personal privacy disclosure .（1 branch ）

47.HTTP Status code 500 Indicates that the request failed due to the client .（1 branch ）

48. Hunger patterns in regular expressions match as little text as possible .（1 branch ）

49. Web crawlers are widely used in Internet search engines or other similar websites , In order to obtain or update the content and retrieval methods of these websites .（1 branch ）

50. General web crawler requires high crawling speed and storage space .（1 branch ）

51.HTTP In the request Range A header field can request one or more subscopes of an entity .（1 branch ）

52. Regular expressions [abc] Representation string abc.（1 branch ）

53.HTTP In the request Keep-Alive The function makes the connection between the client and the server persistent .（1 branch ）

54. Regular expressions cannot match special characters .（1 branch ）

55. In regular expressions , Quantifiers can match multiple occurrences of an expression .（1 branch ）

56. A web crawler crawls along a web page and its hyperlinks , Every time you get to a web page, grab it with a grab program , Extract the content , Extract hyperlinks at the same time , As a clue to further crawling .（1 branch ）

57. Breadth first strategy can not avoid the problem that crawling cannot end when encountering an infinite deep branch .（1 branch ）

58.Java、Python And other languages also support regular expressions .（1 branch ）

59. Surface web pages are mainly composed of static web pages that can be reached by hyperlinks Web page .（1 branch ）

60. For systems with large amounts of data , Generally, one-time data extraction is also often done .（1 branch ）

61.HTTP In the request Cookie Indicates the client type .（1 branch ）

62. Regular expressions are often used for retrieval 、 Replace those that match a pattern ( The rules ) The text of .（1 branch ）

63. Data conversion mainly involves inconsistent data conversion 、 Data granularity transformation , And the calculation of some business rules .（1 branch ）

64.HTTP Requested DELETE Method to request the server to delete Request-URI Identified resources .（1 branch ）

65. Surface pages （SurfaceWeb） It's the biggest on the Internet 、 The fastest growing new type of information resources .（1 branch ）

66. At present, the mainstream web development languages do not support regular expressions ..（1 branch ）

67. Regular expressions [abc] According to the character a or b or c.（1 branch ）

68. Web crawlers can automatically collect all the page content they can access .（1 branch ）

69.GET The request for Request-URI Identified resources .（1 branch ）

70.HTTP Requested TRACE Method to query the performance of the server , Or Query options and requirements related to resources .（1 branch ）

71. Regular expressions consist of some ordinary characters and some metacharacters .（1 branch ）

72. Correctness of data quality （Accuracy） It refers to whether the data correctly represents a realistic or verifiable source .（1 branch ）

73. Data conversion , Handling of null values , You can load or replace it with other meaningful data , According to the null value of the field, it can be loaded into different target libraries .（1 branch ）

74.HTTP Requested PUT Method to request the server to delete Request-URI Identified resources .（1 branch ）

75.HTTP Response Content-Type The type and character set of the object that represents the response .（1 branch ）

76.HTTP Requested Keep-Alive No negative impact on the server .（1 branch ）

77.HTTP Status code 200 Indicates that the request was successful .（1 branch ）

78.HTTP Status code 500 Indicates that the request failed due to the server .（1 branch ）

79. General purpose web crawlers usually work in parallel , But it takes a long time to refresh the page .（1 branch ）

80. Greedy patterns in regular expressions match as little text as possible .（1 branch ）

81.HTTP The request header field may contain Authorization、Referer、Content-Type、Content-Encoding Other parts .（1 branch ）

82. With the rapid development of network , Constantly optimized web crawler technology is effectively addressing various challenges , It provides strong support for efficient search of specific fields and topics concerned by users .（1 branch ）

83.HTTP In request Referer The content of the header field contains the requested user information .（1 branch ）

84. Focus on the order in which web crawlers crawl pages is relatively low .（1 branch ）

85. Breadth first strategy crawls the page according to the depth of the content directory , Pages at a shallow directory level are crawled first .（1 branch ）

86. Business systems generally store very detailed data , So in general , Business system data will be aggregated according to data warehouse granularity .（1 branch ）

87. Use a square bracket in a regular expression [] Indicates character classification .（1 branch ）

88.DeepWeb The most important part of crawler crawling is link extraction .（1 branch ）

89. Depth first strategy is more suitable for vertical search or on-site search .（1 branch ）

90. The amount of information stored in deep web pages only accounts for a small part of the amount of information in the Internet .（1 branch ）

91. Small sites will have no crawler access .（1 branch ）

92. The specific performance of data quality is correctness 、 integrity 、 Uniformity 、 completeness 、 effectiveness 、 Timeliness and availability, etc .（1 branch ）

93. Crawler tools can only be used Java Language writing .（1 branch ）

94. Regular expressions support matching boundaries , For example, word boundaries , The beginning or end of a text .（1 branch ）

95.HTTP Requested POST Method requests the server to store a resource , And use Request-URI As its logo .（1 branch ）

96.HTTP A protocol is a stateless protocol .（1 branch ）

97. Incremental web crawlers only crawl newly generated or updated pages when needed .（1 branch ）

98. In addition to collecting information, web crawlers , You can even implant rogue software , Destroy web content and even hijack websites and servers .（1 branch ）

99. stay HTTP Status information code is an important information in response information .（1 branch ）

100. After a web crawler crawls various resources , Organize this information through corresponding indexing techniques , Provide to the user for query .（1 branch ）

原网站

版权声明
本文为[Star drawing question bank]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202161057329858.html

当前位置：网站首页>Data collection and management [6]

Data collection and management [6]

边栏推荐

猜你喜欢

随机推荐