Job Recruitment Website - Job seeking and recruitment - What does python's reptile mean?

What does python's reptile mean?

Python crawler is a web spider developed by Python program, which is a program or script that automatically crawls information on the World Wide Web according to certain rules. Other less common names are ant, automatic index, emulator or worm. In fact, it is popular to get the data you want on the web page through the program, that is, automatically grab the data. Web crawler (English: Webcrawler), also known as spider, is a network robot used to automatically browse the World Wide Web. Its purpose is usually to compile online indexes.

Web search engines and other websites update their own website contents or their indexes to other websites through crawler software. Web crawlers can save the pages they visit, so that search engines can generate indexes for users to search afterwards.

The process of crawler accessing the website will consume the resources of the target system. Many network systems do not acquiesce in the work of reptiles. Therefore, when accessing a large number of pages, the crawler needs to consider planning, load and "politeness". Public sites that are unwilling to be visited by reptiles and known by their owners can be avoided by using robots.txt files and other methods. This file can require the robot to index only a part of the website, or not to process it at all.

There are so many pages on the Internet that even the largest crawler system can't make a complete index. So in the early days when the World Wide Web appeared before the year 2000, search engines often couldn't find many relevant results. Now search engines have made great progress in this respect and can give high-quality results immediately.

Crawlers can also verify hyperlinks and HTML codes for web crawling.

Python reptile

Python crawler architecture

Python crawler architecture mainly consists of five parts, namely, scheduler, URL manager, webpage downloader, webpage parser and application (captured valuable data).

Scheduler: equivalent to the CPU of a computer, mainly responsible for the coordination among URL manager, downloader and parser.

URL manager: includes URL addresses to be crawled and URL addresses that have been crawled, so as to prevent repeated crawling of URLs and cyclic crawling of URLs. There are three main ways to realize URL manager, which are realized by memory, database and cache database respectively.

Web downloader: downloads a web page through the incoming URL address and converts the web page into a string. Web downloader includes URL Pb 2 (the official basic module of Python), including login, proxy, cookie and request (third-party package).

Web page parser: parsing a web page string can extract our useful information according to our requirements or parse it according to DOM tree parsing method. Web page parsers include regular expressions (intuitive, converting a web page into a string through fuzzy matching to extract valuable information, and it is very difficult to extract data when the document is complex), html. parser (included with Python), and beautifulsoup (a third-party plug-in, which can be used for parsing). You can also use lxml to parse, which is more powerful than others), lxml (a third-party plug-in that can parse xml and HTML), html.parser, beautifulsoup and lxml are all parsed in the form of DOM trees.

Application: It is an application that extracts useful data from web pages.

What can reptiles do?

You can use a crawler to grab the pictures, videos and other data you want. As long as the data can be accessed through the browser, it can be obtained through the crawler.

What is the nature of reptiles?

Simulate a browser to open a webpage and get the data we want in the webpage.

The process of opening a web page with a browser:

When you enter an address in the browser, you find the server host through the DNS server and send a request to the server. The server parses and sends the results to the user's browser, including html, js, css and other file contents. The browser parses and finally presents the results to the user on the browser.

Therefore, the browser results that users see are all composed of html codes, and our crawler just obtains these contents, and obtains the resources we want by analyzing and filtering the HTML codes.

Related recommendation: The python Tutorial is the detailed content that Bian Xiao shared about what python's crawler means. I hope it will help everyone. For more Python tutorials, please pay attention to other related articles of Global Ivy!

Previous article:What time does the train from Fuxin to Shenyang North Station leave?
Next article:Where is the brand of light tea milk tea shop?