Python learning reptile master library material and the choice of framework analysis

To learn Python, we must start with crawlers. After all, similar resources on the Internet are very rich, and there are many open source projects.

The Python learning web crawler is divided into 3 major sections: crawling, analysis, storage

Python learning reptile master library material and the choice of framework analysis

When we enter a url in the browser and press Enter, what happens in the background?

In simple terms this process takes the following four steps:

Find the IP address corresponding to the domain name.

Send a request to the server that corresponds to the IP.

The server responds to the request and sends back the web page content.

The browser parses the web page content.

What libraries do you need to learn about crawlers?

General:

Urllib - Network library (stdlib).

Requests - network library.

Grab - network library (based on pycurl).

Pycurl - Network library (bound libcurl).

Urllib3 - Python HTTP library, secure connection pool, support file post, high availability.

Httplib2 - Network library.

RoboBrowser - A simple, very Python-style Python library that allows you to browse the web without a separate browser.

MechanicalSoup - A Python library that automatically interacts with websites.

Mechanize - A stateful, programmable Web browsing library.

Socket - The underlying network interface (stdlib).

Unirest for Python – Unirest is a set of lightweight HTTP libraries that can be used for multiple languages.

Hyper – Python HTTP/2 client.

PySocks - Updated and actively maintained version of SocksiPy, including bug fixes and some other features. As a direct replacement for socket modules.

Web crawler framework

Full-featured reptile

Grab - web crawler framework (based on pycurl/multicur).

Scrapy - web crawler framework (based on twisted), does not support Python3.

Pyspider - A powerful reptile system.

Cola - A distributed crawler framework.

other

Portia - Scrapy-based visual crawler.

Restkit - Python's HTTP resource kit. It allows you to easily access HTTP resources and build objects around it.

Demiurge - A PyQuery-based crawler micro-framework.

HTML/XML parser

Universal

Lxml - C efficient HTML/XML processing library. Supports XPath.

Cssselect - parses the DOM tree and CSS selectors.

Pyquery - parses the DOM tree and jQuery selector.

BeautifulSoup - Inefficient HTML/XML processing library, pure Python implementation.

Html5lib - DOM for generating HTML/XML documents according to the WHATWG specification. This specification is used on all current browsers.

Feedparser - Parse RSS/ATOM feeds.

MarkupSafe - Safely escaped strings for XML/HTML/XHTML.

Xmltodict - a Python module that lets you feel like dealing with JSON when processing XML.

Xhtml2pdf - Convert HTML/CSS to PDF.

Untangle - Easily convert XML files to Python objects.

Clean up

Bleach - Clean up HTML (html5lib required).

Sanitize - Brings clarity to the chaotic data world.

Text processing

Library for parsing and manipulating simple text.

Universal

Difflib - (Python Standard Library) Helps to make a difference comparison.

Levenshtein - Quickly calculate Levenshtein distance and string similarity.

Fuzzywuzzy - Fuzzy string match.

Esmre - Regular expression accelerator.

Ftfy - Automatically sorts Unicode text, reducing fragmentation.

Natural language processing

Library to deal with human language problems.

NLTK - The best platform for writing Python programs to handle human language data.

Pattern - Python's web mining module. He has natural language processing tools, machine learning and more.

TextBlob - provides a consistent API for deeper natural language processing tasks. It was developed on the shoulders of giants based on NLTK and Pattern.

Jieba – Chinese word segmentation tool.

SnowNLP - Chinese text processing library.

Loso – Another Chinese word breaker.

Browser Automation and Simulation

Selenium - automates real browsers (Chrome, Firefox, Opera, IE).

Ghost.py - Package for PyQt's webkit (requires PyQT).

Spynner - Encapsulation of PyQt's webkit (requires PyQT).

Splinter - generic API browser emulator (selenium web driver, Django client, Zope).

Multiprocessing

Threading - threading of the Python standard library. Useful for I/O intensive tasks. The task for CPU binding is useless because of Python GIL.

Multiprocessing - The standard Python library runs multiple processes.

Celery – Asynchronous task queue/job queue based on distributed messaging.

Concurrent-futures – The concurrent-futures module provides a high-level interface for invoking asynchronous execution.

asynchronous

Asynchronous Network Programming Library

Asyncio – (Python Standard Library above Python 3.4+) Asynchronous I/O, time loops, coroutines, and tasks.

Twisted - Event-driven network engine framework.

Tornado - A network framework and an asynchronous network library.

Pulsar - Python event-driven concurrency framework.

Diesel – Python's green event based I/O framework.

Gevent - A coroutine-based Python network library using a greenlet.

Eventlet - Asynchronous framework with WSGI support.

Tomorrow - A wonderfully modified syntax for asynchronous code.

queue

Celery – Asynchronous task queue/job queue based on distributed messaging.

Huey - Small multi-threaded task queue.

Mrq – Mr. Queue – Python Distributed Work Task Queue using redis & Gevent.

RQ - Redis-based lightweight task queue manager.

Simpleq - A simple, infinitely scalable, Amazon SQS-based queue.

Python-gearman – Gearman's Python API.

cloud computing

Picloud - Python code executed in the cloud.

Dominoup.com - Cloud implementation of R, Python and matlab code

Web content extraction

Extract the library of web page content.

HTML page text and metadata

Newspaper - News extraction, article extraction and content curation using Python.

Html2text - Turn HTML into Markdown format text.

Python-goose - HTML content/article extractor.

Lassie - user-friendly web content retrieval tool

WebSocket

Library for WebSockets.

Crossbar - Open source application messaging router (WebSocket and WAMP for Autobahn by Python).

AutobahnPython - Provides a Python implementation of the WebSocket and WAMP protocols and is open source.

WebSocket-for-Python - Python 2 and 3 and PyPy's WebSocket client and server library.

DNS resolution

Dnsyo - Check your DNS on more than 1500 DNS servers worldwide.

Pycares - c-ares interface. C-ares is a C language library that performs DNS requests and asynchronous name resolution.

Computer vision

OpenCV - Open Source Computer Vision Library.

SimpleCV - Introduction to cameras, image processing, feature extraction, format conversion, and a highly readable interface (based on OpenCV).

Mahotas – fast computer image processing algorithm (completely implemented in C++), based entirely on numpy arrays as its data type.

Some frameworks for web development

1.Django

Django is an open source web application framework written in Python that supports many database engines, allows web development to be quick and scalable, and will be constantly updated to match the latest version of Python, if you are a novice programmer, you can Start with the framework.

2.Flask

Flask is a lightweight web application framework written in Python. Based on WerkzeugWSGI toolbox and Jinja2 template engine. Use BSD authorization.

Flask is also known as "microframework" because it uses a simple core and adds additional functionality with extensions. Flask does not have a default database or form validation tool. However, Flask retains the flexibility of amplification and can use Flask-extension to add these features: ORM, form validation tools, file uploads, and various open authentication technologies.

3.Web2py

Web2py is a free open source web framework written in Python language for agile and rapid development of web applications with fast, scalable, secure, and portable database-driven applications, following the LGPLv3 open source protocol.

Web2py provides a one-stop solution. The entire development process can be performed on the browser. It provides online development of the Web version, HTML template writing, uploading static files, and database writing functions. Others include logging and an automated admin interface.

4. Tornado

Tornado is a Web server (not described in detail in this article). At the same time, it is a micro-framework of web.py. As the framework, Tornado's idea is mainly derived from Web.py. Everyone can also be found on Web.py's home page. See Tornado's big brother Bret Taylor for a paragraph (he said that FriendFeed's framework can be seen as a thing with Tornado):

"[web.py inspired the] Web framework we use at FriendFeed [and] the webapp framework that ships with App Engine..."

Because of this relationship, Tornado is no longer discussed individually.

5. CherryPy

CherryPy is a simple and very useful web framework for Python. Its main role is to connect the web server with Python code with as few operations as possible. Its features include built-in analysis functions, flexible plugin system, and one run. The functionality of multiple HTTP servers can be run on the latest versions of Python, Jython, and Android.

Misunderstanding about the choice of framework

In terms of the choice of the framework, many people easily fall into the following two misunderstandings without knowing which framework is best - there is no best framework in the world, only the framework that is most suitable for you and the team that is most suitable for you. . The choice of programming language is also a reason. Your team is the best at using Python in Python. If you are most familiar with Ruby, then you can use Ruby. The programming language and the framework are just tools that can do more, faster, better, and more. Finishing is a good thing.

Excessive attention to performance - In fact, most people do not need to care too much about the performance of the framework, because the website you develop is basically a small station, and there are not many sites that can have 10,000 IP, and even more than 100,000 are very Little less. It doesn't make much sense to talk about performance before there is a certain amount of traffic, because your CPU and memory are always idle.

Conecting Terminals Without Screws

Conecting Terminals Without Screws,Cold Pressing Terminals,Low Pressure Cold Shrinkage Terminal,Cold Shrinkage Cable Terminals

Taixing Longyi Terminals Co.,Ltd. , https://www.lycopperterminals.com