🔥 MediaCrawler - Social Media Platform Crawler 🕷️
Disclaimer:
Please use this repository for learning purposes only ⚠️⚠️⚠️⚠️, Web scraping illegal cases
All content in this repository is for learning and reference purposes only, and commercial use is prohibited. No person or organization may use the content of this repository for illegal purposes or infringe upon the legitimate rights and interests of others. The web scraping technology involved in this repository is only for learning and research, and may not be used for large-scale crawling of other platforms or other illegal activities. This repository assumes no legal responsibility for any legal liability arising from the use of the content of this repository. By using the content of this repository, you agree to all terms and conditions of this disclaimer.
Click to view a more detailed disclaimer. Click to jump
📖 Project Introduction
A powerful multi-platform social media data collection tool that supports crawling public information from mainstream platforms including Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, Zhihu, and more.
🔧 Technical Principles
- Core Technology: Based on Playwright browser automation framework for login and maintaining login state
- No JS Reverse Engineering Required: Uses browser context environment with preserved login state to obtain signature parameters through JS expressions
- Advantages: No need to reverse complex encryption algorithms, significantly lowering the technical barrier
✨ Features
| Platform | Keyword Search | Specific Post ID Crawling | Secondary Comments | Specific Creator Homepage | Login State Cache | IP Proxy Pool | Generate Comment Word Cloud |
|---|---|---|---|---|---|---|---|
| Xiaohongshu | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Douyin | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Kuaishou | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Bilibili | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| Tieba | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Zhihu | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
🔗 🚀 MediaCrawlerPro Major Release! More features, better architectural design!
🚀 MediaCrawlerPro Major Release!
Focus on learning mature project architectural design, not just crawling technology. The code design philosophy of the Pro version is equally worth in-depth study!
MediaCrawlerPro core advantages over the open-source version:
🎯 Core Feature Upgrades
- ✅ Resume crawling functionality (Key feature)
- ✅ Multi-account + IP proxy pool support (Key feature)
- ✅ Remove Playwright dependency, easier to use
- ✅ Complete Linux environment support
🏗️ Architectural Design Optimization
- ✅ Code refactoring optimization, more readable and maintainable (decoupled JS signature logic)
- ✅ Enterprise-level code quality, suitable for building large-scale crawler projects
- ✅ Perfect architectural design, high scalability, greater source code learning value
🎁 Additional Features
- ✅ Social media video downloader desktop app (suitable for learning full-stack development)
- ✅ Multi-platform homepage feed recommendations (HomeFeed)
- AI Agent based on social media platforms is under development 🚀🚀
Click to view: MediaCrawlerPro Project Homepage for more information
🚀 Quick Start
💡 Open source is not easy, if this project helps you, please give a ⭐ Star to support!
📋 Prerequisites
🚀 uv Installation (Recommended)
Before proceeding with the next steps, please ensure that uv is installed on your computer:
- Installation Guide: uv Official Installation Guide
-
Verify Installation: Enter the command
uv --versionin the terminal. If the version number is displayed normally, the installation was successful - Recommendation Reason: uv is currently the most powerful Python package management tool, with fast speed and accurate dependency resolution
🟢 Node.js Installation
The project depends on Node.js, please download and install from the official website:
- Download Link: https://nodejs.org/en/download/
- Version Requirement: >= 16.0.0
📦 Python Package Installation
# Enter project directory
cd MediaCrawler
# Use uv sync command to ensure consistency of python version and related dependency packages
uv sync
🌐 Browser Driver Installation
# Install browser driver
uv run playwright install
💡 Tip: MediaCrawler now supports using playwright to connect to your local Chrome browser, solving some issues caused by Webdriver.
Currently,
xhsanddyare available using CDP mode to connect to local browsers. If needed, check the configuration items inconfig/base_config.py.
🚀 Run Crawler Program
# The project does not enable comment crawling mode by default. If you need comments, please modify the ENABLE_GET_COMMENTS variable in config/base_config.py
# Other supported options can also be viewed in config/base_config.py with Chinese comments
# Read keywords from configuration file to search related posts and crawl post information and comments
uv run main.py --platform xhs --lt qrcode --type search
# Read specified post ID list from configuration file to get information and comment information of specified posts
uv run main.py --platform xhs --lt qrcode --type detail
# Open corresponding APP to scan QR code for login
# For other platform crawler usage examples, execute the following command to view
uv run main.py --help
🔗 Using Python native venv environment management (Not recommended)
Create and activate Python virtual environment
If crawling Douyin and Zhihu, you need to install nodejs environment in advance, version greater than or equal to:
16
# Enter project root directory
cd MediaCrawler
# Create virtual environment
# My python version is: 3.9.6, the libraries in requirements.txt are based on this version
# If using other python versions, the libraries in requirements.txt may not be compatible, please resolve on your own
python -m venv venv
# macOS & Linux activate virtual environment
source venv/bin/activate
# Windows activate virtual environment
venv\Scripts\activate
Install dependency libraries
pip install -r requirements.txt
Install playwright browser driver
playwright install
Run crawler program (native environment)
# The project does not enable comment crawling mode by default. If you need comments, please modify the ENABLE_GET_COMMENTS variable in config/base_config.py
# Other supported options can also be viewed in config/base_config.py with Chinese comments
# Read keywords from configuration file to search related posts and crawl post information and comments
python main.py --platform xhs --lt qrcode --type search
# Read specified post ID list from configuration file to get information and comment information of specified posts
python main.py --platform xhs --lt qrcode --type detail
# Open corresponding APP to scan QR code for login
# For other platform crawler usage examples, execute the following command to view
python main.py --help
💾 Data Storage
Supports multiple data storage methods:
-
SQLite Database: Lightweight database without server, ideal for personal use (recommended)
- Parameter:
--save_data_option sqlite - Database file created automatically
- Parameter:
-
MySQL Database: Supports saving to relational database MySQL (need to create database in advance)
- Execute
python db.pyto initialize database table structure (only execute on first run)
- Execute
-
CSV Files: Supports saving to CSV (under
data/directory) -
JSON Files: Supports saving to JSON (under
data/directory)
Usage Examples:
# Use SQLite (recommended for personal users)
uv run main.py --platform xhs --lt qrcode --type search --save_data_option sqlite
# Use MySQL
uv run main.py --platform xhs --lt qrcode --type search --save_data_option db
🚀 MediaCrawlerPro Major Release 🚀! More features, better architectural design!
🤝 Community & Support
💬 Discussion Groups
- WeChat Discussion Group: Click to join
📚 Documentation & Tutorials
- Online Documentation: MediaCrawler Complete Documentation
- Crawler Tutorial: CrawlerTutorial Free Tutorial
Other common questions can be viewed in the online documentation
The online documentation includes usage methods, common questions, joining project discussion groups, etc. MediaCrawler Online Documentation
Author's Knowledge Services
If you want to quickly get started and learn the usage of this project, source code architectural design, learn programming technology, or want to understand the source code design of MediaCrawlerPro, you can check out my paid knowledge column.
Author's Paid Knowledge Column Introduction
⭐ Star Trend Chart
If this project helps you, please give a ⭐ Star to support and let more people see MediaCrawler!
💰 Sponsor Display
Exclusive discount code: GHB5 Get 10% off instantly!
Sider - The hottest ChatGPT plugin on the web, amazing experience!
🤝 Become a Sponsor
Become a sponsor and showcase your product here, getting massive exposure daily!
Contact Information:
- WeChat:
yzglan - Email:
relakkes@gmail.com
📚 References
- Xiaohongshu Client: ReaJason's xhs repository
- SMS Forwarding: SmsForwarder reference repository
- Intranet Penetration Tool: ngrok official documentation
Disclaimer
🙏 Acknowledgments
JetBrains Open Source License Support
Thanks to JetBrains for providing free open source license support for this project!