戒酒的李白

Initialization of the New Sentiment Recognition Model

  1 +# BCAT Model: Parallel Chinese Offensive Language Detection with Synergistic Semantics and Topic Modeling
  2 +
  3 +## Overview
  4 +
  5 +The BCAT (BERT-CTM Attention-based Text Classifier) model is designed for Chinese sentiment recognition, particularly focusing on offensive and aggressive language detection. The model leverages BERT-generated contextual word embeddings and CTM (Combined Topic Modeling) to capture both semantic and thematic features from text data. BCAT integrates a multi-head attention mechanism to enhance feature representation and applies convolutional networks (DPCNN and TextCNN) for feature extraction.
  6 +
  7 +## Features
  8 +
  9 +- **BERT and CTM Fusion**: BCAT effectively combines BERT embeddings with CTM topic vectors. BERT captures the context of words in a sentence, while CTM identifies overarching themes, improving the model's ability to detect nuanced sentiments.
  10 +- **Multi-Head Attention Mechanism**: This component focuses on different aspects of the input data, ensuring that critical features are emphasized during classification.
  11 +- **TextCNN and DPCNN**: These two convolutional networks operate in parallel to extract both local (TextCNN) and global (DPCNN) features, improving the robustness of the model for different linguistic structures.
  12 +
  13 +## Model Architecture
  14 +
  15 +The BCAT model is divided into the following components:
  16 +
  17 +1. **Embedding Layer**: Text data is transformed into embeddings using the BERT model.
  18 +
  19 +2. Feature Extraction Layer:
  20 +
  21 + - TextCNN extracts local features (word and phrase combinations).
  22 + - DPCNN captures global text structure and long-range dependencies.
  23 +
  24 +3. **Feature Fusion and Attention**: The output from both networks is combined and processed by the multi-head attention mechanism to highlight relevant information.
  25 +
  26 +4. **Classification Layer**: A fully connected Softmax layer outputs the predicted sentiment classes.
  27 +
  28 +## Data
  29 +
  30 +BCAT is trained on the **COLD (Chinese Offensive Language Dataset)**, a publicly available dataset that includes offensive and safe comments across various categories. The model also uses real-time data collected from social platforms like Weibo through a custom web crawler.
  31 +
  32 +### Dataset Statistics
  33 +
  34 +- **COLD Dataset**: Contains 37,480 comments with binary labels indicating whether a comment is offensive or safe.
  35 +- **Weibo Data**: Supplementary real-world data gathered through a web crawler to ensure model robustness in practical applications.
  36 +
  37 +## Training and Testing
  38 +
  39 +- **Training**: The model was trained on a dataset split into training, validation, and test sets. Key metrics such as accuracy, precision, recall, and F1-score were used to evaluate performance.
  40 +- **Testing**: The model underwent extensive testing with the validation and test datasets, showing excellent results in offensive language detection.
  41 +
  42 +### Key Performance Metrics
  43 +
  44 +| Component Configuration | Precision | Recall | F1 Score |
  45 +|------------------------------------------------|-----------|--------|----------|
  46 +| BCAT (BERT + CTM + DPCNN + TextCNN + MHA) | 87.35% | 86.81% | 87.34% |
  47 +| BERT + DPCNN + TextCNN + MHA | 85.85% | 85.34% | 85.35% |
  48 +| BERT + CTM + TextCNN + MHA | 84.66% | 85.14% | 84.97% |
  49 +
  50 +## How to Use
  51 +
  52 +1. **Dependencies**:
  53 +
  54 + - Python 3.8+
  55 + - PyTorch
  56 + - Transformers (Hugging Face)
  57 + - Contextualized Topic Models (CTM)
  58 + - Jieba for Chinese tokenization
  59 +
  60 +2. **Installation**:
  61 +
  62 + ```bash
  63 + pip install -r requirements.txt
  64 + ```
  65 +
  66 +3. **Training**: To train the BCAT model with your own dataset:
  67 +
  68 + ```bash
  69 + python train_model.py --data_path <path_to_data> --save_path <path_to_save_model>
  70 + ```
  71 +
  72 +4. **Inference**: For predicting sentiment on new text data:
  73 +
  74 + ```bash
  75 + python predict.py --model_path <path_to_model> --input_text "Your input text here"
  76 + ```