Showing
1 changed file
with
76 additions
and
0 deletions
model_pro/readme.md
0 → 100644
| 1 | +# BCAT Model: Parallel Chinese Offensive Language Detection with Synergistic Semantics and Topic Modeling | ||
| 2 | + | ||
| 3 | +## Overview | ||
| 4 | + | ||
| 5 | +The BCAT (BERT-CTM Attention-based Text Classifier) model is designed for Chinese sentiment recognition, particularly focusing on offensive and aggressive language detection. The model leverages BERT-generated contextual word embeddings and CTM (Combined Topic Modeling) to capture both semantic and thematic features from text data. BCAT integrates a multi-head attention mechanism to enhance feature representation and applies convolutional networks (DPCNN and TextCNN) for feature extraction. | ||
| 6 | + | ||
| 7 | +## Features | ||
| 8 | + | ||
| 9 | +- **BERT and CTM Fusion**: BCAT effectively combines BERT embeddings with CTM topic vectors. BERT captures the context of words in a sentence, while CTM identifies overarching themes, improving the model's ability to detect nuanced sentiments. | ||
| 10 | +- **Multi-Head Attention Mechanism**: This component focuses on different aspects of the input data, ensuring that critical features are emphasized during classification. | ||
| 11 | +- **TextCNN and DPCNN**: These two convolutional networks operate in parallel to extract both local (TextCNN) and global (DPCNN) features, improving the robustness of the model for different linguistic structures. | ||
| 12 | + | ||
| 13 | +## Model Architecture | ||
| 14 | + | ||
| 15 | +The BCAT model is divided into the following components: | ||
| 16 | + | ||
| 17 | +1. **Embedding Layer**: Text data is transformed into embeddings using the BERT model. | ||
| 18 | + | ||
| 19 | +2. Feature Extraction Layer: | ||
| 20 | + | ||
| 21 | + - TextCNN extracts local features (word and phrase combinations). | ||
| 22 | + - DPCNN captures global text structure and long-range dependencies. | ||
| 23 | + | ||
| 24 | +3. **Feature Fusion and Attention**: The output from both networks is combined and processed by the multi-head attention mechanism to highlight relevant information. | ||
| 25 | + | ||
| 26 | +4. **Classification Layer**: A fully connected Softmax layer outputs the predicted sentiment classes. | ||
| 27 | + | ||
| 28 | +## Data | ||
| 29 | + | ||
| 30 | +BCAT is trained on the **COLD (Chinese Offensive Language Dataset)**, a publicly available dataset that includes offensive and safe comments across various categories. The model also uses real-time data collected from social platforms like Weibo through a custom web crawler. | ||
| 31 | + | ||
| 32 | +### Dataset Statistics | ||
| 33 | + | ||
| 34 | +- **COLD Dataset**: Contains 37,480 comments with binary labels indicating whether a comment is offensive or safe. | ||
| 35 | +- **Weibo Data**: Supplementary real-world data gathered through a web crawler to ensure model robustness in practical applications. | ||
| 36 | + | ||
| 37 | +## Training and Testing | ||
| 38 | + | ||
| 39 | +- **Training**: The model was trained on a dataset split into training, validation, and test sets. Key metrics such as accuracy, precision, recall, and F1-score were used to evaluate performance. | ||
| 40 | +- **Testing**: The model underwent extensive testing with the validation and test datasets, showing excellent results in offensive language detection. | ||
| 41 | + | ||
| 42 | +### Key Performance Metrics | ||
| 43 | + | ||
| 44 | +| Component Configuration | Precision | Recall | F1 Score | | ||
| 45 | +|------------------------------------------------|-----------|--------|----------| | ||
| 46 | +| BCAT (BERT + CTM + DPCNN + TextCNN + MHA) | 87.35% | 86.81% | 87.34% | | ||
| 47 | +| BERT + DPCNN + TextCNN + MHA | 85.85% | 85.34% | 85.35% | | ||
| 48 | +| BERT + CTM + TextCNN + MHA | 84.66% | 85.14% | 84.97% | | ||
| 49 | + | ||
| 50 | +## How to Use | ||
| 51 | + | ||
| 52 | +1. **Dependencies**: | ||
| 53 | + | ||
| 54 | + - Python 3.8+ | ||
| 55 | + - PyTorch | ||
| 56 | + - Transformers (Hugging Face) | ||
| 57 | + - Contextualized Topic Models (CTM) | ||
| 58 | + - Jieba for Chinese tokenization | ||
| 59 | + | ||
| 60 | +2. **Installation**: | ||
| 61 | + | ||
| 62 | + ```bash | ||
| 63 | + pip install -r requirements.txt | ||
| 64 | + ``` | ||
| 65 | + | ||
| 66 | +3. **Training**: To train the BCAT model with your own dataset: | ||
| 67 | + | ||
| 68 | + ```bash | ||
| 69 | + python train_model.py --data_path <path_to_data> --save_path <path_to_save_model> | ||
| 70 | + ``` | ||
| 71 | + | ||
| 72 | +4. **Inference**: For predicting sentiment on new text data: | ||
| 73 | + | ||
| 74 | + ```bash | ||
| 75 | + python predict.py --model_path <path_to_model> --input_text "Your input text here" | ||
| 76 | + ``` |
-
Please register or login to post a comment