# BCAT Model: Parallel Chinese Offensive Language Detection with Synergistic Semantics and Topic Modeling
## Overview
The BCAT (BERT-CTM Attention-based Text Classifier) model is designed for Chinese sentiment recognition, particularly focusing on offensive and aggressive language detection. The model leverages BERT-generated contextual word embeddings and CTM (Combined Topic Modeling) to capture both semantic and thematic features from text data. BCAT integrates a multi-head attention mechanism to enhance feature representation and applies convolutional networks (DPCNN and TextCNN) for feature extraction.
## Features
-**BERT and CTM Fusion**: BCAT effectively combines BERT embeddings with CTM topic vectors. BERT captures the context of words in a sentence, while CTM identifies overarching themes, improving the model's ability to detect nuanced sentiments.
-**Multi-Head Attention Mechanism**: This component focuses on different aspects of the input data, ensuring that critical features are emphasized during classification.
-**TextCNN and DPCNN**: These two convolutional networks operate in parallel to extract both local (TextCNN) and global (DPCNN) features, improving the robustness of the model for different linguistic structures.
## Model Architecture
The BCAT model is divided into the following components:
1.**Embedding Layer**: Text data is transformed into embeddings using the BERT model.
2. Feature Extraction Layer:
- TextCNN extracts local features (word and phrase combinations).
- DPCNN captures global text structure and long-range dependencies.
3.**Feature Fusion and Attention**: The output from both networks is combined and processed by the multi-head attention mechanism to highlight relevant information.
4.**Classification Layer**: A fully connected Softmax layer outputs the predicted sentiment classes.
## Data
BCAT is trained on the **COLD (Chinese Offensive Language Dataset)**, a publicly available dataset that includes offensive and safe comments across various categories. The model also uses real-time data collected from social platforms like Weibo through a custom web crawler.
### Dataset Statistics
-**COLD Dataset**: Contains 37,480 comments with binary labels indicating whether a comment is offensive or safe.
-**Weibo Data**: Supplementary real-world data gathered through a web crawler to ensure model robustness in practical applications.
## Training and Testing
-**Training**: The model was trained on a dataset split into training, validation, and test sets. Key metrics such as accuracy, precision, recall, and F1-score were used to evaluate performance.
-**Testing**: The model underwent extensive testing with the validation and test datasets, showing excellent results in offensive language detection.