You type a perfectly innocent prompt into ChatGPT or Claude, and it refuses. "I can't help with that request." You weren't asking for anything harmful. You just wanted to write a crime fiction scene, discuss a medical topic, or generate an image that happened to trigger some invisible rule.
AI content filters exist for good reasons, but they're blunt instruments. They rely on classification models that score content across harm categories, and sometimes those classifiers get it wrong. Understanding how they work helps you rephrase requests when you hit false positives, and gives you context for when restrictions feel arbitrary.
How AI Content Filters Actually Work
Every major AI platform (OpenAI, Anthropic, Google, Microsoft) uses a similar approach. Your input goes through a classification model before it ever reaches the main AI. That classifier scores it across several harm categories. If the score is too high, the request gets blocked before the AI generates anything.
The four main harm categories
| Category | What It Catches | Example False Positive |
|---|---|---|
| Hate/discrimination | Slurs, dehumanization, bias | Historical discussion of racism |
| Sexual content | Explicit material, suggestive content | Medical anatomy questions |
| Violence | Graphic descriptions, weapon instructions | Writing a thriller novel |
| Self-harm | Suicide methods, eating disorders | Mental health research |
Each category has severity levels, usually scored as negligible, low, medium, or high. Most platforms block medium and high by default. Some let you adjust these thresholds (Google's Vertex AI, for example), but consumer products like ChatGPT and Claude give you no control over the settings.
Input vs output filtering
Filters run twice: once on your prompt (input filtering) and once on the AI's response (output filtering). This means even if your question passes the input check, the AI's answer can still get caught by the output filter. That's why you sometimes see the AI start generating a response and then suddenly stop or backtrack with "I can't continue with this.".
Why Filters Feel Dumb
The filter models are separate from the main AI. They're smaller, faster classifiers trained specifically to detect harmful patterns. They don't understand context the way the main model does, which is why false positives happen.
Why You Get Blocked for Normal Requests
False positives are the biggest frustration. Here are the common triggers that catch innocent requests:
- Medical terms. Words related to anatomy, drugs, or bodily functions often trigger the sexual or self-harm classifiers
- Fiction writing. Violence in a story context looks the same to a classifier as a genuine threat
- Historical topics. Discussing wars, atrocities, or discrimination can trigger hate and violence filters
- Security research. Asking about vulnerabilities, exploits, or penetration testing often gets flagged
- Certain keywords in combination. Individual words are fine, but specific combinations push the classifier score over the threshold
The frustrating thing is that these classifiers don't understand intent. A medical student asking about drug interactions and someone with bad intentions use the same words. The classifier can't tell the difference, so it errs on the side of caution.
What to Do When Your Request Gets Blocked
If you hit a filter on a legitimate request, these approaches help:
Rephrase with context
Add context that signals legitimate intent. Instead of "how to pick a lock," try "I'm writing a mystery novel where a character picks a lock. How would I describe this scene realistically?" The extra context can push the classifier score below the threshold.
Break the request into parts
Sometimes a complex request triggers filters because it combines multiple borderline elements. Split it into smaller, less ambiguous parts. Ask about each element separately, then combine the responses yourself.
Try a different model
Each platform calibrates its filters differently. What Claude blocks, ChatGPT might allow, and vice versa. If one model refuses your request and you know it's legitimate, try another. Merlio's chat platform lets you switch between multiple models from one interface, which makes this easier than juggling separate accounts.
Use the API with adjustable settings
Some APIs let you control filter sensitivity. Google's Vertex AI lets you set harm thresholds per category. Azure's OpenAI Service lets enterprises customize content filtering. These options aren't available in consumer chat products, but they exist for developers and businesses.
How Strict Are Different AI Platforms?
| Platform | Content Filter Strictness | Adjustable? | Notes |
|---|---|---|---|
| ChatGPT | High | No | Strict on violence, sexual content. Some creative writing allowed |
| Claude | High | No | Very cautious, especially on harmful instructions |
| Gemini | Very high | Yes (API only) | Most restrictive consumer product |
| Vertex AI (Google) | Configurable | Yes | Developers can set thresholds per category |
| Azure OpenAI | Configurable | Yes | Enterprise customers can customize filters |
| Local models (Llama, Mistral) | None to low | Full control | No filters unless you add them |
The general pattern is clear: consumer products are strict with no controls, APIs give some flexibility, and local models give you complete control. If content filters are a persistent problem for your use case (like medical writing, security research, or mature fiction), running a local model is the only way to eliminate them entirely.
Platform Picking Tip
For creative writing, Claude tends to be more flexible than ChatGPT for fiction that involves conflict or complex themes. For technical security content, local models like Mistral or Llama are the practical choice.
The Ethics Side of This
Content filters exist because AI can genuinely be misused. They prevent real harm: generating instructions for weapons, creating non-consensual imagery, automating harassment. These are real problems that companies have a responsibility to address.
The tension is between preventing genuine harm and restricting legitimate use. Right now, the filters lean heavily toward restriction because the reputational cost of a harmful output is much higher than the cost of a false positive. That calculus might change as the technology improves, but for now, false positives are the price of the safety net.
Sources
- Google Cloud: Safety and Content Filters - how configurable filters work
- Microsoft Azure: AI Content Filtering - harm categories and severity levels
- IBM: HAP Filtering Against Harmful Content - technical overview of content safety systems
Frequently Asked Questions
Generate Images, Chat with AI, Create Videos.
No credit card • Cancel anytime

