OpenAI releases open source inference model 'gpt-oss-safeguard', allowing developers to set their own censorship rules and surpassing GPT-5 in ability to follow the rules

OpenAI released the open source inference model ' gpt-oss-safeguard ' on October 29, 2025. gpt-oss-safeguard is a model that allows developers to add content policies during inference, allowing them to set their own rules, such as 'prohibiting discussion of fraudulent behavior' or 'prohibiting the generation of fake reviews.'
Introducing gpt-oss-safeguard | OpenAI
Technical report | OpenAI
https://openai.com/index/gpt-oss-safeguard-technical-report/
gpt-oss-safeguard is a model based on gpt-oss , designed to allow developers to set their own policies. While setting AI content policies typically requires the use of classifiers that can detect prohibited words and topics, gpt-oss-safeguard allows developers to specify policies in natural language without the need for classifiers.

OpenAI has released two models under the Apache 2.0 license: ' gpt-oss-safeguard-120b ' with 120 billion parameters and ' gpt-oss-safeguard-20b ' with 20 billion parameters. They also internally use 'Safety Reasoner,' which is a version of gpt-oss-safeguard that has undergone additional training.
Below is a graph comparing systems that follow the policies of 'gpt-oss-safeguard-120b', 'gpt-oss-safeguard-20b', 'Safety Reasoner', 'gpt-5-thinking', 'gpt-oss-120b', and 'gpt-oss-20b'. It can be seen that gpt-oss-safeguard-120b and gpt-oss-safeguard-20b have a higher rate of policy compliance than gpt-5-thinking.

Furthermore, the results of measuring compliance with safety policies using OpenAI's

On the other hand, tests using the toxicity dataset '

The model data for gpt-oss-safeguard-120b and gpt-oss-safeguard-20b is distributed at the following links.
openai/gpt-oss-safeguard-120b · Hugging Face
https://huggingface.co/openai/gpt-oss-safeguard-120b

openai/gpt-oss-safeguard-20b · Hugging Face
https://huggingface.co/openai/gpt-oss-safeguard-20b

Related Posts:
in AI, Posted by log1o_hf







