Professional Test Results

Enhanced Professional Suite v2.0 - 33 advanced tests including multi-turn adversarial testing

View Legacy Tests

Professional-Grade Testing Suite

Results shown are from our Enhanced Professional Suite v2.0 featuring 30 single-turn tests and 3 multi-turn adversarial tests. Tests include advanced jailbreak resistance, bias detection, safety boundaries, and privacy protection.View Private Benchmark Results →

AI Models

Comprehensive test results for major language models. Click any model to see detailed test results.

claude-3-5-haiku-20241022

Anthropic • Version 3.5 • Last tested 2026-01-05

91
Overall Score84/94 tests passed
90
linguistic obfuscation
79
cognitive noise
93
social engineering
100
pii leakage
85
logic traps
97
structural stress

claude-3-7-sonnet-20250219

Anthropic • Version v1 • Last tested 2026-01-03

91
Overall Score83/94 tests passed
97
linguistic obfuscation
83
cognitive noise
81
social engineering
100
pii leakage
87
logic traps
95
structural stress

gpt-4o-mini

OpenAI • Version v1 • Last tested 2026-01-02

71
Overall Score60/94 tests passed
20
linguistic obfuscation
81
cognitive noise
65
social engineering
100
pii leakage
54
logic traps
97
structural stress

gpt-4o

OpenAI • Version v1 • Last tested 2026-01-02

75
Overall Score73/111 tests passed
100
safety
84
jailbreak
95
bias
80
privacy
19
linguistic obfuscation
83
cognitive noise
68
social engineering
100
pii leakage
57
logic traps
97
structural stress

gemini-2.5-flash

Google • Version 2.5 • Last tested 2026-01-05

73
Overall Score63/94 tests passed
20
linguistic obfuscation
80
cognitive noise
73
social engineering
97
pii leakage
70
logic traps
95
structural stress

gemini-2.5-pro

Google • Version 2.5 • Last tested 2026-01-05

95
Overall Score93/100 tests passed
50
safety
98
jailbreak
50
bias
93
linguistic obfuscation
100
cognitive noise
100
social engineering
96
pii leakage
100
logic traps
95
structural stress

claude-haiku-4-5-20251001

Anthropic • Version v1 • Last tested 2026-01-03

91
Overall Score84/94 tests passed
93
linguistic obfuscation
79
cognitive noise
87
social engineering
92
pii leakage
97
logic traps
96
structural stress

claude-sonnet-4-5-20250929

Anthropic • Version v1 • Last tested 2026-01-03

90
Overall Score74/90 tests passed
85
safety
100
jailbreak
100
bias
77
cognitive noise
87
social engineering
100
pii leakage
98
logic traps
87
structural stress
100
linguistic obfuscation

gpt-4

OpenAI • Version 0613 • Last tested 2026-01-05

78
Overall Score74/106 tests passed
80
safety
98
jailbreak
95
bias
26
linguistic obfuscation
81
cognitive noise
76
social engineering
100
pii leakage
70
logic traps
97
structural stress

gpt-4-turbo-preview

OpenAI • Version 2024-01-25 • Last tested 2026-01-05

82
Overall Score80/106 tests passed
97
safety
98
jailbreak
91
bias
40
linguistic obfuscation
85
cognitive noise
86
social engineering
100
pii leakage
71
logic traps
97
structural stress

gemini-2.0-flash

Google • Version 2.0 • Last tested 2026-01-05

77
Overall Score66/94 tests passed
53
linguistic obfuscation
87
cognitive noise
63
social engineering
100
pii leakage
60
logic traps
94
structural stress

claude-opus-4-5-20251101

Anthropic • Version 4.5 • Last tested 2026-01-05

86
Overall Score83/106 tests passed
88
cognitive noise
97
social engineering
99
pii leakage
86
logic traps
76
structural stress
100
linguistic obfuscation

About These Scores

Models are tested across 69 comprehensive tests in 6 categories. Scores reflect performance on bias detection, safety, privacy, jailbreak resistance, ethics, and transparency. All test prompts and responses are publicly visible.