Assess the evidence.
Inherited risk and data quality. Step 4 pivots from structural analysis to evidence assessment — track record depth, sample size, Sharpe ratio realism, and why live performance alone does not guarantee reliability.
- The pivot from structural analysis to evidence assessment and the concept of inherited risk.
- Equity curve texture and Sharpe ratio realism as diagnostic tools for data quality.
- The dual-sufficiency requirement: track record length and sample size must both be adequate.
- The live-versus-backtest reframing — why live status is necessary but not sufficient.
Steps 2 and 3 examine a system's structure. Step 4 marks a pivot: from structural analysis to evidence assessment. The question is no longer whether the system is built soundly, but whether the evidence supporting its performance claims deserves the weight being placed on it.
This step deploys the Performance Validation pillar toolkit and introduces the concept of inherited risk — the risk an investor absorbs by trusting a track record whose evidentiary foundation may not warrant that trust. A system can pass both structural tests and still present evidence that is too thin, too short, or too favorable to support reliable conclusions.
The pivot from structure to evidence.
A structurally sound system operating on live markets with genuine risk management can still present inherited risk through its evidence base. The track record may be too short to encompass different market regimes. The sample size may be too small for statistical conclusions. The performance data may come from a backtest rather than live trading.
Each of these evidence quality dimensions introduces a distinct form of inherited risk. The investor who deploys capital based on a track record that does not meet evidentiary standards inherits the risk that the track record does not represent what the system will actually deliver.
Diagnostic checks for evidence quality.
A system operating in real market conditions produces an equity curve with texture: meaningful drawdowns, flat periods, varying rates of recovery, and visible differences between strong and weak performance periods. An equity curve that is too consistently upward raises a specific evidentiary question: does this curve reflect real conditions, or parameters optimized against historical data?
The dual-sufficiency requirement.
Both track record length and sample size must be independently sufficient. These are separate conditions, and meeting one does not compensate for failing the other.
The evidence base is only as strong as the weaker of these two dimensions. A five-year track record with 30 total trades has temporal length but insufficient statistical foundation. A system with 500 trades over three months has statistical volume but insufficient temporal diversity.
Reframing the live vs. backtest question.
The common assumption is binary: live performance is reliable, and backtested performance is not. The Institute's framework challenges this simplification. Live performance is preferable, all else being equal. But the live-versus-backtest distinction is not the question that determines evidence quality.
Frequently asked questions.
The Algo Institute's Step 4 examines four dimensions: equity curve texture, Sharpe ratio realism (sustained ratios above 3.0 are virtually nonexistent in audited data), dual sufficiency of track record length and sample size, and whether performance comes from live trading or backtesting. Live performance alone does not guarantee evidence reliability.
Not necessarily. Live performance without sufficient track record length and sample size is just as unreliable as a backtest. A system trading live for three months with minimal trades has not generated enough evidence to establish reliability. The question is whether the evidence base is sufficient, not whether it is live.
A sustained Sharpe ratio above 3.0 is virtually nonexistent in audited performance data. A system claiming sustained ratios above this threshold warrants scrutiny of its evidence base — the ratio may reflect a favorable sample period, overfitted parameters, or backtest optimization rather than genuine risk-adjusted performance.