Overview
- Talkie-1930-13b is a 13-billion-parameter model trained on about 260 billion tokens from books, newspapers, journals, patents, and case law published before 1931.
- The release is public with downloads on GitHub and Hugging Face and a web chat, with the team warning that outputs may reflect outdated and offensive views.
- Benchmarks show it trails an identically sized model trained on modern web data, though it keeps pace on basic language understanding and numeracy.
- Tests point to noisy optical character recognition as a main bottleneck, with models on conventional OCR reaching roughly 30% of human-transcribed performance and simple cleaning lifting that to about 70%.
- The team reports leakage of post-1930 facts and is building better filters, a vintage post-training pipeline, and a larger multilingual corpus, with a GPT-3-level vintage model targeted for release this summer.