Optimise your PDFs for Josef Q
This article will help you optimise your PDFs for Josef Q.
Josef Q performs best when documents are complete, well-organised with logical structure and clear information hierarchy. Below we share some key points on what makes a good PDF, and best practice tips for you when preparing your PDFs. While Josef will be able to handle your PDFs as-is, if you find issues like not getting the exact sources you want for a question, one of the reasons below could be the cause.
Please feel free to share any feedback or questions you have with our product team. You can send your feedback to support@joseflegal.com or to your Josef account manager.
What makes a great PDF?
1. PDF Creation (Source vs. Scanned)
💡 Tip: How your PDF is made matters most. If you're creating a PDF for Josef, always “Save as PDF” or “Export to PDF” — never “Print to PDF”.
- Export your PDFs from Microsoft Word or another structured editor (best): retains text hierarchy (headings, paragraphs, lists, tables).
- Printed-to-PDF / flattened (bad): destroys PDF structure — turns text into images or untagged characters.
- Scanned PDFs (worst): this format introduces errors, missing lines, and word breaks.
2. Structure and Tagging
💡 Tip: Use structured styles (Heading 1, Heading 2, Normal, List) before exporting. Don’t style by font size alone.
Tagged PDFs (like those created by Microsoft Word or Adobe InDesign with accessibility settings on) include semantic information such as headings, lists, and table structures.
3. Layout and Flow
💡 Tip: Use single-column formatting for documents intended for digital reading or AI ingestion.
The visual design strongly affects both comprehension and machine segmentation:
- Two-column layouts confuse reading order.
- Sidebars or footnotes may be included with main text during processing.
- Overlapping text boxes or “floating” elements may break logical flow.
- Large gaps or hidden tables can make reading difficult.
4. Fonts and Encoding
💡 Tip: Use standard fonts, avoid decorative or embedded image-based text.
If a PDF uses non-standard fonts or glyph encoding:
- Text may not copy or parse correctly (e.g. ligatures like “fi” become symbols).
- Characters may show as “?” or Unicode (’ → ’).
- Tables using monospaced fonts can distort when parsed.
5. Tables and Lists
💡 Tip: Always use Microsoft Word or Google Docs’ built-in table and list tools — not manual formatting.
The visual representation of tables and lists affects how they’re parsed:
- Tables made from shapes or tabs are harder to detect.
- Lists made manually with “1.” or “–” instead of real list styles often fail to nest.
6. Images and Scans
💡 Tip: Avoid placing key text in images; add “alt text” for accessibility.
Images add complexity when processing your documents:
- Mixed image/text layers can confuse reading order.
- Background watermarks may be mistaken for text.
7. File Size, Compression, and Object Complexity
💡 Tip: Aim for ≤ 5 MB, cleanly generated text PDFs with minimal embedded objects.
Large, highly compressed or object-heavy PDFs (e.g. many embedded fonts, vector shapes) can:
- Slow down processing;
- Cause incomplete text extraction.
8. Metadata and Document Properties
💡 Tip: Add descriptive metadata and bookmarks before exporting.
Good metadata improves indexing and retrieval:
- Titles, subjects, and author fields inform document classification.
- Tagged outline/bookmarks help AI segment and contextualize sections.