Content Extraction

Extract and analyze file text and metadata with the Document Filters SDK

Document Filters leverages unique deep inspection technology to extract and analyse all of the text and metadata — including hidden content — in a file. With Document Filters, software developers can enable their applications to extract and process content from hundreds of file formats without the need for the source application.

Document Filters can: 

  • Extract all text and metadata from more than 550 file formats including Word, Excel, PowerPoint, PDF, AutoCAD, ZIP, MSG, Visio and hundreds more
  • Extract previously hidden information such as tracked changes, comments, notes, annotations and embedded links
  • Perform optical character recognition (OCR) of document images to extract textual data
  • Extract contents of packaged, archived, compressed, and other container files
  • Deploy it your way — Document Filters runs natively on 27 platforms, and flexible APIs give you the choice of language to integrate with your application

For more than 25 years, Document Filters has powered some of the leading software products on the market, such as email archival, antivirus protection, content management, business intelligence, document imaging and intelligent capture.

Document Filters is also at the forefront of content mining and intelligence gathering for applications like compliance systems, eDiscovery, text analytics and Lucene deployments, which makes it a powerful and proven SDK alternative to open-source options and other OEM solutions.