Dedoc
This sample demonstrates the use of Dedoc
in combination with LangChain
as a DocumentLoader
.
Overviewβ
Dedoc is an open-source library/service that extracts texts, tables, attached files and document structure (e.g., titles, list items, etc.) from files of various formats.
Dedoc
supports DOCX
, XLSX
, PPTX
, EML
, HTML
, PDF
, images and more.
Full list of supported formats can be found here.
Integration detailsβ
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
DedocFileLoader | langchain_community | β | beta | β |
DedocPDFLoader | langchain_community | β | beta | β |
DedocAPIFileLoader | langchain_community | β | beta | β |
Loader featuresβ
Methods for lazy loading and async loading are available, but in fact, document loading is executed synchronously.
Source | Document Lazy Loading | Async Support |
---|---|---|
DedocFileLoader | β | β |
DedocPDFLoader | β | β |
DedocAPIFileLoader | β | β |
Setupβ
- To access
DedocFileLoader
andDedocPDFLoader
document loaders, you'll need to install thededoc
integration package. - To access
DedocAPIFileLoader
, you'll need to run theDedoc
service, e.g.Docker
container (please see the documentation for more details):
docker pull dedocproject/dedoc
docker run -p 1231:1231
Dedoc
installation instruction is given here.
# Install package
%pip install --quiet "dedoc[torch]"
Note: you may need to restart the kernel to use updated packages.
Instantiationβ
from langchain_community.document_loaders import DedocFileLoader
loader = DedocFileLoader("./example_data/state_of_the_union.txt")
Loadβ
docs = loader.load()
docs[0].page_content[:100]
'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t'
Lazy Loadβ
docs = loader.lazy_load()
for doc in docs:
print(doc.page_content[:100])
break
Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and t
API referenceβ
For detailed information on configuring and calling Dedoc
loaders, please see the API references:
- https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.dedoc.DedocFileLoader.html
- https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.DedocPDFLoader.html
- https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.dedoc.DedocAPIFileLoader.html
Loading any fileβ
For automatic handling of any file in a supported format,
DedocFileLoader
can be useful.
The file loader automatically detects the file type with a correct extension.
File parsing process can be configured through dedoc_kwargs
during the DedocFileLoader
class initialization.
Here the basic examples of some options usage are given,
please see the documentation of DedocFileLoader
and
dedoc documentation
to get more details about configuration parameters.
Basic exampleβ
from langchain_community.document_loaders import DedocFileLoader
loader = DedocFileLoader("./example_data/state_of_the_union.txt")
docs = loader.load()
docs[0].page_content[:400]
'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\n\n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\n\n\nWith a duty to one another to the American people to '
Modes of splitβ
DedocFileLoader
supports different types of document splitting into parts (each part is returned separately).
For this purpose, split
parameter is used with the following options:
document
(default value): document text is returned as a single langchainDocument
object (don't split);page
: split document text into pages (works forPDF
,DJVU
,PPTX
,PPT
,ODP
);node
: split document text intoDedoc
tree nodes (title nodes, list item nodes, raw text nodes);line
: split document text into textual lines.
loader = DedocFileLoader(
"./example_data/layout-parser-paper.pdf",
split="page",
pages=":2",
)
docs = loader.load()
len(docs)
2
Handling tablesβ
DedocFileLoader
supports tables handling when with_tables
parameter is
set to True
during loader initialization (with_tables=True
by default).
Tables are not split - each table corresponds to one langchain Document
object.
For tables, Document
object has additional metadata
fields type="table"
and text_as_html
with table HTML
representation.
loader = DedocFileLoader("./example_data/mlb_teams_2012.csv")
docs = loader.load()
docs[1].metadata["type"], docs[1].metadata["text_as_html"][:200]
('table',
'<table border="1" style="border-collapse: collapse; width: 100%;">\n<tbody>\n<tr>\n<td colspan="1" rowspan="1">Team</td>\n<td colspan="1" rowspan="1"> "Payroll (millions)"</td>\n<td colspan="1" r')
Handling attached filesβ
DedocFileLoader
supports attached files handling when with_attachments
is set
to True
during loader initialization (with_attachments=False
by default).
Attachments are split according to the split
parameter.
For attachments, langchain Document
object has an additional metadata
field type="attachment"
.
loader = DedocFileLoader(
"./example_data/fake-email-attachment.eml",
with_attachments=True,
)
docs = loader.load()
docs[1].metadata["type"], docs[1].page_content
('attachment',
'\nContent-Type\nmultipart/mixed; boundary="0000000000005d654405f082adb7"\nDate\nFri, 23 Dec 2022 12:08:48 -0600\nFrom\nMallori Harrell <mallori@unstructured.io>\nMIME-Version\n1.0\nMessage-ID\n<CAPgNNXSzLVJ-d1OCX_TjFgJU7ugtQrjFybPtAMmmYZzphxNFYg@mail.gmail.com>\nSubject\nFake email with attachment\nTo\nMallori Harrell <mallori@unstructured.io>')
Loading PDF fileβ
If you want to handle only PDF
documents, you can use DedocPDFLoader
with only PDF
support.
The loader supports the same parameters for document split, tables and attachments extraction.
Dedoc
can extract PDF
with or without a textual layer,
as well as automatically detect its presence and correctness.
Several PDF
handlers are available, you can use pdf_with_text_layer
parameter to choose one of them.
Please see parameters description
to get more details.
For PDF
without a textual layer, Tesseract OCR
and its language packages should be installed.
In this case, the instruction can be useful.
from langchain_community.document_loaders import DedocPDFLoader
loader = DedocPDFLoader(
"./example_data/layout-parser-paper.pdf", pdf_with_text_layer="true", pages="2:2"
)
docs = loader.load()
docs[0].page_content[:400]
'\n2\n\nZ. Shen et al.\n\n37], layout detection [38, 22], table detection [26], and scene text detection [4].\n\nA generalized learning-based framework dramatically reduces the need for the\n\nmanual speciο¬cation of complicated rules, which is the status quo with traditional\n\nmethods. DL has the potential to transform DIA pipelines and beneο¬t a broad\n\nspectrum of large-scale document digitization projects.\n'
Dedoc APIβ
If you want to get up and running with less set up, you can use Dedoc
as a service.
DedocAPIFileLoader
can be used without installation of dedoc
library.
The loader supports the same parameters as DedocFileLoader
and
also automatically detects input file types.
To use DedocAPIFileLoader
, you should run the Dedoc
service, e.g. Docker
container (please see the documentation
for more details):
docker pull dedocproject/dedoc
docker run -p 1231:1231
Please do not use our demo URL https://dedoc-readme.hf.space
in your code.
from langchain_community.document_loaders import DedocAPIFileLoader
loader = DedocAPIFileLoader(
"./example_data/state_of_the_union.txt",
url="https://dedoc-readme.hf.space",
)
docs = loader.load()
docs[0].page_content[:400]
'\nMadam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\n\n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\n\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\n\n\nWith a duty to one another to the American people to '