Parse HTML files and extract data from tables within them, converting structured data into Flatfile-compatible format
<table>
elements into a format that Flatfile can process.
The plugin can handle multiple tables within a single HTML file, creating a separate sheet for each one. It is capable of interpreting complex table layouts that use colspan
and rowspan
attributes to merge cells, ensuring the data is correctly aligned. Use cases include importing data from legacy systems that export reports as HTML pages, scraping data from web pages, or processing any structured data provided in an HTML table format.
Parameter | Type | Default | Description |
---|---|---|---|
handleColspan | boolean | true | When true, the plugin will correctly handle cells with a colspan attribute by duplicating the cell’s value across the specified number of columns |
handleRowspan | boolean | true | When true, the plugin will attempt to handle cells with a rowspan attribute by carrying the cell’s value down into the subsequent rows |
maxDepth | number | 3 | Defines the maximum depth for parsing nested tables (Note: not currently implemented) |
debug | boolean | false | When set to true, the plugin will output detailed logs to the console during the parsing process |
handleColspan
and handleRowspan
enabled, meaning it will attempt to correctly structure data from cells that span multiple columns or rows. Debug logging is disabled, and the nesting depth for tables is notionally set to 3.
debug: true
in the configuration to see a step-by-step log of the parsing process:
<table>
elements with <th>
tags for headers and <td>
tags for data cells. The plugin’s effectiveness is highly dependent on the quality of the input HTML.
.html
extensionlistener.on('file:created')
event<table>
element found in the HTML document will be extracted into its own separate sheet within the Flatfile workbook. Sheets are named sequentially: Table_1
, Table_2
, and so on<th>
elements. If a table has no <th>
elements, the headers
array for that sheet will be empty, and data rows will likely not be mapped correctlymaxDepth
Limitation: The maxDepth
configuration option is defined in the options type but is not currently implemented in the parsing logic. Nested tables are processed, but their depth is not limited by this settingrowspan
Implementation: The current implementation for handleRowspan
may not function as expected because it attempts to re-parse trimmed text content of a cell to find an attribute, which is not possible. This feature should be considered unreliabledebug
option to true
. This will print detailed logs of the extraction process, including tables found, headers extracted, and cell data<table>
, <tr>
, <th>
, <td>
) is invalid, the function may return an empty object or partially extracted data