75 Commits

Author SHA1 Message Date
Ilayaperumal Gopinathan
fcd2803f36 Release version 1.0.0 2025-05-19 10:40:35 +01:00
Christian Tzolov
5d6bbd9368 refactor: update dependencies to use spring-ai-model and spring-ai-commons
Replace spring-ai-client-chat dependency with spring-ai-model in model implementations
and memory repositories, and with spring-ai-commons in document readers. This change
improves the dependency structure by having components depend on the appropriate
abstraction level.

Additional changes:
- Add slf4j-api dependency to pdf-reader and spring-ai-retry
- Move spring-ai-client-chat to test scope in spring-ai-ollama
- Fix XML formatting in some pom.xml files

Signed-off-by: Christian Tzolov <christian.tzolov@broadcom.com>
2025-05-16 19:58:11 +02:00
Ilayaperumal Gopinathan
f2940cffce Next development version 2025-05-13 19:06:16 +01:00
Ilayaperumal Gopinathan
30a9638de8 Release version 1.0.0-RC1 2025-05-13 19:05:52 +01:00
Ilayaperumal Gopinathan
3acc206eb2 Next development version 2025-04-30 17:51:20 +01:00
Ilayaperumal Gopinathan
b657cf3bae Release version 1.0.0-M8 2025-04-30 17:51:07 +01:00
David Frizelle
c95d5c05ba Bump org.apache.tika to 3.1.0 (#2900)
Signed-off-by: David Frizelle <david.frizelle@gmail.com>
2025-04-28 11:00:29 +01:00
Soby Chacko
81b715b3d2 Update dependencies and cleanup misc version management
- Extract common dependency versions to properties in root pom
  - Added jsoup.version property
  - Added mockk-jvm.version property
  - Added neo4j-cypher-dsl-bom.version property
- Update dependency versions:
  - djl from 0.30.0 to 0.32.0
  - oci-sdk from 3.51.0 to 3.63.1
  - azure-identity from 1.14.0 to 1.15.4
- Remove hardcoded versions for consistency
- Minor polishing

Signed-off-by: Soby Chacko <soby.chacko@broadcom.com>
2025-04-22 19:10:36 -04:00
Soby Chacko
53a7af500b Addressing the remaining checkstyle failures
Signed-off-by: Soby Chacko <soby.chacko@broadcom.com>
2025-04-15 11:04:03 +01:00
Soby Chacko
d9e7ace996 Miscellaneous checkstyle fixes
Signed-off-by: Soby Chacko <soby.chacko@broadcom.com>
2025-04-14 09:48:03 +01:00
Ilayaperumal Gopinathan
bda702e8e1 Next development version 2025-04-10 20:23:38 +01:00
Ilayaperumal Gopinathan
584138af28 Release version 1.0.0-M7 2025-04-10 20:23:07 +01:00
Soby Chacko
717e419515 Rename spring-ai parent from spring-ai to spring-ai-parent
Signed-off-by: Soby Chacko <soby.chacko@broadcom.com>
2025-04-04 12:43:22 -04:00
Mark Pollack
e3f9b7c16b rename spring-ai-core to spring-ai-client-chat 2025-04-04 11:44:38 -04:00
Alexandros Pappas
29002dfc0d chore: remove unused imports (#2542)
Signed-off-by: Alexandros Pappas <apappascs@gmail.com>
2025-03-21 12:20:29 +00:00
gongzhongqiang
f167fd8b3c fix: Update testNonExistingUrl to testNonExistingHtmlResource, use not exist classpath resource
Signed-off-by: gongzhongqiang <gongzhongqiang@apache.org>
2025-03-19 15:43:59 +00:00
Alexandros Pappas
82b46d2182 feat: add JSoup HTML document reader
This commit introduces the `JsoupDocumentReader` and `JsoupDocumentReaderConfig` classes, which provide functionality to read and parse HTML documents using the JSoup library.

The reader supports:
- Extracting text from specific HTML elements using CSS selectors.
- Extracting all text from the body of the document.
- Grouping text by element.
- Extracting metadata, including the document title, meta tags, and link URLs.
- Reading from various resource types (files, URLs, byte arrays).
- Configurable character encoding, selector, separator, and metadata extraction.

This new reader enhances Spring AI's ability to process web content and other HTML-based data sources.

Signed-off-by: Alexandros Pappas <apappascs@gmail.com>
2025-03-10 11:34:25 +00:00
shahbazaamir
2394ac82ad Added test cases to cover usage of ExtractedTextFormatter
Signed-off-by: shahbazaamir <shahbaz07dbit@gmail.com>
2025-03-03 13:30:50 +00:00
Ilayaperumal Gopinathan
2932769883 Switch back to use slf4j logging
- Revert the changes to update to use Apache Commons Logging and re-add the previously used slf4j logging
2025-02-03 15:31:43 -05:00
Ilayaperumal Gopinathan
8303a52611 Use Apache Commons Logging
- Remove existing spring-boot-starter-logging
 - Update to use Springframework's LogAccessor to use commons logging

Resolves #2095
2025-01-28 11:00:05 +00:00
Ilayaperumal Gopinathan
977500f7a1 Remove deprecated classes and methods in spring-ai-core
* Remove use of Document.getContext method from spring-ai-core, use getText
* Remove deprecated ChatOptionsBuilder class
* Remove deprecated FunctionCallingOptionsBuilder class
2025-01-06 16:57:55 -05:00
Mark Pollack
d7fe07b0f1 Next development version 2024-12-23 14:25:21 -05:00
Mark Pollack
ab022fa956 Release version 1.0.0-M5 2024-12-23 14:24:55 -05:00
WonJun Lee
636f3aee4f GH-1913: Add line separator override for text formatting
Fixes: #1913

Issue: https://github.com/spring-projects/spring-ai/issues/1913

- Add lineSeparator field to ExtractedTextFormatter with configurable override
- Update deleteTopTextLines and deleteBottomTextLines methods to use custom separator
- Mark old methods as deprecated in favor of new ones with separator parameter
- Update PDF test to use explicit line separator for Windows compatibility
2024-12-20 20:19:28 -05:00
Mark Pollack
5b11501cbe Update usage of Document::getContent to getText 2024-12-12 14:43:51 -05:00
Mark Pollack
dfbc394f83 Make Document support single text or media content
The Document class previously allowed multiple media entries while also having a
text field, leading to ambiguity in content handling. This change enforces a
clear separation between text and media documents to prevent content type
confusion and simplify document processing.

A Document now must contain either text content or a single media entry, but
never both. This aligns with the class's primary use in ETL pipelines where
clear content type boundaries are essential for proper embedding generation and
vector database storage.

Additional architectural changes:
- Document now implements a cleaner API by removing deprecated methods
- Removed MediaContent interface implementation from Document class
- Document.getMedia() now returns a single Media object instead of Collection
- Removed EMPTY_TEXT constant in favor of proper null handling
- Constructor signatures simplified and streamlined
- Builder pattern improved to enforce single content type constraint

The breaking changes include:
- Media is now a single entry instead of a collection
- Content field renamed to text for clarity
- Removed support for mixed content types
- Simplified builder API to prevent ambiguous construction

Prefer using text-related methods over deprecated content methods to
better reflect the actual content type being handled and improve API clarity.
2024-12-09 23:25:38 -05:00
Thomas Vitale
fe58fd30eb Support similarity scores in Document API
Document
* Introduced “score” attribute in Document API. It stores the similarity score.
* Consolidate “distance” metadata for Documents. It stores the distance measurement.
* Adopted prefix-less naming convention in Document.Builder and deprecated old methods.
* Deprecated the many overloaded Document constructors in favour of Document.Builder.

Vector Stores
* Every vector store implementation now configures a “score” attribute with the similarity score of the Document embedding. It also includes the “distance” metadata with the distance measurement.
* Fixed error in Elasticsearch where distance and similarity were mixed up.
* Added missing integration tests for SimpleVectorStore.
* The Azure Vector Store and HanaDB Vector Store do not include those measurements because the product documentation do not include information about how the similarity score is returned, and without access to the cloud products I could not verify that via debugging.
* Improved tests to actually assert the result of the similarity search based on the returned score.

Signed-off-by: Thomas Vitale <ThomasVitale@users.noreply.github.com>
2024-12-02 14:54:28 -05:00
Mark Pollack
67a8896422 Next development version 2024-11-20 18:03:30 -05:00
Mark Pollack
33c05c399c Release version 1.0.0-M4 2024-11-20 18:02:47 -05:00
Christian Tzolov
018257a605 fix: Resolve javadoc and maven confiuration issues 2024-11-16 12:43:27 +01:00
Christian Tzolov
0ca91b2ed9 fix: Resolve various javadoc and checkstyle issues 2024-11-16 11:06:14 +01:00
dafriz
5e86583679 Bump org.apache.tika to 3.0.0 2024-11-11 11:32:39 +00:00
d050150
78a2a2788b GH-1689 Handle StringIndexOutOfBoundsException in PagePdfDocumentReader
- Add test coverage to TextLine
    - Use char[] instead of String for TextLine
    - Optimise index handling when reading text lines

Resolves #1689
2024-11-08 17:28:32 +00:00
Soby Chacko
66f58d2d70 Change default build setting to disable Checkstyle enforcement
- Disable project-wide Checkstyle checks to unblock development
- Add documentation for enabling Checkstyle locally
- Fix remaining checkstyle violations in current codebase

Fixes #1669
2024-11-05 10:43:38 -05:00
Soby Chacko
e72ab6ba25 Addressing more checkstyle violations
- Enable checkstyle on more modules and adressing violations
review
2024-10-31 01:04:41 -04:00
Soby Chacko
8e758dbd00 Introduce checkstyle plugin
- Based on https://github.com/spring-io/spring-javaformat
- In this iteration, checkstyles are only enabled for spring-ai-core
2024-10-24 16:43:59 -04:00
Mark Pollack
4c83fe8302 Guard against NPE in ZhiPu embedding model
- Update retry test to pass - needs investigation
2024-10-08 23:37:00 +02:00
Mark Pollack
4a892b5269 Release version 1.0.0-M3 2024-10-08 23:18:50 +02:00
Mark Pollack
e1884d1d92 Next development version 2024-08-23 18:47:37 -04:00
Mark Pollack
43ad2bdb97 Release version 1.0.0-M2 2024-08-23 18:46:58 -04:00
Mark Pollack
a89b938def Make PDF Reader classes more customizable for assigning custom metadata 2024-08-23 12:23:18 -04:00
Fu Cheng
8469d7dc27 Upgrade to Apache Tika 3.0.0-BETA2 2024-08-23 10:03:17 -04:00
Piotr Olaszewski
56e678c487 Add Markdown document reader with enhanced features
This commit introduces a new Markdown document reader with several
key features and improvements:

* Add support for text with various formatting elements
* Implement handling for horizontal rules and hard line breaks
* Add functionality for inline and block code sections
* Incorporate blockquote handling
* Support ordered and unordered lists
* Introduce additional metadata capabilities
* Include JavaDocs

Update ETL documentation to reflect these new features and usage.

Fixes #105
2024-08-22 13:35:44 -04:00
Fu Cheng
0ebf4abc2c Fix not-null assertion 2024-07-17 11:03:26 +02:00
Mark Pollack
da4b26f74c Update TikaDocumentReaderTests to use easier to parse web site as input 2024-07-02 12:37:25 -04:00
Lorenzo Caenazzo
237feb3437 ⬆️ bump tika to 3.0.0-BETA to avoid pdfbox version conflicts 2024-07-02 16:17:59 +02:00
wubo
ccddc53f71 Upgrade Apache Tika to version 2.9.2 to address security vulnerabilities 2024-06-15 18:04:05 +02:00
Mark Pollack
ac91302eed Next development version 2024-05-28 13:53:04 -04:00
Mark Pollack
0670575f3e Release version 1.0.0-M1 2024-05-28 13:49:11 -04:00
Eddú Meléndez
8fa675b145 Update spring boot version to 3.2.4 2024-04-04 13:43:01 +02:00