Fredhopper reads XML files to build its indexes.
File format
The format in which the input data should be presented to the FAS Data API is XML, using the UTF-8 character set.
FAS input XML needs to comply with the content-acquisition.dtd. It should contain the root tag <items>, under which there are one or more <item> tags, which in turn contain one or more <attribute> tags:
<?xml version="1.0" encoding="utf-8"?> <items> <item> <attribute> <value>VALUE</value> </attribute> <attribute> ... </attribute> </item> <item> ... </item> ... </items>
The Fredhopper input XML file should never contain more that 5000 items. It is recommended that a maximum of 1000 items are present per file. When Fredhopper is also fed with large chunks of unstructured text it might be necessary to reduce the number of items per file.
Operations
As mentioned in the Data API overview, there are four operations that can be performed through the Data API:
| Operation | Description |
|---|---|
| add | Adds items to the FAS index. When adding an item that already exists it will replace it by default. Fredhopper can be configured to ignore the item also. |
| delete | Removes items from the FAS index. Note that delete removes the entire item, including all its locales. |
| update | Updates an attributes of an item in the FAS index. The whole attribute is always updated, i.e. it is not possible to add/remove only a value in a set. |
| replace | Updates the complete item in the FAS index |
The operation will be specified as an attribute in the item tag, e.g.
<item operation="add" ...>
When adding an item, at least the categories attribute should be specified, as this attribute defines to which Universe the item should be added.
| Operation / Item already in index? | Yes | No |
|---|---|---|
| add | default replace, ignored if configured: ignore-add-for-existing-items | add |
| update | update | ignored |
| delete | delete | ignored |
| replace | replace | add |
Content rules
Content have the following rules.
Localization
Data that is loaded into FAS can be multi lingual. This means that items and attributes do not have to be loaded separately for different locales.
The values of attributes of type list(64), set(64) and asset can be defined in multiple locales in the input data, e.g.
<attribute identifier="color" type="set"> <value identifier="red" locale="en_US">Red</value> <value identifier="red" locale="fr_FR">Rouge</value> <value identifier="red" locale="nl_NL">Rood</value> </attribute>
To also set the multi lingual name of the attribute (not assets) itself, use the following XML, or set the name via the Business Manager / System Manager / localisation.
<attribute identifier="color" type="set"> <name locale="en_US">Color</name> <name locale="fr_FR">Couleur</name> <name locale="nl_NL">Kleur</name> <value identifier="red" locale="en_US">Red</value> <value identifier="red" locale="fr_FR">Rouge</value> <value identifier="red" locale="nl_NL">Rood</value> </attribute>
Locales should follow RFC 3066, i.e. language_COUNTRY, where
- language is a ISO 639-1 code
- country is a ISO 3166-1-alpha-2 code
int, float and text attributes are not localized. Their values are the same across all locales.
Identifiers
Identifiers are the global (non multi-lingual) method to identify items, attributes and attribute values. There are three kinds of identifiers in the FAS input XML
- Item identifiers (e.g. <item identifier="1234ab" ...>
- Attribute identifiers (e.g. <attribute identifier="att1" ...>)
- List / Set / Hierarchical attribute value identifiers (e.g. <value identifier="val1" ...>)
- Int / Float attribute value identifiers (e.g. <value>123</value>)
- Text / Asset attribute value identifiers (e.g. <value>Hello World</value>)
- Ref attribute value identifiers (e.g. <value>id_of_an_item</value>)
Below is a list of allowed characters in these identifiers:
| name | value range | note |
|---|---|---|
| int | 0 .. +MAXINT | |
| float | 0.001 .. +MAXFLOAT | |
| list64 | [a-z][0-9]_ (first character must be a letter or underscore) | |
| set64 | [a-z][0-9]_ (first character must be a letter or underscore) | |
| list | [a-z][0-9]_ (first character must be a letter or underscore) | |
| set | [a-z][0-9]_ (first character must be a letter or underscore) | |
| hierarchical | [a-z0-9]+ (we advise the first character to be a letter) | categories must have unique identifiers |
| text | any utf-8 string | has no identifier tag attribute, it has only a multi lingual part |
| asset | has no identifier tag attribute, it has only a multi lingual part | |
| ref | [a-z][0-9]_ | must refer to an existing item identifier |
| item identifier | [a-z][0-9]_ | is the unique id of an item, also called "secondid" |
| attribute type name | [a-z][0-9]_ (first character must be a letter or underscore) | is the name of an attribute type, e.g. color or brand |
Illegal characters
Illegal XML characters have to be replaced by entity references. A character like "<" inside an XML element, will generate an error because the parser interprets it as the start of a new element. There are 5 predefined entity references in XML 1.0:
| Character | Entity reference | Unicode code point | Description |
|---|---|---|---|
| " | quot | U+0022 (34) | quotation mark |
| & | amp | U+0026 (38) | ampersand |
| ' | apos | U+0027 (39) | apostrophe |
| < | lt | U+003C (60) | less-than sign |
| > | gt | U+003E (62) | greater-than sign |
| Note: Only the characters "<" and "&" are strictly illegal in XML. Apostrophes, quotation marks and greater than signs are legal text content, but it is better to replace them. They must be replaced in attribute content. |
Other entities can only be used in the numeric character reference format, e.g. £ instead of £. The numeric character entity references format covers UTF-8 characters.
Non printing characters (UTF-8 code < U+0020) are not allowed in the xml, with the exception of a Line Feed (U+000A), a Carriage Return (U+000D) and the Horizontal Tab (U+0009). However, the xml parser that is used by the loader converts a Carriage Return into a Line Feed. To avoid this, Carriage Returns should be translated into numerical entities.
CDATA ("<![CDATA[" Text, e.g. asset content "]]>") sections can only be used in the PCDATA of an element, for example the content (value) of an asset. Using CDATA it will not be needed anymore to escape illegal XML characters, but still the rules described above should be taken into account. This means that non printing characters should be filtered.
Hierarchical attributes
The hierarchical categorisation of an item is put into the categories attribute of type hierarchical.
When inserting new items using the add operation, at least the categories attribute will have to be specifed, as the top level category defines in which universe(s) the item will be inserted.
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE items SYSTEM "content-acquisition.dtd"> <items> <item identifier="911targa"> <attribute identifier="categories" type="hierarchical"> <value identifier="cars" locale="en_US">Cars</value> <attribute identifier="categories" type="hierarchical"> <value identifier="sports" locale="en_US">Sports Cars</value> <attribute identifier="categories" type="hierarchical"> <value identifier="gt" locale="en_US">GT</value> </attribute> </attribute> </attribute> </item> </items>
* The top level category defines the universe (cars in the example above)
- The hierarchical attributes have a short name - the identifier of the current hierarchy value (sports, gt)
- The hierarchical attributes are identifiable uniquely by their long name, composed from the short names of the parent components in the hierarchy plus the short name of the current component (e.g on the level of gt the long name will be cars_sports_gt).
- The long names are not stored in the item store - they are constructed on fly.
- The delimiter between the path components in the long names is the underscore character. This is the reason why we do not allow '_' in the short names.
- In case any illegal character is used in the short name (illegal means not a-z, 0-9), it will be encoded with its hexadecimal presentation as UTF8 character when stored in the item store. There is a configurable prefix in the system.default.xml - com.fredhopper.util.Util/generated-identifier-prefix which allows the loader to generate always python friendly identifiers (e.g. when an identifier starts with a number).
- In the enricher the long names will be used to identify the values of hierarchical type.
- In the frontend XML both the short and the long names will be presented.
- In case items are requested for a hierarchy value with a short name, the result will include items for all the values which long names have that short name as a last component
- Depending on a system setting the frontend will use short or long names in the links that involve categories. The default behavior is use long names.
| Note: The indexer only warns about use of '' in the short category names, however we do not advise use of underscore () in the short name, because it may result in ambiguity in the dynamically constructed long name. |
Example
A sample data set is available in every FAS release in the folder doc/sample/data/.
| Fredhopper recommends to use the Fredhopper Data Manager to generate FAS XML |
Comments
0 comments
Article is closed for comments.