pyf.splitter.inputsplitter | PyF, flow-based python programming pyf.splitter.inputsplitter

pyf.splitter.inputsplitter

Low-level item flow splitting, used to save on hard disk data flows in separate reusable buckets.

class pyf.splitter.inputsplitter.InputSplitterByAttribute(input_flow, max_items, split_attribute, force_split)

This splitter guarantees that any two items with the same value for a certain attribute (called split_attribute) will not be separated.

The normal behavior is to try and group items that have a different split_attribute value into the same bucket when possible, unless the force_split parameter is set to True.

If that makes a given group bigger than the desired max number of items, it raises InputSplitterError.

Parameters:
  • input_flow (generator) – Flow of items to split
  • max_items (Integer) – Max number of items in a group
  • split_attribute (String) – The attribute name for which equal values will not be separated
  • force_split – Do not group items with different split_attribute values if True
  • force_split – boolean
split()

Launches the splitting, returning bucket filenames in an array.

class pyf.splitter.inputsplitter.InputSplitterSingle(input_flow=None, max_items=200)

Simple splitter, without special rules

finalize()

Finalizes the splitter, and returns the buckets

get_new_bucket()

This method shouldn’t be used outside of the implementation.

push(input_item)

Permits to push a single item in the splitter, returning the corresponding bucket

split()

Launches the splitting and finalizes it, returning the buckets

pyf.splitter.inputsplitter.chain_splitters(input_flow, splitters_params, separator='\x00')

Allow to chain splitters. Each splitter (except for the first one obviously) will have as input the output of the previous one, a bit like how commands are piped in a Unix shell.

Parameters:
  • input_flow (iterator) – the flow of items to split
  • separator (string) – the item separator in all splitters
  • splitters_params (list of dictionaries) – the parameters to pass to each splitter

The separator obviously needs to be the same for all splitters, since each one will take as input the output of the previous one.

The splitters_params is a list of dictionaries. Each dictionary represents the parameters to pass to each splitter. They are the same as accepted by the get_splitter function.

As an example, take the following list:

[{'split_attribute': 'field1'},
 {'split_attribute': 'field2', 'force_split': True}]

This means that the input_flow will first be splitted by an InputSplitterByAttribute based on the field1 attribute of each item, and then each split flow will be splitted again by a second InputSplitterByAttribute based on the field2 attribute of each item, making sure that this time any two items with a different value for their field2 attribute will not remain together.

pyf.splitter.inputsplitter.get_input_item_flow(source_file, separator='\x00', chunk_size=32768)

Yield Input Items from source_file file.

Parameters:
  • source_file (FileObject) – The open file that contain picklelized input items

The open file must be close by the caller.

pyf.splitter.inputsplitter.get_input_item_str_flow(source_file, separator='\x00', chunk_size=32768)

Low level function that should not be used as part as the public API.

Used to read the raw data in the chunk files and feed the real get_input_item_flow function so that it can marshal the data

pyf.splitter.inputsplitter.get_splitter(input_flow=None, max_items=200, split_attribute=None, force_split=False)

Function to instanciate the adequate splitter, whether you want to:

  • just split the input_flow of items into buckets of a max_items maximum size with the InputSplitterSingle
  • use the InputSplitterByAttribute to guarantee that two items from the input_flow with the same value for their split_attribute attribute will not be separated into different buckets. You might optionally want to guarantee that two items with a different value for their split_node attribute will be separated.