|
1 | 1 | --- |
2 | | -date: 2020-05-21 |
| 2 | +date: 2020-05-25 |
3 | 3 | title: "Scanning Files" |
4 | 4 | linkTitle: "Scanning Files" |
5 | 5 | weight: 30 |
6 | 6 | description: > |
7 | 7 | The commons configuration for Connect File Pulse. |
8 | 8 | --- |
9 | 9 |
|
10 | | -The connector can be configured with a specific [FSDirectoryWalker](https://github.com/streamthoughts/kafka-connect-file-pulse/blob/master/connect-file-pulse-plugin/src/main/java/io/streamthoughts/kafka/connect/filepulse/scanner/local/FSDirectoryWalker.java) |
11 | | -implementation that will be responsible to scan an input directory looking for files to stream into Kafka. |
| 10 | +The connector must be configured with a specific [FSDirectoryWalker](https://github.com/streamthoughts/kafka-connect-file-pulse/blob/master/connect-file-pulse-plugin/src/main/java/io/streamthoughts/kafka/connect/filepulse/scanner/local/FSDirectoryWalker.java) |
| 11 | +that will be responsible for scanning an input directory to find files eligible to be streamed in Kafka. |
12 | 12 |
|
13 | 13 | The default `FSDirectoryWalker` implementation is : |
14 | 14 |
|
15 | 15 | `io.streamthoughts.kafka.connect.filepulse.scanner.local.LocalFSDirectoryWalker`. |
16 | 16 |
|
17 | | -When scheduled, the `LocalFSDirectoryWalker` will recursively scan the input directory configured via `input.directory.path`. |
18 | | -The SourceConnector will run a background-thread to periodically trigger a file system scan using the configured FSDirectoryWalker. |
| 17 | +The `FilePulseSourceConnector` periodically triggers a file system scan of the directory specified in the `input.directory.path` |
| 18 | +connector property. Scan is executed in a background-thread invoking the configured `FSDirectoryWalker`. |
19 | 19 |
|
20 | | -## Connector Configuration |
| 20 | +## Configuring Directory Scan (using `LocalFSDirectoryWalker`) |
21 | 21 |
|
22 | 22 | | Configuration | Description | Type | Default | Importance | |
23 | 23 | | --------------| --------------|-----------| --------- | ------------- | |
24 | 24 | |`fs.scanner.class` | The class used to scan file system | class | *io.streamthoughts.kafka.connect.filepulse.scanner.local.LocalFSDirectoryWalker* | medium | |
25 | 25 | |`fs.scan.directory.path` | The input directory to scan | string | *-* | high | |
26 | 26 | |`fs.scan.interval.ms` | Time interval in milliseconds at wish the input directory is scanned | long | *10000* | high | |
| 27 | +|`fs.scan.filters` | The comma-separated list of fully qualified class names of the filter-filters to be uses to list eligible input files| list | *-* | medium | |
| 28 | +|`fs.recursive.scan.enable` | Boolean indicating whether local directory should be recursively scanned | boolean | *true* | medium | |
27 | 29 |
|
28 | | -## Filter files |
| 30 | +## Filtering input files |
29 | 31 |
|
30 | | -Files can be filtered to determine if they need to be scheduled or ignored. Files which are filtered are simply skipped and |
31 | | -keep untouched on the file system until next scan. On the next scan, previously filtered files will be evaluate again to determine if there are now eligible to be processing. |
| 32 | +You can configure one or more `FileFilter` that will be used to determine if a file should be scheduled for processing or ignored. |
| 33 | +All files that are filtered out are simply ignored and remain untouched on the file system until the next scan. |
| 34 | +At the next scan, previously filtered files will be evaluated again to determine if they are now eligible for processing. |
32 | 35 |
|
33 | | -These filters are available for use with Kafka Connect File Pulse: |
| 36 | +FilePulse packs with the following built-in filters : |
34 | 37 |
|
35 | | -| Filter | Description | |
36 | | -|--- | --- | |
37 | | -| IgnoreHiddenFileFilter | Filters hidden files from being read. | |
38 | | -| LastModifiedFileFilter | Filters files that been modified to recently based on their last modified date property | |
39 | | -| RegexFileFilter | Filter file that do not match the specified regex | |
| 38 | +### IgnoreHiddenFileFilter |
40 | 39 |
|
| 40 | +The `IgnoreHiddenFileFilter` can be used to filter hidden files from being read. |
| 41 | + |
| 42 | +**Configuration example** |
| 43 | + |
| 44 | +```properties |
| 45 | +fs.scan.filters=io.streamthoughts.kafka.connect.filepulse.scanner.local.filter.IgnoreHiddenFileListFilter |
| 46 | +``` |
| 47 | + |
| 48 | +### LastModifiedFileFilter |
| 49 | + |
| 50 | +The `LastModifiedFileFilter` can be used to filter files that have been modified to recently based on their last modified date property. |
| 51 | + |
| 52 | +```properties |
| 53 | +fs.scan.filters=io.streamthoughts.kafka.connect.filepulse.scanner.local.filter.LastModifiedFileFilter |
| 54 | +# The last modified time for a file can be accepted (default: 5000) |
| 55 | +file.filter.minimum.age.ms=10000 |
| 56 | +``` |
| 57 | + |
| 58 | +### RegexFileFilter |
| 59 | + |
| 60 | +The `RegexFileFilter` can be used to filter files that do not match the specified regex. |
| 61 | + |
| 62 | +```properties |
| 63 | +fs.scan.filters=io.streamthoughts.kafka.connect.filepulse.scanner.local.filter.RegexFileFilter |
| 64 | +# The regex pattern used to matches input files |
| 65 | +file.filter.regex.pattern="\\.log$" |
| 66 | +``` |
41 | 67 |
|
42 | 68 | ## Supported File types |
43 | 69 |
|
44 | | -`LocalFSDirectoryWalker` will try to detect if a file needs to be decompressed by probing its content type or its extension (javadoc : [Files#probeContentType](https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#probeContentType-java.nio.file.Path) |
| 70 | +`LocalFSDirectoryWalker` will try to detect if a file needs to be decompressed by probing its content type or its extension (javadoc : [Files#probeContentType](https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#probeContentType-java.nio.file.Path)) |
45 | 71 |
|
46 | 72 | The connector supports the following content types : |
47 | 73 |
|
|
0 commit comments