logax and onceler: parse text files with regex strings and output as json.

User Documentation

Use Case

Log data mining. Say you have a bunch of web/app log files or html files or any kind of text file for that matter laying around. You want to "grep" for many different search strings within those files and output to json. You can subsequently insert the data into a database for further processing and reporting. Then this is the tool for you.

Example

1. Find a text file you want to mine.

$ cat joblog1.log

Begin job log at Tue Nov 26 13:50:43 EST 2013
This is just some random log file you might get from an application.
JobID: 12345
email: aaa@aaa.com

2. Create a parser config file like this:

$ cat my_parser.js

...
exports.searchStrings = function() {
    return [ {
        "searchFor" : /^Begin job log at (.*)/,
        "outputField" : "startTime"
    },
    {
        "searchFor" : /^JobID: ([0-9]*)$/,
        "converter" : function(captures) {
            return parseInt(captures[1], 10);
        },
        "outputField" : "jobId"
    },
    {
        "searchFor" : "^email: (.*)$",
        "default" : null,
        "outputField" : "email"
    } ];
};
// Note: output is a regex "capture"; "searchFor" can be a string or regex type.

3. Run `logax` like this:

$ logax --parserFile my_parser.js \
    --input joblog1.log \
    --outputDir /some/dir

4. You get JSON output like this:

$ cat /some/dir/joblog1.json

[
    {
        "email": "aaa@aaa.com",
        "startTime": "Tue Nov 26 13:50:43 EST 2013",
        "jobId": 12345
    }
]

Awesome! You can find more examples in the test directory.

Installation

Install Node
npm install logax
npm will symlink logax and onceler into the PATH for you! You can also use $(npm bin)/logax in your project directory.

onceler

onceler is a node.js command line program that processes files 'once'. You provide a json config file with the file name globs you want to process. onceler keeps track of which files have been processed already using dates. onceler will search for 'new' files working forward in time. Onceler is intended to be run from a cron or a scheduled task. It can handle gz files!

TODO: Add example config and run of onceler

Functionality

Search for hundreds of regex strings.
JSON array of objects output
Capture one or many data value(s) for each regex!
Provide customer converters to and perform calculations on captured value(s)!
Supply a default when there is no match.
Using onceler and logax together, you can search many files in parallel.
Configure your own log parser with onceler.
Use a wrapper around logax that also inserts the json into your db!
Supports *.gz files.
Supports terminators, so you can stop parsing after certain regexs. (Summary sections...)
Can parse a single job log or multiple jobs in one log (with a delimiter).
Values captured before a delimiter go into each object.
Parsed file name is available in the output JSON.
Pass retObj into the converter so you can add multiple fields with one regex match.
If you don't have delimiters, the output will be one object.
Test coverage
See examples of this functionality in the test directory.
Available from npm: https://npmjs.org/package/logax

Caveats

This tool was developed for unstructured log files. There is no problem using it for any kind of text file regex parsing, but other tools may do a better job. For example, if you want to parse html, xml or some other structured file format, you may want to try a parser for that markup. It's your call.

In your parser it is preferable to use a regex "searchFor" instead of a string. That's because if you have a literal * in your search you have to escape with a regex \* or double escape \\* if your "searchFor" is a string. Double escaping is just annoying.

Developer Documentation

Contributions are welcome. Make sure changes have tests.

Future Enhancements

This is roughly in priority order.

mkdir $workindDir if not exists.
Uncompress .Z Files.
Have a template or some way of generating a oncler or logax config file.
Implement winston or something else cool for logging.
Add optimization when only searching for a few regexes. Grep or some other cross platform search would be more efficient.
Support Windows (Using *nix find command right now).
Support calculated fields (Based on the values of already captured fields. Post row processing step.)
Handle duplicate log messages such that you can specify which one you want. (nth duplicate)
Output CSV or JSON. Only json is supported right now.
Crazy idea. Onceler could concatenate multiple files together prior to parsing. (Many job logs to JSON array output ;)
Optimize using pipes/streams.
Intelligently process truncated log files
Search for strings on more than one line. (Containing newlines)
Allow search strings to be in xpath for true XML parsing?
Test the parser for valid regexes. (Maybe low priority. The error message right now is decent)

Tools

Created with Nodeclipse (Eclipse Marketplace, site)

Nodeclipse is free open-source project that grows with your contributions.

A Note on Egit

I am using EGit with eclipse and I have gone against the EGit recommended settings of having the .git folder in a parent folder of the logax project. It mostly prevents adding more eclipse projects later, but that's ok with me; This is only intended to be one eclipse project. Anyway, I created the git project in ~/git/logax with logax being the project. So I manually did the git init there and imported the project into my eclipse workspace.

logax

User Documentation

Use Case

Example

1. Find a text file you want to mine.

2. Create a parser config file like this:

3. Run `logax` like this:

4. You get JSON output like this:

Installation

onceler

Functionality

Caveats

Developer Documentation

Future Enhancements

Tools

A Note on Egit

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Last publish

Collaborators

logax

User Documentation

Use Case

Example

1. Find a text file you want to mine.

2. Create a parser config file like this:

3. Run logax like this:

4. You get JSON output like this:

Installation

onceler

Functionality

Caveats

Developer Documentation

Future Enhancements

Tools

A Note on Egit

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Last publish

Collaborators

3. Run `logax` like this:

Weekly Downloads