Skip to content

Conversation

@Mikadore
Copy link
Contributor

The head utility is in a pretty poor state. This rewrite is to address that.

Newly supported features:

  • moved to clap
  • support all CLI arguments + NUM syntax (-c 5b, -1c, -n 4kiB) (better output for --help too).
  • Some bug fixes & feature support, e.g. the following tests fail on the current master:
    • test_head::test_byte_syntax
    • test_head::test_line_syntax
    • test_head::test_negative_zero_lines
    • test_head::test_zero_terminated_syntax_2
  • Support for arbitrary data (the current version relies on utf8 encoding when it filters by lines)

We're close to 100% compatibility, albeit I messed up tracking all the features, so there are some obscurities & edge cases.
The most notable of these is that head will write a newline if not present as the last character of output if in terminal.
It won't do this if the output is piped, or redirected to a file. There is a wrapper for isatty on crates.io (+ the windows counterpart), which could be helpful.

On performance & memory:

The performance of head can certainly be improved.
Parsing argument syntax is the most obvious here, I will also do some more detailed benchmarking.
Here's a small table of my local measurements (using hyperfine for perf and valgrind for memory (heap)):

Noop:

head -c 0 Time Memory
GNU 0.5 ms 44 bytes
Current 0.8 ms 25,575 bytes
Rewrite 1.6 ms 489,631 bytes

Bytes (on data from /dev/random):

First 512 bytes of a random file (size 1k)

head -c 512 {random_file} Time Memory
GNU 0.5 ms 1k
Current 1.5 ms 17k
Rewrite 1.6 ms 534k

First 32 mega bytes of a random file (size ~513M)

head -c 33554432 {random_file} > /dev/zero Time Memory
GNU 7.5 ms 4k
Current 3074 ms 17k
Rewrite 6.3 ms 534k

All but the last 32 mega bytes of a random file (size ~513M)

head -c -33554432 {random_file} > /dev/zero Time Memory
GNU 99.5 ms 4k
Current FAILED FAILED
Rewrite 63.2 ms 534k

Note: I couldn't get the current version to work, the error was

head: error: Argument to option 'c' missing

Lines (on The complete works of Shakespeare):

The whole file is ~5.4M

First 2048 lines:

head -n 2048 {text} > /dev/zero Time Memory
GNU 0.6ms 4k
Current 1.8ms 100k
Rewrite 1.5ms 534k

All but the last 4096 lines

head -n -4096 {text} > /dev/zero Time Memory
GNU 1.9ms 4k
Current 52.5ms 5942k
Rewrite 2.3ms 534k

Note: this benchmark only includes data read from files, which the rewrite optimizes for. This also means that memory usage isn't really put to test, better benchmarks can be made

In summary, the rewrite has some constant startup and memory issues but is superior to both GNU and the current version in terms of scalability, and medium+ sized workloads, for run time, and superior to just the current version in terms of memory.
I think the startup time could be improved and will be doing some work in this area in the following days/weeks.

Tests:

I un-ignored most tests and added one. I think testing can & should be improved.
Some tests were not correct (i.e. they'd assert output not compatible with GNU), and were fixed.

Copy link
Contributor

@sylvestre sylvestre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just some requests to be consistent with the rest of the programs

@@ -0,0 +1,89 @@

pub const EXIT_FAILURE: i32 = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand you correctly that I should move the content of app.rs to head.rs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah and probably merge the two files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move constants to app.rs and instead of functions use an options mod.

Instead of this:

#[inline(always)]
pub fn bytes_name() -> &'static str {
    "BYTES"
}

#[inline(always)]
pub fn lines_name() -> &'static str {
    "LINES"
}

Use this:

pub mod Options {
    pub static LINES: &str = "lines";
    pub static BYTES: &str = "bytes";
}

Instead of this:

#[inline(always)]
pub fn usage() -> &'static str {
    "head [FLAG]... [FILE]..."
}

Use this:

static USAGE: &str = "head [FLAG]... [FILE]...";

etc.

@@ -0,0 +1,254 @@
use crate::constants;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename this file to head.rs to match the other programs

use uucore::{executable, exit};

pub fn app<'a>() -> App<'a, 'a> {
App::new(executable!())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow the other programs example (see below)

Copy link
Contributor

@ycd ycd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a small nitpick, if it is working can you add a test for #1799?

}
Err(e) => match e {
parse::ParseError::Overflow => {
eprintln!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crash! macro used widely for this operation in codebase, it 'd be better to use it for consistency.

exit!(constants::EXIT_FAILURE);
}
parse::ParseError::Syntax => {
eprintln!("head: bad number: '{}'", arg_str);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@ycd
Copy link
Contributor

ycd commented Mar 25, 2021

@Mikadore benchmarks looks interesting and the difference in memory usage looks huge, which version of GNU coreutils did you used?

@Mikadore
Copy link
Contributor Author

@ycd head (GNU coreutils) 8.32

The memory usage would be more interesting for piped data, where it is necessary to buffer some of it, imo.
Tbh, it's just late here and I didn't have the energy to run more of these.

Overall, I think benchmarking this project as a whole (and comparing it against GNU) would be interesting to see, maybe I'll get to it some time.

@Mikadore
Copy link
Contributor Author

Mikadore commented Mar 25, 2021

Looks great, just a small nitpick, if it is working can you add a test for #1799?

I can't add a test for this as the testing API can't handle invalid strings for stdout.

Copy link
Collaborator

@Arcterus Arcterus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reviewing this on my phone, so I'll see if I can read the code more in-depth later.

verbose: false,
zero_terminated: false,
loop {
let read = input.read(&mut readbuf)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and basically everywhere else using read()) should check for ErrorKind::Interrupted.

}
}
b'm' | b'M' => {
if is_base_2 {
Copy link
Collaborator

@Arcterus Arcterus Mar 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of these branches are the same, restructure it like so:

let exp = match blah {
    b'k' | b'K' => 1,
    b'm' | b'M' => 2,
    // and so on...
};

if is_base_2 {
    1024u128.pow(exp)
} else {
    1000u128.pow(exp)
}

I think there's also a .to_ascii_lowercase() or something function you can use to avoid checking both the lowercase and uppercase versions of the letters.

}
}
b'm' | b'M' => {
if is_base_2 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of these branches are the same, restructure it like so:

let exp = match blah {
    b'k' | b'K' => 1,
    b'm' | b'M' => 2,
    // and so on...
};

if is_base_2 {
    1024u128.pow(exp)
} else {
    1000u128.pow(exp)
}

I think there's also .to_ascii_lowercase() or something function you can use to avoid checking both the lowercase and uppercase versions of the letters.

// (3) is any of the other units
// (4) is `i`
// (5) is `B`
let re = regex::Regex::new(r"^-?(\d+)?(?:(b)|([kKmMgGtTpPeEzZyY])(?:(i)B|(B))?)?$").unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally prefer this not use a regex, as I think they're hard to read. Same for parse_obsolete().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally prefer this not use a regex, as I think they're hard to read. Same for parse_obsolete().

I was planning to rewrite this anyway for performance, however, I'm not sure about the readability aspect of it

match res {
Ok((n, b)) => {
if b {
self.mode = Some(Modes::Bytes(n));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is pretty deeply nested, so it'd be nice if it was split into functions.

@ycd
Copy link
Contributor

ycd commented Mar 26, 2021

@ycd head (GNU coreutils) 8.32

Ah i see, as far as i remember old head (GNU coreutils <= 8.21) had a similar memory usage because it was allocating the memory up front based on the value passed to -c.

@Mikadore
Copy link
Contributor Author

I just realized that head supports arguments in the form of head -123zv, this means that parse_obsolete needs some adjustments in the API + new tests

@Mikadore
Copy link
Contributor Author

I'm happy to report a significant speedup with the rewritten version of the argument parsing. From my basic measurements, basic argument parsing is about 30-60% faster.

We also now completely support the obsolete syntax (e.g. -123zvqvcvmk (- this is bad, please don't do this)).

I am however unhappy with the implementation side of things (it's ugly), and the fact that we have some (big? - need to profile) overhead for supporting the obsolete syntax. The only way for this to be resolved, really, (imo) is to either

  • stop using clap
  • get clap to support -123abc

The former is very unappealing in regards to consistency and simplicity, and the latter might not be wanted in clap.

However, it's also worth mentioning that our performance is really good, that is, ~0.5ms slower than GNU for very small workloads, and even faster beyond that! Of course, the faster the better, but I think if left as is even the current performance is satisfactory.

Copy link
Contributor

@sylvestre sylvestre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool stuff :)
bravo

let arg_str = &oss.clone().into_string().unwrap_or_else(|_|"".to_owned());
if let Some(res) = parse::parse_obsolete(&arg_str) {
match res {
Ok((n, b)) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please rename b for something more explicit

@@ -0,0 +1,89 @@

pub const EXIT_FAILURE: i32 = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah and probably merge the two files

#[test]
#[cfg(target_pointer_width = "32")]
fn test_parse_obsolete_overflow_x32() {
assert_eq!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please fix the errors here?

const BUF_SIZE: usize = 65536;

mod options {
pub const VERSION: &str = env!("CARGO_PKG_VERSION");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move usage, about and version out of "mod options" (like in the other programs)

@ycd
Copy link
Contributor

ycd commented Mar 27, 2021

get clap to support -123abc

Can you elaborate this? which cases are failing without parse_obsolete?

We had a similar case in echo, i fixed it by allowing the leading hyphens (see: #1887)

@Mikadore
Copy link
Contributor Author

get clap to support -123abc

Can you elaborate this? which cases are failing without parse_obsolete?

We had a similar case in echo, i fixed it by allowing the leading hyphens (see: #1887)

This isn't the same (as far as I see), in head -123zvc, the meaning of zvc changes than if it were head -123 -zvc (let alone that -zvc needs an argument for the c part).
head -123zvc is parsed as (in order) "print the first 123" (default is lines), "use \0 as a line terminator", "set verbose mode", "print the first 123 bytes". The c isn't the same as --bytes here.

So, clap would need some sort of NUM type of argument, in the form of -NUM{modifiers}, it would then also need to parse flags correctly, and sort out ambiguities (NUM c vs normal c).

If there is something like this in clap, I'd be more than happy to use it, but I don't know of it.

@sylvestre
Copy link
Contributor

it is ready to be reviewed?
btw, bravo for

 codecov/project Successful in 1s — 63.48% (+0.59%) compared to 52997b6 

@Mikadore
Copy link
Contributor Author

it is ready to be reviewed?
btw, bravo for

 codecov/project Successful in 1s — 63.48% (+0.59%) compared to 52997b6 

Thanks!

And yes, I think it's ready.

Copy link
Contributor

@sylvestre sylvestre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing critical but a few requests :)


mod options {
pub const BYTES_NAME: &str = "BYTES";
pub const BYTES_HELP: &str = "\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency, please move that in the clap call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (I'm assuming you're referring only to the HELP messages)

NUM bytes of each file\
";
pub const LINES_NAME: &str = "LINES";
pub const LINES_HELP: &str = "\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

{
match parse::parse_num(src) {
Ok((n, last)) => Ok((closure(n), last)),
Err(reason) => match reason {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to add a test covering this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if let Some(s) = second.to_str() {
match parse::parse_obsolete(s) {
Some(Ok(iter)) => Ok(Box::new(vec![first].into_iter().chain(iter).chain(args))),
Some(Err(e)) => match e {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

FilterMode::Bytes(count) => {
for byte in reader.bytes().take(count) {
print!("{}", byte.unwrap() as char);
fn head_backwards_file(input: &mut std::fs::File, options: &HeadOptions) -> std::io::Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the number of warnings, it doesn't seem tested ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah. Looking at the tests, all inputs are piped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the number of warnings, it doesn't seem tested ?

btw, where do you see these warnings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice, didn't know that

@sylvestre
Copy link
Contributor

Bravo for this impressive work.

@sylvestre sylvestre merged commit 8320b1e into uutils:master Mar 29, 2021
@Mikadore
Copy link
Contributor Author

Bravo for this impressive work.

Thank you very much! This was my first contribution to a real open source project, and it's been a blast!

Looking forward to work on other things as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants