Rewrote head #1911

Mikadore · 2021-03-25T20:51:47Z

The head utility is in a pretty poor state. This rewrite is to address that.

Newly supported features:

moved to clap
support all CLI arguments + NUM syntax (-c 5b, -1c, -n 4kiB) (better output for --help too).
Some bug fixes & feature support, e.g. the following tests fail on the current master:
- test_head::test_byte_syntax
- test_head::test_line_syntax
- test_head::test_negative_zero_lines
- test_head::test_zero_terminated_syntax_2
Support for arbitrary data (the current version relies on utf8 encoding when it filters by lines)

We're close to 100% compatibility, albeit I messed up tracking all the features, so there are some obscurities & edge cases.
The most notable of these is that head will write a newline if not present as the last character of output if in terminal.
It won't do this if the output is piped, or redirected to a file. There is a wrapper for isatty on crates.io (+ the windows counterpart), which could be helpful.

On performance & memory:

The performance of head can certainly be improved.
Parsing argument syntax is the most obvious here, I will also do some more detailed benchmarking.
Here's a small table of my local measurements (using hyperfine for perf and valgrind for memory (heap)):

Noop:

`head -c 0`	Time	Memory
GNU	0.5 ms	44 bytes
Current	0.8 ms	25,575 bytes
Rewrite	1.6 ms	489,631 bytes

Bytes (on data from /dev/random):

First 512 bytes of a random file (size 1k)

`head -c 512 {random_file}`	Time	Memory
GNU	0.5 ms	1k
Current	1.5 ms	17k
Rewrite	1.6 ms	534k

First 32 mega bytes of a random file (size ~513M)

`head -c 33554432 {random_file} > /dev/zero`	Time	Memory
GNU	7.5 ms	4k
Current	3074 ms	17k
Rewrite	6.3 ms	534k

All but the last 32 mega bytes of a random file (size ~513M)

`head -c -33554432 {random_file} > /dev/zero`	Time	Memory
GNU	99.5 ms	4k
Current	FAILED	FAILED
Rewrite	63.2 ms	534k

Note: I couldn't get the current version to work, the error was

head: error: Argument to option 'c' missing

Lines (on The complete works of Shakespeare):

The whole file is ~5.4M

First 2048 lines:

`head -n 2048 {text} > /dev/zero`	Time	Memory
GNU	0.6ms	4k
Current	1.8ms	100k
Rewrite	1.5ms	534k

All but the last 4096 lines

`head -n -4096 {text} > /dev/zero`	Time	Memory
GNU	1.9ms	4k
Current	52.5ms	5942k
Rewrite	2.3ms	534k

Note: this benchmark only includes data read from files, which the rewrite optimizes for. This also means that memory usage isn't really put to test, better benchmarks can be made

In summary, the rewrite has some constant startup and memory issues but is superior to both GNU and the current version in terms of scalability, and medium+ sized workloads, for run time, and superior to just the current version in terms of memory.
I think the startup time could be improved and will be doing some work in this area in the following days/weeks.

Tests:

I un-ignored most tests and added one. I think testing can & should be improved.
Some tests were not correct (i.e. they'd assert output not compatible with GNU), and were fixed.

sylvestre

Looks great. Just some requests to be consistent with the rest of the programs

sylvestre · 2021-03-25T21:05:41Z

src/uu/head/src/constants.rs

@@ -0,0 +1,89 @@
+
+pub const EXIT_FAILURE: i32 = 1;


I value a lot consistency as we have a lot of binaries.
Please move that into head.rs and follow what is done elsewhere.
example:
https://github.com/uutils/coreutils/blob/master/src/uu/ls/src/ls.rs#L43-L52
and
https://github.com/uutils/coreutils/blob/master/src/uu/ls/src/ls.rs#L78-L82

or
https://github.com/uutils/coreutils/blob/master/src/uu/chroot/src/chroot.rs#L21-L40

thanks and sorry for the extra work

Do I understand you correctly that I should move the content of app.rs to head.rs?

yeah and probably merge the two files

Move constants to app.rs and instead of functions use an options mod.

Instead of this:

#[inline(always)] pub fn bytes_name() -> &'static str { "BYTES" } #[inline(always)] pub fn lines_name() -> &'static str { "LINES" }

Use this:

pub mod Options { pub static LINES: &str = "lines"; pub static BYTES: &str = "bytes"; }

Instead of this:

#[inline(always)] pub fn usage() -> &'static str { "head [FLAG]... [FILE]..." }

Use this:

static USAGE: &str = "head [FLAG]... [FILE]...";

etc.

tests/by-util/test_head.rs

sylvestre · 2021-03-25T21:06:26Z

src/uu/head/src/app.rs

@@ -0,0 +1,254 @@
+use crate::constants;


please rename this file to head.rs to match the other programs

sylvestre · 2021-03-25T21:06:56Z

src/uu/head/src/app.rs

+use uucore::{executable, exit};
+
+pub fn app<'a>() -> App<'a, 'a> {
+    App::new(executable!())


please follow the other programs example (see below)

ycd

Looks great, just a small nitpick, if it is working can you add a test for #1799?

ycd · 2021-03-25T21:11:44Z

src/uu/head/src/app.rs

+                                }
+                                Err(e) => match e {
+                                    parse::ParseError::Overflow => {
+                                        eprintln!(


crash! macro used widely for this operation in codebase, it 'd be better to use it for consistency.

ycd · 2021-03-25T21:11:57Z

src/uu/head/src/app.rs

+                                        exit!(constants::EXIT_FAILURE);
+                                    }
+                                    parse::ParseError::Syntax => {
+                                        eprintln!("head: bad number: '{}'", arg_str);


ycd · 2021-03-25T21:30:40Z

@Mikadore benchmarks looks interesting and the difference in memory usage looks huge, which version of GNU coreutils did you used?

Mikadore · 2021-03-25T21:42:58Z

@ycd head (GNU coreutils) 8.32

The memory usage would be more interesting for piped data, where it is necessary to buffer some of it, imo.
Tbh, it's just late here and I didn't have the energy to run more of these.

Overall, I think benchmarking this project as a whole (and comparing it against GNU) would be interesting to see, maybe I'll get to it some time.

Mikadore · 2021-03-25T21:54:44Z

Looks great, just a small nitpick, if it is working can you add a test for #1799?

I can't add a test for this as the testing API can't handle invalid strings for stdout.

Arcterus

I'm reviewing this on my phone, so I'll see if I can read the code more in-depth later.

Arcterus · 2021-03-25T22:41:29Z

src/uu/head/src/head.rs

-            verbose: false,
-            zero_terminated: false,
+    loop {
+        let read = input.read(&mut readbuf)?;


This (and basically everywhere else using read()) should check for ErrorKind::Interrupted.

Arcterus · 2021-03-25T22:51:36Z

src/uu/head/src/parse.rs

+                        }
+                    }
+                    b'm' | b'M' => {
+                        if is_base_2 {


Since all of these branches are the same, restructure it like so:

let exp = match blah { b'k' | b'K' => 1, b'm' | b'M' => 2, // and so on... }; if is_base_2 { 1024u128.pow(exp) } else { 1000u128.pow(exp) }

I think there's also a .to_ascii_lowercase() or something function you can use to avoid checking both the lowercase and uppercase versions of the letters.

Arcterus · 2021-03-25T22:51:37Z

src/uu/head/src/parse.rs

+                        }
+                    }
+                    b'm' | b'M' => {
+                        if is_base_2 {


Since all of these branches are the same, restructure it like so:

let exp = match blah { b'k' | b'K' => 1, b'm' | b'M' => 2, // and so on... }; if is_base_2 { 1024u128.pow(exp) } else { 1000u128.pow(exp) }

I think there's also .to_ascii_lowercase() or something function you can use to avoid checking both the lowercase and uppercase versions of the letters.

Arcterus · 2021-03-25T22:53:53Z

src/uu/head/src/parse.rs

+    // (3) is any of the other units
+    // (4) is `i`
+    // (5) is `B`
+    let re = regex::Regex::new(r"^-?(\d+)?(?:(b)|([kKmMgGtTpPeEzZyY])(?:(i)B|(B))?)?$").unwrap();


I would personally prefer this not use a regex, as I think they're hard to read. Same for parse_obsolete().

I would personally prefer this not use a regex, as I think they're hard to read. Same for parse_obsolete().

I was planning to rewrite this anyway for performance, however, I'm not sure about the readability aspect of it

Arcterus · 2021-03-25T22:55:48Z

src/uu/head/src/app.rs

+                            match res {
+                                Ok((n, b)) => {
+                                    if b {
+                                        self.mode = Some(Modes::Bytes(n));


This code is pretty deeply nested, so it'd be nice if it was split into functions.

ycd · 2021-03-26T07:03:04Z

@ycd head (GNU coreutils) 8.32

Ah i see, as far as i remember old head (GNU coreutils <= 8.21) had a similar memory usage because it was allocating the memory up front based on the value passed to -c.

Mikadore · 2021-03-26T17:26:53Z

I just realized that head supports arguments in the form of head -123zv, this means that parse_obsolete needs some adjustments in the API + new tests

Mikadore · 2021-03-26T23:42:28Z

I'm happy to report a significant speedup with the rewritten version of the argument parsing. From my basic measurements, basic argument parsing is about 30-60% faster.

We also now completely support the obsolete syntax (e.g. -123zvqvcvmk (- this is bad, please don't do this)).

I am however unhappy with the implementation side of things (it's ugly), and the fact that we have some (big? - need to profile) overhead for supporting the obsolete syntax. The only way for this to be resolved, really, (imo) is to either

stop using clap
get clap to support -123abc

The former is very unappealing in regards to consistency and simplicity, and the latter might not be wanted in clap.

However, it's also worth mentioning that our performance is really good, that is, ~0.5ms slower than GNU for very small workloads, and even faster beyond that! Of course, the faster the better, but I think if left as is even the current performance is satisfactory.

sylvestre

cool stuff :)
bravo

sylvestre · 2021-03-25T21:08:04Z

src/uu/head/src/app.rs

+                        let arg_str = &oss.clone().into_string().unwrap_or_else(|_|"".to_owned());
+                        if let Some(res) = parse::parse_obsolete(&arg_str) {
+                            match res {
+                                Ok((n, b)) => {


could you please rename b for something more explicit

sylvestre · 2021-03-25T21:52:05Z

src/uu/head/src/constants.rs

@@ -0,0 +1,89 @@
+
+pub const EXIT_FAILURE: i32 = 1;


yeah and probably merge the two files

sylvestre · 2021-03-27T07:51:27Z

src/uu/head/src/parse.rs

+    #[test]
+    #[cfg(target_pointer_width = "32")]
+    fn test_parse_obsolete_overflow_x32() {
+        assert_eq!(


could you please fix the errors here?

sylvestre · 2021-03-27T07:52:48Z

src/uu/head/src/head.rs

+const BUF_SIZE: usize = 65536;
+
+mod options {
+    pub const VERSION: &str = env!("CARGO_PKG_VERSION");


please move usage, about and version out of "mod options" (like in the other programs)

ycd · 2021-03-27T11:14:42Z

get clap to support -123abc

Can you elaborate this? which cases are failing without parse_obsolete?

We had a similar case in echo, i fixed it by allowing the leading hyphens (see: #1887)

Mikadore · 2021-03-27T14:56:17Z

get clap to support -123abc

Can you elaborate this? which cases are failing without parse_obsolete?

We had a similar case in echo, i fixed it by allowing the leading hyphens (see: #1887)

This isn't the same (as far as I see), in head -123zvc, the meaning of zvc changes than if it were head -123 -zvc (let alone that -zvc needs an argument for the c part).
head -123zvc is parsed as (in order) "print the first 123" (default is lines), "use \0 as a line terminator", "set verbose mode", "print the first 123 bytes". The c isn't the same as --bytes here.

So, clap would need some sort of NUM type of argument, in the form of -NUM{modifiers}, it would then also need to parse flags correctly, and sort out ambiguities (NUM c vs normal c).

If there is something like this in clap, I'd be more than happy to use it, but I don't know of it.

sylvestre · 2021-03-28T09:37:02Z

it is ready to be reviewed?
btw, bravo for

 codecov/project Successful in 1s — 63.48% (+0.59%) compared to 52997b6

Mikadore · 2021-03-28T12:07:36Z

it is ready to be reviewed?
btw, bravo for
 codecov/project Successful in 1s — 63.48% (+0.59%) compared to 52997b6 

Thanks!

And yes, I think it's ready.

sylvestre

Nothing critical but a few requests :)

sylvestre · 2021-03-28T13:15:44Z

src/uu/head/src/head.rs

+
+mod options {
+    pub const BYTES_NAME: &str = "BYTES";
+    pub const BYTES_HELP: &str = "\


for consistency, please move that in the clap call

Done (I'm assuming you're referring only to the HELP messages)

sylvestre · 2021-03-28T13:15:48Z

src/uu/head/src/head.rs

+        NUM bytes of each file\
+        ";
+    pub const LINES_NAME: &str = "LINES";
+    pub const LINES_HELP: &str = "\


src/uu/head/src/head.rs

sylvestre · 2021-03-28T13:17:02Z

src/uu/head/src/head.rs

+{
+    match parse::parse_num(src) {
+        Ok((n, last)) => Ok((closure(n), last)),
+        Err(reason) => match reason {


would it be possible to add a test covering this case?

sylvestre · 2021-03-28T13:17:11Z

src/uu/head/src/head.rs

+        if let Some(s) = second.to_str() {
+            match parse::parse_obsolete(s) {
+                Some(Ok(iter)) => Ok(Box::new(vec![first].into_iter().chain(iter).chain(args))),
+                Some(Err(e)) => match e {


sylvestre · 2021-03-28T13:18:00Z

src/uu/head/src/head.rs

-        FilterMode::Bytes(count) => {
-            for byte in reader.bytes().take(count) {
-                print!("{}", byte.unwrap() as char);
+fn head_backwards_file(input: &mut std::fs::File, options: &HeadOptions) -> std::io::Result<()> {


Looking at the number of warnings, it doesn't seem tested ?

Oh, yeah. Looking at the tests, all inputs are piped.

Looking at the number of warnings, it doesn't seem tested ?

btw, where do you see these warnings?

Here: https://github.com/uutils/coreutils/pull/1911/files
when the coverage run are completed

You can also what it here:
https://codecov.io/gh/uutils/coreutils/commit/4ee61a9e033609b16e233e6e61ce34ab63b4ee58/

and here: https://codecov.io/gh/uutils/coreutils/src/3b8dcbf0cfe99657ce7e9b59c8d4f73db33ff5b6/src/uu/head/src/head.rs

oh nice, didn't know that

sylvestre · 2021-03-29T11:07:58Z

Bravo for this impressive work.

Mikadore · 2021-03-29T11:11:06Z

Bravo for this impressive work.

Thank you very much! This was my first contribution to a real open source project, and it's been a blast!

Looking forward to work on other things as well

Mikadore added 2 commits March 25, 2021 19:46

pushed changes on new fork

4c327cf

readded accidentally deleted tests

5c401b9

sylvestre requested changes Mar 25, 2021

View reviewed changes

ycd reviewed Mar 25, 2021

View reviewed changes

Mikadore added 2 commits March 25, 2021 23:01

replaced eprintln!+exit! with crash!

892166d

cargo fmted head tests

4fbf818

Arcterus requested changes Mar 25, 2021

View reviewed changes

retry on ErrorKind::Interrupted when .read()ing

31238b6

Mikadore mentioned this pull request Mar 26, 2021

Tests: Replace String with Vec<u8> for stdout (stderr?) #1916

Closed

Mikadore added 2 commits March 26, 2021 14:48

cargo fmt

f04b563

rewrote parse_num

78f1e0e

Mikadore added 3 commits March 27, 2021 00:04

rewrote parse_obsolete

983c11c

added obsolete_extras test

8facca0

fixed bug: app dependant on argv[0] being exactly 'head'

acdf617

Mikadore added 6 commits March 27, 2021 00:47

refactored app.rs into head.rs

2cc45c2

removed app.rs

b019a2e

moved constants.rs into head.rs and renamed it to options

8505823

removed constants.rs

56b01e6

removed regex dependency

1042b69

fixed clippy warnings

d250904

sylvestre reviewed Mar 27, 2021

View reviewed changes

moved VERSION, ABOUT, USAGE out of options

8e1e4ec

Mikadore added 2 commits March 27, 2021 23:46

fixed errors in x32 tests for parse_obsolete

5e2c1d0

added std:: before usize::MAX

3b8dcbf

sylvestre requested changes Mar 28, 2021

View reviewed changes

Mikadore added 11 commits March 28, 2021 15:57

test head_backwards_file

8978403

added tests for parse_mode

2bac18b

added tests for arg_iter

5fbd4b5

inlined help messages

67c2bd6

added test case for invalid encoding for arg_iterate

49a420e

added case for arg_iter

a783cd3

improved coverage

3d77c9d

added overflow test for arg_iterate

57be94a

added coverage for HeadOptions::new and its default impl

680b3b9

added tests to conver -0 for bytes

081ca27

removed bad test on windows

7437dc2

sylvestre merged commit 8320b1e into uutils:master Mar 29, 2021

Uh oh!

Rewrote head #1911

Rewrote head #1911

Uh oh!

Conversation

Mikadore commented Mar 25, 2021

Newly supported features:

On performance & memory:

Noop:

Bytes (on data from /dev/random):

Lines (on The complete works of Shakespeare):

Tests:

Uh oh!

sylvestre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ycd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ycd commented Mar 25, 2021

Uh oh!

Mikadore commented Mar 25, 2021

Uh oh!

Mikadore commented Mar 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Arcterus left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Arcterus Mar 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ycd commented Mar 26, 2021

Uh oh!

Mikadore commented Mar 26, 2021

Uh oh!

Mikadore commented Mar 26, 2021

Uh oh!

sylvestre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ycd commented Mar 27, 2021

Uh oh!

Mikadore commented Mar 27, 2021

Mikadore commented Mar 25, 2021 •

edited

Loading

Arcterus Mar 25, 2021 •

edited

Loading