-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Rewrote head #1911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrote head #1911
Conversation
sylvestre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Just some requests to be consistent with the rest of the programs
src/uu/head/src/constants.rs
Outdated
| @@ -0,0 +1,89 @@ | |||
|
|
|||
| pub const EXIT_FAILURE: i32 = 1; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I value a lot consistency as we have a lot of binaries.
Please move that into head.rs and follow what is done elsewhere.
example:
https://github.com/uutils/coreutils/blob/master/src/uu/ls/src/ls.rs#L43-L52
and
https://github.com/uutils/coreutils/blob/master/src/uu/ls/src/ls.rs#L78-L82
thanks and sorry for the extra work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand you correctly that I should move the content of app.rs to head.rs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah and probably merge the two files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move constants to app.rs and instead of functions use an options mod.
Instead of this:
#[inline(always)]
pub fn bytes_name() -> &'static str {
"BYTES"
}
#[inline(always)]
pub fn lines_name() -> &'static str {
"LINES"
}Use this:
pub mod Options {
pub static LINES: &str = "lines";
pub static BYTES: &str = "bytes";
}Instead of this:
#[inline(always)]
pub fn usage() -> &'static str {
"head [FLAG]... [FILE]..."
}Use this:
static USAGE: &str = "head [FLAG]... [FILE]...";etc.
src/uu/head/src/app.rs
Outdated
| @@ -0,0 +1,254 @@ | |||
| use crate::constants; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rename this file to head.rs to match the other programs
src/uu/head/src/app.rs
Outdated
| use uucore::{executable, exit}; | ||
|
|
||
| pub fn app<'a>() -> App<'a, 'a> { | ||
| App::new(executable!()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please follow the other programs example (see below)
ycd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just a small nitpick, if it is working can you add a test for #1799?
src/uu/head/src/app.rs
Outdated
| } | ||
| Err(e) => match e { | ||
| parse::ParseError::Overflow => { | ||
| eprintln!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crash! macro used widely for this operation in codebase, it 'd be better to use it for consistency.
src/uu/head/src/app.rs
Outdated
| exit!(constants::EXIT_FAILURE); | ||
| } | ||
| parse::ParseError::Syntax => { | ||
| eprintln!("head: bad number: '{}'", arg_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
|
@Mikadore benchmarks looks interesting and the difference in memory usage looks huge, which version of GNU coreutils did you used? |
|
@ycd The memory usage would be more interesting for piped data, where it is necessary to buffer some of it, imo. Overall, I think benchmarking this project as a whole (and comparing it against GNU) would be interesting to see, maybe I'll get to it some time. |
I can't add a test for this as the testing API can't handle invalid strings for stdout. |
Arcterus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reviewing this on my phone, so I'll see if I can read the code more in-depth later.
src/uu/head/src/head.rs
Outdated
| verbose: false, | ||
| zero_terminated: false, | ||
| loop { | ||
| let read = input.read(&mut readbuf)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and basically everywhere else using read()) should check for ErrorKind::Interrupted.
src/uu/head/src/parse.rs
Outdated
| } | ||
| } | ||
| b'm' | b'M' => { | ||
| if is_base_2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all of these branches are the same, restructure it like so:
let exp = match blah {
b'k' | b'K' => 1,
b'm' | b'M' => 2,
// and so on...
};
if is_base_2 {
1024u128.pow(exp)
} else {
1000u128.pow(exp)
}I think there's also a .to_ascii_lowercase() or something function you can use to avoid checking both the lowercase and uppercase versions of the letters.
src/uu/head/src/parse.rs
Outdated
| } | ||
| } | ||
| b'm' | b'M' => { | ||
| if is_base_2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all of these branches are the same, restructure it like so:
let exp = match blah {
b'k' | b'K' => 1,
b'm' | b'M' => 2,
// and so on...
};
if is_base_2 {
1024u128.pow(exp)
} else {
1000u128.pow(exp)
}I think there's also .to_ascii_lowercase() or something function you can use to avoid checking both the lowercase and uppercase versions of the letters.
src/uu/head/src/parse.rs
Outdated
| // (3) is any of the other units | ||
| // (4) is `i` | ||
| // (5) is `B` | ||
| let re = regex::Regex::new(r"^-?(\d+)?(?:(b)|([kKmMgGtTpPeEzZyY])(?:(i)B|(B))?)?$").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally prefer this not use a regex, as I think they're hard to read. Same for parse_obsolete().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally prefer this not use a regex, as I think they're hard to read. Same for
parse_obsolete().
I was planning to rewrite this anyway for performance, however, I'm not sure about the readability aspect of it
src/uu/head/src/app.rs
Outdated
| match res { | ||
| Ok((n, b)) => { | ||
| if b { | ||
| self.mode = Some(Modes::Bytes(n)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is pretty deeply nested, so it'd be nice if it was split into functions.
Ah i see, as far as i remember old head (GNU coreutils <= 8.21) had a similar memory usage because it was allocating the memory up front based on the value passed to |
|
I just realized that head supports arguments in the form of |
|
I'm happy to report a significant speedup with the rewritten version of the argument parsing. From my basic measurements, basic argument parsing is about 30-60% faster. We also now completely support the obsolete syntax (e.g. I am however unhappy with the implementation side of things (it's ugly), and the fact that we have some (big? - need to profile) overhead for supporting the obsolete syntax. The only way for this to be resolved, really, (imo) is to either
The former is very unappealing in regards to consistency and simplicity, and the latter might not be wanted in clap. However, it's also worth mentioning that our performance is really good, that is, ~0.5ms slower than GNU for very small workloads, and even faster beyond that! Of course, the faster the better, but I think if left as is even the current performance is satisfactory. |
sylvestre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool stuff :)
bravo
src/uu/head/src/app.rs
Outdated
| let arg_str = &oss.clone().into_string().unwrap_or_else(|_|"".to_owned()); | ||
| if let Some(res) = parse::parse_obsolete(&arg_str) { | ||
| match res { | ||
| Ok((n, b)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please rename b for something more explicit
src/uu/head/src/constants.rs
Outdated
| @@ -0,0 +1,89 @@ | |||
|
|
|||
| pub const EXIT_FAILURE: i32 = 1; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah and probably merge the two files
src/uu/head/src/parse.rs
Outdated
| #[test] | ||
| #[cfg(target_pointer_width = "32")] | ||
| fn test_parse_obsolete_overflow_x32() { | ||
| assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please fix the errors here?
src/uu/head/src/head.rs
Outdated
| const BUF_SIZE: usize = 65536; | ||
|
|
||
| mod options { | ||
| pub const VERSION: &str = env!("CARGO_PKG_VERSION"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please move usage, about and version out of "mod options" (like in the other programs)
Can you elaborate this? which cases are failing without We had a similar case in |
This isn't the same (as far as I see), in So, clap would need some sort of NUM type of argument, in the form of If there is something like this in clap, I'd be more than happy to use it, but I don't know of it. |
|
it is ready to be reviewed? |
Thanks! And yes, I think it's ready. |
sylvestre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing critical but a few requests :)
src/uu/head/src/head.rs
Outdated
|
|
||
| mod options { | ||
| pub const BYTES_NAME: &str = "BYTES"; | ||
| pub const BYTES_HELP: &str = "\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for consistency, please move that in the clap call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (I'm assuming you're referring only to the HELP messages)
src/uu/head/src/head.rs
Outdated
| NUM bytes of each file\ | ||
| "; | ||
| pub const LINES_NAME: &str = "LINES"; | ||
| pub const LINES_HELP: &str = "\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
| { | ||
| match parse::parse_num(src) { | ||
| Ok((n, last)) => Ok((closure(n), last)), | ||
| Err(reason) => match reason { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to add a test covering this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| if let Some(s) = second.to_str() { | ||
| match parse::parse_obsolete(s) { | ||
| Some(Ok(iter)) => Ok(Box::new(vec![first].into_iter().chain(iter).chain(args))), | ||
| Some(Err(e)) => match e { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| FilterMode::Bytes(count) => { | ||
| for byte in reader.bytes().take(count) { | ||
| print!("{}", byte.unwrap() as char); | ||
| fn head_backwards_file(input: &mut std::fs::File, options: &HeadOptions) -> std::io::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the number of warnings, it doesn't seem tested ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah. Looking at the tests, all inputs are piped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the number of warnings, it doesn't seem tested ?
btw, where do you see these warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here: https://github.com/uutils/coreutils/pull/1911/files
when the coverage run are completed
You can also what it here:
https://codecov.io/gh/uutils/coreutils/commit/4ee61a9e033609b16e233e6e61ce34ab63b4ee58/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nice, didn't know that
|
Bravo for this impressive work. |
Thank you very much! This was my first contribution to a real open source project, and it's been a blast! Looking forward to work on other things as well |
The head utility is in a pretty poor state. This rewrite is to address that.
Newly supported features:
-c 5b,-1c,-n 4kiB) (better output for--helptoo).We're close to 100% compatibility, albeit I messed up tracking all the features, so there are some obscurities & edge cases.
The most notable of these is that head will write a newline if not present as the last character of output if in terminal.
It won't do this if the output is piped, or redirected to a file. There is a wrapper for
isattyon crates.io (+ the windows counterpart), which could be helpful.On performance & memory:
The performance of
headcan certainly be improved.Parsing argument syntax is the most obvious here, I will also do some more detailed benchmarking.
Here's a small table of my local measurements (using hyperfine for perf and
valgrindfor memory (heap)):Noop:
head -c 0Bytes (on data from /dev/random):
First 512 bytes of a random file (size 1k)
head -c 512 {random_file}First 32 mega bytes of a random file (size ~513M)
head -c 33554432 {random_file} > /dev/zeroAll but the last 32 mega bytes of a random file (size ~513M)
head -c -33554432 {random_file} > /dev/zeroNote: I couldn't get the current version to work, the error was
Lines (on The complete works of Shakespeare):
The whole file is ~5.4M
First 2048 lines:
head -n 2048 {text} > /dev/zeroAll but the last 4096 lines
head -n -4096 {text} > /dev/zeroNote: this benchmark only includes data read from files, which the rewrite optimizes for. This also means that memory usage isn't really put to test, better benchmarks can be made
In summary, the rewrite has some constant startup and memory issues but is superior to both GNU and the current version in terms of scalability, and medium+ sized workloads, for run time, and superior to just the current version in terms of memory.
I think the startup time could be improved and will be doing some work in this area in the following days/weeks.
Tests:
I un-ignored most tests and added one. I think testing can & should be improved.
Some tests were not correct (i.e. they'd assert output not compatible with GNU), and were fixed.