Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions core/src/main/scala/org/apache/spark/SparkContext.scala
Original file line number Diff line number Diff line change
Expand Up @@ -816,14 +816,25 @@ class SparkContext(config: SparkConf) extends Logging {
* Hadoop-supported file system URI, and return it as an RDD of Strings.
* @param path path to the text file on a supported file system
* @param minPartitions suggested minimum number of partitions for the resulting RDD
* @param charsetName name of the charset to be used to decode the bytes from the text file
* @return RDD of lines of the text file
*/
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
minPartitions: Int = defaultMinPartitions,
charsetName: String = null): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
val textRdd = hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2)
val stringRdd = if (charsetName == null) {
textRdd.map(text => text.toString)
} else {
// Text.getBytes returns the internal memory buffer which size usually larger than the actual
// bytes count of the string content for performance reasons. We should be careful to limit
// the index range correctly within [0, Text.getLength)
textRdd.map(text => new String(text.getBytes, 0, text.getLength, charsetName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about new lines? I think newlines are not encoded as specified. Also, I think we should just focus on adding the support in data sources.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line separator bytes handled before line contents, and could be specified by config item textinputformat.record.delimiter. I think Spark more like an application level tool, user-friendliness is its strength. In usual data clean cases, it's a big help to have an option to specify target encoding parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea but I meant to have it in data source like text data source. textinputformat.record.delimiter should be manually set and it's a bit odd BTW. SparkContext is super old API and I wonder if we'd extend the support for this usually. I would just focus on fixing it in text datasource side.

}
stringRdd.setName(path)
}

/**
Expand Down