Geeky Character Set Question
Aug. 23rd, 2003 03:27 amProbably no one knows this since it is way geeky. Oh well.But any help would be appreciated.
Anyhow, here's the sitch:
With Java, one can use InputStreamReader and OutputStreamReader to convert files from one character set to another. In my case, from cp1252 (aka Windows) to MacRoman or vice versa.
But the real question is, how to tell which files to bother converting. Let's say I have a bunch of files derived from another process, and some of them are of Windows origin and some of Macintosh origin. But I don't know which, and want to programmatically decide which ones to bother converting from, say, MacRoman to cp1252. And not necessarily using Java, but using any tool which will work, Perl, PHP, grep, whatever.
So e.g. this sort of thing (in pseudo code of indeterminate language):
while (looping through the files) {
if (HasMacRomanCharacters($thisFile) {
runJavaProgramToConvertToCp1252($thisFile);
}
}
???? Help!
If no one can help, I will devise some really hacky thing.
Visually I can tell -- if I use 'less' to look at a file and see <8F>, I know I have an è (i.e. è) in MacRoman. If I use 'less' to look at a file and see <E8>, I know I have an è (i.e. è) in cp1252. And I know what the hex equivalents of these are. And I know emacs displays them differently yet. But I can't use grep or something like that to search for these successfully, so I am kind of stumped.
Anyhow, here's the sitch:
With Java, one can use InputStreamReader and OutputStreamReader to convert files from one character set to another. In my case, from cp1252 (aka Windows) to MacRoman or vice versa.
But the real question is, how to tell which files to bother converting. Let's say I have a bunch of files derived from another process, and some of them are of Windows origin and some of Macintosh origin. But I don't know which, and want to programmatically decide which ones to bother converting from, say, MacRoman to cp1252. And not necessarily using Java, but using any tool which will work, Perl, PHP, grep, whatever.
So e.g. this sort of thing (in pseudo code of indeterminate language):
while (looping through the files) {
if (HasMacRomanCharacters($thisFile) {
runJavaProgramToConvertToCp1252($thisFile);
}
}
???? Help!
If no one can help, I will devise some really hacky thing.
Visually I can tell -- if I use 'less' to look at a file and see <8F>, I know I have an è (i.e. è) in MacRoman. If I use 'less' to look at a file and see <E8>, I know I have an è (i.e. è) in cp1252. And I know what the hex equivalents of these are. And I know emacs displays them differently yet. But I can't use grep or something like that to search for these successfully, so I am kind of stumped.
(no subject)
Date: 2003-08-23 06:58 pm (UTC)Besides, the application in question gets data piped in from a web form.
Therefore I was thinking of just checking $_SERVER["REMOTE_ADDR"] with php or User-Agent with Java, to see what the client is, embedded that string in a predictable place inside the results transcript, and then use that as a boolean test in the next script.